<a href="https://colab.research.google.com/github/harishk1998/HarishBabu_INFO5731_Fall2024/blob/main/Kancharla_Harishbabu_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
Question:
What are the most common themes or categories of books available on an online bookstore,
and how does the distribution of these categories vary across the website?

Data to Collect:
1. Book titles
2. Book categories
3. Price and availability of each book

Steps for Collecting Data:
Web Scraping Setup:
We have to use Python’s requests and BeautifulSoup libraries to scrape the website books.toscrape.com,
which contains a collection of books.

Data Collection:
Scrape the book titles, categories, prices, and availability from multiple pages of the website.
Collect at least 1000 book entries to ensure enough data for analysis.


The fainal step is to save the scraped data in a CSV file for easy analysis of book categories, price distribution, and availability.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [1]:
import requests
from bs4 import BeautifulSoup
import csv

url = 'http://books.toscrape.com/catalogue/page-{}.html'
with open('books_data.csv', mode='w', newline='', encoding='utf-8-sig') as file:
    a = csv.writer(file)
    a.writerow(['Title', 'Price', 'Availability'])
    for i in range(1, 11):
        res = requests.get(url.format(i))
        soup = BeautifulSoup(res.content, 'html.parser')
        books = soup.find_all('article', class_='product_pod')
        for j in books:
            title = j.find('h3').find('a')['title']
            price = j.find('p', class_='price_color').text
            availability = j.find('p', class_='instock availability').text.strip()
            a.writerow([title, price, availability])
        print(f"Scraped page {i}")

print("Data collection completed and saved to 'books_data.csv'.")
#After executing the code the books_data.csv file was created in the folder


Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Scraped page 6
Scraped page 7
Scraped page 8
Scraped page 9
Scraped page 10
Data collection completed and saved to 'books_data.csv'.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [2]:
import requests
import pandas as pd
import time
url = 'https://api.crossref.org/works'
file = 'articles.csv'
data = []
total = 0
limit = 100
start = 0
keyword = 'XYZ'
start_year = 2014
end_year = 2024
def wait(attempts):
    wait_time = min(2 ** attempts, 60)
    print(f"Rate limit exceeded. Waiting for {wait_time} seconds...")
    time.sleep(wait_time)

while total < 1000:
    retries = 0
    while True:
        params = {
            'query': keyword,
            'offset': start,
            'rows': limit,
            'filter': f'from-pub-date:{start_year}-01-01,until-pub-date:{end_year}-12-31'
        }

        try:
            response = requests.get(url, params=params)
            if response.status_code == 429:
                retries += 1
                wait(retries)
                continue
            response.raise_for_status()
            json_data = response.json()
            for item in json_data.get('message', {}).get('items', []):
                year = item.get('created', {}).get('date-parts', [[None]])[0][0]
                if year and start_year <= int(year) <= end_year:
                    title = item.get('title', ['No title'])[0]
                    venue = item.get('container-title', ['No venue'])[0]
                    authors = ', '.join(author.get('name', 'No author') for author in item.get('author', []))
                    abstract = item.get('abstract', 'No abstract')
                    data.append([title, venue, year, authors, abstract])
                    total += 1
                    if total >= 1000:
                        break
            start += limit
            print(f"Collected {total} articles so far.")
            break
        except requests.RequestException as e:
            print(f"Error: {e}")
            break
df = pd.DataFrame(data, columns=['Title', 'Venue', 'Year', 'Authors', 'Abstract'])
df.to_csv(file, index=False)
print(f"Finished collecting {total} articles. Data saved to {file}.")


Collected 100 articles so far.
Collected 200 articles so far.
Collected 300 articles so far.
Collected 400 articles so far.
Collected 500 articles so far.
Collected 600 articles so far.
Collected 700 articles so far.
Collected 800 articles so far.
Collected 900 articles so far.
Collected 1000 articles so far.
Finished collecting 1000 articles. Data saved to articles.csv.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [4]:
!pip install feedparser
import feedparser
import pandas as pd
def scrape_reddit(subreddit):
    url = f'https://www.reddit.com/r/{subreddit}/top/.rss'
    feed = feedparser.parse(url)
    data = []
    for entry in feed.entries:
        title = entry.title
        summary = entry.summary
        link = entry.link
        published = entry.published
        author = entry.author
        data.append([title, summary, link, published, author])
    df = pd.DataFrame(data, columns=['Title', 'Summary', 'Link', 'Published', 'Author'])
    return df
subreddit = 'Python'
df = scrape_reddit(subreddit)
df.to_csv('reddit_data.csv', index=False)
print("Data has been saved to 'reddit_data.csv'")


Collecting feedparser
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=b91557a4595ac0f7c89b939d7c4505fd727016c8698b5e8813173d8b179fe722
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.11 sgmllib3k-1.0.0
Data has been saved to 'reddit_data.csv'


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I learnt a lot overall from working on the web scraping assignments. My knowledge of the structure of data on
web sites and how to extract it using programs like BeautifulSoup and Selenium has significantly improved.
Understanding the differences between static and dynamic content as well as how APIs may be helpful for data
scraping from websites like Reddit and Twitter was interesting to me. Finding a way to obtain the information
I needed from several sources felt nice.

I did run into some difficulties, particularly on sites like Twitter where obtaining data without an
API key was restricted. Managing SSL certificate issues while scraping was another difficult aspect,
but I was able to resolve it by adjusting a few parameters. It was a little challenging at first to use
Selenium to handle JavaScript for websites with dynamic content (like Instagram), but I ultimately figured it out.

The ability to gather information from the internet is quite beneficial for numerous purposes.
In my line of work, real-time data from web sources can be very helpful for trend analysis,
machine learning data collection, or just conducting targeted study. Making better decisions or developing
 more robust data-driven projects can both benefit from it.

'''