<a href="https://colab.research.google.com/github/praveen1608/Praveen-Reddy_INFO5731_Spring2024/blob/main/Kadasani_PraveenReddy_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
"""
How does the performance of a product sales depend on customer reviews?

Here we are using web scrapping techniques to collect product reviews from the amazon.

we are conducting a complete analysis, so we should collect a good amount of data at least 1000 reviews for a product to analyse the sentiment of customers i.e overall rating of the product.

We need to first get the URL link for the product we working on, by using the code we should scrape the product reviews to the extent we require. Now save the scraped data in a structured format like CSV or JSON.
"""

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [25]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_amazon_reviews(product_url, max_samples):
    reviews_data = []
    page_number = 1

    while len(reviews_data) < max_samples:
        # Construct URL for the current page
        url = f"{product_url}&pageNumber={page_number}"

        # Send HTTP GET request to product URL
        response = requests.get(url)

        # Parse HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract review elements
        review_elements = soup.find_all('div', class_='a-section review aok-relative')

        # Extract review text and star ratings
        for review in review_elements:
            review_text = review.find('span', class_='review-text').text.strip()
            star_rating = float(review.find('span', class_='a-icon-alt').text.split()[0])  # Extract star rating from alt text
            reviews_data.append({'Review Text': review_text, 'Star Rating': star_rating})

            # Check if we have collected enough samples
            if len(reviews_data) >= max_samples:
                break

        # Move to the next page
        page_number += 1

    return reviews_data

def main():
    # Product URL to scrape reviews from
    product_url = 'https://www.amazon.com/2022-Apple-iPad-10-9-inch-Wi-Fi/dp/B09V3JJT5D/ref=sr_1_1?crid=U18Z0W66Y3D2&keywords=ipad+air+5th+generation&qid=1708058290&sprefix=ipad+air+5th+generation+%2Caps%2C325&sr=8-1-spons&ufe=app_do%3Aamzn1.fos.c3015c4a-46bb-44b9-81a4-dc28e6d374b3&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&psc=1'

    # Number of samples to collect
    max_samples = 1000

    # Scrape product reviews from Amazon
    reviews_data = scrape_amazon_reviews(product_url, max_samples)

    # Convert reviews data to DataFrame
    df = pd.DataFrame(reviews_data)

    # Save data to CSV file
    df.to_csv('amazon_reviews.csv', index=False)
    print("Data saved successfully.")

    print(df)

if __name__ == '__main__':
    main()

Data saved successfully.
                                           Review Text  Star Rating
0    The iPad Air 5th Generation has been a perfect...          5.0
1    I bought ipad air to carry in small bags when ...          5.0
2    Started using an IPad back when the IPad 2 was...          5.0
3    It's been years I've used an apple device with...          5.0
4    I received this iPad Air as my Christmas gift ...          5.0
..                                                 ...          ...
995  The M1 chip in this thing rips. Was torn betwe...          5.0
996  The iPad Air 5th Generation has been a perfect...          5.0
997  I bought ipad air to carry in small bags when ...          5.0
998  Started using an IPad back when the IPad 2 was...          5.0
999  It's been years I've used an apple device with...          5.0

[1000 rows x 2 columns]


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [24]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_google_scholar(keyword, max_articles):
    base_url = "https://scholar.google.com/scholar"
    params = {'q': keyword, 'hl': 'en', 'start': 0}
    articles_data = []
    articles_collected = 0

    try:
        while articles_collected < max_articles:
            response = requests.get(base_url, params=params)
            print("Scraping URL:", response.url)  # Print the URL being scraped
            soup = BeautifulSoup(response.text, 'html.parser')
            articles = soup.find_all('div', class_='gs_r gs_or gs_scl')

            if not articles:
                break

            for article in articles:
                title = article.find('h3', class_='gs_rt').text.strip()
                authors = article.find('div', class_='gs_a').text.strip()
                venue = article.find('div', class_='gs_a').text.strip()
                year = article.find('div', class_='gs_a').text.strip()
                abstract = article.find('div', class_='gs_rs').text.strip()

                articles_data.append({
                    'Title': title,
                    'Authors': authors,
                    'Venue': venue,
                    'Year': year,
                    'Abstract': abstract
                })

                articles_collected += 1
                if articles_collected >= max_articles:
                    break

            # Update params for next page
            params['start'] += 10

            # Print the length of articles_data after scraping each page
            print(f"Articles collected so far: {len(articles_data)}")

    except Exception as e:
        print("An error occurred during scraping:", e)

    return articles_data

def main():
    keyword = "XYZ"
    max_articles = 1000

    articles_data = scrape_google_scholar(keyword, max_articles)

    # Convert data to DataFrame
    df = pd.DataFrame(articles_data)

    # Save data to CSV file
    df.to_csv('articles_data.csv', index=False)
    print("Data saved successfully.")

    # Print the collected data
    print(df)

if __name__ == "__main__":
    main()


Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=0
Articles collected so far: 10
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=10
Articles collected so far: 20
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=20
Articles collected so far: 30
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=30
Articles collected so far: 40
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=40
Articles collected so far: 50
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=50
Articles collected so far: 60
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=60
Articles collected so far: 70
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=70
Articles collected so far: 80
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=80
Articles collected so far: 90
Scraping URL: https://scholar.google.com/scholar?q=XYZ&hl=en&start=90
Articles collected so far: 100


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.

Its feels very nice to learn new things, I never did web scraping before, this feels good.
It is like building our own things with our products rather than depending on others for data.
I learnt about new libraries and how to use them effectively in web scraping rather than writing the long code. These libraries will save the time they were really very useful.

I tried scraping with the twitter using API, where I faced the issues like "you need more access or there is only limited access to you".
I think its restricted, twitter is not allowing us to do that now. So, I have gone with product reviews.

And in question number 4 I want to try a differnt tool, so I worked on "ParseHub". At first I have faced a lot of problems on how to use it. It was very confusing.
Later after several tries I got succeeded.
'''