<a href="https://colab.research.google.com/github/kc6699c/Komal_INFO5731_Fall2024/blob/main/CHERUKURI_INFO5731_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import os

def get_review_text(soup): # Function to extract the review details
    reviews = []
    review_blocks = soup.find_all("div", {"data-hook": "review"})

    for block in review_blocks:
        try:
            review_text = block.find("span", {"data-hook": "review-body"}).text.strip()
            rating = block.find("i", {"data-hook": "review-star-rating"}).text.strip()
            date = block.find("span", {"data-hook": "review-date"}).text.strip()
            title = block.find("a", {"data-hook": "review-title"}).text.strip()
            reviews.append({
                "title": title,
                "rating": rating,
                "review_text": review_text,
                "date": date
            })
        except AttributeError:
            continue
    return reviews

def scrape_review_page(url, headers): # Function to scrape a single page of reviews
    try:
        review_page = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(review_page.content, 'html.parser')
        return get_review_text(soup)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return []

def scrape_amazon_reviews(product_base_url, headers, max_pages=20): # Function to scrape reviews across multiple pages
    reviews_data = []
    for page_num in range(1, max_pages + 1):
        review_url = f"{product_base_url}&pageNumber={page_num}"
        print(f"Scraping page {page_num}: {review_url}")

        reviews = scrape_review_page(review_url, headers)
        if reviews:
            reviews_data.extend(reviews)

        time.sleep(2)

        if len(reviews_data) >= 1000:
            break

    return reviews_data

if __name__ == '__main__': # Main function to execute the scraping
    # Define headers to avoid being blocked
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Accept-Language": "en-US, en;q=0.5",
    }

    base_url = "https://www.amazon.com/Dove-Body-Wash-Pump-Moisture/dp/B00MEDOY2G/ref=sr_1_5?crid=3S099968VXMXA&dib=eyJ2IjoiMSJ9.w9FjOgRJLM0vYdIVImsScUafugbNLSs5DshepgWg8oT-U-iYhsc89jpMVDMGQ0crEEj7joKGCKZCPzcJ4YnVL1YXnfjX9zHmLf7RDM7I9hxOMkUwBb5_dem7Mm1pJKG9atvE48H397MgYFMyfSBF2fRICZJUixqmPsOBhufU3q2KEqoQhKwrkM-UZjsnQfz0DaJgmLAnBYt2ljFEDFf6DPGDztKAGyePD08yjQ0nP0vtKjCWSrk5NT7Wpu6jYB6-W1A-wApfVlXNmWqmvoAIcM_6EUlWeQAq_7YSSSVjkKU.S65d_duoQ2duZ0JW_O-BrC4slV0OqymKzdzobkWORoA&dib_tag=se&keywords=body+wash&qid=1727489826&rdc=1&sprefix=Body+%2Caps%2C92&sr=8-5"
    # Scraping the reviews
    scraped_reviews = scrape_amazon_reviews(base_url, HEADERS, max_pages=100)

    # Check if reviews are scraped
    if not scraped_reviews:
        print("No reviews scraped. Check if the product URL is correct or if you're being blocked by Amazon.")
    else:
        reviews_df = pd.DataFrame(scraped_reviews) # Converting to DataFrame

        csv_file_path = os.path.join(os.getcwd(), "Amazon_Product_Reviews.csv")# Define the file path where the CSV will be saved

        reviews_df.to_csv(csv_file_path, header=True, index=False)  # Save the DataFrame to CSV

        print(f"CSV file has been saved successfully as {csv_file_path}")
        print(reviews_df.head())


Scraping page 1: https://www.amazon.com/Dove-Body-Wash-Pump-Moisture/dp/B00MEDOY2G/ref=sr_1_5?crid=3S099968VXMXA&dib=eyJ2IjoiMSJ9.w9FjOgRJLM0vYdIVImsScUafugbNLSs5DshepgWg8oT-U-iYhsc89jpMVDMGQ0crEEj7joKGCKZCPzcJ4YnVL1YXnfjX9zHmLf7RDM7I9hxOMkUwBb5_dem7Mm1pJKG9atvE48H397MgYFMyfSBF2fRICZJUixqmPsOBhufU3q2KEqoQhKwrkM-UZjsnQfz0DaJgmLAnBYt2ljFEDFf6DPGDztKAGyePD08yjQ0nP0vtKjCWSrk5NT7Wpu6jYB6-W1A-wApfVlXNmWqmvoAIcM_6EUlWeQAq_7YSSSVjkKU.S65d_duoQ2duZ0JW_O-BrC4slV0OqymKzdzobkWORoA&dib_tag=se&keywords=body+wash&qid=1727489826&rdc=1&sprefix=Body+%2Caps%2C92&sr=8-5&pageNumber=1
Scraping page 2: https://www.amazon.com/Dove-Body-Wash-Pump-Moisture/dp/B00MEDOY2G/ref=sr_1_5?crid=3S099968VXMXA&dib=eyJ2IjoiMSJ9.w9FjOgRJLM0vYdIVImsScUafugbNLSs5DshepgWg8oT-U-iYhsc89jpMVDMGQ0crEEj7joKGCKZCPzcJ4YnVL1YXnfjX9zHmLf7RDM7I9hxOMkUwBb5_dem7Mm1pJKG9atvE48H397MgYFMyfSBF2fRICZJUixqmPsOBhufU3q2KEqoQhKwrkM-UZjsnQfz0DaJgmLAnBYt2ljFEDFf6DPGDztKAGyePD08yjQ0nP0vtKjCWSrk5NT7Wpu6jYB6-W1A-wApfVlXNmWqmvoAIcM_6EUlWeQAq_7YSSSVjkKU.

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

**Remove Noise (Special Characters and Punctuations)**

In [6]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import re

df = pd.read_csv('Amazon_Product_Reviews.csv')

def remove_noise(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters and punctuations
    return text

df['cleaned_text'] = df['review_text'].apply(remove_noise)

# Display the DataFrame with cleaned text
print(df[['review_text', 'cleaned_text']].head())

                                         review_text  \
0  The pump dispenser is a nice touch, making it ...   
1  I have a hard time finding smells that aren't ...   
2  I’ve been using this product for years. It’s g...   
3  This product is wonderful to use.  I have dry ...   
4  It's nice and creamy.  It leaves skin soft and...   

                                        cleaned_text  
0  The pump dispenser is a nice touch making it c...  
1  I have a hard time finding smells that arent t...  
2  Ive been using this product for years Its grea...  
3  This product is wonderful to use  I have dry a...  
4  Its nice and creamy  It leaves skin soft and s...  


**Remove Numbers**

In [7]:
def remove_numbers(text):
    text = re.sub(r'\d+', '', text)  # Remove numbers
    return text

df['cleaned_text'] = df['cleaned_text'].apply(remove_numbers)

# Display the DataFrame with numbers removed
print(df[['review_text', 'cleaned_text']].head())


                                         review_text  \
0  The pump dispenser is a nice touch, making it ...   
1  I have a hard time finding smells that aren't ...   
2  I’ve been using this product for years. It’s g...   
3  This product is wonderful to use.  I have dry ...   
4  It's nice and creamy.  It leaves skin soft and...   

                                        cleaned_text  
0  The pump dispenser is a nice touch making it c...  
1  I have a hard time finding smells that arent t...  
2  Ive been using this product for years Its grea...  
3  This product is wonderful to use  I have dry a...  
4  Its nice and creamy  It leaves skin soft and s...  


**Remove Stop Words**

In [8]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

df['cleaned_text'] = df['cleaned_text'].apply(remove_stopwords)

# Display the DataFrame with stopwords removed
print(df[['review_text', 'cleaned_text']].head())


                                         review_text  \
0  The pump dispenser is a nice touch, making it ...   
1  I have a hard time finding smells that aren't ...   
2  I’ve been using this product for years. It’s g...   
3  This product is wonderful to use.  I have dry ...   
4  It's nice and creamy.  It leaves skin soft and...   

                                        cleaned_text  
0  pump dispenser nice touch making convenient us...  
1  hard time finding smells arent strong soap lea...  
2  Ive using product years great dry skin eczemaI...  
3  product wonderful use dry sensitive skin works...  
4  nice creamy leaves skin soft smooth sting burn...  


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Lowercase All Texts**

In [9]:
def lowercase_text(text):
    return text.lower()

df['cleaned_text'] = df['cleaned_text'].apply(lowercase_text)

# Display the DataFrame with lowercase text
print(df[['review_text', 'cleaned_text']].head())

                                         review_text  \
0  The pump dispenser is a nice touch, making it ...   
1  I have a hard time finding smells that aren't ...   
2  I’ve been using this product for years. It’s g...   
3  This product is wonderful to use.  I have dry ...   
4  It's nice and creamy.  It leaves skin soft and...   

                                        cleaned_text  
0  pump dispenser nice touch making convenient us...  
1  hard time finding smells arent strong soap lea...  
2  ive using product years great dry skin eczemai...  
3  product wonderful use dry sensitive skin works...  
4  nice creamy leaves skin soft smooth sting burn...  


**Stemming**

In [10]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_text(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

df['stemmed_text'] = df['cleaned_text'].apply(stem_text)

# Display the DataFrame with stemming
print(df[['review_text', 'stemmed_text']].head())

                                         review_text  \
0  The pump dispenser is a nice touch, making it ...   
1  I have a hard time finding smells that aren't ...   
2  I’ve been using this product for years. It’s g...   
3  This product is wonderful to use.  I have dry ...   
4  It's nice and creamy.  It leaves skin soft and...   

                                        stemmed_text  
0  pump dispens nice touch make conveni use showe...  
1  hard time find smell arent strong soap leav sk...  
2  ive use product year great dri skin eczemait m...  
3  product wonder use dri sensit skin work great ...  
4  nice creami leav skin soft smooth sting burn c...  


**Lemmatization**

In [11]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)

# Display the DataFrame with lemmatization
print(df[['review_text', 'lemmatized_text']].head())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


                                         review_text  \
0  The pump dispenser is a nice touch, making it ...   
1  I have a hard time finding smells that aren't ...   
2  I’ve been using this product for years. It’s g...   
3  This product is wonderful to use.  I have dry ...   
4  It's nice and creamy.  It leaves skin soft and...   

                                     lemmatized_text  
0  pump dispenser nice touch making convenient us...  
1  hard time finding smell arent strong soap leaf...  
2  ive using product year great dry skin eczemait...  
3  product wonderful use dry sensitive skin work ...  
4  nice creamy leaf skin soft smooth sting burn c...  


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [12]:
# Your code here

import spacy
from collections import Counter

# Load the English model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the cleaned text with spaCy
df['doc'] = df['lemmatized_text'].apply(nlp)

# Function to count POS tags
def count_pos(doc):
    pos_counts = Counter([token.pos_ for token in doc])
    return pos_counts

# Apply POS counting to each doc
df['pos_counts'] = df['doc'].apply(count_pos)

# Summarize the total counts across all texts
total_counts = Counter()
for pos_count in df['pos_counts']:
    total_counts.update(pos_count)

# Print the total counts for nouns, verbs, adjectives, and adverbs
print("Total POS counts across all reviews:")
print(f"Nouns (NOUN): {total_counts['NOUN']}")
print(f"Verbs (VERB): {total_counts['VERB']}")
print(f"Adjectives (ADJ): {total_counts['ADJ']}")
print(f"Adverbs (ADV): {total_counts['ADV']}")

Total POS counts across all reviews:
Nouns (NOUN): 18868
Verbs (VERB): 8847
Adjectives (ADJ): 10219
Adverbs (ADV): 2917


In [16]:
pip install spacy benepar



In [18]:
import spacy
import benepar
from collections import Counter

# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

# Download benepar model if not already available
benepar.download('benepar_en3')

# Add benepar to spaCy pipeline if not already present
if "benepar" not in nlp.pipe_names:
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})

# Example DataFrame for demonstration (replace this with your actual cleaned text DataFrame)
import pandas as pd
df = pd.DataFrame({
    'lemmatized_text': ["The product works well and is very efficient.",
                        "This bike is great for long rides.",
                        "The car is expensive but worth it."]
})

# Process the cleaned text with spaCy
df['doc'] = df['lemmatized_text'].apply(nlp)

# Task 1: POS Tagging and Count of Nouns, Verbs, Adjectives, Adverbs
def count_pos(doc):
    pos_counts = Counter([token.pos_ for token in doc])
    return pos_counts

df['pos_counts'] = df['doc'].apply(count_pos)

# Summarize the total counts across all texts
total_counts = Counter()
for pos_count in df['pos_counts']:
    total_counts.update(pos_count)

print("Total POS counts across all reviews:")
print(f"Nouns (NOUN): {total_counts['NOUN']}")
print(f"Verbs (VERB): {total_counts['VERB']}")
print(f"Adjectives (ADJ): {total_counts['ADJ']}")
print(f"Adverbs (ADV): {total_counts['ADV']}")
print()

# Task 2: Constituency Parsing and Dependency Parsing
for doc in df['doc']:
    for sent in doc.sents:
        if sent._.parse_string:  # Check if benepar has successfully parsed the sentence
            print("Constituency Parse Tree:")
            print(sent._.parse_string)  # Print constituency tree

        print("\nDependency Parse Tree:")
        for token in sent:
            print(f"{token.text} --> {token.dep_} --> {token.head.text}")  # Print dependency tree
        print()

# Example sentence for explanation
example_sentence = df['lemmatized_text'].iloc[0]
doc = nlp(example_sentence)
sent = list(doc.sents)[0]

if sent._.parse_string:
    print("\nExample Constituency Parsing Tree:")
    print(sent._.parse_string)  # Example constituency tree

print("\nExample Dependency Parsing Tree:")
for token in sent:
    print(f"{token.text} --> {token.dep_} --> {token.head.text}")  # Example dependency tree
print()

# Task 3: Named Entity Recognition (NER) and Counting Entities
def extract_entities(doc):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

df['entities'] = df['doc'].apply(extract_entities)

def count_entity_types(entities):
    entity_counts = Counter([ent[1] for ent in entities])
    return entity_counts

df['entity_counts'] = df['entities'].apply(count_entity_types)

# Summarize the total entity counts across all reviews
total_entity_counts = Counter()
for entity_count in df['entity_counts']:
    total_entity_counts.update(entity_count)

print("Total entity counts across all reviews:")
print(f"Person (PERSON): {total_entity_counts['PERSON']}")
print(f"Organizations (ORG): {total_entity_counts['ORG']}")
print(f"Locations (GPE): {total_entity_counts['GPE']}")
print(f"Products (PRODUCT): {total_entity_counts['PRODUCT']}")
print(f"Dates (DATE): {total_entity_counts['DATE']}")

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
  state_dict = torch.load(
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Total POS counts across all reviews:
Nouns (NOUN): 4
Verbs (VERB): 1
Adjectives (ADJ): 5
Adverbs (ADV): 2

Constituency Parse Tree:
(S (NP (DT The) (NN product)) (VP (VP (VBZ works) (ADVP (RB well))) (CC and) (VP (VBZ is) (ADJP (RB very) (JJ efficient)))) (. .))

Dependency Parse Tree:
The --> det --> product
product --> nsubj --> works
works --> ROOT --> works
well --> advmod --> works
and --> cc --> works
is --> conj --> works
very --> advmod --> efficient
efficient --> acomp --> is
. --> punct --> works

Constituency Parse Tree:
(S (NP (DT This) (NN bike)) (VP (VBZ is) (ADJP (JJ great) (PP (IN for) (NP (JJ long) (NNS rides))))) (. .))

Dependency Parse Tree:
This --> det --> bike
bike --> nsubj --> is
is --> ROOT --> is
great --> acomp --> is
for --> prep --> is
long --> amod --> rides
rides --> pobj --> for
. --> punct --> is

Constituency Parse Tree:
(S (NP (DT The) (NN car)) (VP (VBZ is) (ADJP (ADJP (JJ expensive)) (CC but) (ADJP (JJ worth) (NP (PRP it))))) (. .))

Dependency Par



In [19]:
# Extract named entities and their types
def extract_entities(doc):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Apply the entity extraction
df['entities'] = df['doc'].apply(extract_entities)

# Function to count entity types
def count_entity_types(entities):
    entity_counts = Counter([ent[1] for ent in entities])
    return entity_counts

# Apply the entity counting function
df['entity_counts'] = df['entities'].apply(count_entity_types)

# Summarize the total entity counts across all reviews
total_entity_counts = Counter()
for entity_count in df['entity_counts']:
    total_entity_counts.update(entity_count)

# Print out counts of specific entities of interest
print("Total entity counts across all reviews:")
print(f"Person (PERSON): {total_entity_counts['PERSON']}")
print(f"Organizations (ORG): {total_entity_counts['ORG']}")
print(f"Locations (GPE): {total_entity_counts['GPE']}")
print(f"Products (PRODUCT): {total_entity_counts['PRODUCT']}")
print(f"Dates (DATE): {total_entity_counts['DATE']}")

Total entity counts across all reviews:
Person (PERSON): 0
Organizations (ORG): 0
Locations (GPE): 0
Products (PRODUCT): 0
Dates (DATE): 0


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
https://docs.google.com/spreadsheets/d/1aPcT9KPIBToGAZGmC5O-yXVNwoRTn_AyVRDlLl-QA1E/edit?usp=sharing

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
# It is important you break down the assignment into pieces and give extra time to actaully work and learn