<a href="https://colab.research.google.com/github/harishk1998/HarishBabu_INFO5731_Fall2024/blob/main/INFO5731_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# List of IMDb movie review links for 2023 and 2024 movies
movie_urls = [
    "https://www.imdb.com/title/tt0974015/reviews?ref_=nmawd_awd_1",  # Barbie
    "https://www.imdb.com/title/tt15398776/reviews?ref_=nmawd_awd_1",  # Oppenheimer
    "https://www.imdb.com/title/tt12844910/reviews?ref_=nmawd_awd_1",  # Guardians of the Galaxy Vol. 3
    "https://www.imdb.com/title/tt5537002/reviews?ref_=nmawd_awd_1",   # Killers of the Flower Moon
    "https://www.imdb.com/title/tt15284066/reviews?ref_=nmawd_awd_1",  # Dune: Part Two
    "https://www.imdb.com/title/tt6782946/reviews?ref_=nmawd_awd_1",   # The Marvels
    "https://www.imdb.com/title/tt10327262/reviews?ref_=nmawd_awd_1",  # John Wick: Chapter 4
    "https://www.imdb.com/title/tt9366396/reviews?ref_=nmawd_awd_1",   # Spider-Man: Across the Spider-Verse
    "https://www.imdb.com/title/tt5123916/reviews?ref_=nmawd_awd_1",   # Fast X
    "https://www.imdb.com/title/tt9603212/reviews?ref_=nmawd_awd_1",   # Mission: Impossible – Dead Reckoning Part One
    "https://www.imdb.com/title/tt0439572/reviews?ref_=nmawd_awd_1",   # The Flash
    "https://www.imdb.com/title/tt12844910/reviews?ref_=nmawd_awd_1",  # Ant-Man and The Wasp: Quantumania
    "https://www.imdb.com/title/tt10402542/reviews?ref_=nmawd_awd_1",  # Transformers: Rise of the Beasts
    "https://www.imdb.com/title/tt14708376/reviews?ref_=nmawd_awd_1",  # Aquaman and the Lost Kingdom
    "https://www.imdb.com/title/tt2027122/reviews?ref_=nmawd_awd_1",   # Elemental
    "https://www.imdb.com/title/tt10298830/reviews?ref_=nmawd_awd_1"   # Nimona
]

# Create a DataFrame to hold all the reviews
reviews_df = pd.DataFrame(columns=["Movie Title", "Review", "Rating"])

# Function to fetch reviews from a given URL
def get_reviews(url):
    page_number = 1
    while True:
        # Construct the complete URL with the page number
        full_url = f"{url}&page={page_number}"
        print(f"Collecting reviews from {full_url} (Page {page_number})")

        # Set headers for the request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
        }

        response = requests.get(full_url, headers=headers)

        # Check if the page was fetched successfully
        if response.status_code != 200:
            print(f"Failed to fetch page {page_number}. Status code: {response.status_code}")
            break

        # Parse the page content
        soup = BeautifulSoup(response.text, 'html.parser')
        review_containers = soup.find_all("div", class_="review-container")

        # If no more reviews are found, break the loop
        if not review_containers:
            print(f"No more reviews found on page {page_number} of {full_url}.")
            break

        # Extract review details from each review container
        for review in review_containers:
            title = review.find("a", class_="title").text.strip()  # Review title
            content = review.find("div", class_="text show-more__control").text.strip()  # Review content
            rating = review.find("span", class_="sc-16z0m2g-0 eGgqDq").text.strip() if review.find("span", class_="sc-16z0m2g-0 eGgqDq") else "N/A"  # Review rating
            # Add the review details to the DataFrame
            reviews_df.loc[len(reviews_df)] = [title, content, rating]

        print(f"Fetched {len(review_containers)} reviews from {full_url} (Page {page_number}).")
        page_number += 1
        time.sleep(1)  # Pause to be respectful of the website's resources

# Fetch reviews from all initial movie links
for movie_url in movie_urls:
    get_reviews(movie_url)

# Check how many reviews have been collected and add more movies if necessary
while len(reviews_df) < 1000:
    print(f"Only {len(reviews_df)} reviews collected. Consider adding more links.")
    # Add additional movie links
    additional_movie_urls = [
        "https://www.imdb.com/title/tt13560574/reviews?ref_=nmawd_awd_1",  # A Haunting in Venice
        "https://www.imdb.com/title/tt15339312/reviews?ref_=nmawd_awd_1",  # The Equalizer 3
        "https://www.imdb.com/title/tt1630029/reviews?ref_=nmawd_awd_1",   # Asteroid City
        "https://www.imdb.com/title/tt15484934/reviews?ref_=nmawd_awd_1"   # The Hunger Games: The Ballad of Songbirds and Snakes
    ]
    movie_urls.extend(additional_movie_urls)

    # Fetch reviews again from the new list of links
    for movie_url in additional_movie_urls:
        get_reviews(movie_url)

# Print the final number of reviews collected
print(f"Total reviews collected: {len(reviews_df)}")

# Save all reviews to a CSV file
reviews_df.to_csv('imdb_reviews.csv', index=False)
print("All the reviews are fetched into the imdb_reviews.csv.")


Collecting reviews from https://www.imdb.com/title/tt0974015/reviews?ref_=nmawd_awd_1&page=1 (Page 1)
No more reviews found on page 1 of https://www.imdb.com/title/tt0974015/reviews?ref_=nmawd_awd_1&page=1.
Collecting reviews from https://www.imdb.com/title/tt15398776/reviews?ref_=nmawd_awd_1&page=1 (Page 1)
Fetched 25 reviews from https://www.imdb.com/title/tt15398776/reviews?ref_=nmawd_awd_1&page=1 (Page 1).
Collecting reviews from https://www.imdb.com/title/tt15398776/reviews?ref_=nmawd_awd_1&page=2 (Page 2)
No more reviews found on page 2 of https://www.imdb.com/title/tt15398776/reviews?ref_=nmawd_awd_1&page=2.
Collecting reviews from https://www.imdb.com/title/tt12844910/reviews?ref_=nmawd_awd_1&page=1 (Page 1)
Fetched 25 reviews from https://www.imdb.com/title/tt12844910/reviews?ref_=nmawd_awd_1&page=1 (Page 1).
Collecting reviews from https://www.imdb.com/title/tt12844910/reviews?ref_=nmawd_awd_1&page=2 (Page 2)
Fetched 25 reviews from https://www.imdb.com/title/tt12844910/revie

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [22]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk

# Download necessary NLTK resources if not done already
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file containing the reviews
reviews_df = pd.read_csv('imdb_reviews.csv')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Define a function to clean the text
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Removes anything that isn't a letter or space
    #which means it also removes the number

    # Lowercase all texts
    text = text.lower()

    # Remove stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words)

    # Stemming
    text = ' '.join(stemmer.stem(word) for word in text.split())

    # Lemmatization
    text = ' '.join(lemmatizer.lemmatize(word) for word in text.split())

    return text

# Applied the cleaning function to the 'Review' column and create a new column 'Cleaned Review'
reviews_df['Cleaned Review'] = reviews_df['Review'].apply(clean_text)

# Save the cleaned data to a new CSV file
reviews_df.to_csv('imdb_reviews_cleaned.csv', index=False)
print("Cleaned data has been saved to 'imdb_reviews_cleaned.csv'.")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data has been saved to 'imdb_reviews_cleaned.csv'.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [28]:
import pandas as pd
import nltk
import spacy
from collections import Counter

nltk.download('averaged_perceptron_tagger')

# Reading the imdb_reviews.csv file from the above code output
df = pd.read_csv('imdb_reviews_cleaned.csv')

# Load spacy model for Dependency Parsing and Named Entity Recognition
nlp = spacy.load("en_core_web_sm")

# POS Tagging: Tag Parts of Speech and count Nouns, Verbs, Adjectives, Adverbs
def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)

    # Count the Nouns, Verbs, Adjectives, Adverbs
    pos_counts = Counter(tag for word, tag in pos_tags)

    # Simplify the POS tag categories for Nouns, Verbs, Adjectives, Adverbs
    noun_count = pos_counts['NN'] + pos_counts['NNS'] + pos_counts['NNP'] + pos_counts['NNPS']
    verb_count = pos_counts['VB'] + pos_counts['VBD'] + pos_counts['VBG'] + pos_counts['VBN'] + pos_counts['VBP'] + pos_counts['VBZ']
    adj_count = pos_counts['JJ'] + pos_counts['JJR'] + pos_counts['JJS']
    adv_count = pos_counts['RB'] + pos_counts['RBR'] + pos_counts['RBS']

    return pos_tags, noun_count, verb_count, adj_count, adv_count

# Function to print POS tagging results
def analyze_pos(df):
    total_nouns, total_verbs, total_adjectives, total_adverbs = 0, 0, 0, 0
    for text in df['Cleaned Review']:
        _, nouns, verbs, adjectives, adverbs = pos_tagging(text)
        total_nouns += nouns
        total_verbs += verbs
        total_adjectives += adjectives
        total_adverbs += adverbs

    print(f"Total Nouns: {total_nouns}, Verbs: {total_verbs}, Adjectives: {total_adjectives}, Adverbs: {total_adverbs}")

# Constituency Parsing and Dependency Parsing using spaCy
def parse_sentences(df):
    for index, row in df.iterrows():
        review = row['Cleaned Review']
        doc = nlp(review)

        # Dependency Parsing
        print(f"\nDependency Parsing of review {index + 1}:")
        for token in doc:
            print(f"{token.text} --> {token.dep_} --> {token.head.text}")

        # Constituency parsing not available in spacy, so we'll focus on dependency parsing.
        if index == 0:
            break

# Named Entity Recognition (NER)
def named_entity_recognition(df):
    entity_counts = Counter()
    for text in df['Cleaned Review']:
        doc = nlp(text)
        for ent in doc.ents:
            entity_counts[ent.label_] += 1
            print(f"Entity: {ent.text}, Label: {ent.label_}")

    print("\nNamed Entity Counts:")
    for entity, count in entity_counts.items():
        print(f"{entity}: {count}")

# POS analysis
analyze_pos(df)

# Dependency Parsing
parse_sentences(df)

# Named Entity Recognition
named_entity_recognition(df)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Entity: half, Label: CARDINAL
Entity: extrem surpris, Label: PERSON
Entity: sam levinson name, Label: PERSON
Entity: gore, Label: PERSON
Entity: third, Label: ORDINAL
Entity: extrem repuls, Label: PERSON
Entity: third, Label: ORDINAL
Entity: akin jeeper creeper came, Label: ORG
Entity: materi, Label: PERSON
Entity: third, Label: ORDINAL
Entity: crusheri, Label: NORP
Entity: ti west, Label: PERSON
Entity: nightmar imagin place, Label: ORG
Entity: chang one, Label: PERSON
Entity: yearn bygon, Label: PERSON
Entity: melancholi, Label: GPE
Entity: everi, Label: NORP
Entity: realli enjoy, Label: PERSON
Entity: first, Label: ORDINAL
Entity: half, Label: CARDINAL
Entity: happen chang, Label: PERSON
Entity: gore, Label: PERSON
Entity: realli, Label: PERSON
Entity: movi good chanc, Label: ORG
Entity: first, Label: ORDINAL
Entity: first, Label: ORDINAL
Entity: american, Label: NORP
Entity: teenagersecond charact sleazi stori sleazi 

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [30]:
from google.colab import files
files.download('imdb_reviews_cleaned.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
'''
Challenges that i have faced:
1. The initial data cleaning process required multiple iterations to ensure all unwanted characters, numbers, and
stopwords were correctly removed. Fine-tuning the regex patterns and ensuring the results met the specifications took
considerable effort.
2.Understanding NLP Techniques: Getting comfortable with concepts like POS tagging, dependency parsing, and named
entity recognition was initially overwhelming. There were moments when figuring out how to apply these techniques
effectively in code felt challenging, especially when encountering errors.
3. Environment Setup: Setting up the necessary libraries and ensuring everything ran smoothly in Google Colab
took some time. Managing dependencies and handling errors related to package installations could be frustrating.
Enjoyable Aspects:
4. Learning Opportunities: The assignment provided a great opportunity to deepen my understanding of Natural
Language Processing (NLP) techniques. I enjoyed exploring how different text analysis methods could yield insights
from the cleaned reviews.
5. Visualization: The ability to visualize parsing trees with spaCy was particularly enjoyable. It helped me better
understand the relationships between words in a sentence and how they function together syntactically.
6. Real-World Application: Working with movie reviews made the assignment feel relevant and engaging. Analyzing
actual sentiments expressed in reviews was interesting, and it provided a glimpse into how sentiment analysis
could be useful in various fields.

Opinion on the Time Provided:
The timeframe for completing the assignment felt reasonable, but a bit more time could have allowed for deeper
exploration of the NLP concepts. While the initial phases of cleaning and analyzing data were manageable,
the complexity of the analyses might require additional time for those unfamiliar with the tools or concepts.
Overall, the deadline was sufficient but could benefit from slight adjustments for future assignments to
accommodate varying levels of experience among students.

'''