<a href="https://colab.research.google.com/github/hemareddyyanala/HemaReddy_INFO5731_Fall2024/blob/main/Yanala_Hema_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests  #importing this library to send HTTP requests and handle responses
from bs4 import BeautifulSoup  # importing bs4 library to parse HTML and extract data
import pandas as pd  # importing pandas for data manipulation

In [2]:
# creating a function to send an HTTP GET request to the IMDb URL and return the parsed HTML using BeautifulSoup.
def retrieve_page_content(url):
    try:
        #sending a GET request to the URL and storing the response
        response = requests.get(url)
        # Using the if condition to check if the request was successful
        if response.status_code == 200:
            # if true, we parse the response content using BeautifulSoup and return it
            return BeautifulSoup(response.content, 'html.parser')
        else:
            # if not, we print an error message if the status code is not 200
            print(f"There is an, the status code is {response.status_code}")
            return None
    except Exception as e:
        # printing an error message if the request fails
        print(f"sorry, failed to retrieve page content: {e}")
        return None

# defining a function to extract reviews from the parsed HTML content
def extract_reviews_from_page(soup):
    #finding all the review elements using their specific HTML class
    review_elements = soup.select("div.text.show-more__control")
    #initializing an empty list to store the cleaned reviews
    reviews = []
    # using for loop to go through each review element and get its text content
    for review in review_elements:
        # now, striping any whitespace and adding it to the list
        cleaned_review = review.get_text(strip=True)
        reviews.append(cleaned_review)
    # Returning the list of cleaned reviews
    return reviews

# defining a function to get the pagination key for getting more reviews
def get_pagination_key(soup):
    # Finding the 'div' element with class 'load-more-data', as it contains the pagination key
    load_more_section = soup.find("div", class_="load-more-data")
    #if the load-more section is found, return the value of pagination key
    if load_more_section:
        return load_more_section.get("data-key")
    # It returns None if there is no pagination key found
    return None

# this is the IMDb URL setup for collecting reviews for the movie Furiosa: A Mad Max Saga (2024)
base_imdb_url = f"https://www.imdb.com/title/tt6263850/reviews/_ajax?ref_=undefined"
#initiating an empty list for collected reviews and start with no pagination key
all_reviews = []  # empty List
pagination_token = None  # Initializing pagination_token as None

# using while loop to repeat the process of collecting reviews until we get 1000
while len(all_reviews) < 1000:
    # using if to check if there's a pagination token, then we append it to the base URL
    current_url = base_imdb_url
    if pagination_token:
        current_url += f"&paginationKey={pagination_token}"  #here, we append the pagination key to the URL

    #fetching the page content using the IMDb URL with the current pagination key
    soup = retrieve_page_content(current_url)
    #exiting the loop if we can't fetch the page content using break
    if soup is None:
        break

    #extracting the reviews from the current page using the extract_reviews_from_page() function
    reviews_on_page = extract_reviews_from_page(soup)
    #add the extracted reviews to the all_reviews list
    all_reviews.extend(reviews_on_page)
    #if the required number of reviews is collected, stop the loop using break
    if len(all_reviews) >= 1000:
        break
    #fetching the pagination key for the next page of reviews
    pagination_token = get_pagination_key(soup)
    # If no pagination token is found, break the process
    if pagination_token is None:
        print("No more reviews to load.")
        break

#truncating the review list to exactly 1000 reviews to ensure we don't collect more
collected_reviews = all_reviews[:1000]


In [3]:
#converting the list of collected reviews into a DataFrame using pandas
imdb_reviews_df = pd.DataFrame(collected_reviews, columns=["Review"])
# Saving the DataFrame to a CSV file named '1000_imdb_reviews.csv'
imdb_reviews_df.to_csv("1000_imdb_reviews.csv", index=False)

print(f"Collected {len(collected_reviews)} reviews and saved it to '1000_imdb_reviews.csv'.")


Collected 1000 reviews and saved it to '1000_imdb_reviews.csv'.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
imdb_reviews_df = pd.read_csv("1000_imdb_reviews.csv") #Loading the reviews from the CSV file '1000_imdb_reviews.csv' into a DataFrame

####(1) Remove noise, such as special characters and punctuations.*italicized text*

In [5]:
import re  #Importing the regular expression module

#removing punctuation and special characters from the 'Review' column in the DataFrame
# and store the cleaned reviews in a new column called 'Cleaned_Reviews'
imdb_reviews_df['Cleaned_Reviews'] = imdb_reviews_df['Review'].str.replace(r'[^\w\s]', '', regex=True)

In [6]:
print("Output after Removing Noise, such as special characters and punctuations:")
print(imdb_reviews_df.head())

Output after Removing Noise, such as special characters and punctuations:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  Hugh Jackman is the perfect Wolverine What a f...  
1  What a crazy blast  Bonkers Sooo \nWhat I can ...  
2  Weve waited so long for this moment and it was...  
3  So many Easter Eggs so true to the comic chara...  
4  I read an IGN review where the guy gave it a 7...  


####(2) Remove numbers.

In [7]:
#remove all numeric characters from the 'Cleaned_Reviews' column in the DataFrame
imdb_reviews_df['Cleaned_Reviews'] = imdb_reviews_df['Cleaned_Reviews'].str.replace(r'\d+', '', regex=True)

# Display the 'Review' and 'Cleaned_Reviews' columns
print("Output after Removing Numbers:")
print(imdb_reviews_df.head())  # Display original review and cleaned review

Output after Removing Numbers:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  Hugh Jackman is the perfect Wolverine What a f...  
1  What a crazy blast  Bonkers Sooo \nWhat I can ...  
2  Weve waited so long for this moment and it was...  
3  So many Easter Eggs so true to the comic chara...  
4  I read an IGN review where the guy gave it a  ...  


####(3) Remove stopwords by using the stopwords list.

In [8]:
import nltk  #importing the Natural Language Toolkit (NLTK) library for NLP
from nltk.corpus import stopwords  #importing the stopwords list from NLTK
from nltk.tokenize import word_tokenize  #import the word_tokenize function for splitting text into individual words

nltk.download('stopwords')  #downloading the stopwords dataset, which includes common words like 'and', 'the', etc,.
nltk.download('punkt')  # Downloading the Punkt tokenizer models for tokenizing sentences and words


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
# Getting the set of stopwords for English
stop_words = set(stopwords.words('english'))
clean_reviews = []  # Initializing an empty list to store cleaned reviews

# Using a for loop to iterate through each review in the Cleaned_Reviews column
for text in imdb_reviews_df['Cleaned_Reviews']:
    tokens = word_tokenize(text)  # Tokenizing the review text
    filtered_words = []  # Initializing an empty list to store words that are not stopwords

    # Using a for loop to check each token and add it to the filtered_words list if it's not a stopword
    for word in tokens:
        if word.lower() not in stop_words:  # Check in lowercase to match stopwords
            filtered_words.append(word)  # Append only non-stopwords

    # Joining the filtered words back into a single string and adding it to clean_reviews
    clean_reviews.append(' '.join(filtered_words))  # Properly join words with a space

# Assigning the cleaned reviews back to the DataFrame
imdb_reviews_df['Cleaned_Reviews'] = clean_reviews

# Printing the DataFrame after removing stopwords
print("\nOutput after Removing Stopwords:")
print(imdb_reviews_df.head())



Output after Removing Stopwords:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  Hugh Jackman perfect Wolverine fun movie like ...  
1  crazy blast Bonkers Sooo say movie whole team ...  
2  Weve waited long moment beyond fun wholesome f...  
3  many Easter Eggs true comic characters may pos...  
4  read IGN review guy gave story poorThe guy rea...  


####(4) Lowercase all texts

In [10]:
#converting all text in the 'Cleaned_Reviews' column to lowercase to standardize the text using str.lower() method
imdb_reviews_df['Cleaned_Reviews'] = imdb_reviews_df['Cleaned_Reviews'].str.lower()
# Print the message indicating the completion of the lowercasing step
print("Output after Lowercasing all texts:")
# Display the first few rows of the DataFrame, showing both original and cleaned reviews
print(imdb_reviews_df.head())


Output after Lowercasing all texts:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  hugh jackman perfect wolverine fun movie like ...  
1  crazy blast bonkers sooo say movie whole team ...  
2  weve waited long moment beyond fun wholesome f...  
3  many easter eggs true comic characters may pos...  
4  read ign review guy gave story poorthe guy rea...  


####(5) Stemming.

In [11]:
#importing the PorterStemmer class from the NLTK library for stemming words
from nltk.stem import PorterStemmer

In [12]:
# nitialize the stemmer
stemmer = PorterStemmer()
# creating an empty list to store the cleaned reviews
clean_reviews = []

for text in imdb_reviews_df['Cleaned_Reviews']: # using foor loop through each cleaned review in the DataFrame
    tokens = word_tokenize(text) # tokenizing the review into individual words
    cleaned_text = "" # initializing an empty string to store the cleaned text

    for word in tokens: # again using for Lloop through each token to stem the word
        cleaned_text += stemmer.stem(word) + " " # stemming the word and adding it to the cleaned_text string with a space
    # striping any extra spaces and adding the cleaned review to the list using append() function
    clean_reviews.append(cleaned_text.strip())

# updating the DataFrame with the cleaned reviews
imdb_reviews_df['Cleaned_Reviews'] = clean_reviews

# printing the DataFrame to show the results after stemming
print("Output after Stemming:")
print(imdb_reviews_df.head())


Output after Stemming:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  hugh jackman perfect wolverin fun movi like di...  
1  crazi blast bonker sooo say movi whole team be...  
2  weve wait long moment beyond fun wholesom full...  
3  mani easter egg true comic charact may possibl...  
4  read ign review guy gave stori poorth guy real...  


####(6) Lemmatization

In [13]:
#importing the WordNetLemmatizer class from the NLTK library
from nltk.stem import WordNetLemmatizer
# downloading the WordNet data used for lemmatization
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
lemmatizer = WordNetLemmatizer() #initialising the lemmatizer
clean_reviews = [] #creating an empty list to store the lemmatized reviews

# using for loop to iterate through each review in the 'Cleaned_Reviews' column of the DataFrame
for review in imdb_reviews_df['Cleaned_Reviews']:
    tokens = word_tokenize(review) # Tokenizing the review into individual words
    lemmatized = "" # Initializing an empty string to hold the lemmatized words

    for word in tokens: # using nested for to iterate through each token
        lemmatized_word = lemmatizer.lemmatize(word) #applying lemmatization to the current word using lemmatize() function
        lemmatized += lemmatized_word + " " # Adding the lemmatized word to the string with a space
    clean_reviews.append(lemmatized.strip()) #removing the space and adding the lemmatized review to the list

# updating the DataFrame with the lemmatized reviews
imdb_reviews_df['Cleaned_Reviews'] = clean_reviews

# Printing the DataFrame after lemmatization
print("Output after Lemmatization:")
print(imdb_reviews_df.head())

Output after Lemmatization:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  hugh jackman perfect wolverin fun movi like di...  
1  crazi blast bonker sooo say movi whole team be...  
2  weve wait long moment beyond fun wholesom full...  
3  mani easter egg true comic charact may possibl...  
4  read ign review guy gave stori poorth guy real...  


####Saving the clean data in a new column in the CSV the

In [15]:
# Saving the cleaned DataFrame to the CSV file '1000_imdb_reviews.csv'
imdb_reviews_df.to_csv("1000_imdb_reviews.csv", index=False)

# showing the first few rows of the cleaned DataFrame to verify the changes
imdb_reviews_df.head()


Unnamed: 0,Review,Cleaned_Reviews
0,Hugh Jackman is the perfect Wolverine. What a ...,hugh jackman perfect wolverin fun movi like di...
1,What a crazy blast ! Bonkers !!Sooo !...\nWhat...,crazi blast bonker sooo say movi whole team be...
2,"We've waited so long for this moment, and it w...",weve wait long moment beyond fun wholesom full...
3,"So many Easter Eggs, so true to the comic char...",mani easter egg true comic charact may possibl...
4,I read an IGN review where the guy gave it a 7...,read ign review guy gave stori poorth guy real...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

####1) Parts of Speech (POS) Tagging on clean text

In [16]:
#Importing the part-of-speech tagging function from NLTK
from nltk import pos_tag
# downloading the 'averaged_perceptron_tagger' resource for POS tagging
# to ensure the POS tagger can work properly
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [17]:
pos_list = [] #initializing an empty list to hold the POS counts and reviews

# using for loop to iterate through each review in the 'Cleaned_Reviews' column
for review in imdb_reviews_df['Cleaned_Reviews']:
    # Initializing counters for nouns, verbs, adjectives, and adverbs
    noun_count, verb_count, adj_count, adv_count = 0, 0, 0, 0

    for word, tag in pos_tag(word_tokenize(review)): #tokenizing the review and applying pos tagging
        # Checking if the tag indicates a noun (NN)
        if tag.startswith('NN'):
            noun_count += 1  # Increment noun counter
        # Checking if the tag indicates a verb (VB)
        elif tag.startswith('VB'):
            verb_count += 1  # Increment verb counter
        # Checking if the tag indicates an adjective (JJ)
        elif tag.startswith('JJ'):
            adj_count += 1  # Increment adjective counter
        # Checking if the tag indicates an adverb (RB)
        elif tag.startswith('RB'):
            adv_count += 1  # Increment adverb counter

    #now, we append the review and its corresponding counts to the pos_data list using append() function
    pos_list.append([review, noun_count, verb_count, adj_count, adv_count])

#creating a DataFrame from the list of counts and reviews
pos_df = pd.DataFrame(pos_list, columns=['Cleaned_Reviews', 'Nouns', 'Verbs', 'Adjectives', 'Adverbs'])

# showing the first few rows of the resulting DataFrame
(pos_df.head())


Unnamed: 0,Cleaned_Reviews,Nouns,Verbs,Adjectives,Adverbs
0,hugh jackman perfect wolverin fun movi like di...,39,9,12,2
1,crazi blast bonker sooo say movi whole team be...,66,19,23,4
2,weve wait long moment beyond fun wholesom full...,99,30,51,14
3,mani easter egg true comic charact may possibl...,34,14,12,1
4,read ign review guy gave stori poorth guy real...,34,12,15,3


In [18]:
# calculating the total count for each part of speech (POS) type using sum() function
total_nouns = pos_df['Nouns'].sum() #Sum of noun counts
total_verbs = pos_df['Verbs'].sum()   #Sum of verb counts
total_adjectives = pos_df['Adjectives'].sum() #Sum of adjective counts
total_adverbs = pos_df['Adverbs'].sum() #Sum of adverb counts

#printing the counts of pos
print("Summary of Part-of-Speech Counts in the DataFrame:")
print(f"Total number of nouns: {total_nouns}")
print(f"Total number of verbs: {total_verbs}")
print(f"Total number of adjectives: {total_adjectives}")
print(f"Total number of adverbs: {total_adverbs}")


Summary of Part-of-Speech Counts in the DataFrame:
Total number of nouns: 64147
Total number of verbs: 16395
Total number of adjectives: 24420
Total number of adverbs: 5478


####1) Parts of Speech (POS) Tagging on originally scraped text

In [19]:
pos_list = [] #initializing an empty list to hold the POS counts and reviews

# using for loop to iterate through each review in the 'Cleaned_Reviews' column
for review in imdb_reviews_df['Review']:
    # Initializing counters for nouns, verbs, adjectives, and adverbs
    noun_count, verb_count, adj_count, adv_count = 0, 0, 0, 0

    for word, tag in pos_tag(word_tokenize(review)): #tokenizing the review and applying pos tagging
        # Checking if the tag indicates a noun (NN)
        if tag.startswith('NN'):
            noun_count += 1  # Increment noun counter
        # Checking if the tag indicates a verb (VB)
        elif tag.startswith('VB'):
            verb_count += 1  # Increment verb counter
        # Checking if the tag indicates an adjective (JJ)
        elif tag.startswith('JJ'):
            adj_count += 1  # Increment adjective counter
        # Checking if the tag indicates an adverb (RB)
        elif tag.startswith('RB'):
            adv_count += 1  # Increment adverb counter

    #now, we append the review and its corresponding counts to the pos_data list using append() function
    pos_list.append([review, noun_count, verb_count, adj_count, adv_count])

#creating a DataFrame from the list of counts and reviews
pos_df = pd.DataFrame(pos_list, columns=['Review', 'Nouns', 'Verbs', 'Adjectives', 'Adverbs'])

# showing the first few rows of the resulting DataFrame
(pos_df.head())


Unnamed: 0,Review,Nouns,Verbs,Adjectives,Adverbs
0,Hugh Jackman is the perfect Wolverine. What a ...,32,24,11,13
1,What a crazy blast ! Bonkers !!Sooo !...\nWhat...,60,38,23,21
2,"We've waited so long for this moment, and it w...",103,76,38,33
3,"So many Easter Eggs, so true to the comic char...",32,16,10,13
4,I read an IGN review where the guy gave it a 7...,28,22,14,7


In [20]:
# calculating the total count for each part of speech (POS) type using sum() function
total_nouns = pos_df['Nouns'].sum() #Sum of noun counts
total_verbs = pos_df['Verbs'].sum()   #Sum of verb counts
total_adjectives = pos_df['Adjectives'].sum() #Sum of adjective counts
total_adverbs = pos_df['Adverbs'].sum() #Sum of adverb counts

#printing the counts of pos
print("Summary of Part-of-Speech Counts in the DataFrame:")
print(f"Total number of nouns: {total_nouns}")
print(f"Total number of verbs: {total_verbs}")
print(f"Total number of adjectives: {total_adjectives}")
print(f"Total number of adverbs: {total_adverbs}")


Summary of Part-of-Speech Counts in the DataFrame:
Total number of nouns: 58372
Total number of verbs: 40313
Total number of adjectives: 20479
Total number of adverbs: 16800


####(2) Constituency Parsing and Dependency Parsing

In [21]:
#installing the spaCy library
!pip install spacy
#downloading the English language model for spaCy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m88.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [22]:
#installing the Benepar library, used for constituency parsing
!pip install benepar



In [23]:
import spacy  #importing the spaCy library
from benepar import BeneparComponent  #importing the BeneparComponent from the benepar library
import benepar  #importing the benepar library

#downloading the Benepar model for English
benepar.download('benepar_en3')

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


True

**Constituency Parsing**


Example Sentence: George Miller... "Why yes, yes I do. Well sort of anyway..." I really really wanted to love Furiosa but in the end I didn't, I liked it but didn't love it.

Cleaned Sentence: georg miller ye ye well sort anywayi realli realli want love furiosa end didnt like didnt love

Constituency Parsing: It focuses on the hierarchical structure of sentences, so it breaks them down into constituents that forms a tree structure to shows how words combine into phrases. For the example

NP (Noun Phrase): "george miller", this represents the subject of the sentence.
  
VP (Verb Phrase): Contains the main verb "want" along with other components like "love Furiosa." This phrase has the action being described.

PP (Prepositional Phrase): This indicates relationships within the sentence, showing how different elements are connected.

In [24]:
# import pandas as pd  # Import pandas for data manipulation
# import spacy  # Import spaCy for NLP tasks
# from benepar import BeneparComponent  # Import Benepar for constituency parsing
# from nltk import Tree  # Import Tree from NLTK to handle and display constituency trees

# # Download the Benepar model for English
# benepar.download('benepar_en3')

# # Load the CSV file containing IMDb reviews into a DataFrame
# imdb_reviews_df = pd.read_csv("1000_imdb_reviews.csv")

# # Load spaCy's English language model for text processing
# nlp = spacy.load("en_core_web_sm")

# # Add the Benepar component to the spaCy pipeline for constituency parsing
# nlp.add_pipe("benepar", config={"model": "benepar_en3"}, after='parser')

# # Define a function for constituency parsing
# def constituency_parsing(text):
#     doc = nlp(text)  # Process the text with spaCy

#     # Iterate through the sentences in the text
#     for sent in doc.sents:
#         # Extract the constituency parse tree in string format
#         constituency_tree = sent._.parse_string
#         # Convert the string representation into an NLTK Tree object
#         tree = Tree.fromstring(constituency_tree)
#         # Pretty print the constituency tree for better visualization
#         tree.pretty_print()

# # Example sentence to explain constituency parsing
# example_sentence = "The quick brown fox jumps over the lazy dog."

# print("\nConstituency Parsing Example:")
# constituency_parsing(example_sentence)  # Perform constituency parsing on the example sentence

# # Perform constituency parsing on all IMDb reviews
# print("\nConstituency Parsing for IMDb Reviews:")
# for review in imdb_reviews_df['Cleaned_Reviews']:
#     constituency_parsing(review)  # Parse each review
#     print("\n")


**Dependency Parsing**

Example Sentence: George Miller... "Why yes, yes I do. Well sort of anyway..." I really really wanted to love Furiosa but in the end I didn't, I liked it but didn't love it.

Cleaned Sentence: georg miller ye ye well sort anywayi realli realli want love furiosa end didnt like didnt love

Dependency parsing: It shows how words in a sentence depend on each other, connecting each word to its main word.

"georg" -> compound -> X: "georg" is a compound element contributing to the subject.

"miller" -> nsubj -> X: "miller" is the nominal subject of the main verb "want."

"want" -> ROOT -> VERB: The main verb of the sentence, representing the primary action.

"realli" -> nsubj -> X: This connects as a subject for the verb "want," indicating emphasis on wanting.

"did" -> aux -> AUX: This acts as an auxiliary verb modifying "like," indicating the past tense.

In [25]:
# import pandas as pd  # Importing pandas for data manipulation
# import spacy  # Importing spaCy for NLP tasks
# from spacy import displacy  # Importing displacy for visualizing dependency parsing

# # Load the CSV file containing IMDb reviews into a DataFrame
# imdb_reviews_df = pd.read_csv("1000_imdb_reviews.csv")

# # Load spaCy's English language model for text processing
# nlp = spacy.load("en_core_web_sm")

# # Define a function to process and display dependency parsing for each review
# def dependency_parsing(text):
#     # Process the text with spaCy
#     doc = nlp(text)

#     # Visualize the dependency tree in Jupyter Notebook
#     displacy.render(doc, style='dep', jupyter=True)

#     # Print each token, its dependency relation, and its part of speech
#     for token in doc:
#         print(f"{token.text} -> {token.dep_} --> {token.pos_}")

#     print("\n")

# # Example: Apply dependency parsing for each review in the DataFrame
# for review in imdb_reviews_df['Cleaned_Reviews']:
#     print("Dependency Parsing for Review:")
#     dependency_parsing(review)


####(3) Named Entity Recognition on cleaned text

In [26]:
from collections import Counter  #Importing Counter to count entity occurrences
nlp = spacy.load("en_core_web_sm") #loading the spaCy model for English language processing
imdb_reviews_df = pd.read_csv("1000_imdb_reviews.csv") #loading the IMDb reviews dataset from the CSV file
all_entities = []# Initializing an empty list to hold all entities

# usinf for loop to iterate over each cleaned review in the DataFrame
for review in imdb_reviews_df['Cleaned_Reviews']:
    doc = nlp(review)# Using spaCy to process the review and extract entities
    for ent in doc.ents: # again using for loop to iterate over each entity found in the review
        all_entities.append((ent.text.strip(), ent.label_)) #appending a tuple of the entity text and its label to the all_entities list

#Counting the occurrences of each entity using Counter
entity_count = Counter(all_entities)
# Print entity counts in a table
print(f"{'Entity Type':<10}  {'Entity Name':<30}  {'Count'}")
for (entity, label), count in entity_count.items():  #Iterating over counted entities
    print(f"{label:<10}  {entity:<30}  {count}")  #Printing entity type, name, and count


Entity Type  Entity Name                     Count
ORG         hugh jackman                    106
ORG         fox                             41
ORG         funni                           70
CARDINAL    two                             350
ORG         promot movi hard                1
ORG         crazi blast bonker sooo         2
ORDINAL     second                          100
ORDINAL     first                           398
PERSON      doingbetween onelin             2
ORG         park overflowingli craycray frame  2
ORG         stuffsoh tone movi              2
FAC         someth el intro sequenc         2
NORP        surpris                         59
PERSON      hugh jackman                    197
ORG         best inevit bromanc             1
LOC         nova                            65
ORG         prais movi                      2
CARDINAL    one                             536
PERSON      realli beauti nowher            1
DATE        last year                       22
ORG      

In [27]:
#initializing an empty dictionary to hold entity types and their counts
entity_types = {}

for entity, label in all_entities: # using for loop to go through all extracted entities
    if label not in entity_types: #usinf if to check if the entity label already exists in the dictionary
        entity_types[label] = {}  #if not, we create a new dictionary for that label

    #checking if the entity already exists under its label
    if entity in entity_types[label]:
        entity_types[label][entity] += 1  #If yes, increment its count
    else:
        entity_types[label][entity] = 1  #If no, initialize its count to 1

print("Summary of Named Entities:\n")
for label, entities in entity_types.items():
    total_count = sum(entities.values())  # Calculating the total count of entities for this label
    print(f"\n{label} Type: Total Count = {total_count}")  # printing the label and total count
    for entity, count in entities.items():
        print(f"  {entity}: {count}")  #printing each entity and its corresponding count


Summary of Named Entities:


ORG Type: Total Count = 1293
  hugh jackman: 106
  fox: 41
  funni: 70
  promot movi hard: 1
  crazi blast bonker sooo: 2
  park overflowingli craycray frame: 2
  stuffsoh tone movi: 2
  best inevit bromanc: 1
  prais movi: 2
  lucki nz: 2
  hardcor movi meta: 2
  hardcor: 3
  failur marvel sinc: 1
  funni someth: 1
  theme instantli: 1
  tva: 36
  issu: 25
  aris actual: 1
  sloppi disorgan: 1
  dri: 1
  spectacl: 13
  everyth: 3
  movi middl: 1
  neg movi: 1
  awesom movi lot funni: 1
  uniqu substanc movi run: 1
  everyth movi perfect end: 1
  funni moment: 2
  disney: 8
  hugh jackmanth: 1
  allur superhero movi: 1
  funni rip roaringli: 1
  ga: 1
  funni comedian: 1
  watch movi watch movi: 1
  referencesal movi: 1
  countless nod: 1
  impactth: 1
  summar movi awsom moment: 1
  exactli: 7
  believ ban: 1
  funni flat: 1
  outcom movi mani: 1
  funni uniqu disney: 1
  superhero movi subpar: 1
  fightingso movi mixtur: 1
  hugh jackman back: 1
  believ 

####(3) Named Entity Recognition on scraped text

In [28]:
from collections import Counter  #Importing Counter to count entity occurrences
nlp = spacy.load("en_core_web_sm") #loading the spaCy model for English language processing
imdb_reviews_df = pd.read_csv("1000_imdb_reviews.csv") #loading the IMDb reviews dataset from the CSV file
all_entities = []# Initializing an empty list to hold all entities

# usinf for loop to iterate over each cleaned review in the DataFrame
for review in imdb_reviews_df['Review']:
    doc = nlp(review)# Using spaCy to process the review and extract entities
    for ent in doc.ents: # again using for loop to iterate over each entity found in the review
        all_entities.append((ent.text.strip(), ent.label_)) #appending a tuple of the entity text and its label to the all_entities list

#Counting the occurrences of each entity using Counter
entity_count = Counter(all_entities)
# Print entity counts in a table
print(f"{'Entity Type':<10}  {'Entity Name':<30}  {'Count'}")
for (entity, label), count in entity_count.items():  #Iterating over counted entities
    print(f"{label:<10}  {entity:<30}  {count}")  #Printing entity type, name, and count


Entity Type  Entity Name                     Count
PERSON      Hugh Jackman                    115
PERSON      Wolverine                       282
ORG         Fox                             118
CARDINAL    two                             355
ORG         Hot Ones                        1
CARDINAL    90                              4
ORDINAL     second                          91
ORDINAL     first                           380
CARDINAL    one                             263
PERSON      Ryan Reynolds                   352
GPE         Hugh Jackman                    212
ORG         MCU                             417
PERSON      Shawn Levy                      59
ORG         FOX                             20
PERSON      Emma Corrin                     52
PERSON      Cassandra Nova                  65
ORG         Deadpool & Wolverine            228
CARDINAL    5                               13
PERSON      Easter Eggs                     3
DATE        the last 5 years                5
PER

In [29]:
#initializing an empty dictionary to hold entity types and their counts
entity_types = {}

for entity, label in all_entities: # using for loop to go through all extracted entities
    if label not in entity_types: #usinf if to check if the entity label already exists in the dictionary
        entity_types[label] = {}  #if not, we create a new dictionary for that label

    #checking if the entity already exists under its label
    if entity in entity_types[label]:
        entity_types[label][entity] += 1  #If yes, increment its count
    else:
        entity_types[label][entity] = 1  #If no, initialize its count to 1

print("Summary of Named Entities:\n")
for label, entities in entity_types.items():
    total_count = sum(entities.values())  # Calculating the total count of entities for this label
    print(f"\n{label} Type: Total Count = {total_count}")  # printing the label and total count
    for entity, count in entities.items():
        print(f"  {entity}: {count}")  #printing each entity and its corresponding count


Summary of Named Entities:


PERSON Type: Total Count = 2988
  Hugh Jackman: 115
  Wolverine: 282
  Ryan Reynolds: 352
  Shawn Levy: 59
  Emma Corrin: 52
  Cassandra Nova: 65
  Easter Eggs: 3
  Endgame: 6
  God: 2
  Levy: 17
  Imax: 2
  Collusus: 1
  Dulpinder: 1
  Marvel: 186
  Cassandra: 35
  Ryan: 105
  Charles: 2
  gore: 62
  I'll: 2
  Marvel Jesus: 20
  Logan: 128
  Deadpool: 30
  Elektra: 7
  Channing Tatum's Gambit: 1
  Laura: 14
  Paradox: 28
  Avril Lavigne: 2
  goof ball: 1
  genre!Emma Corrin: 1
  Matthew Macfadyen: 22
  Kevin Feige: 10
  Stan Lee's: 1
  Free Guy: 2
  Yeppy: 1
  Sigh: 1
  Strange: 3
  Iron Man: 5
  buttercup-you're: 1
  Easter: 3
  Peter: 12
  Henry Cavill: 18
  Chris Evans: 42
  Cinema Photography 10/10
Story: 1
  Cary Grant: 1
  Butch Cassady: 1
  Mad Max: 9
  Blade: 30
  Harun Can: 1
  Ryan reynolds definitely: 1
  Aussie: 1
  Weak: 1
  Loki: 36
  Sean Levy: 1
  Spiderman: 4
  Spider-Man:: 4
  Sheeva: 1
  Family Guy: 1
  Channing Tatum: 18
  Ryan Reynold'

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [30]:
# Write your response below

'''
I think the assignment was great and I had fun scrapping the IMDB website for reviews data.
What I found the most challenging was Question 3, especially the constituency and dependency parsing
but I figured it out with the help of StackOverflow and some research.
I think the time given to complete the assignment was less, I would appreciate it if you give
more time for the upcoming assignments.

'''

'\nI think the assignment was great and I had fun scrapping the IMDB website for reviews data.\nWhat I found the most challenging was Question 3, especially the constituency and dependency parsing\nbut I figured it out with the help of StackOverflow and some research.\nI think the time given to complete the assignment was less, I would appreciate it if you give\nmore time for the upcoming assignments.\n\n'