<a href="https://colab.research.google.com/github/hemareddyyanala/HemaReddy_INFO5731_Fall2024/blob/main/Yanala_Hema_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests  #importing this library to send HTTP requests and handle responses
from bs4 import BeautifulSoup  # importing bs4 library to parse HTML and extract data
import pandas as pd  # importing pandas for data manipulation

In [2]:
# creating a function to send an HTTP GET request to the IMDb URL and return the parsed HTML using BeautifulSoup.
def retrieve_page_content(url):
    try:
        #sending a GET request to the URL and storing the response
        response = requests.get(url)
        # Using the if condition to check if the request was successful
        if response.status_code == 200:
            # if true, we parse the response content using BeautifulSoup and return it
            return BeautifulSoup(response.content, 'html.parser')
        else:
            # if not, we print an error message if the status code is not 200
            print(f"There is an, the status code is {response.status_code}")
            return None
    except Exception as e:
        # printing an error message if the request fails
        print(f"sorry, failed to retrieve page content: {e}")
        return None

# defining a function to extract reviews from the parsed HTML content
def extract_reviews_from_page(soup):
    #finding all the review elements using their specific HTML class
    review_elements = soup.select("div.text.show-more__control")
    #initializing an empty list to store the cleaned reviews
    reviews = []
    # using for loop to go through each review element and get its text content
    for review in review_elements:
        # now, striping any whitespace and adding it to the list
        cleaned_review = review.get_text(strip=True)
        reviews.append(cleaned_review)
    # Returning the list of cleaned reviews
    return reviews

# defining a function to get the pagination key for getting more reviews
def get_pagination_key(soup):
    # Finding the 'div' element with class 'load-more-data', as it contains the pagination key
    load_more_section = soup.find("div", class_="load-more-data")
    #if the load-more section is found, return the value of pagination key
    if load_more_section:
        return load_more_section.get("data-key")
    # It returns None if there is no pagination key found
    return None

# this is the IMDb URL setup for collecting reviews for the movie Deadpool & Wolverine (2024)
base_imdb_url = f"https://www.imdb.com/title/tt6263850/reviews/_ajax?ref_=undefined"
#initiating an empty list for collected reviews and start with no pagination key
all_reviews = []  # empty List
pagination_token = None  # Initializing pagination_token as None

# using while loop to repeat the process of collecting reviews until we get 1000
while len(all_reviews) < 1000:
    # using if to check if there's a pagination token, then we append it to the base URL
    current_url = base_imdb_url
    if pagination_token:
        current_url += f"&paginationKey={pagination_token}"  #here, we append the pagination key to the URL

    #fetching the page content using the IMDb URL with the current pagination key
    soup = retrieve_page_content(current_url)
    #exiting the loop if we can't fetch the page content using break
    if soup is None:
        break

    #extracting the reviews from the current page using the extract_reviews_from_page() function
    reviews_on_page = extract_reviews_from_page(soup)
    #add the extracted reviews to the all_reviews list
    all_reviews.extend(reviews_on_page)
    #if the required number of reviews is collected, stop the loop using break
    if len(all_reviews) >= 1000:
        break
    #fetching the pagination key for the next page of reviews
    pagination_token = get_pagination_key(soup)
    # If no pagination token is found, break the process
    if pagination_token is None:
        print("No more reviews to load.")
        break

#truncating the review list to exactly 1000 reviews to ensure we don't collect more
collected_reviews = all_reviews[:1000]


In [3]:
#converting the list of collected reviews into a DataFrame using pandas
imdb_reviews_df = pd.DataFrame(collected_reviews, columns=["Review"])
# Saving the DataFrame to a CSV file named '1000_imdb_reviews.csv'
imdb_reviews_df.to_csv("1000_imdb_reviews.csv", index=False)

print(f"Collected {len(collected_reviews)} reviews of the movie Deadpool & Wolverine (2024) and saved it to '1000_imdb_reviews.csv'.")

Collected 1000 reviews of the movie Deadpool & Wolverine (2024) and saved it to '1000_imdb_reviews.csv'.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
imdb_reviews_df = pd.read_csv("1000_imdb_reviews.csv") #Loading the reviews from the CSV file '1000_imdb_reviews.csv' into a DataFrame

####(1) Remove noise, such as special characters and punctuations.*italicized text*

In [5]:
import re  #Importing the regular expression module

#removing punctuation and special characters from the 'Review' column in the DataFrame
# and store the cleaned reviews in a new column called 'Cleaned_Reviews'
imdb_reviews_df['Cleaned_Reviews'] = imdb_reviews_df['Review'].str.replace(r'[^\w\s]', '', regex=True)

In [6]:
print("Output after Removing Noise, such as special characters and punctuations:")
print(imdb_reviews_df.head())

Output after Removing Noise, such as special characters and punctuations:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  Hugh Jackman is the perfect Wolverine What a f...  
1  What a crazy blast  Bonkers Sooo \nWhat I can ...  
2  Weve waited so long for this moment and it was...  
3  So many Easter Eggs so true to the comic chara...  
4  I read an IGN review where the guy gave it a 7...  


####(2) Remove numbers.

In [7]:
#remove all numeric characters from the 'Cleaned_Reviews' column in the DataFrame
imdb_reviews_df['Cleaned_Reviews'] = imdb_reviews_df['Cleaned_Reviews'].str.replace(r'\d+', '', regex=True)

# Display the 'Review' and 'Cleaned_Reviews' columns
print("Output after Removing Numbers:")
print(imdb_reviews_df.head())  # Display original review and cleaned review

Output after Removing Numbers:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  Hugh Jackman is the perfect Wolverine What a f...  
1  What a crazy blast  Bonkers Sooo \nWhat I can ...  
2  Weve waited so long for this moment and it was...  
3  So many Easter Eggs so true to the comic chara...  
4  I read an IGN review where the guy gave it a  ...  


####(3) Remove stopwords by using the stopwords list.

In [8]:
import nltk  #importing the Natural Language Toolkit (NLTK) library for NLP
from nltk.corpus import stopwords  #importing the stopwords list from NLTK
from nltk.tokenize import word_tokenize  #import the word_tokenize function for splitting text into individual words

nltk.download('stopwords')  #downloading the stopwords dataset, which includes common words like 'and', 'the', etc,.
nltk.download('punkt')  # Downloading the Punkt tokenizer models for tokenizing sentences and words


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
# Getting the set of stopwords for English
stop_words = set(stopwords.words('english'))
clean_reviews = []  # Initializing an empty list to store cleaned reviews

# Using a for loop to iterate through each review in the Cleaned_Reviews column
for text in imdb_reviews_df['Cleaned_Reviews']:
    tokens = word_tokenize(text)  # Tokenizing the review text
    filtered_words = []  # Initializing an empty list to store words that are not stopwords

    # Using a for loop to check each token and add it to the filtered_words list if it's not a stopword
    for word in tokens:
        if word.lower() not in stop_words:  # Check in lowercase to match stopwords
            filtered_words.append(word)  # Append only non-stopwords

    # Joining the filtered words back into a single string and adding it to clean_reviews
    clean_reviews.append(' '.join(filtered_words))  # Properly join words with a space

# Assigning the cleaned reviews back to the DataFrame
imdb_reviews_df['Cleaned_Reviews'] = clean_reviews

# Printing the DataFrame after removing stopwords
print("\nOutput after Removing Stopwords:")
print(imdb_reviews_df.head())



Output after Removing Stopwords:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  Hugh Jackman perfect Wolverine fun movie like ...  
1  crazy blast Bonkers Sooo say movie whole team ...  
2  Weve waited long moment beyond fun wholesome f...  
3  many Easter Eggs true comic characters may pos...  
4  read IGN review guy gave story poorThe guy rea...  


####(4) Lowercase all texts

In [10]:
#converting all text in the 'Cleaned_Reviews' column to lowercase to standardize the text using str.lower() method
imdb_reviews_df['Cleaned_Reviews'] = imdb_reviews_df['Cleaned_Reviews'].str.lower()
# Print the message indicating the completion of the lowercasing step
print("Output after Lowercasing all texts:")
# Display the first few rows of the DataFrame, showing both original and cleaned reviews
print(imdb_reviews_df.head())


Output after Lowercasing all texts:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  hugh jackman perfect wolverine fun movie like ...  
1  crazy blast bonkers sooo say movie whole team ...  
2  weve waited long moment beyond fun wholesome f...  
3  many easter eggs true comic characters may pos...  
4  read ign review guy gave story poorthe guy rea...  


####(5) Stemming.

In [11]:
#importing the PorterStemmer class from the NLTK library for stemming words
from nltk.stem import PorterStemmer

In [12]:
# nitialize the stemmer
stemmer = PorterStemmer()
# creating an empty list to store the cleaned reviews
clean_reviews = []

for text in imdb_reviews_df['Cleaned_Reviews']: # using foor loop through each cleaned review in the DataFrame
    tokens = word_tokenize(text) # tokenizing the review into individual words
    cleaned_text = "" # initializing an empty string to store the cleaned text

    for word in tokens: # again using for Lloop through each token to stem the word
        cleaned_text += stemmer.stem(word) + " " # stemming the word and adding it to the cleaned_text string with a space
    # striping any extra spaces and adding the cleaned review to the list using append() function
    clean_reviews.append(cleaned_text.strip())

# updating the DataFrame with the cleaned reviews
imdb_reviews_df['Cleaned_Reviews'] = clean_reviews

# printing the DataFrame to show the results after stemming
print("Output after Stemming:")
print(imdb_reviews_df.head())


Output after Stemming:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  hugh jackman perfect wolverin fun movi like di...  
1  crazi blast bonker sooo say movi whole team be...  
2  weve wait long moment beyond fun wholesom full...  
3  mani easter egg true comic charact may possibl...  
4  read ign review guy gave stori poorth guy real...  


####(6) Lemmatization

In [13]:
#importing the WordNetLemmatizer class from the NLTK library
from nltk.stem import WordNetLemmatizer
# downloading the WordNet data used for lemmatization
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
lemmatizer = WordNetLemmatizer() #initialising the lemmatizer
clean_reviews = [] #creating an empty list to store the lemmatized reviews

# using for loop to iterate through each review in the 'Cleaned_Reviews' column of the DataFrame
for review in imdb_reviews_df['Cleaned_Reviews']:
    tokens = word_tokenize(review) # Tokenizing the review into individual words
    lemmatized = "" # Initializing an empty string to hold the lemmatized words

    for word in tokens: # using nested for to iterate through each token
        lemmatized_word = lemmatizer.lemmatize(word) #applying lemmatization to the current word using lemmatize() function
        lemmatized += lemmatized_word + " " # Adding the lemmatized word to the string with a space
    clean_reviews.append(lemmatized.strip()) #removing the space and adding the lemmatized review to the list

# updating the DataFrame with the lemmatized reviews
imdb_reviews_df['Cleaned_Reviews'] = clean_reviews

# Printing the DataFrame after lemmatization
print("Output after Lemmatization:")
print(imdb_reviews_df.head())

Output after Lemmatization:
                                              Review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                     Cleaned_Reviews  
0  hugh jackman perfect wolverin fun movi like di...  
1  crazi blast bonker sooo say movi whole team be...  
2  weve wait long moment beyond fun wholesom full...  
3  mani easter egg true comic charact may possibl...  
4  read ign review guy gave stori poorth guy real...  


####Saving the clean data in a new column in the CSV the

In [15]:
# Saving the cleaned DataFrame to the CSV file '1000_imdb_reviews.csv'
imdb_reviews_df.to_csv("1000_imdb_reviews.csv", index=False)

# showing the first few rows of the cleaned DataFrame to verify the changes
imdb_reviews_df.head()

Unnamed: 0,Review,Cleaned_Reviews
0,Hugh Jackman is the perfect Wolverine. What a ...,hugh jackman perfect wolverin fun movi like di...
1,What a crazy blast ! Bonkers !!Sooo !...\nWhat...,crazi blast bonker sooo say movi whole team be...
2,"We've waited so long for this moment, and it w...",weve wait long moment beyond fun wholesom full...
3,"So many Easter Eggs, so true to the comic char...",mani easter egg true comic charact may possibl...
4,I read an IGN review where the guy gave it a 7...,read ign review guy gave stori poorth guy real...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

####1) Parts of Speech (POS) Tagging on clean text

In [16]:
#Importing the part-of-speech tagging function from NLTK
from nltk import pos_tag
# downloading the 'averaged_perceptron_tagger' resource for POS tagging
# to ensure the POS tagger can work properly
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [17]:
pos_list = [] #initializing an empty list to hold the POS counts and reviews

# using for loop to iterate through each review in the 'Cleaned_Reviews' column
for review in imdb_reviews_df['Cleaned_Reviews']:
    # Initializing counters for nouns, verbs, adjectives, and adverbs
    noun_count, verb_count, adj_count, adv_count = 0, 0, 0, 0

    for word, tag in pos_tag(word_tokenize(review)): #tokenizing the review and applying pos tagging
        # Checking if the tag indicates a noun (NN)
        if tag.startswith('NN'):
            noun_count += 1  # Increment noun counter
        # Checking if the tag indicates a verb (VB)
        elif tag.startswith('VB'):
            verb_count += 1  # Increment verb counter
        # Checking if the tag indicates an adjective (JJ)
        elif tag.startswith('JJ'):
            adj_count += 1  # Increment adjective counter
        # Checking if the tag indicates an adverb (RB)
        elif tag.startswith('RB'):
            adv_count += 1  # Increment adverb counter

    #now, we append the review and its corresponding counts to the pos_data list using append() function
    pos_list.append([review, noun_count, verb_count, adj_count, adv_count])

#creating a DataFrame from the list of counts and reviews
pos_df = pd.DataFrame(pos_list, columns=['Cleaned_Reviews', 'Nouns', 'Verbs', 'Adjectives', 'Adverbs'])

# showing the first few rows of the resulting DataFrame
(pos_df.head())


Unnamed: 0,Cleaned_Reviews,Nouns,Verbs,Adjectives,Adverbs
0,hugh jackman perfect wolverin fun movi like di...,39,9,12,2
1,crazi blast bonker sooo say movi whole team be...,66,19,23,4
2,weve wait long moment beyond fun wholesom full...,99,30,51,14
3,mani easter egg true comic charact may possibl...,34,14,12,1
4,read ign review guy gave stori poorth guy real...,34,12,15,3


In [18]:
# calculating the total count for each part of speech (POS) type using sum() function
total_nouns = pos_df['Nouns'].sum() #Sum of noun counts
total_verbs = pos_df['Verbs'].sum()   #Sum of verb counts
total_adjectives = pos_df['Adjectives'].sum() #Sum of adjective counts
total_adverbs = pos_df['Adverbs'].sum() #Sum of adverb counts

#printing the counts of pos
print("Summary of Part-of-Speech Counts in the DataFrame:")
print(f"Total number of nouns: {total_nouns}")
print(f"Total number of verbs: {total_verbs}")
print(f"Total number of adjectives: {total_adjectives}")
print(f"Total number of adverbs: {total_adverbs}")

Summary of Part-of-Speech Counts in the DataFrame:
Total number of nouns: 63727
Total number of verbs: 16335
Total number of adjectives: 24318
Total number of adverbs: 5437


####(2) Constituency Parsing and Dependency Parsing

In [19]:
#installing the spaCy library
!pip install spacy
#downloading the English language model for spaCy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [20]:
#installing the Benepar library, used for constituency parsing
!pip install benepar



In [21]:
import spacy  #importing the spaCy library
from benepar import BeneparComponent  #importing the BeneparComponent from the benepar library
import benepar  #importing the benepar library
from nltk import Tree  #importing Tree from NLTK to display constituency trees
#downloading the Benepar model for English
benepar.download('benepar_en3')

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


True

**Constituency Parsing**

Considering 1st review

Example sentence: Hugh Jackman is the perfect Wolverine.

Cleaned sentence: hugh jackman perfect wolverin

Constituency Parsing: It focuses on the hierarchical structure of sentences, so it breaks them down into constituents that forms a tree structure to shows how words combine into phrases. For the example

Identify Constituents:
"Hugh Jackman": This is a Noun Phrase (NP). It refers to a specific person and serves as the subject of the sentence.

"perfect": This is an Adjective (ADJ). It describes the noun, providing more information about the noun phrase it modifies.

"Wolverine": This is another Noun (N), that serves as a subject compliment, indicating the role of the subject in the context of the sentence.

Hierarchical Structure:
The entire sentence can be considered a Sentence (S), which includes:

An NP ("Hugh Jackman") as the subject.

An ADJ ("perfect") modifying the NP.

An additional Noun (N) ("Wolverine") completes the thought.

In [22]:
#loading spaCy's English language model for text processing
nlp = spacy.load("en_core_web_sm")

#adding the Benepar component to the spaCy pipeline for constituency parsing
nlp.add_pipe("benepar", config={"model": "benepar_en3"}, after='parser')

def constituency_parsing(text, max_length=512): #creating a function to get constituency parsing, arguments are the review from the dataframe with a max length of 512
    document = nlp(text)  # processing the text with spaCy using nlp()
    # using for loop to iterate through the sentences in the document
    for sent in document.sents:
        # using if condition to check if the sentence length exceeds the maximum length as benepar has a max token length of 512
        if len(sent) > max_length:
            print(f"Skipping sentence due to length: {sent}")
            continue

        #extracting the constituency parse tree in string format
        constituency_tree = sent._.parse_string
        #converting the string representation into an NLTK Tree object
        tree = Tree.fromstring(constituency_tree)
        # Using pretty_print() function to visualize the constituency tree
        tree.pretty_print()

print("\nConstituency Parsing for IMDb Reviews:")
for review in imdb_reviews_df['Cleaned_Reviews'][:5]: #using for loop to iterate through 5 cleaned reviews to perform constituency parsing on IMDb reviews
    print(f"Review:\n{review}\n")
    constituency_parsing(review)  # Parsing the selected each review
    print("\n")


  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Constituency Parsing for IMDb Reviews:
Review:
hugh jackman perfect wolverin fun movi like dialogu clever quip f bomb sprinkl definit take serious ton fun cameo didnt expect normal watch spoiler video ahead time didnt occas im glad didnt oh snap moment good action pack fun film break fox joke speak camera joke funni definit see sequel two horizon promot movi hard watch two hot one eat chicken wing make dynam duo wolverin lol





                                                                                                                       S                                                                                                        
       ________________________________________________________________________________________________________________|______________________________________________________________                                           
      |             |       |          |             |            |     |    |   |      |       |            |              |                                                         VP                                        
      |             |       |          |             |            |     |    |   |      |       |            |              |     ____________________________________________________|____                                      
      |             |       |          |             |            |     |    |   |      |       | 

**Dependency Parsing**

Considering 1st review

Example sentence: Hugh Jackman is the perfect Wolverine.

Cleaned sentence: hugh jackman perfect wolverin

Dependency parsing: It shows how words in a sentence depend on each other, connecting each word to its main word.

hugh  -->  compound  -->  PROPN </br>
Relation: compound  </br>
POS: PROPN (Proper Noun)
it means, "Hugh" modifies "Jackman," forming a proper noun entity.

jackman  -->  nsubj  -->  PROPN  </br>
Relation: nsubj  </br>
POS: PROPN (Proper Noun)
here, "Jackman" serves as the subject of the verb "perfect."

perfect  -->  ROOT  -->  VERB  </br>
Relation: ROOT  </br>
POS: VERB
in this case "Perfect" is the main verb of the sentence.
wolverin (assuming "Wolverine")

wolverin  -->  amod  -->  ADJ  </br>
Relation: amod  </br>
POS: ADJ (Adjective)
here, "Wolverine" is described by "perfect," indicating a quality.

In [23]:
from spacy import displacy  #importing displacy for visualizing dependency parsing

#again, loading the spaCy's English language model for text processing
nlp = spacy.load("en_core_web_sm")

#defining a function to process and display dependency parsing for cleaned review
def dependency_parsing(text):
    # Processing the text with spaCy using nlp()
    document = nlp(text)
    # visualize the dependency tree in with the help of render() function from displacy library
    displacy.render(document, style='dep', jupyter=True)
    #printing each token, its dependency relation, and its part of speech
    for token in document:
        print(f"{token.text}  -->  {token.dep_}  -->  {token.pos_}")
    print("\n") #printing a new line

#applying dependency parsing for 5 cleaned review in the 'imdb_reviews_df' DataFrame
for review in imdb_reviews_df['Cleaned_Reviews'][:5]:
    print(f"Dependency APrsing for Review:\n{review}\n")
    dependency_parsing(review)
    print('\n')

Dependency APrsing for Review:
hugh jackman perfect wolverin fun movi like dialogu clever quip f bomb sprinkl definit take serious ton fun cameo didnt expect normal watch spoiler video ahead time didnt occas im glad didnt oh snap moment good action pack fun film break fox joke speak camera joke funni definit see sequel two horizon promot movi hard watch two hot one eat chicken wing make dynam duo wolverin lol



hugh  -->  compound  -->  PROPN
jackman  -->  nsubj  -->  PROPN
perfect  -->  ROOT  -->  VERB
wolverin  -->  amod  -->  ADJ
fun  -->  compound  -->  NOUN
movi  -->  dobj  -->  NOUN
like  -->  prep  -->  ADP
dialogu  -->  amod  -->  ADJ
clever  -->  amod  -->  ADJ
quip  -->  pobj  -->  PROPN
f  -->  compound  -->  PROPN
bomb  -->  nsubj  -->  PROPN
sprinkl  -->  conj  -->  VERB
definit  -->  dobj  -->  NOUN
take  -->  conj  -->  VERB
serious  -->  amod  -->  ADJ
ton  -->  compound  -->  NOUN
fun  -->  compound  -->  NOUN
cameo  -->  nsubj  -->  NOUN
did  -->  aux  -->  AUX
nt  -->  neg  -->  PART
expect  -->  conj  -->  VERB
normal  -->  amod  -->  ADJ
watch  -->  compound  -->  NOUN
spoiler  -->  compound  -->  NOUN
video  -->  dobj  -->  NOUN
ahead  -->  amod  -->  ADJ
time  -->  npadvmod  -->  NOUN
did  -->  aux  -->  AUX
nt  -->  neg  -->  PART
occas  -->  conj  -->  NOUN
i  -->  nsubj  -->  PRON
m  -->  appos  -->  VERB
glad  -->  acomp  -->  ADJ
did  -->  prep  -->  AUX
nt  -->  p

crazi  -->  compound  -->  PROPN
blast  -->  compound  -->  PROPN
bonker  -->  compound  -->  PROPN
sooo  -->  nsubj  -->  NOUN
say  -->  ROOT  -->  VERB
movi  -->  nmod  -->  PROPN
whole  -->  amod  -->  ADJ
team  -->  nsubj  -->  NOUN
behind  -->  prep  -->  ADP
movi  -->  pobj  -->  PROPN
never  -->  neg  -->  ADV
hesit  -->  ccomp  -->  VERB
second  -->  advmod  -->  ADV
go  -->  xcomp  -->  VERB
everyth  -->  compound  -->  NOUN
store  -->  compound  -->  NOUN
throw  -->  compound  -->  NOUN
kitchen  -->  compound  -->  NOUN
sink  -->  dobj  -->  VERB
everyth  -->  nmod  -->  PROPN
elsewhat  -->  compound  -->  PROPN
love  -->  compound  -->  NOUN
highli  -->  compound  -->  PROPN
satisfi  -->  compound  -->  PROPN
movi  -->  dobj  -->  PROPN
first  -->  advmod  -->  ADV
last  -->  amod  -->  ADJ
second  -->  advmod  -->  ADV
come  -->  ccomp  -->  VERB
littl  -->  compound  -->  PROPN
littl  -->  compound  -->  PROPN
wink  -->  compound  -->  PROPN
audienc  -->  nsubj  -->  PROPN

we  -->  nsubj  -->  PRON
ve  -->  aux  -->  AUX
wait  -->  ccomp  -->  VERB
long  -->  amod  -->  ADJ
moment  -->  npadvmod  -->  NOUN
beyond  -->  prep  -->  ADP
fun  -->  pobj  -->  NOUN
wholesom  -->  advcl  -->  NOUN
full  -->  amod  -->  ADJ
surpris  -->  compound  -->  NOUN
emot  -->  dobj  -->  NOUN
epic  -->  compound  -->  PROPN
ryan  -->  compound  -->  PROPN
reynold  -->  compound  -->  PROPN
hugh  -->  compound  -->  PROPN
jackman  -->  nsubj  -->  PROPN
shawn  -->  compound  -->  PROPN
levi  -->  compound  -->  PROPN
pour  -->  compound  -->  PROPN
heart  -->  compound  -->  NOUN
movi  -->  compound  -->  PROPN
itit  -->  npadvmod  -->  NOUN
beyond  -->  prep  -->  ADP
mcu  -->  compound  -->  PROPN
timelin  -->  pobj  -->  PROPN
beyond  -->  prep  -->  ADP
even  -->  advmod  -->  ADV
fox  -->  compound  -->  PROPN
xmen  -->  compound  -->  PROPN
movi  -->  compound  -->  PROPN
kid  -->  nsubj  -->  PROPN
grew  -->  conj  -->  VERB
movi  -->  compound  -->  PROPN
charact 

mani  -->  nsubj  -->  PROPN
easter  -->  amod  -->  PROPN
egg  -->  nmod  -->  NOUN
true  -->  amod  -->  ADJ
comic  -->  amod  -->  ADJ
charact  -->  dobj  -->  NOUN
may  -->  aux  -->  AUX
possibl  -->  ROOT  -->  VERB
singl  -->  amod  -->  PROPN
handedli  -->  dobj  -->  PROPN
save  -->  conj  -->  VERB
mcu  -->  dobj  -->  NOUN
everyth  -->  advmod  -->  ADV
you  -->  compound  -->  PRON
d  -->  nsubj  -->  PROPN
expect  -->  conj  -->  VERB
deadpool  -->  compound  -->  PROPN
movi  -->  compound  -->  PROPN
plu  -->  nsubj  -->  PROPN
have  -->  aux  -->  AUX
nt  -->  neg  -->  PART
left  -->  ccomp  -->  VERB
theatr  -->  amod  -->  ADJ
buzzi  -->  dobj  -->  NOUN
long  -->  amod  -->  ADJ
time  -->  npadvmod  -->  NOUN
rip  -->  oprd  -->  VERB
shred  -->  amod  -->  ADJ
tedium  -->  dobj  -->  NOUN
last  -->  amod  -->  ADJ
year  -->  npadvmod  -->  NOUN
mcu  -->  nsubj  -->  NOUN
make  -->  ccomp  -->  VERB
think  -->  dobj  -->  NOUN
may  -->  aux  -->  AUX
life  -->  nmod 

read  -->  advcl  -->  VERB
ign  -->  compound  -->  PROPN
review  -->  compound  -->  NOUN
guy  -->  nsubj  -->  NOUN
gave  -->  ROOT  -->  VERB
stori  -->  dative  -->  NOUN
poorth  -->  compound  -->  ADJ
guy  -->  compound  -->  NOUN
realli  -->  compound  -->  PROPN
need  -->  compound  -->  PROPN
read  -->  compound  -->  NOUN
roomyou  -->  dobj  -->  NOUN
go  -->  dobj  -->  VERB
see  -->  advcl  -->  VERB
deadpool  -->  amod  -->  ADJ
great  -->  amod  -->  ADJ
indepth  -->  amod  -->  ADJ
stori  -->  nsubj  -->  NOUN
make  -->  ccomp  -->  VERB
thinkyou  -->  nsubj  -->  PRON
go  -->  ccomp  -->  VERB
see  -->  advcl  -->  VERB
deadpool  -->  amod  -->  ADJ
fun  -->  compound  -->  NOUN
obscen  -->  compound  -->  NOUN
fight  -->  compound  -->  NOUN
scene  -->  compound  -->  NOUN
great  -->  amod  -->  ADJ
joke  -->  nmod  -->  NOUN
bad  -->  amod  -->  ADJ
joke  -->  nsubj  -->  NOUN
ridicul  -->  ccomp  -->  VERB
death  -->  compound  -->  NOUN
sarcasm  -->  dobj  -->  NOU

####(3) Named Entity Recognition on cleaned text

In [24]:
from collections import Counter  #Importing Counter to count entity occurrences

nlp = spacy.load("en_core_web_sm") #loading the spaCy model for English language processing
all_entities = []# Initializing an empty list to hold all entities

# usinf for loop to iterate over each cleaned review in the 'imdb_reviews_df'DataFrame
for review in imdb_reviews_df['Cleaned_Reviews']:
    doc = nlp(review)# Using spaCy to process the review and extract entities
    for ent in doc.ents: # again using for loop to iterate over each entity found in the review
        all_entities.append((ent.text.strip(), ent.label_)) #appending a tuple of the entity text and its label to the all_entities list

#Counting the occurrences of each entity using Counter
entity_count = Counter(all_entities)
# Print entity counts in a table
print(f"{'Entity Type':<10}  {'Entity Name':<30}  {'Count'}")
for (entity, label), count in entity_count.items():  #Iterating over counted entities
    print(f"{label:<10}  {entity:<30}  {count}")  #Printing entity type, name, and count


Entity Type  Entity Name                     Count
ORG         hugh jackman                    105
ORG         fox                             41
ORG         funni                           68
CARDINAL    two                             347
ORG         promot movi hard                1
ORG         crazi blast bonker sooo         2
ORDINAL     second                          97
ORDINAL     first                           394
PERSON      doingbetween onelin             2
ORG         park overflowingli craycray frame  2
ORG         stuffsoh tone movi              2
FAC         someth el intro sequenc         2
NORP        surpris                         59
PERSON      hugh jackman                    196
ORG         best inevit bromanc             1
LOC         nova                            64
ORG         prais movi                      2
CARDINAL    one                             534
PERSON      realli beauti nowher            1
DATE        last year                       22
ORG       

In [25]:
#initializing an empty dictionary to hold entity types and their counts
entity_types = {}

for entity, label in all_entities: # using for loop to go through all extracted entities
    if label not in entity_types: #usinf if to check if the entity label already exists in the dictionary
        entity_types[label] = {}  #if not, we create a new dictionary for that label

    #checking if the entity already exists under its label
    if entity in entity_types[label]:
        entity_types[label][entity] += 1  #If yes, increment its count
    else:
        entity_types[label][entity] = 1  #If no, initialize its count to 1

print('Summary of Named Entities:\n')
for label, entities in entity_types.items():
    total_count = sum(entities.values())  # Calculating the total count of entities for this label
    print(f"\n{label} Type: Total Count = {total_count}")  # printing the label and total count
    for entity, count in entities.items():
        print(f"  {entity}: {count}")  #printing each entity and its corresponding count


Summary of Named Entities:


ORG Type: Total Count = 1279
  hugh jackman: 105
  fox: 41
  funni: 68
  promot movi hard: 1
  crazi blast bonker sooo: 2
  park overflowingli craycray frame: 2
  stuffsoh tone movi: 2
  best inevit bromanc: 1
  prais movi: 2
  lucki nz: 2
  hardcor movi meta: 2
  hardcor: 3
  failur marvel sinc: 1
  funni someth: 1
  theme instantli: 1
  tva: 35
  issu: 24
  aris actual: 1
  sloppi disorgan: 1
  dri: 1
  spectacl: 13
  everyth: 3
  funni moment: 2
  movi middl: 1
  neg movi: 1
  awesom movi lot funni: 1
  funni flat: 1
  everyth movi perfect end: 1
  uniqu substanc movi run: 1
  disney: 8
  hugh jackmanth: 1
  allur superhero movi: 1
  funni rip roaringli: 1
  ga: 1
  funni comedian: 1
  impactth: 1
  referencesal movi: 1
  outcom movi mani: 1
  funni uniqu disney: 1
  summar movi awsom moment: 1
  crazi movi realli: 1
  feel lazi uninspir pull punchesreynold: 1
  world loki tv: 1
  el show: 1
  loki reason lazi: 1
  conveni: 4
  vfx: 1
  choreographi: 1
 

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

https://drive.google.com/drive/folders/1W6yN3WRrncqOTj_HPk5muiMdtceF7B-d?usp=sharing

https://drive.google.com/drive/folders/1W6yN3WRrncqOTj_HPk5muiMdtceF7B-d?usp=sharing

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [26]:
# Write your response below

'''
I think the assignment was great and I had fun scrapping the IMDB website for reviews data.
What I found the most challenging was Question 3, especially the constituency and dependency parsing
but I figured it out with the help of StackOverflow and some research.
I think the time given to complete the assignment was less, I would appreciate it if you give
more time for the upcoming assignments.

'''

'\nI think the assignment was great and I had fun scrapping the IMDB website for reviews data.\nWhat I found the most challenging was Question 3, especially the constituency and dependency parsing\nbut I figured it out with the help of StackOverflow and some research.\nI think the time given to complete the assignment was less, I would appreciate it if you give\nmore time for the upcoming assignments.\n\n'