<a href="https://colab.research.google.com/github/muppallajhansi/Jhansi_INFO5731_Fall2024/blob/main/Jhansi_Muppalla_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# Extracted and assigned the url of the imdb movie reviews page to the variable URL U
url = 'https://www.imdb.com/title/tt9362722/reviews/_ajax?ref_=undefined&paginationKey='

# Headers to mimic a real browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Initialized the  variables for the purpose of pagination and reviews
reviews = []
pagination_key = ''
total_reviews_needed = 1000  # top 1000 reviews

# Opened the  CSV file for the purpose of writing
with open('imdb_reviews.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Review'])  # Write header row

    # Loop to go through all the pages until we have enough(1000) reviews
    while len(reviews) < total_reviews_needed:
        # Construct the full URL with the pagination key
        full_url = url + pagination_key

        # Sending a GET request to  URL
        response = requests.get(full_url, headers=headers)

        # Parsing the page content
        soup = BeautifulSoup(response.content, 'html.parser')

        # Finding all the review-containers
        new_reviews = soup.find_all('div', class_='text show-more__control')

        # If no new reviews are being found, break the loop which is (end of pages)
        if not new_reviews:
            print("No more reviews found.")
            break

        # Adding  the new reviews to the final or the total list
        for review in new_reviews:
            review_text = review.get_text().strip()
            if len(reviews) < total_reviews_needed:
                reviews.append(review_text)
                writer.writerow([review_text])
            else:
                break  # Break if we already have the 1000 reviews

        # Update the pagination key for the next page
        load_more_data = soup.find('div', {'class': 'load-more-data'})
        if load_more_data and load_more_data.has_attr('data-key'):
            pagination_key = load_more_data['data-key']
        else:
            break  # Break here, if no pagination key is being found

        # Pausing between requests to avoid being blocked
        time.sleep(1)

# Printing the number of reviews scraped
print(f'Scraped {len(reviews)} reviews successfully!')


Scraped 1000 reviews successfully!


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Downloading the NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
import pandas as pd

# Example of reading a CSV file from a URL
url = "/content/imdb_reviews.csv"
df = pd.read_csv(url)

# Display the first few rows of the DataFrame
print(df.head())


                                              Review
0  It's honestly absurd how good the "Spider-Vers...
1  If it wasn't already obvious in the first film...
2  First off, you should know that this is the fi...
3  This film is a visual concert. The animation a...
4  The animation, flow of everything, genius char...


# (1) Remove noise, such as special characters and punctuations.

In [6]:
import re
def remove_noise(text):
    return re.sub(r'[^A-Za-z\s]', '', text)  # Keep only letters and whitespace

# Applying the function to remove noise
df['Noisy Removed'] = df['Review'].apply(remove_noise)

# Display the updated DataFrame with original and cleaned reviews
print(df[['Review', 'Noisy Removed']].head())


                                              Review  \
0  It's honestly absurd how good the "Spider-Vers...   
1  If it wasn't already obvious in the first film...   
2  First off, you should know that this is the fi...   
3  This film is a visual concert. The animation a...   
4  The animation, flow of everything, genius char...   

                                       Noisy Removed  
0  Its honestly absurd how good the SpiderVerse m...  
1  If it wasnt already obvious in the first film ...  
2  First off you should know that this is the fir...  
3  This film is a visual concert The animation an...  
4  The animation flow of everything genius charac...  


# (2) Remove numbers.

In [7]:
# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)  # Remove all digits

# Apply the function to remove numbers from the 'Noisy Removed' column
df['Numbers Removed'] = df['Noisy Removed'].apply(remove_numbers)

# Display the updated DataFrame with original, cleaned, and number-removed reviews
print(df[['Review', 'Noisy Removed', 'Numbers Removed']].head())



                                              Review  \
0  It's honestly absurd how good the "Spider-Vers...   
1  If it wasn't already obvious in the first film...   
2  First off, you should know that this is the fi...   
3  This film is a visual concert. The animation a...   
4  The animation, flow of everything, genius char...   

                                       Noisy Removed  \
0  Its honestly absurd how good the SpiderVerse m...   
1  If it wasnt already obvious in the first film ...   
2  First off you should know that this is the fir...   
3  This film is a visual concert The animation an...   
4  The animation flow of everything genius charac...   

                                     Numbers Removed  
0  Its honestly absurd how good the SpiderVerse m...  
1  If it wasnt already obvious in the first film ...  
2  First off you should know that this is the fir...  
3  This film is a visual concert The animation an...  
4  The animation flow of everything genius charac..

# (3) Remove stopwords by using the stopwords list.

In [10]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Downloading the NLTK resources if not already available
nltk.download('stopwords')

# Initializing the stopwords
stop_words = set(stopwords.words('english'))

# Function for removing the stopwords
def remove_stopwords(text):
    text_tokens = text.split()  # Split the text into tokens (words)
    return ' '.join([word for word in text_tokens if word.lower() not in stop_words])  # Remove stopwords and join back to string

# Applying the function to remove stopwords from the 'Numbers Removed' column
df['Stopwords Removed'] = df['Numbers Removed'].apply(remove_stopwords)

# Displaying the updated DataFrame with original review and stopwords removed
print(df[['Review', 'Stopwords Removed']].head())


                                              Review  \
0  It's honestly absurd how good the "Spider-Vers...   
1  If it wasn't already obvious in the first film...   
2  First off, you should know that this is the fi...   
3  This film is a visual concert. The animation a...   
4  The animation, flow of everything, genius char...   

                                   Stopwords Removed  
0  honestly absurd good SpiderVerse movies Across...  
1  wasnt already obvious first film officially cl...  
2  First know first part one big movie split two ...  
3  film visual concert animation character design...  
4  animation flow everything genius character dev...  


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# (4) Lowercase all texts

In [11]:
# Function for lowercasing all texts
def lowercase_text(text):
    return text.lower()  # Convert text to lowercase

# Applying the function to the 'Stopwords Removed' column
df['Lowercased'] = df['Stopwords Removed'].apply(lowercase_text)

# Displaying the updated DataFrame with the original review and lowercased text
print(df[['Review', 'Lowercased']].head())



                                              Review  \
0  It's honestly absurd how good the "Spider-Vers...   
1  If it wasn't already obvious in the first film...   
2  First off, you should know that this is the fi...   
3  This film is a visual concert. The animation a...   
4  The animation, flow of everything, genius char...   

                                          Lowercased  
0  honestly absurd good spiderverse movies across...  
1  wasnt already obvious first film officially cl...  
2  first know first part one big movie split two ...  
3  film visual concert animation character design...  
4  animation flow everything genius character dev...  


# (5) Stemming.

In [12]:
from nltk.stem import PorterStemmer

# Initializing the stemmer
stemmer = PorterStemmer()

# Function for the purpose of stemming
def stem_text(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])  # Apply stemming to each word

# Applying the function to the 'Lowercased' column
df['Stemmed'] = df['Lowercased'].apply(stem_text)

# Displaying the updated DataFrame with the original review and stemmed text
print(df[['Review', 'Stemmed']].head())



                                              Review  \
0  It's honestly absurd how good the "Spider-Vers...   
1  If it wasn't already obvious in the first film...   
2  First off, you should know that this is the fi...   
3  This film is a visual concert. The animation a...   
4  The animation, flow of everything, genius char...   

                                             Stemmed  
0  honestli absurd good spidervers movi across sp...  
1  wasnt alreadi obviou first film offici clear d...  
2  first know first part one big movi split two h...  
3  film visual concert anim charact design neatli...  
4  anim flow everyth geniu charact develop action...  


# (6) Lemmatization.

In [13]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K

In [14]:
import pandas as pd
import spacy
import contractions

# Load the English NLP model
nlp = spacy.load('en_core_web_sm')

# Assuming you already have your DataFrame named df
# df = pd.read_csv('your_file.csv')

# Function to expand contractions
def expand_contractions(text):
    return contractions.fix(text)  # Expanding contractions

# Function for lemmatization using spaCy
def lemmatize_text(text):
    doc = nlp(text)  # Process the text with spaCy
    return ' '.join([token.lemma_ for token in doc])  # Join lemmas into a string

# Applying the function to expand contractions and then lemmatize
df['Lemmatized'] = df['Stemmed'].apply(expand_contractions).apply(lemmatize_text)

# Displaying the updated DataFrame with original review and lemmatized text
print(df[['Review', 'Lemmatized']].head())



                                              Review  \
0  It's honestly absurd how good the "Spider-Vers...   
1  If it wasn't already obvious in the first film...   
2  First off, you should know that this is the fi...   
3  This film is a visual concert. The animation a...   
4  The animation, flow of everything, genius char...   

                                          Lemmatized  
0  honestli absurd good spiderver movi across spi...  
1  be not alreadi obviou first film offici clear ...  
2  first know first part one big movi split two h...  
3  film visual concert anim charact design neatli...  
4  anim flow everyth geniu charact develop action...  


In [15]:
# Saving the cleaned data to a new CSV file called 'imdb_reviews_cleaned.csv'
df.to_csv('imdb_reviews_cleaned.csv', index=False)  # Use the variable name for your DataFrame

# Print a confirmation message
print("Cleaned data saved to 'imdb_reviews_cleaned.csv'")



Cleaned data saved to 'imdb_reviews_cleaned.csv'


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

# (1) Parts of Speech (POS) Tagging

In [16]:
import nltk
import pandas as pd
from collections import Counter

# Downloading all necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Function for POS tagging
def pos_tagging(text):
    words = nltk.word_tokenize(text)  # Tokenize the text
    pos_tags = nltk.pos_tag(words)     # Get POS tags
    return pos_tags

# Function to count POS tags
def count_pos(pos_tags):
    pos_counts = Counter(tag for word, tag in pos_tags)  # Count POS tags
    return {
        'Nouns': sum(pos_counts[tag] for tag in ['NN', 'NNS', 'NNP', 'NNPS']),
        'Verbs': sum(pos_counts[tag] for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']),
        'Adjectives': sum(pos_counts[tag] for tag in ['JJ', 'JJR', 'JJS']),
        'Adverbs': sum(pos_counts[tag] for tag in ['RB', 'RBR', 'RBS'])
    }

# Loading the cleaned data
dataFrame = pd.read_csv('imdb_reviews_cleaned.csv')

# Performing POS tagging on cleaned text
dataFrame['POS_tags'] = dataFrame['Lemmatized'].apply(pos_tagging)

# Counting the POS for each review
dataFrame['POS_counts'] = dataFrame['POS_tags'].apply(count_pos)

# Calculating the total POS counts
total_pos_counts = dataFrame['POS_counts'].apply(pd.Series).sum()

# Display total POS counts
print("Total POS counts:")
print(total_pos_counts)

# Saving all results to a CSV file
dataFrame.to_csv('imdb_reviews_pos_tagged.csv', index=False)
print("POS tagging results saved to 'imdb_reviews_pos_tagged.csv'")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Total POS counts:
Nouns         57117
Verbs         16042
Adjectives    22432
Adverbs        7637
dtype: int64
POS tagging results saved to 'imdb_reviews_pos_tagged.csv'


# (2) Constituency Parsing and Dependency Parsing

In [17]:
import nltk
import spacy
from nltk import Tree
import pandas as pd

# Downloading all the necessary NLTK data quietly
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)

# Loading the spaCy model
nlp = spacy.load("en_core_web_sm")

# Function for constituency parsing
def constituency_parse(sentence):
    words = nltk.word_tokenize(sentence)  # Tokenize the sentence
    pos_tags = nltk.pos_tag(words)        # Get POS tags
    tree = nltk.ne_chunk(pos_tags)        # Create the parse tree
    return tree

# Function for dependency parsing
def dependency_parse(sentence):
    doc = nlp(sentence)                    # Process the sentence with spaCy
    return [(token.text, token.dep_, token.head.text) for token in doc]

# Loading the cleaned data from CSV file
dataFrame = pd.read_csv('imdb_reviews_cleaned.csv')

# Processing all the sentences
for index, row in dataFrame.iterrows():
    sentence = row['Lemmatized']  # Access the lemmatized sentence
    print(f"\n\nSentence {index + 1}:")
    print(sentence)

    # Constituency parsing
    try:
        print("\nConstituency Parse Tree:")
        constituency_tree = constituency_parse(sentence)
        print(constituency_tree)
    except LookupError as e:
        print(f"Error in constituency parsing: {e}")

    # Dependency parsing
    print("\nDependency Parse:")
    dependency_relations = dependency_parse(sentence)
    for word, dep, head in dependency_relations:
        print(f"{word} --{dep}--> {head}")

print("\nParsing completed for all sentences.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  wisecrack/NN
  peter/NN
  b/NN
  parker/NN
  enigmat/NN
  miguel/NN
  ohara/NN
  team/NN
  spiderpeopl/NN
  form/NN
  bond/NN
  endear/JJ
  empow/NN
  chemistri/NN
  growth/NN
  throughout/IN
  film/NN
  keep/VB
  hook/NN
  start/VB
  finishat/WP
  core/NN
  film/NN
  delv/NN
  profound/NN
  theme/NN
  famili/NN
  friendship/NN
  selfdiscoveri/JJ
  explor/NN
  complex/JJ
  ident/NN
  univer/NN
  desir/NN
  belong/JJ
  script/NN
  master/NN
  weav/NN
  theme/NN
  narr/NN
  leav/NN
  last/JJ
  impact/NN
  reson/VBD
  audienc/RB
  long/JJ
  credit/NN
  rollspiderman/NN
  across/IN
  spiderver/NN
  sequel/NN
  testament/NN
  power/NN
  anim/IN
  art/JJ
  form/NN
  transcend/NN
  boundari/NN
  genr/NN
  age/NN
  appeal/NN
  young/JJ
  adult/NN
  audienc/JJ
  cinemat/NN
  achiev/NN
  rememb/NN
  landmark/NN
  anim/NN
  storytellingin/NN
  hand/NN
  visionari/NN
  filmmak/NN
  film/NN
  take/VB
  bold/JJ
  risk/NN
  redefin/NN

# (3) Named Entity Recognition

In [18]:
import spacy
import pandas as pd
from collections import Counter

# Loading the spaCy model
nlp = spacy.load("en_core_web_sm")

# Function to perform NER
def perform_ner(text):
    doc = nlp(text)  # Process the text using the spaCy model
    return [(ent.text, ent.label_) for ent in doc.ents]  # Extract entities and their labels

# Loading the cleaned data from CSV file
dataFrame = pd.read_csv('imdb_reviews_cleaned.csv')

# Performing NER on all the cleaned texts extracted from the CSV file
all_entities = []

for index, row in dataFrame.iterrows():
    text = row['Lemmatized']  # Access the lemmatized text
    entities = perform_ner(text)
    all_entities.extend(entities)

    # Printing all entities for each text
    print(f"\nEntities in text {index + 1}:")
    for entity, label in entities:
        print(f"{entity} - {label}")

# Calculating the count of each entity type
entity_counts = Counter(label for _, label in all_entities)

print("\nTotal entity counts:")
for entity_type, count in entity_counts.items():
    print(f"{entity_type}: {count}")

# Creating a list of all unique entities
unique_entities = list(set(all_entities))

print("\nSample of unique entities found (up to 20):")
for entity, label in unique_entities[:20]:
    print(f"{entity} - {label}")

# Saving the results to CSV
results_dataFrame = pd.DataFrame(unique_entities, columns=['Entity', 'Type'])
results_dataFrame.to_csv('named_entities.csv', index=False)
print("\nFull list of named entities saved to 'named_entities.csv'")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m

Entities in text 391:
cool movi long time - ORG
mile - QUANTITY

Entities in text 392:
five - CARDINAL
five - CARDINAL
cliffhang - ORG
one - CARDINAL
realiti accompani undeni truth - PERSON
believ challeng - ORG
hurdl emerg - ORG
vividli display mile - EVENT
lesson impart - PERSON
mile emerg - PERSON
relat - PERSON
battl strike - PERSON
coupl apprehen - PERSON
urg safeguard - PERSON
peter demis experi parallel - PERSON
world instanc - ORG
age mile - DATE
effortlessi distinctli - PERSON
realiti spiderperson life integr intric detail fluid - PERSON
narr - ORG
genr - GPE
reson profoundli - PERSON

Entities in text 393:
first - ORDINAL
one - CARDINAL
one - CARDINAL
movi charact thereinc wait - ORG
next one - DATE

Entities in text 394:
thrill - GPE
precis creativ - PERSON
innov - PERSON

Entities in text 395:

Entities in text 396:
confus - ORG
everi movi - PERSON
two - CARDINAL

Entities in text 397:
first - ORDINAL
everi d

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [19]:

dataFrame.to_csv('final_cleaned_dataset.csv', index=False)
print("Final cleaned dataset with all steps saved to 'final_cleaned_dataset.csv'")




Final cleaned dataset with all steps saved to 'final_cleaned_dataset.csv'


In [20]:
import pandas as pd

# Loading the final cleaned CSV file
dataFrame = pd.read_csv('final_cleaned_dataset.csv')

# Keeping only the final cleaned 'Lemmatized' column since it is the last step performed and renaming it to 'Final Review'
final_dataFrame = dataFrame[['Lemmatized']].rename(columns={'Lemmatized': 'Final Review'})

# Saving the entire final dataset to a new CSV file called 'final_reviews_dataset.csv'
final_dataFrame.to_csv('final_reviews_dataset.csv', index=False)

# Print a final confirmation message
print("Cleaned and final reviews dataset has been saved to 'final_reviews_dataset.csv'")




Cleaned and final reviews dataset has been saved to 'final_reviews_dataset.csv'


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [21]:
# Write your response below

'''

The assignment was engaging, especially the text processing techniques that improved the quality of movie reviews. I found it challenging to handle certain tasks like removing noise and stemming words. I enjoyed using NLP techniques, particularly Named Entity Recognition, to identify different entities in the text. The time provided felt reasonable, allowing me to explore concepts without feeling rushed.
'''

'\n\nThe assignment was engaging, especially the text processing techniques that improved the quality of movie reviews. I found it challenging to handle certain tasks like removing noise and stemming words. I enjoyed using NLP techniques, particularly Named Entity Recognition, to identify different entities in the text. The time provided felt reasonable, allowing me to explore concepts without feeling rushed.\n'