# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
!pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time



In [2]:
# Defining a function
def BarbieIMDbReviews(reviews_url, no_of_reviews=1000):
    review_titles = []
    review_texts = []

    while len(review_titles) < no_of_reviews:
        # GET request to the URL
        response = requests.get(reviews_url)
        # Initializing BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Finding all reviews
        review_boxes = soup.find_all('div', class_='lister-item-content')

        # Using for loop to iterate through all the reviews
        for single_review in review_boxes:
            if len(review_titles) >= no_of_reviews:
                break # in case of more than 1000 reviews
            # Extracting the review title
            title = single_review.find('a', class_='title').text.strip()
            review_titles.append(title)

            # Extracting the review text
            text = single_review.find('div', class_='text show-more__control').text.strip()
            review_texts.append(text)

        # For next page, handles pagination
        load_more = soup.find('div', class_='load-more-data')
        if not load_more:
            break  # if no reviews

        key = load_more['data-key']
        reviews_url = f"https://www.imdb.com/title/tt1517268/reviews/_ajax?ref_=undefined&paginationKey={key}"

        # to avoid overloading IMDb server
        time.sleep(2)

    # Creating a DataFrame
    reviews_dataframe = pd.DataFrame({'Review_Title': review_titles, 'Review_Text': review_texts})
    print(f"Collected top {len(reviews_dataframe)} User Reviews of Movie - Barbie (2023)")
    # Print the first 5 reviews
    print("\n The first 5 reviews among 1000 collected :")
    print(reviews_dataframe.head())
    return reviews_dataframe

# First page URL for the Barbie movie reviews
imdb_url = 'https://www.imdb.com/title/tt1517268/reviews?ref_=tt_urv'
# Calling the function
reviews_dataframe = BarbieIMDbReviews(imdb_url)
# Saving the data into csv file
reviews_dataframe.to_csv('barbie_imdb_reviews.csv', index=False)
print("Saved the dataset as 'barbie_imdb_reviews.csv'.")

Collected top 1000 User Reviews of Movie - Barbie (2023)

 The first 5 reviews among 1000 collected :
                                      Review_Title  \
0                   Beautiful film, but so preachy   
1                                  A Hot Pink Mess   
2  Could Have Been Great. 2nd Half Brings It Down.   
3  As a guy I felt some discomfort, and that's ok.   
4                                 Too heavy handed   

                                         Review_Text  
0  Margot does the best with what she's given, bu...  
1  Before making Barbie (2023), Greta Gerwig sing...  
2  The quality, the humor, and the writing of the...  
3  As much as it pains me to give a movie called ...  
4  As a woman that grew up with Barbie, I was ver...  
Saved the dataset as 'barbie_imdb_reviews.csv'.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
!pip install pandas nltk
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Downloading NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
# loading barbie_imdb_reviews dataset
df = pd.read_csv('barbie_imdb_reviews.csv')

# Define a function to print outputs for each part
def output_function(part, title_text, review_text):
    print(f"Output after {part} for title:\n{title_text}\n")
    print(f"Output after {part} for review:\n{review_text}\n\n")

# Initialize NLTK resources
stopwords_list = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# (1) Remove noise, such as special characters and punctuations
# Using regex to remove anything that is not a letter or space
df['Title_clean'] = df['Review_Title'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))
df['Review_clean'] = df['Review_Text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))
output_function('noise removal', df['Title_clean'].iloc[0], df['Review_clean'].iloc[0])

# (2) Remove numbers
# Using regex to remove digits
df['Title_clean'] = df['Title_clean'].apply(lambda x: re.sub(r'\d+', '', x))
df['Review_clean'] = df['Review_clean'].apply(lambda x: re.sub(r'\d+', '', x))
output_function('number removal', df['Title_clean'].iloc[0], df['Review_clean'].iloc[0])

# (3) Remove stopwords
# Tokenizing and filtering out stopwords
def remove_stopwords(text):
    tokens = word_tokenize(text)
    return ' '.join([word for word in tokens if word not in stopwords_list])
df['Title_clean'] = df['Title_clean'].apply(remove_stopwords)
df['Review_clean'] = df['Review_clean'].apply(remove_stopwords)
output_function('stopword removal', df['Title_clean'].iloc[0], df['Review_clean'].iloc[0])

# (4) Lowercase all texts
# Converting all characters to lowercase
df['Title_clean'] = df['Title_clean'].apply(str.lower)
df['Review_clean'] = df['Review_clean'].apply(str.lower)
output_function('lowercasing', df['Title_clean'].iloc[0], df['Review_clean'].iloc[0])

# (5) Stemming
# Applying stemming to words
def apply_stemming(text):
    tokens = word_tokenize(text)
    return ' '.join([stemmer.stem(word) for word in tokens])
df['Title_clean'] = df['Title_clean'].apply(apply_stemming)
df['Review_clean'] = df['Review_clean'].apply(apply_stemming)
output_function('stemming', df['Title_clean'].iloc[0], df['Review_clean'].iloc[0])

# (6) Lemmatization
# Applying lemmatization to words
def apply_lemmatization(text):
    tokens = word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in tokens])
df['Title_clean'] = df['Title_clean'].apply(apply_lemmatization)
df['Review_clean'] = df['Review_clean'].apply(apply_lemmatization)
output_function('lemmatization', df['Title_clean'].iloc[0], df['Review_clean'].iloc[0])

# Save the cleaned DataFrame in CSV
df.to_csv('barbie_imdb_reviews_cleaned.csv', index=False)
print("Saved the cleaned dataset as 'barbie_imdb_reviews_cleaned.csv'")

Output after noise removal for title:
Beautiful film but so preachy

Output after noise removal for review:
Margot does the best with what shes given but this film was very disappointing to me It was marketed as a fun quirky satire with homages to other movies It started that way but ended with overdramatized speeches and an ending that clearly tried to make the audience feel something but left everyone just feeling confused And before you say Im a crotchety old man Im a woman in my s so Im pretty sure Im this movies target audience The saddest part is there were parents with their kids in the theater that were victims of the poor marketing because this is not a kids movie Overall the humor was fun on occasion and the film is beautiful to look at but the whole concept falls apart in the second half of the film and becomes a pity party for the strong woman


Output after number removal for title:
Beautiful film but so preachy

Output after number removal for review:
Margot does the best

In [15]:
# For downloading the cleaned CSV file
from google.colab import files

files.download('barbie_imdb_reviews_cleaned.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [13]:
import pandas as pd
import spacy
from spacy import displacy
from collections import Counter
import nltk
from nltk.tree import Tree

# Load Spacy's English-language model
spacy_model = spacy.load("en_core_web_sm")

# Load the cleaned dataset
reviews_df = pd.read_csv('barbie_imdb_reviews_cleaned.csv')

# Select a sample review for detailed analysis
example_review = reviews_df['Review_clean'].iloc[0]

# Process the text with Spacy
processed_text = spacy_model(example_review)

# (1) Parts of Speech (POS) Tagging
speech_parts_tags = {'NOUN': 0, 'VERB': 0, 'ADJ': 0, 'ADV': 0}
parts_of_speech_counts = processed_text.count_by(spacy.attrs.POS)
for key, val in parts_of_speech_counts.items():
    tag = processed_text.vocab[key].text
    if tag in speech_parts_tags:
        speech_parts_tags[tag] = val
print(f"POS Tagging Counts: {speech_parts_tags}")

# (2) Dependency Parsing
print("Dependency Parsing Trees:")
for sentence in processed_text.sents:
    displacy.render(sentence, style='dep', jupyter=True, options={'distance': 90})
    break  # Displaying only one tree for brevity

# Constituency Parsing Tree(hardcoded example)
tree_structure_str = """
(S
    (NP (PRP$ My) (NN friend))
    (VP (VBD enjoyed)
        (VP (VBG watching)
            (NP (DT the) (NNP Barbie) (NN Movie))))
    (. .))
"""
syntax_tree = Tree.fromstring(tree_structure_str)
print("Constituency Parsing Tree for 'My friend enjoyed watching the Barbie Movie'")
syntax_tree.pretty_print()

# (3) Named Entity Recognition (NER)
named_entities = list(processed_text.ents)
named_entity_counts = Counter(entity.label_ for entity in named_entities)
print(f"Entity Counts: {named_entity_counts}")
for entity in named_entities:
    print(f"Text: {entity.text}, Entity: {entity.label_}")


POS Tagging Counts: {'NOUN': 28, 'VERB': 14, 'ADJ': 9, 'ADV': 4}
Dependency Parsing Trees:


Constituency Parsing Tree for 'My friend enjoyed watching the Barbie Movie'
                           S                             
       ____________________|___________________________   
      |                           VP                   | 
      |             ______________|___                 |  
      |            |                  VP               | 
      |            |        __________|____            |  
      NP           |       |               NP          | 
  ____|____        |       |       ________|______     |  
PRP$       NN     VBD     VBG     DT      NNP     NN   . 
 |         |       |       |      |        |      |    |  
 My      friend enjoyed watching the     Barbie Movie  . 

Entity Counts: Counter({'GPE': 1, 'PERSON': 1, 'ORDINAL': 1, 'CARDINAL': 1})
Text: overdramat, Entity: GPE
Text: crotcheti, Entity: PERSON
Text: second, Entity: ORDINAL
Text: half, Entity: CARDINAL


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [8]:
# I feel the 1st and 2nd question was interesting and I enjoyed web scraping the reviews of my chosen movie over IMDB
# and perfoming operations such as removing noise, stopwords, digits, lemmatization, stemming along with outputs made me understand how the initial text can change accordingly to each part
# the 3rd question was bit challenging, as I had to learn how to conduct syntax and structure analysis of the clean text and took alot of online research and reference