# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


#Scraping IMDB movie reviews

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
from google.colab import drive


drive.mount('/content/drive')

# Function for scraping
def scrape_reviews(url):
    all_reviews = []
    while len(all_reviews) < 1000:
      user_agent = {'User-agent': 'Mozilla/5.0'}
      # Using requests module
      response = requests.get(url, headers=user_agent)
      soup = BeautifulSoup(response.text, 'html.parser')
      review_containers = soup.find_all("div", class_="review-container")
      if not review_containers:
        print("No review containers found on the page:", url)
        break
      for container in review_containers:
        #getting tags for extraction
        review_text = container.find("div", class_="text").get_text(strip=True)
        rating_element = container.find("span", class_="rating-other-user-rating")
        rating = rating_element.find("span").text if rating_element else None
        reviewer_name = container.find("span", class_="display-name-link").text
        review_date = container.find("span", class_="review-date").text
        all_reviews.append({"Review": review_text, "Rating": rating, "Reviewer": reviewer_name, "Date": review_date})
        if len(all_reviews) >= 1000:
            break
      # load more option search
      load_more_button = soup.find("button", class_="ipl-load-more__button")
      if load_more_button:
        next_page_key = soup.find("div", class_="load-more-data").get("data-key")
        if next_page_key:
            next_page_url = f"{url}?paginationKey={next_page_key}"
            url = next_page_url
        else:
            print("Data-key attribute not found on the Load More button.")
            break
      else:
        break
    return all_reviews

url = "https://www.imdb.com/title/tt9603212/reviews/"

#Function calling
all_reviews = scrape_reviews(url)

# Writing the csv file
csv_file_path = '/content/drive/My Drive/imdb_reviews_dataset.csv'
with open(csv_file_path, mode='w', newline='', encoding="utf-8") as file:
  fieldnames = ['Review', 'Rating', 'Reviewer', 'Date']
  writer = csv.DictWriter(file, fieldnames=fieldnames)
  writer.writeheader()
  for review in all_reviews:
    writer.writerow(review)

print("Reviews have been stored in CSV file:", csv_file_path)


Mounted at /content/drive
Reviews have been stored in CSV file: /content/drive/My Drive/imdb_reviews_dataset.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import re

# csv to dataframe using pandas
noiseless_review = pd.read_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv')

def remove_noise(text):
  cleaned_text = re.sub(r'[^\w\s]', '', text)
  return cleaned_text

noiseless_review['Clean_Review'] = noiseless_review['Review'].apply(remove_noise)

noiseless_review.to_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv', index=False)

noiseless_review.head()

Unnamed: 0,Review,Rating,Reviewer,Date,Clean_Review
0,Man.... I wish I loved this movie more than I ...,7.0,Paragon240,12 July 2023,Man I wish I loved this movie more than I did ...
1,Ethan Hunt has left the mere secret agent stat...,6.0,JackRJosie,15 July 2023,Ethan Hunt has left the mere secret agent stat...
2,After the first 30 minutes that promised an in...,5.0,ragingbull_2005,18 July 2023,After the first 30 minutes that promised an in...
3,4 considerations for those with high expectati...,5.0,imseeg,12 July 2023,4 considerations for those with high expectati...
4,Mission Impossible is one of those rare franch...,5.0,BA_Harrison,11 July 2023,Mission Impossible is one of those rare franch...


In [5]:
# Function to remove numbers from the cleaned review text
def remove_numbers(text):
  cleaned_text = re.sub(r'\d+', '', text)
  return cleaned_text

noiseless_review['Clean_Review'] = noiseless_review['Clean_Review'].apply(remove_numbers)

noiseless_review.to_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv', index=False)

noiseless_review.head()

Unnamed: 0,Review,Rating,Reviewer,Date,Clean_Review
0,Man.... I wish I loved this movie more than I ...,7.0,Paragon240,12 July 2023,Man I wish I loved this movie more than I did ...
1,Ethan Hunt has left the mere secret agent stat...,6.0,JackRJosie,15 July 2023,Ethan Hunt has left the mere secret agent stat...
2,After the first 30 minutes that promised an in...,5.0,ragingbull_2005,18 July 2023,After the first minutes that promised an inte...
3,4 considerations for those with high expectati...,5.0,imseeg,12 July 2023,considerations for those with high expectatio...
4,Mission Impossible is one of those rare franch...,5.0,BA_Harrison,11 July 2023,Mission Impossible is one of those rare franch...


In [6]:
pip install nltk



In [7]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stopwords(text):
  words = text.split()
  stopwords_list = set(stopwords.words('english'))
  filtered_words = [word for word in words if word.lower() not in stopwords_list]
  filtered_text = ' '.join(filtered_words)
  return filtered_text

noiseless_review['Clean_Review'] = noiseless_review['Clean_Review'].apply(remove_stopwords)

noiseless_review.to_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv', index=False)

noiseless_review.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Review,Rating,Reviewer,Date,Clean_Review
0,Man.... I wish I loved this movie more than I ...,7.0,Paragon240,12 July 2023,Man wish loved movie Dont get wrong solid acti...
1,Ethan Hunt has left the mere secret agent stat...,6.0,JackRJosie,15 July 2023,Ethan Hunt left mere secret agent status ascen...
2,After the first 30 minutes that promised an in...,5.0,ragingbull_2005,18 July 2023,first minutes promised intellectual action thr...
3,4 considerations for those with high expectati...,5.0,imseeg,12 July 2023,considerations high expectations want avoid se...
4,Mission Impossible is one of those rare franch...,5.0,BA_Harrison,11 July 2023,Mission Impossible one rare franchises getting...


In [8]:
noiseless_review['Clean_Review'] = noiseless_review['Clean_Review'].str.lower()

noiseless_review.to_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv', index=False)

noiseless_review.head()

Unnamed: 0,Review,Rating,Reviewer,Date,Clean_Review
0,Man.... I wish I loved this movie more than I ...,7.0,Paragon240,12 July 2023,man wish loved movie dont get wrong solid acti...
1,Ethan Hunt has left the mere secret agent stat...,6.0,JackRJosie,15 July 2023,ethan hunt left mere secret agent status ascen...
2,After the first 30 minutes that promised an in...,5.0,ragingbull_2005,18 July 2023,first minutes promised intellectual action thr...
3,4 considerations for those with high expectati...,5.0,imseeg,12 July 2023,considerations high expectations want avoid se...
4,Mission Impossible is one of those rare franch...,5.0,BA_Harrison,11 July 2023,mission impossible one rare franchises getting...


In [9]:
from nltk.stem import PorterStemmer
nltk.download('punkt')

stemmer = PorterStemmer()

def stem_text(text):
  words = nltk.word_tokenize(text)
  stemmed_words = [stemmer.stem(word) for word in words]
  stemmed_text = ' '.join(stemmed_words)
  return stemmed_text

noiseless_review['Clean_Review'] = noiseless_review['Clean_Review'].apply(stem_text)

noiseless_review.to_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv', index=False)

noiseless_review.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,Review,Rating,Reviewer,Date,Clean_Review
0,Man.... I wish I loved this movie more than I ...,7.0,Paragon240,12 July 2023,man wish love movi dont get wrong solid action...
1,Ethan Hunt has left the mere secret agent stat...,6.0,JackRJosie,15 July 2023,ethan hunt left mere secret agent statu ascend...
2,After the first 30 minutes that promised an in...,5.0,ragingbull_2005,18 July 2023,first minut promis intellectu action thriller ...
3,4 considerations for those with high expectati...,5.0,imseeg,12 July 2023,consider high expect want avoid sever disappoi...
4,Mission Impossible is one of those rare franch...,5.0,BA_Harrison,11 July 2023,mission imposs one rare franchis get better be...


In [10]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

noiseless_review['Clean_Review'] = noiseless_review['Clean_Review'].apply(lemmatize_text)

noiseless_review.to_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv', index=False)

noiseless_review.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,Review,Rating,Reviewer,Date,Clean_Review
0,Man.... I wish I loved this movie more than I ...,7.0,Paragon240,12 July 2023,man wish love movi dont get wrong solid action...
1,Ethan Hunt has left the mere secret agent stat...,6.0,JackRJosie,15 July 2023,ethan hunt left mere secret agent statu ascend...
2,After the first 30 minutes that promised an in...,5.0,ragingbull_2005,18 July 2023,first minut promis intellectu action thriller ...
3,4 considerations for those with high expectati...,5.0,imseeg,12 July 2023,consider high expect want avoid sever disappoi...
4,Mission Impossible is one of those rare franch...,5.0,BA_Harrison,11 July 2023,mission imposs one rare franchis get better be...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [11]:
# Your code here
from collections import Counter
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# csv to dataframe
data_analysis = pd.read_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv')

pos = []

def pos_tag_and_count(text):
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    counts = Counter(tag for word, tag in pos_tags)
    return counts

total_counts = data_analysis['Clean_Review'].apply(pos_tag_and_count).apply(Counter).sum()

print("POS Tag Counts:")
print("Nouns (N):", total_counts['NN'] + total_counts['NNS'] + total_counts['NNP'] + total_counts['NNPS'])
print("Verbs (V):", total_counts['VB'] + total_counts['VBD'] + total_counts['VBG'] + total_counts['VBN'] + total_counts['VBP'] + total_counts['VBZ'])
print("Adjectives (Adj):", total_counts['JJ'] + total_counts['JJR'] + total_counts['JJS'])
print("Adverbs (Adv):", total_counts['RB'] + total_counts['RBR'] + total_counts['RBS'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


POS Tag Counts:
Nouns (N): 88640
Verbs (V): 20960
Adjectives (Adj): 33000
Adverbs (Adv): 8120


In [12]:
pip install stanfordcorenlp

Collecting stanfordcorenlp
  Downloading stanfordcorenlp-3.9.1.1-py2.py3-none-any.whl (5.7 kB)
Installing collected packages: stanfordcorenlp
Successfully installed stanfordcorenlp-3.9.1.1


In [13]:
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

def dependency_parsing(review):
    doc = nlp(review)
    for sent in doc.sents:
        print("\nDependency Parsing Tree:")
        for token in sent:
            print(token.text, token.dep_, token.head.text)
        print("\n")

data_analysis['Clean_Review'].apply(dependency_parsing)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
pocket dobj kept
mani compound time
time npadvmod kept
make xcomp kept
wonder compound script
lazi compound script
script nsubj kill
kill ccomp make
old amod gf
gf dobj kill
make dep pull
new amod gf
gf ccomp make
also advmod make
lazi compound mi
writingunlik compound mi
mi compound ethan
ethan nummod plan
singl amod plan
smart amod plan
move compound plan
plan nsubj felt
felt ccomp make
like intj let
let advcl felt
go nsubj see
see ccomp let
happensam dobj see
sure advmod gave
mi compound ignor
fan compound ignor
ignor nsubj gave
weak amod ignor
gave ccomp see
film compound rate
rate dobj gave



Dependency Parsing Tree:
instal amod stunt
franchis det entertain
wellmad compound highli
highli nsubj entertain
entertain nmod stunt
decent amod fun
fun dobj entertain
alway advmod entertain
tri amod stunt
invent nmod stunt
better amod stunt
stunt nsubj noth
also advmod noth
noth ROOT noth
wrong amod cinemat
cinemat compound d

0      None
1      None
2      None
3      None
4      None
       ... 
995    None
996    None
997    None
998    None
999    None
Name: Clean_Review, Length: 1000, dtype: object

In [14]:
from collections import Counter
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [17]:
import spacy
import pandas as pd
from collections import Counter

nlp = spacy.load("en_core_web_sm")

reviews_data = pd.read_csv('/content/drive/MyDrive/imdb_reviews_dataset.csv')

def ner_extraction(review):
    doc = nlp(review)
    person_names = Counter()
    organizations = Counter()
    locations = Counter()
    product_names = Counter()
    dates = Counter()
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            person_names[ent.text] += 1
        elif ent.label_ == 'ORG':
            organizations[ent.text] += 1
        elif ent.label_ == 'GPE':
            locations[ent.text] += 1
        elif ent.label_ == 'PRODUCT':
            product_names[ent.text] += 1
        elif ent.label_ == 'DATE':
            dates[ent.text] += 1
    return person_names, organizations, locations, product_names, dates

ner_results = reviews_data['Clean_Review'].apply(ner_extraction)

person_names_total = Counter()
organizations_total = Counter()
locations_total = Counter()
product_names_total = Counter()
dates_total = Counter()

for person_names, organizations, locations, product_names, dates in ner_results:
    person_names_total += person_names
    organizations_total += organizations
    locations_total += locations
    product_names_total += product_names
    dates_total += dates

print("Total Sum of Each Entity Type:")
print("Person Names:", sum(person_names_total.values()))
print("Organizations:", sum(organizations_total.values()))
print("Locations:", sum(locations_total.values()))
print("Product Names:", sum(product_names_total.values()))
print("Dates:", sum(dates_total.values()))


Total Sum of Each Entity Type:
Person Names: 3640
Organizations: 960
Locations: 360
Product Names: 0
Dates: 360


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [18]:
# Write your response below
'''Question 2 was good enough to execute, but 1 & 3 were very time
consuming and difficult to crack. Time given to finish this
assignment was okay compared to the time usually given fpr exeercises.'''

'Question 2 was good enough to execute, but 1 & 3 were very time \nconsuming and difficult to crack. Time given to finish this \nassignment was okay compared to the time usually given fpr exeercises.'