# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
import requests as req
import pandas as pd1
from bs4 import BeautifulSoup

# Function to fetch IMDb reviews
def get_reviews(movie_id):
  list_reviews = []
  page_num=1
  while page_num<41:
    #generating dynamic url to fetch data
    imdb_url = "https://www.imdb.com/title/"+movie_id+"/reviews?start="+str(page_num)
    response_1 = req.get(imdb_url)
    soup = BeautifulSoup(response_1.content, 'html.parser')
    review_divs = soup.findAll('div', class_='text show-more__control')
    list_reviews.extend([review123.text for review123 in review_divs])
    page_num+=1
  return list_reviews[:1000]

# I am collecting the reviews of Openheimer(2023) from IMDB. It's title id is 'tt15398776'.
movie_reviews = get_reviews('tt15398776')

# Save the data into a CSV file
review_df = pd1.DataFrame(movie_reviews, columns=['Review content'])
review_df.to_csv('openheimer_reviews.csv', index=False)

In [None]:
review_df

Unnamed: 0,Review content
0,One of the most anticipated films of the year ...
1,You'll have to have your wits about you and yo...
2,I'm a big fan of Nolan's work so was really lo...
3,"""Oppenheimer"" is a biographical thriller film ..."
4,This movie is just... wow! I don't think I hav...
...,...
995,It's isn't a masterpiece. It's a decent biopic...
996,I'm a big Nolan fan. Maybe this one just wasn'...
997,My Review - Oppenheimer\nMy Rating Ten plus 10...
998,Nolan is good at constructing complicated timi...


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

**1. Remove special characters and punctuations**

In [None]:
#import required packages
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Ensure you have the necessary nltk resources downloaded
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_data(text1):

  # Remove punctuation
  text1 = text1.translate(str.maketrans('', '', string.punctuation))

  #remove numbers
  text1 = ''.join([i for i in text1 if not i.isdigit()])

  #remove stopwords
  stop_words_list = stopwords.words('english')
  text1=' '.join([ x for x in text1.split() if x not in stop_words_list])

  #convert to lowercase
  text1=text1.lower()

  #stemming
  stem_tokens =[word_token for word_token in nltk.word_tokenize(text1)]

  #Lemmatization
  final_cleaned_tokens = [lemmatizer.lemmatize(word1) for word1 in stem_tokens]


  #return clean data
  return ' '.join(final_cleaned_tokens)


review_df["clean Review content"]=review_df["Review content"].apply(clean_data)
review_df.head(10)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Review content,clean Review content
0,One of the most anticipated films of the year ...,one anticipated film year many people included...
1,You'll have to have your wits about you and yo...,youll wit brain fully switched watching oppenh...
2,I'm a big fan of Nolan's work so was really lo...,im big fan nolans work really looking forward ...
3,"""Oppenheimer"" is a biographical thriller film ...",oppenheimer biographical thriller film written...
4,This movie is just... wow! I don't think I hav...,this movie wow i dont think i ever felt like w...
5,I was familiar with the Manhattan project and ...,i familiar manhattan project social political ...
6,I'm still collecting my thoughts after experie...,im still collecting thought experiencing film ...
7,Is it just me or did anyone else find this mov...,is anyone else find movie i hate say boring i ...
8,I may consider myself lucky to be alive to wat...,i may consider lucky alive watch christopher n...
9,"Okay, Nolan fans, get your fingers poised to d...",okay nolan fan get finger poised downvote im s...


In [None]:
review_df.to_csv('movie_reviews_cleaned.csv', index=False)

review_df.head(10)

Unnamed: 0,Review content,clean Review content
0,One of the most anticipated films of the year ...,one anticipated film year many people included...
1,You'll have to have your wits about you and yo...,youll wit brain fully switched watching oppenh...
2,I'm a big fan of Nolan's work so was really lo...,im big fan nolans work really looking forward ...
3,"""Oppenheimer"" is a biographical thriller film ...",oppenheimer biographical thriller film written...
4,This movie is just... wow! I don't think I hav...,this movie wow i dont think i ever felt like w...
5,I was familiar with the Manhattan project and ...,i familiar manhattan project social political ...
6,I'm still collecting my thoughts after experie...,im still collecting thought experiencing film ...
7,Is it just me or did anyone else find this mov...,is anyone else find movie i hate say boring i ...
8,I may consider myself lucky to be alive to wat...,i may consider lucky alive watch christopher n...
9,"Okay, Nolan fans, get your fingers poised to d...",okay nolan fan get finger poised downvote im s...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import pandas as pd
import spacy
from nltk import Tree
from spacy.tokens import Token
from collections import Counter

# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Function to perform POS tagging and calculate counts
def pos_tagging(text1):
    doc = nlp(text1)
    return Counter([token.pos_ for token in doc])

# Function to perform constituency parsing
def constituency_parsing(sentence):
    # Placeholder for constituency parsing
    print(f"Constituency parsing tree for: {sentence}")
    # Use an external library like StanfordNLP here

# Dependency Parsing with SpaCy
def dependency_parsing(text):
    doc = nlp(text)
    for sentence in doc.sents:
        print(f"Dependency parsing tree for: {sentence}")
        spacy.displacy.render(sentence, style='dep', jupyter=True, options={'distance': 90})

# Named Entity Recognition
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    entity_labels = [ent.label_ for ent in doc.ents]
    entity_counts = Counter(entity_labels)
    return entities, entity_counts


df = review_df

# Applying POS tagging
df['POS_counts'] = df['clean Review content'].apply(pos_tagging)

# Print Constituency Parsing and Dependency Parsing for the first few sentences
for content in df['clean Review content'].head():
    sentences = content.split(".")
    for sentence in sentences:
        if sentence.strip():
            dependency_parsing(sentence)

# Applying Named Entity Recognition
df['Entities'], df['Entity_counts'] = zip(*df['clean Review content'].apply(named_entity_recognition))

# Display results
print(df[['POS_counts', 'Entities', 'Entity_counts']].head())

print(df.head(10))

Dependency parsing tree for: one anticipated film year many people included oppenheimer largely delivers much great


Dependency parsing tree for: i feel like i loved two three hour liked hour fact stop adoring entire thing i know christopher nolans dunkirk clicked second watch maybe oppenheimer need one that said i dont feel need rush see soon long exhausting filmbut many way i cant deny exceptionally well made one it look sound amazing you


Dependency parsing tree for: d expect feeling though accurately capture time period set containing amazing sound design one year best score far every performance good great film belongs cillian murphy i feel like he lead actor beat stage talking early award considerationthe film best focus psychological thriller featuring famous historical figure one point even turn psychological horror film there one sequence involving speech thats particularly terrifying it also manages suspenseful moment even though story commonly known history pointi really feel length final hour though maybe i wish final act extended epilogue rather whole third movie


Dependency parsing tree for: i currently feel though i wouldve loved oppenheimer hour instead nothing bad mean little patience testing subjective i remember feeling like similarly long babylon totally justified runtime though others didnt feel wayim left feeling like i watched film wasnt slam dunk incredible runtime wasnt and thats still worth celebrating make oppenheimer worth seeing cinema sure


Dependency parsing tree for: youll wit brain fully switched watching oppenheimer could easily get away nonattentive viewer this intelligent filmmaking show audience great respect it fire dialogue packed information relentless pace jump different time oppenheimer life continuously hour runtime there visual clue guide viewer time youll get grip quite quickly this relentlessness help express urgency u attacked chase atomic bomb germany could an absolute career best performance consistenly brilliant cillian murphy anchor film this nailed oscar performance in fact whole cast fantastic apart maybe sometimes overwrought emily blunt performance rdj also particularly brilliant return proper acting decade calling the screenplay dense layered id say thick bible cinematography quite stark spare part imbued rich lucious colour moment especially scene florence pugh score beautiful time mostly anxious oppressive adding relentless pacing the hour runtime fly all i found intense taxing highly rewarding

Dependency parsing tree for: im big fan nolans work really looking forward


Dependency parsing tree for: i understood would flipping timeline id need concentrate i didnt find problem storytelling beautifully done the acting universally excellent i saw review saying emily blunt rather


Dependency parsing tree for: ott i didnt find alli think biggest gripe film may mean im getting old i found direction quite jarring jump cut galore while keep thing moving along apace rather exhausting i also found music sound loud point intrusion much like nolan film go interstellar i love also loud


Dependency parsing tree for: musicall quality watch


Dependency parsing tree for: it left longing day called cerebral biopics little tranquil


Dependency parsing tree for: oppenheimer biographical thriller film written directed christopher nolan the dark knight trilogy inception interstellar dunkirk based biography american prometheus kai bird martin j sherwin starring cillian murphy lead role addition matt damon robert downey jr emily blunt florence pugh subverts usual biopic formula create brilliantly layered examination man throughout incredible accomplishment fundamental flawsduring height second world war theoretical physicist j robert oppenheimer cillian murphy recruited united state government oversee manhattan project top secret operation intended develop world first nuclear weapon after becoming acquainted project director major general leslie grove matt damon oppenheimer general come agreement best place carry undertaking vast desert los alamo new mexico a numerous scientist family brought discreet location oppenheimer work tirelessly around clock build weapon mass destruction nazi devise with war raging personal tr

Dependency parsing tree for: i cant imagine amount pressure play effectively the combined effort murphy acting nolans direction


Dependency parsing tree for: help make oppenheimer one fascinating individual th century this man viewed simply face value many layer character bear indepth exploration movie like accomplish the film paint oppenheimer neither hero villain rather complicated man whose human quality undermine remembered history book murphy approach like shakespearian figure rife flaw haughtiness sense hubris end sealing inevitable fate one scene may admiring remarkable talent field nuclear physic another might cause hate unfaithfulness family he viewed simultaneously martyr scapegoat way helped bring end deadliest global conflict history consequently ushering something even worsethe rest film cast fantastic job


Dependency parsing tree for: well standouts matt damon robert downey


Dependency parsing tree for: jr emily blunt florence pugh damon take major general leslie grove simply stock military character rather important figure seizes opportunity use oppenheimer talent advantage we watch grove form unlikely alliance physicist often questioning ramification theoretical nature experimenting nuclear power groves ignorance oppenheimer extensive scientific knowledge allows audience learn along explained basic detail to effect provides important third party perspective oppenheimer achievementsits also great see


Dependency parsing tree for: robert downey jr shine lewis strauss best postmcu role one best role general strauss man viewed favourably history due role exposing oppenheimer tie communism he hold grudge oppenheimer practically consider true villain story downey take every opportunity show strauss twofaced nature biding time right moment strip oppenheimer record book damage reputation reportedly downey considers best role date definitely seems like putting everything performanceemily blunt florence pugh also contributed significantly kitty oppenheimer jean tatlock respectively each two woman represent something significant oppenheimer life kitty jean personally want this draw parallel oppenheimer choosing acting instinct acting intellect assisting construction bomb reminds audience flawed human quality it difficult give following heart fate world rest pragmatic decision making sometimes choiceas biopic christopher nolan film oppenheimer exceeds virtually expectation become one best fiel

Dependency parsing tree for: this movie


Dependency parsing tree for: wow i dont think i ever felt like watching movie it like blend sad also scared i read christopher nolan said kind theme horror watching movie think i knew meant very movie make feel quite like one cannolan show expertly craftsman filmmaking this stand perhaps one humble movie also one greatest reminds earlier moviesthe cast also amazing cillian murphy delivering performance carrer oppenheimer esentially becoming pretty much securing oscar nomination best lead actor robert downey junior also give one best performance reminding u despite year iron man still actthe soundtrack sound editing also masterfull creates cinematic experience like otheroverall esential viewing experience historic event still remains relevant day one favorite nolan movie


                                          POS_counts  \
0  {'NUM': 7, 'VERB': 50, 'NOUN': 58, 'ADJ': 31, ...   
1  {'PRON': 6, 'AUX': 5, 'NOUN': 49, 'ADV': 17, '...   
2  {'PRON': 10, 'VERB': 26, 'ADJ': 12, 'NOUN': 22...   
3  {'NOUN': 338, 'PROPN': 99, 'VERB': 167, 'DET':...   
4  {'DET': 4, 'NOUN': 32, 'INTJ': 2, 'PRON': 5, '...   

                                            Entities  \
0  [one, three hour, christopher, dunkirk, second...   
1        [germany, cillian murphy, decade, the hour]   
2                                              [day]   
3  [christopher, american prometheus kai, martin,...   
4  [christopher nolan, one, one, cillian, robert ...   

                                       Entity_counts  
0  {'CARDINAL': 3, 'TIME': 1, 'PERSON': 3, 'ORG':...  
1      {'GPE': 1, 'PERSON': 1, 'DATE': 1, 'TIME': 1}  
2                                        {'DATE': 1}  
3  {'PERSON': 21, 'ORG': 2, 'GPE': 3, 'ORDINAL': ...  
4            {'PERSON': 2, 'CARDINAL': 3, 'NORP': 1

In [6]:
df.to_csv('openheimer_reviews_cleaned_data.csv', index=False)

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
'''
I enjoyed collecting the reviews of one of my favourite movie openheimer from IMDB website using my code.
Through this assignment, I learnt how to collect data huge data from websites and steps to pre-process the data.
Example codes uploaded in canvas helped me in data preprocessing steps and parsing.
Time provided for this assignment is sufficient to complete the tasks in time.
'''