# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
# Your code here
from bs4 import BeautifulSoup
import requests
import pandas as pd

base_url = 'https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ql_2'
def reviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    movie_titles = []
    for movie in soup.select('a.title'):
        movie_titles.append(movie.text.strip())
    review_dates = []
    for date in soup.select('span.review-date'):
        review_dates.append(date.text.strip())
    review_texts = []
    for text in soup.select('div.text.show-more__control'):
        review_texts.append(text.text.strip())
    return movie_titles, review_dates, review_texts
all_reviews = []

reviews_scraped = 0
while reviews_scraped < 10000:
  url = base_url
  movie_titles, review_dates, review_texts = reviews(url)
  all_reviews.extend(zip(movie_titles, review_dates, review_texts))
  reviews_scraped += len(movie_titles)
reviews_df = pd.DataFrame(all_reviews, columns=['Review Title', 'Review Date', 'Review Description'])
reviews_df.to_csv('movies_reviews.csv', index=False)

print("Total reviews scraped:", reviews_scraped)


Total reviews scraped: 10000


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
import requests
from bs4 import BeautifulSoup
import csv
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

df = pd.read_csv('movies_reviews.csv')
dfs = df.head(30)
print(dfs)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                         Review Title       Review Date  \
0   A scene between an intelligent raccoon, a rabb...       10 May 2023   
1          A Fitting and Heartfelt End to the Trilogy        4 May 2023   
2                       One of the Best MCU Trilogies        8 May 2023   
3      This is one of the best MCU movies, hands down        7 May 2023   
4                                          I am groot        3 May 2023   
5                A Beautifully Dark and Goofy Goodbye        3 May 2023   
6                          Best Marvel film this year        4 May 2023   
7                A perfect send off for the Guardians        3 May 2023   
8                                  I Cried Four Times       10 May 2023   
9                            Emotional and energetic!        3 May 2023   
10                                    Rocket and Love        3 May 2023   
11       Finally, Marvel start to return to its roots       11 May 2023   
12  Much better than most

In [None]:
updated_title_col = []
updated_description_col = []
updated_date_col = []


for i in dfs['Review Title']:
    clean_text =(re.sub("[^A-Za-z0-9 ]","",i))
    clean_text =(re.sub("[^A-Za-z ]","",clean_text))
    clean_text = clean_text.lower().strip()
    updated_title_col.append(clean_text)
dfs['Review Title'] = updated_title_col


for i in dfs['Review Description']:
    clean_text =(re.sub("[^A-Za-z0-9 ]","",i))
    clean_text =(re.sub("[^A-Za-z ]","",clean_text))
    clean_text = clean_text.lower().strip()
    updated_description_col.append(clean_text)
dfs['Review Description'] = updated_description_col

for i in dfs['Review Date']:
    clean_text =(re.sub("[^A-Za-z0-9 ]","",i))
    clean_text =(re.sub("[^A-Za-z ]","",clean_text))
    clean_text = clean_text.lower().strip()
    updated_date_col.append(clean_text)
dfs['Review Date'] = updated_date_col

print("Printing first few rows")
print(dfs)

Printing first few rows
                                         Review Title Review Date  \
0   a scene between an intelligent raccoon a rabbi...         may   
1          a fitting and heartfelt end to the trilogy         may   
2                       one of the best mcu trilogies         may   
3       this is one of the best mcu movies hands down         may   
4                                          i am groot         may   
5                a beautifully dark and goofy goodbye         may   
6                          best marvel film this year         may   
7                a perfect send off for the guardians         may   
8                                  i cried four times         may   
9                             emotional and energetic         may   
10                                    rocket and love         may   
11        finally marvel start to return to its roots         may   
12  much better than most of the recent marvel pro...         may   
13        

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs['Review Title'] = updated_title_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs['Review Description'] = updated_description_col
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs['Review Date'] = updated_date_col


In [None]:
# Removing the stopwords

stop_word = set(stopwords.words('english'))
column_description = []

for i in dfs['Review Description']:
    word_token = word_tokenize(i)
    filter_sentence = [w for w in word_token if not w.lower() in stop_word]
    filter_sentence = []

    for w in word_token:
        if w not in stop_word:
          filter_sentence.append(w)
    print(filter_sentence)
    column_description.append(filter_sentence)

['guardians', 'galaxy', 'volume', 'chaotic', 'weird', 'oftentimes', 'ridiculous', 'also', 'full', 'heart', 'emotion', 'great', 'themesi', 'must', 'say', 'best', 'marvel', 'movie', 'since', 'endgame', 'thats', 'necessarily', 'hard', 'though', 'needed', 'surpass', 'way', 'home', 'amazing', 'high', 'moments', 'lazy', 'others', 'marvel', 'desperate', 'need', 'hit', 'theyve', 'finally', 'got', 'ithighlightsevery', 'member', 'crew', 'got', 'time', 'shine', 'rocket', 'definitely', 'one', 'stood', 'though', 'sees', 'friend', 'die', 'wail', 'pain', 'grief', 'made', 'bawl', 'high', 'evolutionary', 'mocks', 'man', 'heavy', 'stuff', 'proceeds', 'rip', 'face', 'two', 'friends', 'shot', 'well', 'thus', 'rockets', 'traumatic', 'pastime', 'revealedchukwudi', 'iwuji', 'fantastic', 'villain', 'certain', 'points', 'downright', 'terrifying', 'really', 'liked', 'line', 'god', 'thats', 'took', 'place', 'convincing', 'villainthe', 'moment', 'starlord', 'screams', 'agony', 'rocket', 'live', 'moment', 'resonat

In [None]:
# Removing the stopwords
stop_word = set(stopwords.words('english'))
column_title = []

for i in dfs['Review Title']:
    word_token = word_tokenize(i)
    filter_sentence = [w for w in word_token if not w.lower() in stop_word]
    filter_sentence = []

    for w in word_token:
        if w not in stop_word:
          filter_sentence.append(w)
    print(filter_sentence)
    column_title.append(filter_sentence)

['scene', 'intelligent', 'raccoon', 'rabbit', 'artificial', 'legs', 'walrus', 'wheels', 'otter', 'metal', 'arms', 'made', 'cry']
['fitting', 'heartfelt', 'end', 'trilogy']
['one', 'best', 'mcu', 'trilogies']
['one', 'best', 'mcu', 'movies', 'hands']
['groot']
['beautifully', 'dark', 'goofy', 'goodbye']
['best', 'marvel', 'film', 'year']
['perfect', 'send', 'guardians']
['cried', 'four', 'times']
['emotional', 'energetic']
['rocket', 'love']
['finally', 'marvel', 'start', 'return', 'roots']
['much', 'better', 'recent', 'marvel', 'projects', 'strong', 'previous', 'guardians', 'galaxy', 'instalments']
['messy', 'enjoyable', 'sendoff']
['wasnt', 'prepared', 'traumatic', 'would']
['aint', 'thing', 'like', 'except', 'james', 'gunn', 'really', 'isnt']
['outstanding', 'emotional', 'payoff']
['heroes', 'start', 'somewhere']
['poignant', 'heartfelt', 'finale']
['different', 'necessarily', 'good', 'way']
['hero', 'story']
['goodbut', 'oversold', 'underwhelming']
['goodness', 'gracious']
['underwh

In [None]:
from textblob import Word
portstem = PorterStemmer()
for i in column_description:
    updated_description_column = []
    words = word_tokenize(" ".join(i))
    for w in words:
        # create updated column data, after running the word tokens through porter stemmer and stemming process
        updated_description_column.append(portstem.stem(w))
    # display each comment after stemming
    print(updated_description_column)

['guardian', 'galaxi', 'volum', 'chaotic', 'weird', 'oftentim', 'ridicul', 'also', 'full', 'heart', 'emot', 'great', 'themesi', 'must', 'say', 'best', 'marvel', 'movi', 'sinc', 'endgam', 'that', 'necessarili', 'hard', 'though', 'need', 'surpass', 'way', 'home', 'amaz', 'high', 'moment', 'lazi', 'other', 'marvel', 'desper', 'need', 'hit', 'theyv', 'final', 'got', 'ithighlightseveri', 'member', 'crew', 'got', 'time', 'shine', 'rocket', 'definit', 'one', 'stood', 'though', 'see', 'friend', 'die', 'wail', 'pain', 'grief', 'made', 'bawl', 'high', 'evolutionari', 'mock', 'man', 'heavi', 'stuff', 'proce', 'rip', 'face', 'two', 'friend', 'shot', 'well', 'thu', 'rocket', 'traumat', 'pastim', 'revealedchukwudi', 'iwuji', 'fantast', 'villain', 'certain', 'point', 'downright', 'terrifi', 'realli', 'like', 'line', 'god', 'that', 'took', 'place', 'convinc', 'villainth', 'moment', 'starlord', 'scream', 'agoni', 'rocket', 'live', 'moment', 'reson', 'lost', 'mani', 'peopl', 'close', 'cant', 'stand', 'l

In [None]:
for i in column_title:
    updated_title_column = []
    words = word_tokenize(" ".join(i))
    for w in words:
        # create updated column data, after running the word tokens through porter stemmer and stemming process
        updated_title_column.append(portstem.stem(w))
    # display each comment after stemming
    print(updated_title_column)

['scene', 'intellig', 'raccoon', 'rabbit', 'artifici', 'leg', 'walru', 'wheel', 'otter', 'metal', 'arm', 'made', 'cri']
['fit', 'heartfelt', 'end', 'trilog']
['one', 'best', 'mcu', 'trilog']
['one', 'best', 'mcu', 'movi', 'hand']
['groot']
['beauti', 'dark', 'goofi', 'goodby']
['best', 'marvel', 'film', 'year']
['perfect', 'send', 'guardian']
['cri', 'four', 'time']
['emot', 'energet']
['rocket', 'love']
['final', 'marvel', 'start', 'return', 'root']
['much', 'better', 'recent', 'marvel', 'project', 'strong', 'previou', 'guardian', 'galaxi', 'instal']
['messi', 'enjoy', 'sendoff']
['wasnt', 'prepar', 'traumat', 'would']
['aint', 'thing', 'like', 'except', 'jame', 'gunn', 'realli', 'isnt']
['outstand', 'emot', 'payoff']
['hero', 'start', 'somewher']
['poignant', 'heartfelt', 'final']
['differ', 'necessarili', 'good', 'way']
['hero', 'stori']
['goodbut', 'oversold', 'underwhelm']
['good', 'graciou']
['underwhelm']
['total', 'miss', 'didnt', 'need']
['scene', 'intellig', 'raccoon', 'rabbi

In [None]:
# import the required library
from textblob import Word
import nltk
nltk.download('wordnet')
# run the loop for each value from Comment column
for i in column_description:
    updated_description_column = []
    words = word_tokenize(" ".join(i))
    for w in words:
# create updated column data, after running the word tokens through Lemmatization process
        updated_description_column.append(Word(w).lemmatize())
    print(updated_description_column)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['guardian', 'galaxy', 'volume', 'chaotic', 'weird', 'oftentimes', 'ridiculous', 'also', 'full', 'heart', 'emotion', 'great', 'themesi', 'must', 'say', 'best', 'marvel', 'movie', 'since', 'endgame', 'thats', 'necessarily', 'hard', 'though', 'needed', 'surpass', 'way', 'home', 'amazing', 'high', 'moment', 'lazy', 'others', 'marvel', 'desperate', 'need', 'hit', 'theyve', 'finally', 'got', 'ithighlightsevery', 'member', 'crew', 'got', 'time', 'shine', 'rocket', 'definitely', 'one', 'stood', 'though', 'see', 'friend', 'die', 'wail', 'pain', 'grief', 'made', 'bawl', 'high', 'evolutionary', 'mock', 'man', 'heavy', 'stuff', 'proceeds', 'rip', 'face', 'two', 'friend', 'shot', 'well', 'thus', 'rocket', 'traumatic', 'pastime', 'revealedchukwudi', 'iwuji', 'fantastic', 'villain', 'certain', 'point', 'downright', 'terrifying', 'really', 'liked', 'line', 'god', 'thats', 'took', 'place', 'convincing', 'villainthe', 'moment', 'starlord', 'scream', 'agony', 'rocket', 'live', 'moment', 'resonated', 'lo

In [None]:
# import the required library
from textblob import Word
import nltk
nltk.download('wordnet')

# run the loop for each value from Comment column
for i in column_title:
    updated_title_column= []
    words = word_tokenize(" ".join(i))
    for w in words:
# create updated column data, after running the word tokens through Lemmatization process
        updated_title_column.append(Word(w).lemmatize())
    print(updated_title_column)

['scene', 'intelligent', 'raccoon', 'rabbit', 'artificial', 'leg', 'walrus', 'wheel', 'otter', 'metal', 'arm', 'made', 'cry']
['fitting', 'heartfelt', 'end', 'trilogy']
['one', 'best', 'mcu', 'trilogy']
['one', 'best', 'mcu', 'movie', 'hand']
['groot']
['beautifully', 'dark', 'goofy', 'goodbye']
['best', 'marvel', 'film', 'year']
['perfect', 'send', 'guardian']
['cried', 'four', 'time']
['emotional', 'energetic']
['rocket', 'love']
['finally', 'marvel', 'start', 'return', 'root']
['much', 'better', 'recent', 'marvel', 'project', 'strong', 'previous', 'guardian', 'galaxy', 'instalment']
['messy', 'enjoyable', 'sendoff']
['wasnt', 'prepared', 'traumatic', 'would']
['aint', 'thing', 'like', 'except', 'james', 'gunn', 'really', 'isnt']
['outstanding', 'emotional', 'payoff']
['hero', 'start', 'somewhere']
['poignant', 'heartfelt', 'finale']
['different', 'necessarily', 'good', 'way']
['hero', 'story']
['goodbut', 'oversold', 'underwhelming']
['goodness', 'gracious']
['underwhelming']
['tota

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
from tkinter.constants import N
# Your code here
import spacy
from textblob import Word
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
NER = spacy.load("en_core_web_sm")
nlp=spacy.load('en_core_web_sm')

noun_count = 0
adjective_count = 0
adverb_count = 0
verb_count = 0
for i in column_description:
  text = " ".join(i)
  for word in nlp(text):
        # Print the word and its part of speech
        print(word.text, ':', word.pos_)

        # Count the occurrences of different parts of speech
        if word.pos_ == 'NOUN':
            noun_count += 1
        elif word.pos_ == 'ADJ':
            adjective_count += 1
        elif word.pos_ == 'ADV':
            adverb_count += 1
        elif word.pos_ == 'VERB':
            verb_count += 1
for i in column_title:
  text = " ".join(i)
  for word in nlp(text):
        # Print the word and its part of speech
        print(word.text, ':', word.pos_)

        # Count the occurrences of different parts of speech
        if word.pos_ == 'NOUN':
            noun_count += 1
        elif word.pos_ == 'ADJ':
            adjective_count += 1
        elif word.pos_ == 'ADV':
            adverb_count += 1
        elif word.pos_ == 'VERB':
            verb_count += 1
print('No. of Verbs = ',verb_count)
print('No. of Nouns = ',noun_count)
print('No. of Adverbs = ',adverb_count)
print('No. of Adjectives = ',adjective_count)




[1;30;43mStreaming output truncated to the last 5000 lines.[0m
guardians : PROPN
galaxy : VERB
volume : NOUN
story : NOUN
family : NOUN
loss : NOUN
technology : NOUN
prominent : ADJ
todays : NOUN
zeitgeist : NOUN
themes : NOUN
sometimes : ADV
cause : VERB
forget : VERB
watching : VERB
marvel : NOUN
movie : NOUN
movies : NOUN
backstory : NOUN
dark : ADJ
helps : VERB
hammer : PROPN
home : PROPN
themes : NOUN
villainthe : X
universe : NOUN
full : ADJ
characters : NOUN
weird : ADJ
sometimes : ADV
weird : ADJ
unique : ADJ
compared : VERB
rest : NOUN
mcu : NOUN
that : PRON
s : VERB
mostly : ADV
good : ADJ
thing : NOUN
despite : SCONJ
movie : NOUN
dark : ADJ
times : NOUN
also : ADV
lot : NOUN
humour : NOUN
fits : VERB
especially : ADV
well : ADV
guardians : PROPN
galaxy : VERB
franchise : NOUN
compared : VERB
marvel : NOUN
franchises : NOUN
action : NOUN
course : NOUN
music : NOUN
level : NOUN
expect : VERB
guardians : NOUN
galaxy : VERB
movies : NOUN
one : NUM
better : ADJ
post : NOUN
endg

In [None]:
# Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences.
# Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.
def analyze(text):
  for word in nlp(text):
    print(word.text, word.dep_, word.head.text, word.head.pos_)
for i in column_title:
  text = " ".join(i)
  analyze(text)
for i in column_description:
  text = " ".join(i)
  analyze(text)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
though mark sees VERB
sees advcl stood VERB
friend compound grief NOUN
die compound wail NOUN
wail compound grief NOUN
pain compound grief NOUN
grief dobj sees VERB
made acl grief NOUN
bawl ccomp made VERB
high amod man NOUN
evolutionary amod mocks NOUN
mocks nmod man NOUN
man compound proceeds NOUN
heavy amod stuff NOUN
stuff compound proceeds NOUN
proceeds nsubj rip VERB
rip nsubj face VERB
face ccomp galaxy VERB
two nummod friends NOUN
friends dobj face VERB
shot acl friends NOUN
well advmod shot VERB
thus advmod shot VERB
rockets nsubj liked VERB
traumatic nmod villain PROPN
pastime compound iwuji PROPN
revealedchukwudi compound iwuji PROPN
iwuji nmod villain PROPN
fantastic amod villain PROPN
villain nmod points NOUN
certain amod points NOUN
points appos rockets PROPN
downright advmod terrifying VERB
terrifying acl points NOUN
really advmod liked VERB
liked ccomp galaxy VERB
line compound god PROPN
god dobj liked VER

In [None]:
options = {'compact': True, 'font': 'Source Sans Pro', 'distance': 100}
def visualization(text):
  displacy.render(nlp(text), options=options, style = 'dep', jupyter=True)
for i in column_title:
  text = " ".join(i)
  visualization(text)
for i in column_description:
  text = " ".join(i)
  visualization(text)

In [None]:
entity_types = ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW',
                'LANGUAGE','DATE','TIME','PERCENT','MONEY','QUANTITY','ORDINAL','CARDINAL']
entity_name = ['person','Nationality','Building','Institution','country','location','PRODUCT','EVENT','Title','LAW',
                'LANGUAGE','DATE','TIME','PERCENT','MONEY','QUANTITY','ORDINAL','CARDINAL']

def extraction(sentence):
    entities = []
    doc = nlp(sentence)
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))
    return entities
for i in column_title:
    entity_list = extraction(" ".join(i))
    if len(entity_list) > 0:
        print(entity_list)
for i in column_description:
    entity_list = extraction(" ".join(i))
    if len(entity_list) > 0:
        print(entity_list)


[('fitting heartfelt end', 'PERSON')]
[('one', 'CARDINAL')]
[('one', 'CARDINAL')]
[('beautifully dark goofy goodbye', 'PERSON')]
[('four', 'CARDINAL')]
[('james gunn', 'PERSON')]
[('fitting heartfelt end', 'PERSON')]
[('one', 'CARDINAL')]
[('one', 'CARDINAL')]
[('one', 'CARDINAL'), ('two', 'CARDINAL'), ('revealedchukwudi iwuji', 'ORG'), ('starlord', 'ORG'), ('quill', 'PERSON')]
[('one', 'CARDINAL')]
[('one', 'CARDINAL'), ('james gunn', 'PERSON'), ('one', 'CARDINAL'), ('hours minutes', 'TIME'), ('aroundbetween', 'NORP'), ('third', 'ORDINAL'), ('one', 'CARDINAL'), ('two', 'CARDINAL')]
[('one', 'CARDINAL'), ('quill gamora', 'PERSON'), ('james gunn', 'PERSON'), ('one', 'CARDINAL'), ('james gunn mind', 'PERSON')]
[('firstly', 'ORDINAL'), ('warlocks intro', 'PERSON'), ('two', 'CARDINAL'), ('half', 'CARDINAL'), ('third', 'ORDINAL'), ('third', 'ORDINAL'), ('one', 'CARDINAL')]
[('chris pratt', 'PERSON'), ('james gunn', 'PERSON')]
[('james gunn', 'PERSON')]
[('one', 'CARDINAL'), ('first', 'ORDIN

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
# Working on these is quiet challenging for me. These the first time i have worked on scraping. I have collected 10000 reviews from Imbd rating and performed scraping on the reviews collected. I enjoyed working on collecting the data from the imdb page and performing the cleaning tasks.