<a href="https://colab.research.google.com/github/mdurgasrikari/Durga_Srikari_INFO5731_Spring2024/blob/main/Maguluri_Durga_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [138]:
# Program to collect top 1000 User Reviews of The Super Mario Bros (2023) movie recently from IMDB.

#importing requests library for http requests, urljoin to join related URls with base url, beautifulsoup for HTML parsing, pandas data analysis
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import pandas as pd

#Setting the base url
imdb_url = 'https://www.imdb.com/title/tt6718170/reviews?sort=submissionDate&dir=asc&ratingFilter=0'

#From the imdb url, retriving the HTML code, and parsing the code to extract information for further analysis
result = requests.get(imdb_url)
soup = BeautifulSoup(result.text, "html.parser")

# Finding the first HTML element from loading more data of the extracted HTML content
Load_more_first_element = soup.select(".load-more-data")[0]['data-ajaxurl']

# Defining a function to extract reviews from a single page. The title and reviews are collected and added into a dictionary the key is the title and the value is the review.
def extract_reviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    reviews_page = soup.select(".review-container")
    reviews = []
    for i in reviews_page:
        title = i.select(".title")[0].text.strip()
        review = i.select(".text")[0].text.strip()
        reviews.append({'Title': title, 'Review': review})
    return reviews

# Collect reviews from multiple pages (limiting to 100 pages in this case) and add all the reviews to a list of all_reviews
all_reviews = []
for j in range(1, 100):
    page_url = urljoin(imdb_url, Load_more_first_element.format(j))
    reviews_on_page = extract_reviews(page_url)
    all_reviews.extend(reviews_on_page)

# Organize the reviews in a data frame
df = pd.DataFrame(all_reviews)

df.to_csv('The_Mario_Bros_Movie_Reviews.csv')
print(df.head(5))
print(df.tail(5))

                                    Title  \
0                 Better than anticipated   
1                            Lots of Fun.   
2  I wanted to love it but just couldn't.   
3                     For the gamers only   
4                        Boring & joyless   

                                              Review  
0  We've all seen the crappy version of Chris Pra...  
1  I went in with low expectations . Over the yea...  
2  When I was a kid I remember anticipating the n...  
3  This movie is like a mario game coming to life...  
4  A Joyless, humourless, plotless waste of time....  
                                                  Title  \
2470  1UP for the Gamers: A complete love letter to ...   
2471  Can someone tell me what the plot for this mov...   
2472  THIS IS THE BEST MOVIE I EVER SEEN! This movie...   
2473                       It has Donkey Kong and more!   
2474  If you like constant fan service, this movie i...   

                                            

In [110]:
df.shape

(2475, 2)

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [139]:
#Program to clean the text data and save the clean data in the CSV file.

#Importing string library string/text manipulations and NLTK library for preprocessing and analyzing data, that includes, removing stopwords, stemming, lemmentization, and removing noise
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK punkt tokenizer, for tokenization. Download collection of stopwords to be removed from the text. Download wordnet dataset to be used in lemmentization.
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [140]:
#Program to clean the data to remove special characters. We can see that the special character "'" from the first review is removed and the result is the word converting from We've to Weve

import string
# Remove noise such as special characters and punctuations by defining a function to return the string without special characters
def clean_text(text):
    return ''.join(i for i in text if i  not in string.punctuation)


# Apply cleaning steps to the 'Review' column in the DataFrame and display the first 5 rows in the data frame with the original review and the cleaned review in another column.
df['Cleaned_Review'] = df['Review'].apply(clean_text)
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,Weve all seen the crappy version of Chris Prat...
1,Lots of Fun.,I went in with low expectations . Over the yea...,I went in with low expectations Over the year...
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,When I was a kid I remember anticipating the n...
3,For the gamers only,This movie is like a mario game coming to life...,This movie is like a mario game coming to life...
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",A Joyless humourless plotless waste of timeThe...


In [141]:
# Program to clean the data to remove numbers and we can see that the last text from review 1 "A Mario Number 2" is changed to "A Mario Number"
# Defining function to remove numbers using string manipulation to return text that are no digits
def remove_numbers_from_clean_text(text):
    return ''.join(j for j in text if not j.isdigit())

# Apply cleaning steps to the 'Cleaned_Review' column in the DataFrame
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_numbers_from_clean_text)
# Display first 5 rows of the modified data frame
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,Weve all seen the crappy version of Chris Prat...
1,Lots of Fun.,I went in with low expectations . Over the yea...,I went in with low expectations Over the year...
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,When I was a kid I remember anticipating the n...
3,For the gamers only,This movie is like a mario game coming to life...,This movie is like a mario game coming to life...
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",A Joyless humourless plotless waste of timeThe...


In [142]:
# Program to clean the data to remove stop words that were downloaded in the command "nltk.download('stopwords')". We can see that in the result, the second word of the first review (all) is removed in the cleaned review
# Defining a function to convert text into lower case letters and returning without the stop words
stop_words = set(stopwords.words('english'))
def remove_stopwords_from_clean_text(text):
    return ' '.join(k for k in text.split() if k.lower() not in stop_words)

# Apply cleaning steps to the previous cleaned review where the special characters and numbers were already removed in previous steps.
df['Cleaned_Review'] = df['Cleaned_Review'].apply(remove_stopwords_from_clean_text)
# Display first 5 rows of the modified data frame
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,Weve seen crappy version Chris Prats wahoo mar...
1,Lots of Fun.,I went in with low expectations . Over the yea...,went low expectations years watched Hollywood ...
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,kid remember anticipating new Super Mario Bros...
3,For the gamers only,This movie is like a mario game coming to life...,movie like mario game coming life great time f...
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",Joyless humourless plotless waste timeThe plot...


In [143]:
# Program to convert the text into lowercase letters. As a result, we can see that the the first word of the first review is converted to lower case.
# Defining function to convert all of the cleaned review text into lowercase letters
def lowercase_text(text):
    return text.lower()

# Apply cleaning steps to the 'Cleaned_Review' column DataFrame in the data frame and also copying this into another dataframe which will be used in the lemmentization step
#Here we will used Cleaned_Review dataframe for stemming and its copy saved in Cleaned_Review_For Lemmentization dataframe for lemmentization such that we can see the difference
#of both these processes on the text by the end of lemmentization step
df['Cleaned_Review'] = df['Cleaned_Review'].apply(lowercase_text)
# Display first 5 rows of the modified data frame
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,weve seen crappy version chris prats wahoo mar...
1,Lots of Fun.,I went in with low expectations . Over the yea...,went low expectations years watched hollywood ...
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,kid remember anticipating new super mario bros...
3,For the gamers only,This movie is like a mario game coming to life...,movie like mario game coming life great time f...
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",joyless humourless plotless waste timethe plot...


In [144]:
# Program to apply stemming to convert words into their root words. as a result we can see that in review 1 Chris has been changed to chri
# Initializing and calling the porterstemmer in-built function. Defining function to stem the words of the reviews into a root word
word_stemmer = PorterStemmer()
def review_word_stem(text):
    return ' '.join(word_stemmer.stem(word) for word in text.split())

# Apply cleaning steps to the 'Cleaned_Review' column in the DataFrame
df['Cleaned_Review_stemming'] = df['Cleaned_Review'].apply(review_word_stem)
# Display first 5 rows of the modified data frame
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review,Cleaned_Review_stemming
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,weve seen crappy version chris prats wahoo mar...,weve seen crappi version chri prat wahoo mario...
1,Lots of Fun.,I went in with low expectations . Over the yea...,went low expectations years watched hollywood ...,went low expect year watch hollywood cut time ...
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,kid remember anticipating new super mario bros...,kid rememb anticip new super mario bro movi bo...
3,For the gamers only,This movie is like a mario game coming to life...,movie like mario game coming life great time f...,movi like mario game come life great time fan ...
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",joyless humourless plotless waste timethe plot...,joyless humourless plotless wast timeth plot w...


In [145]:
# Program to apply lemmentizatoon with the help of the downloaded wordnet to convert into root words such that they are meaningful. We can see as a result on how the text is different from stemming to lemmentization
# for the conversion made on same text
# Initiating and calling the inbuilt wordnetlemmentizer. Defining a function to use the inbuilt functions to lemmentize the words from the cleaned text after
review_word_lemmatizer = WordNetLemmatizer()
def word_lemmatization(text):
    return ' '.join(review_word_lemmatizer.lemmatize(word) for word in text.split())

# Apply cleaning steps to the 'Cleaned_Review' column in the DataFrame and creating a new column to display changes after lemmentization
df['Cleaned_Review_Lemmentization'] = df['Cleaned_Review'].apply(word_lemmatization)
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review,Cleaned_Review_stemming,Cleaned_Review_Lemmentization
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,weve seen crappy version chris prats wahoo mar...,weve seen crappi version chri prat wahoo mario...,weve seen crappy version chris prat wahoo mari...
1,Lots of Fun.,I went in with low expectations . Over the yea...,went low expectations years watched hollywood ...,went low expect year watch hollywood cut time ...,went low expectation year watched hollywood cu...
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,kid remember anticipating new super mario bros...,kid rememb anticip new super mario bro movi bo...,kid remember anticipating new super mario bros...
3,For the gamers only,This movie is like a mario game coming to life...,movie like mario game coming life great time f...,movi like mario game come life great time fan ...,movie like mario game coming life great time f...
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",joyless humourless plotless waste timethe plot...,joyless humourless plotless wast timeth plot w...,joyless humourless plotless waste timethe plot...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [146]:
#commands for installing necessary libraries to answer all the requirements of question 3

# pip is a package installation program which is used to install various libraries. In this case we will install nltk an dspacy libraries
!pip install nltk
!pip install spacy
!python -m spacy download en_core_web_sm

# Import all necessary libraries for POS tagging, parsing, and named entity recognition
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

#Loading the spacy language model which is an advanced NLP library
import spacy
nlp = spacy.load("en_core_web_sm")

# Download NLTK resources punkt, an english word tokenizer, averaged_perceptron_tagger, a parts-of-speech tagger
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [147]:
# Defining a funtion to perform POS tagging and count POS tags for a single review. The function validate a pre-defined dictionary pos_counts with the in-built pos tags in the
# nltk library to count the number of nouns, verbs, adjectives, and adverbs
def review_pos_tag(review):
    Review_Analysis_POS_tags = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}
    pos_tags = nltk.pos_tag(nltk.word_tokenize(review))
    for word, tag in pos_tags:
        if tag.startswith('N'):
            Review_Analysis_POS_tags['Noun'] += 1
        elif tag.startswith('V'):
            Review_Analysis_POS_tags['Verb'] += 1
        elif tag.startswith('J'):
            Review_Analysis_POS_tags['Adjective'] += 1
        elif tag.startswith('R'):
            Review_Analysis_POS_tags['Adverb'] += 1
    return Review_Analysis_POS_tags

In [148]:
# Function to conduct syntax and structure analysis for each review in the DataFrame. The result provides information on how many nouns, verbs, adjectives, and so on are
#present in each review
def review_analysis(df):
    df['Review_Analysis_POS_tags_based_stemming'] = df['Cleaned_Review_stemming'].apply(review_pos_tag)
    return df

# adding to a data frame and displaying first five rows
review_analysis(df)
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review,Cleaned_Review_stemming,Cleaned_Review_Lemmentization,Review_Analysis_POS_tags_based_stemming
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,weve seen crappy version chris prats wahoo mar...,weve seen crappi version chri prat wahoo mario...,weve seen crappy version chris prat wahoo mari...,"{'Noun': 28, 'Verb': 10, 'Adjective': 14, 'Adv..."
1,Lots of Fun.,I went in with low expectations . Over the yea...,went low expectations years watched hollywood ...,went low expect year watch hollywood cut time ...,went low expectation year watched hollywood cu...,"{'Noun': 57, 'Verb': 15, 'Adjective': 27, 'Adv..."
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,kid remember anticipating new super mario bros...,kid rememb anticip new super mario bro movi bo...,kid remember anticipating new super mario bros...,"{'Noun': 96, 'Verb': 26, 'Adjective': 36, 'Adv..."
3,For the gamers only,This movie is like a mario game coming to life...,movie like mario game coming life great time f...,movi like mario game come life great time fan ...,movie like mario game coming life great time f...,"{'Noun': 36, 'Verb': 9, 'Adjective': 17, 'Adve..."
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",joyless humourless plotless waste timethe plot...,joyless humourless plotless wast timeth plot w...,joyless humourless plotless waste timethe plot...,"{'Noun': 33, 'Verb': 11, 'Adjective': 9, 'Adve..."


In [149]:
# Importing the displacy library to implement the dependency parsing tree
from spacy import displacy

#the first review was taken as text for example to demonstrate the dependency parsing tree. Dependency parsing focuses on capturing the syntactic links between words in a phrase
#using a structure resembling a tree, whereas Constituency Parsing examines the hierarchical structure of sentences by breaking them down into constituents, represented in a parse tree.
text = "We've all seen the crappy version of Chris Prats 'wahoo' mario attempt.However it was used as comic relief which was very much appreciated.Bowsers character held a lot of what we love about Jack Blacks personality.DK wasn't so much what we expect out of Seth Rogen, but considering it's PG, probably a good thing haha.In saying that we do get the laugh, yes you're welcome!Over all a very well rounded story line. I was personally a bit nervous about the movie being great but a shocker of an ending. Though not everyone will be happy with it, there is an opening with us wanting more. And it looks like we will get what we want. A Mario Number 2!"

# Plot the dependency graph
doc = nlp(text)
displacy.render(doc, style='dep',jupyter=True)
for tok in doc:
  print(tok.text,"-->",tok.dep_,"-->",tok.pos_)


We --> nsubj --> PRON
've --> aux --> AUX
all --> dep --> PRON
seen --> ROOT --> VERB
the --> det --> DET
crappy --> amod --> ADJ
version --> dobj --> NOUN
of --> prep --> ADP
Chris --> compound --> PROPN
Prats --> poss --> PROPN
' --> case --> PROPN
wahoo --> nmod --> PROPN
' --> punct --> PUNCT
mario --> compound --> NOUN
attempt --> pobj --> NOUN
. --> punct --> PUNCT
However --> advmod --> ADV
it --> nsubjpass --> PRON
was --> auxpass --> AUX
used --> ROOT --> VERB
as --> prep --> ADP
comic --> amod --> ADJ
relief --> pobj --> NOUN
which --> nsubj --> PRON
was --> auxpass --> AUX
very --> advmod --> ADV
much --> advmod --> ADV
appreciated --> conj --> VERB
. --> punct --> PUNCT
Bowsers --> compound --> NOUN
character --> nsubj --> NOUN
held --> ROOT --> VERB
a --> det --> DET
lot --> dobj --> NOUN
of --> prep --> ADP
what --> dobj --> PRON
we --> nsubj --> PRON
love --> pcomp --> VERB
about --> prep --> ADP
Jack --> compound --> PROPN
Blacks --> compound --> PROPN
personality --> p

In [150]:
# Function to perform Named Entity Recognition (NER) and count entities for a single review.
# Named entity recognition is an important step in NLP to be able to classify different words into name, location, organization, date, etc. provide the counts of each of them
def Named_Entity_Recognition(review):
    output = nlp(review)
    NER_counts = {}
    for m in output.ents:
        Type = m.label_
        Text = m.text
        if Type in NER_counts:
            NER_counts[Type].append(Text)
        else:
            NER_counts[Type] = [Text]
    NER_counts = {key: len(value) for key, value in NER_counts.items()}
    return NER_counts


In [151]:
# Function to conduct syntax and structure analysis for each review in the DataFrame
def review_NER_analysis(df):
    df['NER_Counts_based_stemming'] = df['Cleaned_Review_stemming'].apply(Named_Entity_Recognition)
    return df

# Example usage
df_analysis = review_NER_analysis(df)
df.head(5)

Unnamed: 0,Title,Review,Cleaned_Review,Cleaned_Review_stemming,Cleaned_Review_Lemmentization,Review_Analysis_POS_tags_based_stemming,NER_Counts_based_stemming
0,Better than anticipated,We've all seen the crappy version of Chris Pra...,weve seen crappy version chris prats wahoo mar...,weve seen crappi version chri prat wahoo mario...,weve seen crappy version chris prat wahoo mari...,"{'Noun': 28, 'Verb': 10, 'Adjective': 14, 'Adv...",{'PERSON': 1}
1,Lots of Fun.,I went in with low expectations . Over the yea...,went low expectations years watched hollywood ...,went low expect year watch hollywood cut time ...,went low expectation year watched hollywood cu...,"{'Noun': 57, 'Verb': 15, 'Adjective': 27, 'Adv...","{'DATE': 2, 'TIME': 1}"
2,I wanted to love it but just couldn't.,When I was a kid I remember anticipating the n...,kid remember anticipating new super mario bros...,kid rememb anticip new super mario bro movi bo...,kid remember anticipating new super mario bros...,"{'Noun': 96, 'Verb': 26, 'Adjective': 36, 'Adv...","{'ORG': 3, 'CARDINAL': 2, 'GPE': 1, 'PERSON': 1}"
3,For the gamers only,This movie is like a mario game coming to life...,movie like mario game coming life great time f...,movi like mario game come life great time fan ...,movie like mario game coming life great time f...,"{'Noun': 36, 'Verb': 9, 'Adjective': 17, 'Adve...",{'PERSON': 2}
4,Boring & joyless,"A Joyless, humourless, plotless waste of time....",joyless humourless plotless waste timethe plot...,joyless humourless plotless wast timeth plot w...,joyless humourless plotless waste timethe plot...,"{'Noun': 33, 'Verb': 11, 'Adjective': 9, 'Adve...","{'ORG': 1, 'PERSON': 1, 'DATE': 1}"


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.


When I started this assignment, the first thing I achieved was understanding how to scrape data efficiently from the IMDB website. I did a lot of research by analyzing various Github
repositories and read several articles from Medium to understand how scrapping is done. I understood that there is a package in python (imdby) which can scrape the data from IMDB,
its reviews and as well as the content of the reviews. But when I used it, the challenge I faced is it loaded the reviews and content of the reviews only from the first page.
This only loaded 25 reviews from the first page which did not satisfy the requirement of this assignment (loading 1000 reviews). Then, after further research I used another option
present to use ajaxurl to complete this task, which helped me to load more than 25 reviews.

For the preprocessing, the tasks provided in the assignment were to remove special characters, numbers, stop words, and perform stemming and lemmentization. I learned how to achieve
these through string manipulation functions through specific libraries. I have used NLTK libraries and string libraries primarily to achieve this. The result of each sub task of
preprocessing was stored in a data frame and the new changes can be seen in the column Cleaned_Review such along with original title and review for better understanding and comparision
As part of achieving this, I had to understand how different string manipulation functions and how can they be combined such they work in conjuction to achieve each preprocesis sing task
I got to learn how stemming and lemmentization work and theoritically how they are different.

For task three to demonstrate Parts of Speech (POS) Tagging, Constituenunderstanding and Dependency Parsing, and Named Entity Recognition. This took some time for me to understand
the concepts for me to understand how I can demonstarte the code to derive the results.I used geeksforgeeks, python documentations to get better understand. After research, I took the
approach of downloading NLP and Spacy libraries to use the NLP capabilities to acheive required tasks. I was able to successfully implement the POS tagging and NER counts. Class code
demonstartion helped with the DP and CP trees
