# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of the product [2019 Dell labtop](https://www.amazon.com/Dell-Inspiron-5000-5570-Laptop/dp/B07N49F51N/ref=sr_1_11?crid=1IJ7UWF2F4GHH&keywords=dell%2Bxps%2B15&qid=1580173569&sprefix=dell%2Caps%2C181&sr=8-11&th=1) on amazon.

(2) Collect the top 100 User Reviews of the film [Joker](https://www.imdb.com/title/tt7286456/reviews?ref_=tt_urv) from IMDB.

(3) Collect the abstracts of the top 100 research papers by using the query [natural language processing](https://citeseerx.ist.psu.edu/search?q=natural+language+processing&submit.x=0&submit.y=0&sort=rlv&t=doc) from CiteSeerX.

(4) Collect the top 100 tweets by using hashtag ["#wuhancoronovirus"](https://twitter.com/hashtag/wuhancoronovirus) from Twitter. 


In [1]:
# Write your code here
#Install necessary packages

!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |▍                               | 10kB 15.6MB/s eta 0:00:01[K     |▊                               | 20kB 21.4MB/s eta 0:00:01[K     |█                               | 30kB 15.4MB/s eta 0:00:01[K     |█▌                              | 40kB 16.4MB/s eta 0:00:01[K     |█▉                              | 51kB 14.7MB/s eta 0:00:01[K     |██▏                             | 61kB 15.4MB/s eta 0:00:01[K     |██▌                             | 71kB 14.3MB/s eta 0:00:01[K     |███                             | 81kB 13.3MB/s eta 0:00:01[K     |███▎                            | 92kB 12.1MB/s eta 0:00:01[K     |███▋                            | 102kB 12.1MB/s eta 0:00:01[K     |████                            | 112kB 12.1MB/s eta 0:00:01[K     |████▍                           | 12

In [25]:
# Collection of top 100 reviews of Joker and creation of a dataframe
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
import time
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver =webdriver.Chrome('chromedriver',options=chrome_options)
url='https://www.imdb.com/title/tt7286456/reviews?ref_=tt_urv'
wait=WebDriverWait(driver,10)
driver.get(url)

for i in range(3):
  button=driver.find_element_by_class_name('ipl-load-more__button')
  time.sleep(2)
  button.click()
  time.sleep(5)
time.sleep(10)
data=driver.page_source
driver.quit()
parser=BeautifulSoup(data,'html.parser')


names=[]
for div in parser.find_all(name='div',attrs={"class":"display-name-date"}):
  for span in div.find_all(name='span',attrs={"class":"display-name-link"}):
    names.append(span.text)

reviews=[]
for div in parser.find_all(name='div',attrs={'class':'content'}):
  for di in div.find_all(name='div',attrs={'class':'show-more__control'}):
    reviews.append(di.text)

df = pd.DataFrame({'User Name':names, 'Review': reviews})
df
       

Unnamed: 0,User Name,Review
0,MihaVrhunc,"Every once in a while a movie comes, that trul..."
1,lesterarnoldpinto,This is a movie that only those who have felt ...
2,Aman_Goyal,"Truly a masterpiece, The Best Hollywood film o..."
3,logical_guy,Joaquin Phoenix gives a tour de force performa...
4,kdagoulis26,Most of the time movies are anticipated like t...
...,...,...
95,The CyberHippie,This film tries desperately to devillify the m...
96,wongcalvin,"Wow I honestly gotta tell you, it's one of the..."
97,kernelmilkshake-67766,The whole point of this movie is to bring atte...
98,danteshamest,"While I enjoyed the film, it felt pretty short..."


In [26]:
# Copy to CSV file
pd.DataFrame(df).to_csv('joker_reviews.csv',header=True,index=None)

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming. 

(6) Lemmatization.

In [5]:
# Write your code here
# Import and download required packages
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
words=stopwords.words('english')
from textblob import Word
from textblob import TextBlob
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import PorterStemmer
st=PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [27]:
# Clean the text data
# To lower_case
df['Clean_Review']=df['Review'].apply(lambda x:" ".join(x.lower() for x in x.split()))

# remove punctuation
df['Clean_Review']=df['Clean_Review'].apply(lambda  x: " ".join(x for x in x.split() if x not in string.punctuation))

# remove special characters
df['Clean_Review']=df['Clean_Review'].apply(lambda x:" ".join(x.replace('[#,@,&,!,$,^,*]', '') for x in x.split()))

# remove stop words
df['Clean_Review']=df['Clean_Review'].apply(lambda x:" ".join(x for x in x.split() if x not in words))

# remove numbers
df['Clean_Review']=df['Clean_Review'].apply(lambda x:" ".join(x.replace('\d+', '') for x in x.split()))

# correct spellings
df['Clean_Review']=df['Clean_Review'].apply(lambda x: str(TextBlob(x).correct()))

# tokenize
df['Clean_Review']=df['Clean_Review'].apply(lambda x: TextBlob(x).words)

# stemming
df['Clean_Review']=df['Clean_Review'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# lemmatization
df['Clean_Review']=df['Clean_Review'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

df_update = df.drop('Review', axis=1).copy()

df_update


Unnamed: 0,User Name,Clean_Review
0,MihaVrhunc,everi movi come truli make impact joaquin 's p...
1,lesterarnoldpinto,movi felt alon isol truli relat it understand ...
2,Aman_Goyal,truli masterpiec best hollywood film 2019 one ...
3,logical_guy,joaquin phoenix give tour de forc perform fear...
4,kdagoulis26,time move anticip like end fall short way shor...
...,...,...
95,The CyberHippie,film tri desper devillifi stylish charismat vi...
96,wongcalvin,now honestli gutta tell you one best move i 'v...
97,kernelmilkshake-67766,whole point movi bring attent mental ill i 'll...
98,danteshamest,enjoy film felt pretti short the end appear sc...


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes: 

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [28]:
# Write your code here

# 3-1 Tagging and Counting POS
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from collections import Counter
from nltk.tokenize import word_tokenize

df['Tokens']=df['Clean_Review'].apply(lambda x: TextBlob(x).words)

pos=[]
for i in df['Tokens']:
  pos.append(nltk.pos_tag(i))
pos

counts=[]
for i in pos:
  counts.append(Counter(tag for word,tag in i))
counts


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[Counter({'IN': 1,
          'JJ': 8,
          'JJS': 1,
          'NN': 24,
          'NNS': 1,
          'POS': 1,
          'PRP': 1,
          'RB': 3,
          'VB': 1,
          'VBD': 1,
          'VBG': 1,
          'VBN': 1,
          'VBP': 3,
          'VBZ': 1}),
 Counter({'CD': 1,
          'IN': 3,
          'JJ': 7,
          'NN': 19,
          'PRP': 1,
          'RB': 1,
          'RBR': 1,
          'VB': 1,
          'VBD': 1,
          'VBP': 4,
          'VBZ': 1}),
 Counter({'CD': 2,
          'DT': 1,
          'IN': 1,
          'JJ': 13,
          'JJR': 1,
          'JJS': 3,
          'MD': 2,
          'NN': 36,
          'NNS': 2,
          'RB': 4,
          'RP': 1,
          'VB': 4,
          'VBG': 1,
          'VBP': 7,
          'VBZ': 1,
          'WP': 1}),
 Counter({'FW': 1,
          'IN': 3,
          'JJ': 10,
          'NN': 28,
          'POS': 3,
          'PRP': 1,
          'VB': 3,
          'VBD': 3,
          'VBZ': 1}),
 Counter({'I

In [29]:
#3-2
#Dependency Parsing
import spacy
from spacy import displacy
nlp=spacy.load('en_core_web_sm')

for i in df_update['Clean_Review']:
  token=nlp(i)
  text= list(token.sents)
  displacy.render(text, style='dep', jupyter=True, options={'distance': 90})

In [30]:
# Constituency Parsing
!pip install benepar
%tensorflow_version 1.x
import benepar
benepar.download('benepar_en3')
from benepar.spacy_plugin import BeneparComponent

# Loading spaCy’s en model and adding benepar model to its pipeline
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent('benepar_en3'))


# Generating a parse tree for the text
for i in df_update['Clean_Review']:
  print(list(nlp(i).sents)[0]._.parse_string)

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
(S (NP (FW everi) (FW movi)) (VP (VBP come)))
(S (NP (RB movi)) (VP (VBD felt) (S (JJ alon) (NP (FW isol) (FW truli) (FW relat)))))
(NP (NP (NN truli) (NN masterpiec)) (NP (NP (JJS best) (JJ hollywood) (NN film)) (CD 2019)))
(S (NP (NN joaquin) (FW phoenix)) (VBP give) (NP (NN tour) (FW de) (FW forc)) (VB perform) (NP (JJ fearless) (NN stun) (NN emot) (NN depth) (JJ physic) (JJ imposs) (NN talk)) (PP (IN without) (NN referenc) (NP (NN heath) (NNP ledger) (POS 's)) (NML (NN oscar) (HYPH -) (NN win)) (VB perform) (NML (JJ dark) (NN knight)) (JJ wide) (NN consid) (VB definit) (NML (JJ live) (HYPH -) (NN act)) (NN portray) (NN joke)) (VP (VB let) (S (NP (PRP 's)) (VP (VB talk) (NP (PRP it))))))
(NP (NN time) (VB move) (PP (RB anticip) (PP (IN like) (NP (NN end)))))
(VP (VB let) (S (S (VB start)) (S (VP (VB say) (SBAR (S (NP (NN joaquin) (NNP phoenix)) (VP (VB get) (NP 

In [31]:
# Extract entries and calculate the count

entry=[]
tag=[]
for review in df_update['Clean_Review']:
  nlp_review=nlp(review)
  for ent in nlp_review.ents:
    entry.append(ent.text)
    tag.append(ent.label_)
    print("Text :",ent.text," ","Label: ",ent.label_)



Text : joaquin   Label:  ORG
Text : scenographi brillianc   Label:  PERSON
Text : funni   Label:  ORG
Text : rollercoast sometim   Label:  ORG
Text : multipl   Label:  PERSON
Text : isol truli   Label:  PERSON
Text : truli   Label:  PERSON
Text : one   Label:  CARDINAL
Text : truli masterpiec   Label:  PERSON
Text : hollywood   Label:  GPE
Text : 2019   Label:  DATE
Text : one   Label:  CARDINAL
Text : messag societi   Label:  PERSON
Text : someth   Label:  PERSON
Text : societi   Label:  NORP
Text : messag   Label:  PERSON
Text : first   Label:  ORDINAL
Text : joaquin phoenix   Label:  PERSON
Text : de forc   Label:  PERSON
Text : exce dark knight 's   Label:  PERSON
Text : first   Label:  ORDINAL
Text : pernici violenc   Label:  PERSON
Text : crimin phillip phoenix   Label:  PERSON
Text : joaquin phoenix   Label:  PERSON
Text : phoenix amaz   Label:  PERSON
Text : phillip best   Label:  PERSON
Text : someth horribl   Label:  PERSON
Text : empti arthur 's   Label:  ORG
Text : masterpi

In [32]:
from collections import Counter

my_tags=set(tag)
occurrences = Counter(tag)
occurrences

Counter({'CARDINAL': 108,
         'DATE': 25,
         'EVENT': 1,
         'FAC': 5,
         'GPE': 43,
         'LOC': 3,
         'NORP': 27,
         'ORDINAL': 16,
         'ORG': 101,
         'PERSON': 256,
         'PRODUCT': 6,
         'QUANTITY': 2,
         'TIME': 11,
         'WORK_OF_ART': 1})

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):** 

In [None]:
'''
Dependency Parsing Tree:

Dependency parsing tree is the process of analysing the grammatical structure of a sentence based on the dependencies between the words 
in a sentence.

Example:
consider the sentence 'rainy weather' . In this sentence, the word rainy modifies the meaning of noun weather. Therefore dependency exists
from weather -> rainy.

Constituency Parsing Tree:

It is the process of analyzing the sentences by breaking down it into sub phrases. These sub phrases may belong to a category of grammar 
like NP(noun phrase) and VP(verb phrase).

Example:

(S (NP (DT every) (NN movie)) (VP (VP (VBZ comes)) (, ,) (VP (ADVP (RB truly)) (VB make) (NP (NN impact))))

The above one is the parse tree in the form of a string.

The S, NP, VP, ADVP, NP.... represents the constituents.
DT, NN, VBZ, RB, VB -> pos tags
every, movie,  comes, truly, make, impact -> words of sentences.


'''