<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
import nltk
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests as rq
import numpy as np

import matplotlib.pyplot as plt


# Define the base URL for IMDb movie reviews and specify the movie ID
imdb_base_url = 'https://www.imdb.com/title/'
mve_id = 'tt15398776'  

# Creating a DataFrame with a single column to store the review texts.
df = pd.DataFrame(columns=['Movie Review'])

# Number of pages to scrapeDefine the total number of review pages to scrape
total_review_pages = 50  


for p in range(1, total_review_pages + 1):
    
    url = f'{imdb_base_url}{mve_id}/reviews?start={((p - 1) * 10)}'
    req = rq.get(url).text
    sp = bs(req, 'html.parser')
    review_txt = sp.find_all('div', attrs={'class': 'text show-more__control'})
    review_lst = []
    
    
    for j in range(len(review_txt)):
        review_lst.append(review_txt[j].get_text())
    
    
    df = pd.concat([df, pd.DataFrame({'Movie Reviews': review_lst})], ignore_index=True)

print(df[['Movie Reviews']])

                                          Movie Reviews
0     You'll have to have your wits about you and yo...
1     One of the most anticipated films of the year ...
2     "Oppenheimer" is a biographical thriller film ...
3     This movie is just... wow! I don't think I hav...
4     I was familiar with the Manhattan project and ...
...                                                 ...
1245  I'm a big fan of Nolan's work so was really lo...
1246  This movie is very interesting and very thrill...
1247  I think I will be in the minority here when I ...
1248  It saddens me that so many people are mistakin...
1249  0 out of 10 starsChristopher Nolan's Oppenheim...

[1250 rows x 1 columns]


In [2]:
df.to_csv('mvereview.csv')

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('wordnet')

# Create a function that processes and refines text
def cleaningText(txt):
    # (1) Eliminate symbols and punctuation by employing regular expressions.
    txt = re.sub(r'[^\w\s]', '', txt)
    # (2) Eliminate digits using regular expressions.
    txt = re.sub(r'\d', '', txt)
    # (4) Change all the letters in the text to lowercase.
    txt = txt.lower()
    # (3) Eliminate stop words
    common_words = set(stopwords.words('english'))
    text_tokens = txt.split()
    filtered_tokens = [token for token in text_tokens if token not in common_words]

    
    # Create instances of a stemmer and a lemmatizer.
    stemming_tool = PorterStemmer()

    lemmatization_tool = WordNetLemmatizer()

    
    # (5) Employ stemming 
    # (6) lemmatization
    words = [stemming_tool.stem(w) for w in text_tokens]
    words = [lemmatization_tool.lemmatize(w) for w in text_tokens]
    
    # Reassemble the cleaned words to form a unified string.
    cleaned_text = ' '.join(words)
    return cleaned_text

# Retrieve the DataFrame containing movie reviews.
df = pd.read_csv('mvereview.csv')  

# Utilize the cleaningText function to process the content in the 'Movie Reviews' column.
df['cleanedFilm_reviews'] = df['Movie Reviews'].apply(cleaningText)

# Save the DataFrame containing the cleaned data to a new CSV file
df['cleanedFilm_reviews'].to_csv('cleanedFilm_reviews.csv', index=False)




[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mounicatamalampudi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mounicatamalampudi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
!pip install spacy



In [5]:
pip install --upgrade pydantic thinc spacy


Note: you may need to restart the kernel to use updated packages.


In [6]:
import spacy
import string

# Download the "en_core_web_sm" model
!python -m spacy download en_core_web_sm

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Now you can use the model for your text processing


Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 6.2 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [14]:
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tree import ParentedTree
import string


nlp = spacy.load("en_core_web_sm")


sampleText = df["cleanedFilm_reviews"].iloc[15]

# (1) Parts of Speech (POS) Tagging
doc = nlp(sampleText)
nounCnt = 0
verbCnt = 0
adjCnt = 0
advCnt = 0

for token in doc:
    if token.pos_ == 'NOUN':
        nounCnt += 1
    elif token.pos_ == 'VERB':
        verbCnt += 1
    elif token.pos_ == 'ADJ':
        adjCnt += 1
    elif token.pos_ == 'ADV':
        advCnt += 1

print(f"Number of Nouns: {nounCnt}")
print(f"Number of Verbs: {verbCnt}")
print(f"Number of Adjectives: {adjCnt}")
print(f"Number of Adverbs: {advCnt}")

# (2) Constituency Parsing
def nltkconstituency_Parsing(text):
    sentences = nltk.sent_tokenize(text)
    for s in sentences:
        word = nltk.word_tokenize(s)
        tag = nltk.pos_tag(word)
        grammar = "NP: {<DT>?<JJ>*<NN>}"
        cp = nltk.RegexpParser(grammar)
        tree= cp.parse(tag)
        print("Constituency Parsing Tree:")
        tree.pretty_print()

# Calling nltkconstituency_Parsing function 
nltkconstituency_Parsing(sampleText)

# (3) Named Entity Recognition (NER)
entity = {}
for e in doc.ents:
    entity[e.label_] = entity.get(e.label_, 0) + 1

print("\nNamed Entity Recognition (NER):")
for label, count in entity.items():
    print(f"{label}: {count}")

Number of Nouns: 40
Number of Verbs: 22
Number of Adjectives: 22
Number of Adverbs: 16
Constituency Parsing Tree:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 S                                    

In [15]:
#ALTERATIVE CODE


# "cleanedFilm_reviews" column
cleanedReview = df["cleanedFilm_reviews"]


# Define a simple grammar for constituency parsing
grammar = r"""
    NP: {<DT>?<JJ>*<NN>}  # Noun Phrase
    VP: {<VB.*><NP|PP|CLAUSE>+$}  # Verb Phrase
    PP: {<IN><NP>}  # Prepositional Phrase
    CLAUSE: {<NP><VP>}  # Clause
"""

# Function to conduct syntax and structure analysis
def syntaxstructure_Analysis(review):
    # Tokenize the review into words
    word = nltk.word_tokenize(review)

    tag = nltk.pos_tag(word)
    parser = nltk.RegexpParser(grammar)
    constituency_tree = parser.parse(tag)

    namedEntities = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(review)))

    return tags, constituency_tree, namedEntities

# Initialize counters for POS categories
count_of_nouns = 0
count_of_verbs = 0
count_of_adjectives = 0
count_of_adverbs = 0

# Initialize dictionaries to count entities
entity_counts = {
    "Person": 0,
    "Organization": 0,
    "Location": 0,
    "Product": 0,
    "Date": 0
}

# performing syntax and structure analysis for each review in the column
for idx, review in enumerate(cleaned_review[:20], start=1):
    print(f"Analysis for review {idx}:")
    tags, constituency_tree, named_entities = perform_syntax_structure_analysis(review)

    # Calculate the total number of Nouns (N), Verbs (V), Adjectives (Adj), and Adverbs (Adv) for each review
    nouns = sum(1 for word, pos in tags if pos.startswith('N'))
    verbs = sum(1 for word, pos in tags if pos.startswith('V'))
    adjectives = sum(1 for word, pos in tags if pos.startswith('J'))
    adverbs = sum(1 for word, pos in tags if pos.startswith('R'))

    count_of_nouns += nouns
    count_of_verbs += verbs
    count_of_adjectives += adjectives
    count_of_adverbs += adverbs

    # Extract entities and count them
    for entity in named_entities:
        if isinstance(entity, nltk.Tree):
            entity_label = entity.label()
            entity_text = " ".join(word for word, pos in entity.leaves())
            if entity_label in entity_counts:
                entity_counts[entity_label] += 1

    print("Tags:")
    print(tags)

    print("Constituency Parsing Tree:")
    print(Tree.fromstring(str(constituency_tree)).pretty_print())

    print("Named Entities:")
    print(named_entities)
    print("="*50)

# Print the totals
print(f"Total Nouns (N): {count_of_nouns}")
print(f"Total Verbs (V): {count_of_verbs}")
print(f"Total Adjectives (Adj): {count_of_adjectives}")
print(f"Total Adverbs (Adv): {count_of_adverbs}")

# Print entity counts
print("Entity Counts:")
for entity_type, count in entity_counts.items():
    print(f"{entity_type}: {count}")

Analysis for review 1:
Tags:
[('youll', 'RB'), ('have', 'VBP'), ('to', 'TO'), ('have', 'VB'), ('your', 'PRP$'), ('wit', 'NN'), ('about', 'IN'), ('you', 'PRP'), ('and', 'CC'), ('your', 'PRP$'), ('brain', 'NN'), ('fully', 'RB'), ('switched', 'VBN'), ('on', 'IN'), ('watching', 'VBG'), ('oppenheimer', 'RP'), ('a', 'DT'), ('it', 'PRP'), ('could', 'MD'), ('easily', 'RB'), ('get', 'VB'), ('away', 'RP'), ('from', 'IN'), ('a', 'DT'), ('nonattentive', 'JJ'), ('viewer', 'NN'), ('this', 'DT'), ('is', 'VBZ'), ('intelligent', 'JJ'), ('filmmaking', 'VBG'), ('which', 'WDT'), ('show', 'VBP'), ('it', 'PRP'), ('audience', 'NN'), ('great', 'JJ'), ('respect', 'NN'), ('it', 'PRP'), ('fire', 'VBZ'), ('dialogue', 'NN'), ('packed', 'VBD'), ('with', 'IN'), ('information', 'NN'), ('at', 'IN'), ('a', 'DT'), ('relentless', 'JJ'), ('pace', 'NN'), ('and', 'CC'), ('jump', 'NN'), ('to', 'TO'), ('very', 'RB'), ('different', 'JJ'), ('time', 'NN'), ('in', 'IN'), ('oppenheimer', 'JJ'), ('life', 'NN'), ('continuously', 'RB

Tags:
[('oppenheimer', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('biographical', 'JJ'), ('thriller', 'NN'), ('film', 'NN'), ('written', 'VBN'), ('and', 'CC'), ('directed', 'VBN'), ('by', 'IN'), ('christopher', 'NN'), ('nolan', 'IN'), ('the', 'DT'), ('dark', 'NN'), ('knight', 'VBD'), ('trilogy', 'JJ'), ('inception', 'NN'), ('interstellar', 'NN'), ('dunkirk', 'NN'), ('based', 'VBN'), ('on', 'IN'), ('the', 'DT'), ('biography', 'NN'), ('american', 'JJ'), ('prometheus', 'NN'), ('by', 'IN'), ('kai', 'JJ'), ('bird', 'NN'), ('and', 'CC'), ('martin', 'NN'), ('j', 'NN'), ('sherwin', 'NN'), ('starring', 'VBG'), ('cillian', 'JJ'), ('murphy', 'NN'), ('in', 'IN'), ('the', 'DT'), ('lead', 'JJ'), ('role', 'NN'), ('in', 'IN'), ('addition', 'NN'), ('to', 'TO'), ('matt', 'VB'), ('damon', 'NN'), ('robert', 'NN'), ('downey', 'NN'), ('jr', 'NN'), ('emily', 'RB'), ('blunt', 'NN'), ('and', 'CC'), ('florence', 'NN'), ('pugh', 'IN'), ('it', 'PRP'), ('subverts', 'VBZ'), ('the', 'DT'), ('usual', 'JJ'), ('biopic', 'NN')

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

Tags:
[('oppenheimer', 'NN'), ('might', 'MD'), ('be', 'VB'), ('the', 'DT'), ('best', 'JJS'), ('film', 'NN'), ('i', 'NN'), ('watched', 'VBD'), ('in', 'IN'), ('a', 'DT'), ('long', 'JJ'), ('long', 'JJ'), ('timevery', 'NN'), ('different', 'JJ'), ('than', 'IN'), ('nolans', 'NNS'), ('recent', 'JJ'), ('film', 'NN'), ('especially', 'RB'), ('the', 'DT'), ('scifi', 'JJ'), ('one', 'CD'), ('but', 'CC'), ('show', 'VBP'), ('that', 'IN'), ('nolan', 'NN'), ('can', 'MD'), ('master', 'VB'), ('the', 'DT'), ('biopicdrama', 'NN'), ('genre', 'NN'), ('just', 'RB'), ('a', 'DT'), ('well', 'RB'), ('a', 'DT'), ('he', 'PRP'), ('can', 'MD'), ('any', 'DT'), ('other', 'JJ'), ('genre', 'NN'), ('he', 'PRP'), ('tried', 'VBD'), ('to', 'TO'), ('tackle', 'VB'), ('yetthe', 'NN'), ('film', 'NN'), ('is', 'VBZ'), ('hour', 'NN'), ('long', 'RB'), ('but', 'CC'), ('go', 'VBP'), ('through', 'IN'), ('very', 'RB'), ('quickly', 'RB'), ('and', 'CC'), ('enjoyably', 'RB'), ('without', 'IN'), ('spoiling', 'VBG'), ('anything', 'NN'), ('th

Tags:
[('master', 'NN'), ('craftsman', 'NN'), ('christopher', 'NN'), ('nolan', 'NN'), ('probably', 'RB'), ('the', 'DT'), ('best', 'JJS'), ('blockbuster', 'NN'), ('director', 'NN'), ('out', 'IN'), ('there', 'RB'), ('along', 'IN'), ('with', 'IN'), ('ridley', 'NN'), ('scott', 'NN'), ('return', 'NN'), ('to', 'TO'), ('good', 'JJ'), ('old', 'JJ'), ('fashioned', 'VBN'), ('nocgi', 'JJ'), ('drama', 'NN'), ('where', 'WRB'), ('tension', 'NN'), ('come', 'VB'), ('from', 'IN'), ('word', 'NN'), ('spoken', 'VBN'), ('and', 'CC'), ('how', 'WRB'), ('people', 'NNS'), ('react', 'VBP'), ('to', 'TO'), ('them', 'PRP'), ('there', 'EX'), ('are', 'VBP'), ('no', 'DT'), ('chase', 'JJ'), ('no', 'DT'), ('shootout', 'NN'), ('death', 'NN'), ('defying', 'VBG'), ('stunt', 'NN'), ('or', 'CC'), ('explosion', 'NN'), ('wait', 'VBP'), ('actually', 'RB'), ('there', 'EX'), ('is', 'VBZ'), ('one', 'CD'), ('explosion', 'NN'), ('i', 'NN'), ('dont', 'VBP'), ('know', 'VB'), ('how', 'WRB'), ('they', 'PRP'), ('made', 'VBD'), ('those',

Tags:
[('okay', 'JJ'), ('nolan', 'NN'), ('fan', 'NN'), ('get', 'VB'), ('your', 'PRP$'), ('finger', 'NN'), ('poised', 'VBD'), ('to', 'TO'), ('downvote', 'VB'), ('what', 'WP'), ('im', 'VB'), ('about', 'IN'), ('to', 'TO'), ('say', 'VB'), ('thats', 'NNS'), ('the', 'DT'), ('only', 'JJ'), ('way', 'NN'), ('i', 'NN'), ('can', 'MD'), ('understand', 'VB'), ('the', 'DT'), ('high', 'JJ'), ('rating', 'NN'), ('for', 'IN'), ('this', 'DT'), ('film', 'NN'), ('thousand', 'CD'), ('of', 'IN'), ('devoted', 'VBN'), ('nolan', 'NN'), ('fan', 'NN'), ('inflating', 'VBG'), ('the', 'DT'), ('score', 'NN'), ('because', 'IN'), ('if', 'IN'), ('youre', 'NN'), ('honest', 'VBP'), ('there', 'RB'), ('no', 'DT'), ('way', 'NN'), ('this', 'DT'), ('mottled', 'VBD'), ('mess', 'NN'), ('of', 'IN'), ('a', 'DT'), ('movie', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('not', 'RB'), ('in', 'IN'), ('any', 'DT'), ('sane', 'NN'), ('universeive', 'JJ'), ('seen', 'VBN'), ('all', 'DT'), ('of', 'IN'), ('nolans', 'NNS'), ('film', 'NN'), ('memento',

Tags:
[('christopher', 'NN'), ('nolan', 'NN'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('all', 'DT'), ('time', 'NN'), ('he', 'PRP'), ('ha', 'VBD'), ('four', 'CD'), ('of', 'IN'), ('my', 'PRP$'), ('top', 'JJ'), ('six', 'CD'), ('movie', 'NN'), ('eight', 'CD'), ('of', 'IN'), ('my', 'PRP$'), ('top', 'JJ'), ('twenty', 'NN'), ('it', 'PRP'), ('absurd', 'VBZ'), ('how', 'WRB'), ('much', 'JJ'), ('joy', 'NN'), ('he', 'PRP'), ('given', 'VBN'), ('me', 'PRP'), ('i', 'JJ'), ('consider', 'VBP'), ('him', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('full', 'JJ'), ('extent', 'NN'), ('of', 'IN'), ('the', 'DT'), ('word', 'NN'), ('a', 'DT'), ('geniuslet', 'NN'), ('me', 'PRP'), ('start', 'VBP'), ('by', 'IN'), ('saying', 'VBG'), ('this', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('objectively', 'RB'), ('good', 'JJ'), ('movie', 'NN'), ('the', 'DT'), ('technical', 'JJ'), ('aspect', 'NN'), ('are', 'VBP'), ('all', 'DT'), ('up', 'RB'), ('to', 'TO'), ('the', 'DT'), ('nolan', 'JJ'), (

Tags:
[('authentic', 'JJ'), ('audiovisual', 'JJ'), ('journey', 'NN'), ('to', 'TO'), ('the', 'DT'), ('era', 'NN'), ('of', 'IN'), ('the', 'DT'), ('birth', 'NN'), ('of', 'IN'), ('atomic', 'JJ'), ('genesis', 'NN'), ('which', 'WDT'), ('both', 'DT'), ('terrifies', 'NNS'), ('and', 'CC'), ('astonishes', 'NNS'), ('you', 'PRP'), ('with', 'IN'), ('it', 'PRP'), ('nonlinear', 'JJ'), ('storytelling', 'VBG'), ('thanks', 'NNS'), ('to', 'TO'), ('christopher', 'VB'), ('nolans', 'NNS'), ('masterful', 'JJ'), ('approach', 'NN'), ('to', 'TO'), ('direction', 'NN'), ('and', 'CC'), ('screenplay', 'VB'), ('the', 'DT'), ('pacing', 'NN'), ('of', 'IN'), ('the', 'DT'), ('movie', 'NN'), ('is', 'VBZ'), ('simply', 'RB'), ('brilliant', 'JJ'), ('especially', 'RB'), ('in', 'IN'), ('the', 'DT'), ('moment', 'NN'), ('where', 'WRB'), ('the', 'DT'), ('main', 'JJ'), ('theme', 'NN'), ('by', 'IN'), ('ludwig', 'JJ'), ('göransson', 'NN'), ('kick', 'NN'), ('in', 'IN'), ('it', 'PRP'), ('a', 'DT'), ('future', 'JJ'), ('classic', 'NN')

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Constituency Parsing Tree: The grammatical structure of a phrase can be represented in a hierarchical form by making use of something that is known as a constituency parsing tree. This is accomplished by breaking down a sentence into its component phrases, which can be classified as either noun phrases (NP), verb phrases (VP), prepositional phrases (PP), or clauses. The edges illustrate how the component phrases are nested inside each other at various levels, while the nodes in the tree each represent a different phrase that may be broken down into its component parts. As you make your way down the tree, you will eventually get at the level of individual words before you get to the root node, which represents the complete phrase most of the time. As you make your way away from the root node, you will run upon phrases that are progressively shorter. Each phrase has a label that reflects the category of grammatical construction to which it belongs (for example, NP or VP), and the phrase itself may contain words or other phrases.

Dependency Parsing Tree: A dependency parsing tree is a type of tree that is utilized for the purpose of describing the syntactic relationships that exist between the words in a sentence. In this particular representation of a tree, every word is given its own node, and the connections that exist between the nodes illustrate the grammatical and syntactic relationships that exist between the words. In a dependency tree, there are often two distinct types of nodes: the root node, which represents the most significant verb or main phrase of the sentence, and dependent nodes, which indicate other words that rely on the root or on each other. The root node is always at the beginning of the tree, and it always represents the most essential part of the sentence. The root node of the tree is consistently located at the very beginning of the structure.
