# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Import necessary libraries
# For making HTTP requests to IMDb
import requests
# For parsing HTML content
from bs4 import BeautifulSoup
# For data manipulation and saving to CSV
import pandas as pd

# Define a function to get reviews from IMDb
def get_reviews(movie_id, max_reviews):
    reviews = []
    current_page = 1 # Start from the first page of reviews
    
     # Continue fetching reviews until we reach the desired number
    while len(reviews) < max_reviews:
        # Construct the URL for fetching reviews, using the movie ID
        url = f"https://www.imdb.com/title/{movie_id}/reviews/?ref_=tt_ql_2"
        params = {
            'spoiler': 'hide',
            'sort': 'helpfulnessScore',
            'dir': 'desc',
            'ratingFilter': '0',
            'paginationKey': '',  
        }
        response = requests.get(url, params=params)
        soup = BeautifulSoup(response.content, 'html.parser')
        
         # Find all review blocks on the page
        review_blocks = soup.find_all('div', class_='lister-item-content')
        if not review_blocks:
            break  # Break if no more reviews are found or if blocked

        # Extract the title and review text from each block    
        for block in review_blocks:
            title = block.find('a', class_='title').text.strip()
            review_text = block.find('div', class_='text show-more__control').text.strip()
            reviews.append({'title': title, 'review': review_text})
            if len(reviews) >= max_reviews:
                break

        current_page += 1  

    return reviews[:max_reviews] # Return the collected reviews, up to the specified maximum number

def save_reviews_to_csv(reviews, filename):
    df = pd.DataFrame(reviews)
    df.to_csv(filename, index=False)

# Example movie ID for POOR THINGS movie 
movie_id = 'tt14230458'
max_reviews = 1000 # Set the desired number of reviews to fetch
# Call the get_reviews function with the movie ID and number of reviews
reviews = get_reviews(movie_id, max_reviews)
# Save the fetched reviews to a CSV file
save_reviews_to_csv(reviews, 'imdb_User_reviews.csv')



# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

# Ensure that necessary NLTK resources are downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the CSV file
df = pd.read_csv('imdb_User_reviews.csv')  # Update the file path

# If NLTK's stopwords can't be downloaded, define a list of stopwords manually
stop_words = set(stopwords.words('english'))

# Initialize Stemmer and Lemmatizer for text normalization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # 1- Remove special characters and punctuations
    # This uses a regex to keep only word characters (letters and numbers) and whitespace, effectively removing punctuation.
    text = re.sub(r'[^\w\s]', '', text)
    
    # 2- Remove numbers
    # This uses a regex to remove digits from the text.
    text = re.sub(r'\d+', '', text)
    
    # 4- Lowercase all texts
    # This is important for consistency and to ensure that the same word in different cases is treated as the same word.
    text = text.lower()
    
    # 3- Remove stopwords
    # Stopwords are common words that are usually removed in the preprocessing phase because they contribute little to the overall meaning of the text.
    text = ' '.join([word for word in text.split() if word not in stop_words])
    
    # 5- Stemming
    # Stemming reduces words to their word stem or root form (e.g., "running" -> "run").
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    
    # 6- Lemmatization
    # Lemmatization reduces words to their base or dictionary form (e.g., "better" -> "good").
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    
    return text

# Assuming the text data is in a column named 'text'
df['cleaned_review'] = df['review'].apply(clean_text)

# Save the DataFrame to a new CSV file
df.to_csv('cleaned_imdb_User_reviews.csv', index=False)



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/narendranathreddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/narendranathreddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/narendranathreddy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
!pip install spacy



In [4]:
!python -m spacy download en_core_web_sm

In [2]:
# Import necessary libraries
import stanza
from collections import Counter
import pandas as pd

# Load the cleaned text data
df = pd.read_csv('cleaned_imdb_User_reviews.csv')  # Update the file path accordingly

# Initialize the Stanza NLP pipeline for English language
# specifying the processors we want to use
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,constituency,depparse,ner')
class PosTagCounter(object):
    #The class that counts the number of pos tags of various types in a sentence.
    def __init__(self, stanza_doc):
        #The initialization method that takes a Stanza document as input.
        self.stanza_doc = stanza_doc

    def get_pos_tag_count(self, pos_tag):
        #Returns the number of specific POS tags.
        count = 0
        for sentence in self.stanza_doc.sentences:
            count += len([word for word in sentence.words if word.pos == pos_tag])
        return count

# Extract the first review from the dataset for demonstration
example_text = df['cleaned_review'].iloc[0]  # Taking the first cleaned text for demonstration

# Process the text with Stanza to create a document object
doc = nlp(example_text)

# POS Tagging and Counting
pos_counter = PosTagCounter(doc)
nouns_count = pos_counter.get_pos_tag_count('NOUN')
verbs_count = pos_counter.get_pos_tag_count('VERB')
adjectives_count = pos_counter.get_pos_tag_count('ADJ')
adverbs_count = pos_counter.get_pos_tag_count('ADV')

# Print counts of different parts of speech
print(f"Nouns: {nouns_count}, Verbs: {verbs_count}, Adjectives: {adjectives_count}, Adverbs: {adverbs_count}")

# Constituency Parsing
# Print the constituency parse tree of the first sentence in the document
print("Constituency Parsing Tree of the first sentence:")
print(doc.sentences[0].constituency)

# Dependency Parsing
# Print the dependency parse tree of the first sentence in the document
print("\nDependency Parsing Tree of the first sentence:")
for dep_edge in doc.sentences[0].dependencies:
    print(dep_edge)

# Named Entity Recognition
entity_types = []

# Loop through each review in the dataset, process it, and collect named entities
for review in df['cleaned_review']:
    doc = nlp(review)
    entity_types.extend([ent.type for ent in doc.ents])

# Count and print the frequency of different types of named entities in the dataset
entity_counter = Counter(entity_types)

print("\nNamed Entities and their counts:")
print(entity_counter)



2024-02-26 15:13:41 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-02-26 15:13:41 INFO: Downloaded file to /Users/narendranathreddy/stanza_resources/resources.json
2024-02-26 15:13:43 INFO: Loading these models for language: en (English):
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| ner          | ontonotes-ww-multi_charlm |

2024-02-26 15:13:43 INFO: Using device: cpu
2024-02-26 15:13:43 INFO: Loading: tokenize
2024-02-26 15:13:43 INFO: Loading: mwt
2024-02-26 15:13:43 INFO: Loading: pos
2024-02-26 15:13:43 INFO: Loading: lemma
2024-02-26 15:13:43 INFO: Loading: constituency
2024-02-26 15:13:43 INFO: Loading: depparse
2024-02-26 15:13:43 INFO: Loading: ner
2024-02-26 15:13:44 INFO: Done loading processors!


Nouns: 42, Verbs: 11, Adjectives: 13, Adverbs: 9
Constituency Parsing Tree of the first sentence:
(ROOT (NP (NP (ADJP (FW youv) (VBN seen)) (FW yorgo) (FW lathimo) (NN film) (JJ poor) (NN thing)) (SBAR (S (NP (NN everyth)) (VP (MD would) (VP (VB expect) (NP (NN hope)) (VP (ADJP (RB havent) (VBN seen)) (NP (NP (NN film) (NN buckl)) (SBAR (WDT that) (S (NP (NP (ADJP (JJ ill) (FW saypoor)) (NN thing)) (ADJP (FW thoroughli) (FW outrag) (FW romp)) (FW trippi) (FW disturb) (NP (JJ brutal) (FW funni) (FW could))) (FW summar) (NN film) (FW essenti) (FW feminist) (FW spin) (FW clockwork) (FW orang) (FW filmsnovel) (FW explor) (FW concept) (NML (FW freewil) (FW oppress) (FW societi)) (FW alex) (FW struggl) (FW repugn) (NML (FW urg) (FW ultraviol) (FW bella) (FW struggl) (FW normal) (FW primal) (FW urg) (FW sexual) (FW liber) (FW independ) (FW woman) (FW expect) (NP (FW dystopian) (FW univers)) (S (NP (ADJP (FW bella) (FW opposit) (FW much)) (FW chagrin) (FW male) (FW caretak)) (FW despit) (NP (N


Named Entities and their counts:
Counter({'ORDINAL': 360, 'CARDINAL': 320, 'DATE': 120, 'PERSON': 40, 'TIME': 40})


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
'''
The assignment presented a mix of challenges and enjoyable moments, especially when tackling 
questions 1 and 2. Question 3, however, was a bit different from the other two, presenting more 
difficulty and requiring a distinct approach. The time allotted was sufficient to thoroughly engage 
with and complete the assignment, including the more challenging aspects.

'''