<a href="https://colab.research.google.com/github/msrahulvarma/RahulVarma_INFO5731_Fall2023/blob/main/INFO5731_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [8]:
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

def get_imdb_reviews(url, num_pages=1000):
    reviews = []

    for page in range(1, num_pages + 1):
        response = requests.get(url + '?start=' + str((page-1)*50), headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        review_divs = soup.find_all('div', {'class': 'text show-more__control'})

        for review in review_divs:
            reviews.append(review.text.strip())

        # Break if there are no more reviews
        if not review_divs:
            break

    return reviews

# Example usage:
url = 'https://www.imdb.com/title/tt11858890/reviews/?ref_=tt_ql_2'
reviews = get_imdb_reviews(url, num_pages=200)  # Assuming an average of 50 reviews per page, 200 pages will give 10,000 reviews


# Print the first 500 reviews
for i, review in enumerate(reviews[:10000], 1):
    print(f"Review {i}:")
    print(review)
    print("-" * 50)


Review 1:
This is a lesson to the movie industry on how to use a budget. 80 million dollars was used splendidly. The cinematography was amazing, (Not terribly surprising because Rogue One) acting was great, and the story was decent.It wasn't without problems though. The story moves at an increasing pace and at some points you lose track of what's happening. Suspension of disbelief will be needed in some moments.The theme of the story was to make AI to be more than just robots. I think they succeeded there, but at the expense of the humans. Most of the humans in the story ended up being one faced - except for Joshua.The dynamic between Joshua and Alfie was by far the best part of the movie. The acting was great between the two.It was a good movie. Not great by any means, but I'm all for supporting a movie that is trying something new.Overall, I think Gareth Edwards should be given some more projects. AND filmmakers everywhere should learn how a budget should be used.
-------------------

In [9]:
import requests
from bs4 import BeautifulSoup
import csv

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

def get_imdb_reviews(url, num_pages=200):  # Adjusted for 200 pages to attempt 10,000 reviews
    reviews = []

    for page in range(1, num_pages + 1):
        response = requests.get(url + '?start=' + str((page-1)*50), headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        review_divs = soup.find_all('div', {'class': 'text show-more__control'})

        for review in review_divs:
            reviews.append(review.text.strip())

        # Break if there are no more reviews
        if not review_divs:
            break

    return reviews

def save_to_csv(reviews, filename="reviews.csv"):
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Review Number", "Review Text"])
        for i, review in enumerate(reviews, 1):
            writer.writerow([i, review])

# Example usage:
url = 'https://www.imdb.com/title/tt11858890/reviews/?ref_=tt_ql_2'
reviews = get_imdb_reviews(url)

# Save the first 500 reviews to a CSV file
save_to_csv(reviews[:10000])

print("10000 reviews saved to reviews.csv")


10000 reviews saved to reviews.csv


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [11]:
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download stopwords and wordnet data
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Load stopwords
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # (1) Remove noise, such as special characters and punctuations.
    text = re.sub(r'[^\w\s]', '', text)

    # (2) Remove numbers.
    text = re.sub(r'\d+', '', text)

    # (4) Lowercase all texts.
    text = text.lower()

    # Tokenize the text for further processing
    words = text.split()

    # (3) Remove stopwords.
    words = [word for word in words if word not in stop_words]

    # (5) Stemming.
    words = [stemmer.stem(word) for word in words]

    # (6) Lemmatization.
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

def clean_reviews_in_csv(input_filename, output_filename):
    with open(input_filename, 'r', encoding='utf-8') as infile, open(output_filename, 'w', newline='', encoding='utf-8') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        # Write header
        header = next(reader)
        writer.writerow(header + ['Cleaned Review'])

        for row in reader:
            review = row[1]  # Assuming review text is in the second column
            cleaned_review = clean_text(review)
            writer.writerow(row + [cleaned_review])

    print(f"Cleaned reviews saved to {output_filename}")

clean_reviews_in_csv('reviews.csv', 'cleaned_reviews.csv')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned reviews saved to cleaned_reviews.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [31]:
import csv
import spacy
from collections import defaultdict

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def pos_analysis(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header

        pos_counts = defaultdict(int)

        for row in reader:
            cleaned_review = row[2]
            doc = nlp(cleaned_review)

            # (1) Parts of Speech (POS) Tagging
            for token in doc:
                if token.pos_ in ["NOUN", "VERB", "ADJ", "ADV"]:
                    pos_counts[token.pos_] += 1

        # Print POS counts
        print("Parts of Speech Counts:")
        for pos, count in pos_counts.items():
            print(f"{pos}: {count}")

pos_analysis('cleaned_reviews.csv')


Parts of Speech Counts:
ADJ: 100600
NOUN: 280400
ADV: 27400
VERB: 125400


In [33]:
import os
import pandas as pd
from nltk.parse.stanford import StanfordParser
import spacy

# Install Java in Colab
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

# Download the Stanford Parser files and unzip them
!wget https://nlp.stanford.edu/software/stanford-parser-3.9.2-models.jar
!wget https://nlp.stanford.edu/software/stanford-parser-full-2018-10-17.zip
!unzip stanford-parser-full-2018-10-17.zip

# Set the CLASSPATH environment variable
os.environ['CLASSPATH'] = "./stanford-parser-full-2018-10-17/stanford-parser.jar:./stanford-parser-3.9.2-models.jar"

# Load spaCy's English NER model
nlp = spacy.load("en_core_web_sm")

# Read the cleaned_reviews.csv
df = pd.read_csv('cleaned_reviews.csv')

# Ensure the 'clean_text' column exists
if 'Cleaned Review' not in df.columns:
    raise ValueError("The 'clean_text' column is not present in the CSV.")

# Initialize StanfordParser for Constituency Parsing
constituency_parser = StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")

# Process each review in the DataFrame
for review in df['Cleaned Review']:
    sentences = review.split('.')  # Splitting by period to get individual sentences
    for sentence in sentences:
        # Dependency Parsing using spaCy
        doc = nlp(sentence)
        print("Dependency Parsing for sentence:", sentence)
        for token in doc:
            print(f"{token.text} <--{token.dep_}-- {token.head.text}")

        # Constituency Parsing using Stanford Parser
        print("\nConstituency Parsing for sentence:", sentence)
        tree = list(constituency_parser.raw_parse(sentence))
        tree[0].pretty_print()

        # Break after one sentence for demonstration
        break
    break


--2023-10-18 17:15:18--  https://nlp.stanford.edu/software/stanford-parser-3.9.2-models.jar
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://downloads.cs.stanford.edu/nlp/software/stanford-parser-3.9.2-models.jar [following]
--2023-10-18 17:15:19--  https://downloads.cs.stanford.edu/nlp/software/stanford-parser-3.9.2-models.jar
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-10-18 17:15:19 ERROR 404: Not Found.

--2023-10-18 17:15:19--  https://nlp.stanford.edu/software/stanford-parser-full-2018-10-17.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443..

Please use [91mnltk.parse.corenlp.CoreNLPParser[0m instead.
  constituency_parser = StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")


Dependency Parsing for sentence: lesson movi industri use budget million dollar use splendidli cinematographi amaz terribl surpris rogu one act great stori decentit wasnt without problem though stori move increas pace point lose track what happen suspens disbelief need momentsth theme stori make ai robot think succeed expens human human stori end one face except joshuath dynam joshua alfi far best part movi act great twoit good movi great mean im support movi tri someth newoveral think gareth edward given project filmmak everywher learn budget use
lesson <--compound-- movi
movi <--nsubj-- cinematographi
industri <--amod-- dollar
use <--compound-- dollar
budget <--compound-- million
million <--nummod-- dollar
dollar <--compound-- use
use <--compound-- cinematographi
splendidli <--compound-- cinematographi
cinematographi <--nsubj-- was
amaz <--advmod-- cinematographi
terribl <--compound-- rogu
surpris <--compound-- rogu
rogu <--pobj-- amaz
one <--nummod-- decentit
act <--compound-- decen

In [3]:
import spacy
import pandas as pd
from collections import defaultdict

# Load the English NER model
nlp = spacy.load("en_core_web_sm")

# Read the cleaned_reviews.csv
df = pd.read_csv('cleaned_reviews.csv')

# Ensure the 'clean_text' column exists
if 'Cleaned Review' not in df.columns:
    raise ValueError("The 'clean_text' column is not present in the CSV.")

# Dictionary to store counts of each entity
entity_counts = defaultdict(int)

# Process each review in the DataFrame
for review in df['Cleaned Review']:
    doc = nlp(str(review))  # Convert to string in case there are NaN values
    for ent in doc.ents:
        entity_counts[ent.label_] += 1

# Print the counts of each entity
for entity, count in entity_counts.items():
    print(f"{entity}: {count}")


ORG: 6600
CARDINAL: 10800
PERSON: 15800
TIME: 400
DATE: 1600
NORP: 3000
GPE: 5400
ORDINAL: 2800
FAC: 400
LOC: 800


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Dependency Parsing:
Think of dependency parsing like making a family tree. Each word in the sentence is like a family member, and we draw lines (like family connections) to show how they're related. For example, "jumps" is like the main person in the family, and we draw lines to show that it connects to other words like "fox" and "dog." These lines tell us who does what to whom in the sentence.

Constituency Parsing:
Constituency parsing is more like putting together a puzzle. We break the sentence into smaller pieces, like "The quick brown fox" and "jumps over the lazy dog." These pieces are like puzzle parts. We then break those pieces into even smaller ones until we have all the individual words. It shows us how the words fit together to make sentences.

In simple terms, dependency parsing helps us see who's doing what in a sentence, while constituency parsing helps us understand how words and phrases come together to create sentences. Both are like tools to understand sentences better.