<a href="https://colab.research.google.com/github/greeshmanth-5/Greeshmanth_INFO5731_Fall2023/blob/main/INFO5731_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the IMDb film reviews page (replace with your desired film's URL).
url = "https://www.imdb.com/title/tt0111161/reviews"

# Send an HTTP GET request to the IMDb film reviews page.
response = requests.get(url)

# Check if the request was successful.
if response.status_code == 200:
    # Parse the HTML content of the page.
    soup = BeautifulSoup(response.text, 'html.parser')

    # Create a list to store the reviews data.
    reviews_data = []

    # Find the container that holds the user reviews (You may need to inspect the page source to find the correct element).
    reviews_container = soup.find('div', class_='lister-list')

    if reviews_container:
        # Loop through the user reviews and extract relevant information.
        for review in reviews_container.find_all('div', class_='text show-more__control'):
            review_text = review.get_text(strip=True)
            reviews_data.append({'Review Text': review_text})

        # Save the data to a CSV file.
        with open('imdb_reviews.csv', 'w', newline='', encoding='utf-8') as csvfile:
            fieldnames = ['Review Text']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(reviews_data)

        print(f"Data saved to 'imdb_reviews.csv'.")
    else:
        print("Reviews container not found on the page.")

else:
    print(f"Failed to retrieve the IMDb film reviews page. Status code: {response.status_code}")


Data saved to 'imdb_reviews.csv'.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
!pip install spacy
!python -m spacy download en

2023-10-13 03:38:14.144572: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import pandas as pd
import re
import spacy

# Load the IMDb reviews data from the CSV file
df = pd.read_csv('imdb_reviews.csv')

# Load the spaCy English language model
nlp = spacy.load('en_core_web_sm')

# Define functions for text cleaning and preprocessing
def clean_text(text):
    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

def remove_stopwords(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop]
    return ' '.join(tokens)

def stem_and_lemmatize(text):
    doc = nlp(text)
    stemmed_lemmatized_words = [token.lemma_ for token in doc]
    return ' '.join(stemmed_lemmatized_words)

# Apply the cleaning and preprocessing functions to the 'Review Text' column
df['Cleaned Text'] = df['Review Text'].apply(clean_text)
df['Cleaned Text'] = df['Cleaned Text'].apply(remove_stopwords)
df['Cleaned Text'] = df['Cleaned Text'].apply(stem_and_lemmatize)

# Save the cleaned data to the CSV file
df.to_csv('imdb_reviews_cleaned.csv', index=False)


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Write your code here

import spacy
from collections import Counter

# Load the spaCy English language model
nlp = spacy.load('en_core_web_sm')

# Example sentence for explanation
example_sentence = "The quick brown fox jumps over the lazy dog."

# Load the cleaned text from the CSV file
df = pd.read_csv('imdb_reviews_cleaned.csv')

# Initialize counters for POS tags and named entities
pos_counter = Counter()
entity_counter = Counter()

# Process each cleaned text in the DataFrame
for text in df['Cleaned Text']:
    doc = nlp(text)

    # 1. Parts of Speech (POS) Tagging
    pos_tags = [token.pos_ for token in doc]
    pos_counter.update(pos_tags)

    # 2. Constituency Parsing and Dependency Parsing
    # Print one example sentence for explanation
    if text == example_sentence:
        print("Example Sentence:", text)
        # Constituency Parsing Tree
        print("Constituency Parsing Tree:")
        for sent in doc.sents:
            for token in sent:
                print(f"Token: {token.text}, POS: {token.pos_}, Dependency: {token.dep_}")
        # Dependency Parsing Tree
        print("\nDependency Parsing Tree:")
        for sent in doc.sents:
            for token in sent:
                print(f"Token: {token.text}, Dependency Head: {token.head.text}, Dependency Relation: {token.dep_}")

    # 3. Named Entity Recognition (NER)
    entities = [ent.text for ent in doc.ents]
    entity_counter.update(entities)

# Display the total counts of POS tags and named entities
print("Total Counts of POS Tags:")
for tag, count in pos_counter.items():
    print(f"{tag}: {count}")

print("\nTotal Counts of Named Entities:")
for entity, count in entity_counter.items():
    print(f"{entity}: {count}")





Total Counts of POS Tags:
PROPN: 703
NOUN: 1670
VERB: 575
ADJ: 547
ADV: 184
AUX: 32
SPACE: 131
PART: 58
SCONJ: 7
INTJ: 12
ADP: 32
DET: 7
NUM: 7
PRON: 3
X: 3
SYM: 1

Total Counts of Named Entities:
darabont: 7
rita hayworth: 4
tim robbin morgan freeman: 2
robbin banker: 1
lover andy: 1
redd freeman: 1
andy prison: 1
freeman: 9
spring: 1
norton: 1
hadley william sadler: 1
diamond s roger deakin lush: 1
hank williams: 1
seven: 4
andy: 6
ohio: 1
tim robbin: 8
andy dufresne: 4
red redd play morgan freeman freeman: 1
robbin: 1
andy learn: 1
brooks halten: 1
second: 3
oscar year: 1
zero: 1
american: 1
firstly: 1
today: 2
dvd tim robbin: 1
washington: 1
andy robbin: 1
zihuatanejo: 1
andy pick: 1
carefree andy: 1
bob gunton: 3
nixon: 1
tom hank forr gump freeman: 1
sean: 1
mozart aria: 1
recent year: 1
year: 8
freeman robbin: 1
james whitmore: 3
shine brilliantly: 1
tim robbin send: 2
appreciate year: 1
freeman tim robbin: 1
stephen king: 3
imprison: 1
tim robbins: 1
morgan freeman: 5
year year

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

***Constituency parsing :***
Constituency parsing, also known as phrase structure parsing, is a natural language processing (NLP) technique used to analyse and describe the grammatical structure of a sentence. The fundamental purpose of constituency parsing is to identify the constituents or subparts of a sentence and expose how words are grouped together to form bigger units, such as noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and more. This parsing technique divides a text into hierarchical structures, demonstrating how distinct words and phrases are organised and connected inside the sentence.

Key characteristics of constituency parsing include:


1.   Hierarchical Structure
2.   Phrases and Constituents

3.  Context-Free Grammar
2.  Parse Trees






***Dependency parsing :***
Dependency parsing is a natural language processing (NLP) technique for analysing the grammatical structure of a phrase by studying the links between words and their dependencies. Dependency parsing, as opposed to constituency parsing, determines how words in a sentence depend on or relate to one another. It describes the syntactic structure of a sentence in terms of directed dependencies or grammatical relationships between words.

Key characteristics of dependency parsing include:


1.   Directed Dependencies

1.   Root Node

1.   Types of Dependencies
2.   No Phrases or Constituents

1.   Simplicity and Elegance








