<a href="https://colab.research.google.com/github/madhan444-s/Madhan_INFO5731_Spring2024/blob/main/Dadi_Madhan_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [8]:
import csv
import requests
from bs4 import BeautifulSoup

# Function to get movie reviews from IMDb
def getMovieReviews(movieId, numReviews=1000):
    baseUrl = f'https://www.imdb.com/title/{movieId}/reviews'

    reviewsData = []
    pageNumber = 1

    # Loop until the desired number of reviews is collected
    while len(reviewsData) < numReviews:
        url = f'{baseUrl}?spoiler=hide&sort=helpfulnessScore&dir=desc&ratingFilter=0&page={pageNumber}'
        response = requests.get(url)        # Send a request to the IMDb website

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            reviewElements = soup.find_all('div', class_='lister-item-content')

            if not reviewElements:
                break

            # Loop through each review element and extract author and review text
            for review in reviewElements:
                author = review.find('span', class_='display-name-link').text.strip()
                review = review.find('div', class_='text show-more__control').text.strip()

                reviewsData.append({
                    'Author': author,
                    'Review': review,
                })

            pageNumber += 1
        else:
            print(f"Failed to fetch reviews. Status code: {response.status_code}")
            break

    return reviewsData[:numReviews]

# Function to save reviews data to a CSV file
def saveReviewsDataToCSV(reviewsData, csvFileName='movie_reviews.csv'):
    with open(csvFileName, mode='w', encoding='utf-8', newline='') as file:
        fieldnames = ['Author', 'Review']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()

        for review_data in reviewsData:
            writer.writerow({
                'Author': review_data['Author'],
                'Review': review_data['Review'],
            })

if __name__ == "__main__":
    movieId = "tt15398776"  # IMDb ID for "Oppenheimer" - Change this to the IMDb ID of your movie
    numReviewsNeeded = 1000

    reviewsData = getMovieReviews(movieId, numReviewsNeeded)

    if reviewsData:
        saveReviewsDataToCSV(reviewsData)
        print(f"{numReviewsNeeded} reviews collected and saved to 'movie_reviews.csv'")
    else:
        print("Unable to collect reviews.")


1000 reviews collected and saved to 'movie_reviews.csv'


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [9]:
# Write code for each of the sub parts with proper comments.
# Write code for each of the sub parts with proper comments.
# !pip install nltk
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')


def cleanText(text):
    # Remove noise (special characters and punctuations)
    text = re.sub(r'[^\w\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d', '', text)

    # Lowercase all texts
    text = text.lower()

    return text

def removeStopWords(text):
    stopWords = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    filteredText = ' '.join([word for word in words if word.lower() not in stopWords])
    return filteredText

def applyStemming(text):
    ps = PorterStemmer()
    words = nltk.word_tokenize(text)
    stemmedText = ' '.join([ps.stem(word) for word in words])
    return stemmedText

def applyLemmatization(text):
    lemmatizer = WordNetLemmatizer()
    words = nltk.word_tokenize(text)
    lemmatizedText = ' '.join([lemmatizer.lemmatize(word) for word in words])
    return lemmatizedText

def cleanAndSaveCsv(inputCsv, outputCsv):
    with open(inputCsv, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        rows = list(reader)

    for row in rows:
        # Clean the 'Review' column
        cleanedText = row['Review']
        cleanedText = cleanText(cleanedText)
        cleanedText = removeStopWords(cleanedText)
        cleanedText = applyStemming(cleanedText)
        cleanedText = applyLemmatization(cleanedText)
        row['Cleaned Text'] = cleanedText

    fieldnames = reader.fieldnames + ['Cleaned Text']

    with open(outputCsv, 'w', encoding='utf-8', newline='') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)

if __name__ == "__main__":
    inputCsv = 'movie_reviews.csv'
    outputCsv = 'movie_reviews.csv'

    cleanAndSaveCsv(inputCsv, outputCsv)
    print(f"Cleaned data saved to '{outputCsv}'")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Cleaned data saved to 'movie_reviews.csv'


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [11]:
# Your code here
import csv
import spacy

# Download the English model for spaCy
# Make sure to run this command before executing the script for the first time:
# python -m spacy download en_core_web_sm

# Load spaCy language model
nlp = spacy.load("en_core_web_sm")

def posTagging(text):
    doc = nlp(text)
    posTags = [(token.text, token.pos_) for token in doc]
    return posTags

def constituencyParsing(text):
    doc = nlp(text)
    nounChunks = [chunk.text for chunk in doc.noun_chunks]
    return nounChunks

def dependencyParsing(text):
    doc = nlp(text)
    dependencyTree = [(token.text, token.dep_, token.head.text) for token in doc]
    return dependencyTree

def namedEntityRecognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

if __name__ == "__main__":
    # Load the cleaned text from the CSV file
    with open('movie_reviews.csv', 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        cleanedText = ' '.join([row['Cleaned Text'] for row in reader])

    # Split the text into chunks of 100,000 characters
    chunkSize = 100000
    chunks = [cleanedText[i:i+chunkSize] for i in range(0, len(cleanedText), chunkSize)]

    for idx, chunk in enumerate(chunks, start=1):
        print(f"Processing chunk {idx}/{len(chunks)}...")

        # (1) Parts of Speech (POS) Tagging
        posTags = posTagging(chunk)
        nounCount = len([word for word, pos in posTags if pos.startswith('N')])
        verbCount = len([word for word, pos in posTags if pos.startswith('V')])
        adjCount = len([word for word, pos in posTags if pos.startswith('J')])
        advCount = len([word for word, pos in posTags if pos.startswith('R')])

        print("POS Tagging Results:")
        print(f"Total Nouns: {nounCount}")
        print(f"Total Verbs: {verbCount}")
        print(f"Total Adjectives: {adjCount}")
        print(f"Total Adverbs: {advCount}\n")

        # (2) Constituency Parsing
        constituencyChunks = constituencyParsing(chunk)
        print("Constituency Parsing Tree:")
        print(constituencyChunks[0])  # Print only the first sentence for example
        print("\nExplanation:")
        print("Constituency parsing represents the sentence structure in terms of noun chunks. Each chunk is a syntactic unit.")

        # (3) Dependency Parsing
        dependencyTree = dependencyParsing(chunk)
        print("\nDependency Parsing Tree:")
        print(dependencyTree[:10])  # Print only the first 10 tokens for example
        print("\nExplanation:")
        print("Dependency parsing represents the grammatical structure of a sentence in terms of the relationships between words. Each tuple (word, dependency label, head word) describes a grammatical relationship.")

        # (4) Named Entity Recognition
        namedEntities = namedEntityRecognition(chunk)
        print("\nNamed Entity Recognition:")
        entityCounts = {}
        for entity, label in namedEntities:
            #print(f"{entity} - {label}")
            entityCounts[label] = entityCounts.get(label, 0) + 1

        print("\nEntity Counts:")
        for label, count in entityCounts.items():
            print(f"{label}: {count}")

        print("\n" + "-"*50)




Processing chunk 1/12...
POS Tagging Results:
Total Nouns: 5104
Total Verbs: 2029
Total Adjectives: 0
Total Adverbs: 0

Constituency Parsing Tree:
you

Explanation:
Constituency parsing represents the sentence structure in terms of noun chunks. Each chunk is a syntactic unit.

Dependency Parsing Tree:
[('you', 'nsubj', 'wit'), ('ll', 'aux', 'wit'), ('wit', 'nsubj', 'absolut'), ('brain', 'compound', 'switch'), ('fulli', 'compound', 'switch'), ('switch', 'compound', 'watch'), ('watch', 'dobj', 'wit'), ('oppenheim', 'nsubj', 'easili'), ('could', 'aux', 'easili'), ('easili', 'ccomp', 'wit')]

Explanation:
Dependency parsing represents the grammatical structure of a sentence in terms of the relationships between words. Each tuple (word, dependency label, head word) describes a grammatical relationship.

Named Entity Recognition:

Entity Counts:
PERSON: 476
ORG: 176
NORP: 102
GPE: 158
ORDINAL: 73
FAC: 10
LOC: 4
DATE: 57
EVENT: 11
CARDINAL: 154
WORK_OF_ART: 8
TIME: 44
PRODUCT: 13
QUANTITY: 3


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below

# Doing this assignment was kind of tough but interesting.
# I learned how to get text from the internet, clean it up, and understand what the words mean.
# Getting the text from websites or apps had some tricky parts, like dealing with changes on websites or limits on APIs.
# Cleaning the text was like removing extra stuff and making it simpler to understand.
# Figuring out what each word does in a sentence was another challenge, but it helped me see how words work together.
# The assignment was like solving real-world problems with data from the internet, which made it cool.
# I used special tools like BeautifulSoup and NLTK to make things easier.
# The time they gave us to finish the assignment was okay if you know a bit about getting data from the web and some basics about words.
# If you're new to this, some parts might take a bit longer. Overall, it was a good learning experience about words and data on the internet.