<a href="https://colab.research.google.com/github/praveen1608/Praveen-Reddy_INFO5731_Spring2024/blob/main/Kadasani_PraveenReddy_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Your code here
# (2) I'm doing 2 sub question, collecting 1000 reviews of a movie "Killers of the Flower Moon".

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the IMDb page for the movie "Killers of the Flower Moon"
url = "https://www.imdb.com/title/tt5537002/reviews/?ref_=tt_ov_rt"

# Function to scrape user reviews
def scrape_imdb_reviews(url, num_reviews=1000):
    reviews = []
    page = 1

    while len(reviews) < num_reviews:
        page_url = f"{url}?sort=helpfulnessScore&dir=desc&ratingFilter=0&spoiler=hide&ref_=tt_ov_rt&page={page}"
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, "html.parser")
        user_reviews = soup.find_all("div", class_="text show-more__control")

        for review in user_reviews:
            text = review.get_text(strip=True)
            reviews.append(text)

        page += 1

    return reviews[:num_reviews]

# Movie name
movie_name = "Killers of the Flower Moon"

# Scrape reviews
reviews = scrape_imdb_reviews(url)

# Creating a DataFrame
data = pd.DataFrame(reviews, columns=["User Reviews"])

# Saving the data to a CSV file
csv_movies = f"{movie_name}_user_reviews.csv"
data.to_csv(csv_movies, index=False)
print(f"Collected {len(reviews)} user reviews and saved to {csv_movies}")



# Read the CSV file into a DataFrame
data = pd.read_csv(csv_movies)

# Print the top five rows
print(data.head())

# Print the bottom five rows
print(data.tail())

Collected 1000 user reviews and saved to Killers of the Flower Moon_user_reviews.csv
                                        User Reviews
0  "Killers of the Flower Moon" is a Western crim...
1  Some films warrant long runtimes. Epics like '...
2  Martin Scorsese follows up his sloppy The Iris...
3  Obviously this isn't bad. It's from an amazing...
4  I'm not a die-hard Martin Scorsese fan. I have...
                                          User Reviews
995  This film probably is enjoyed by a certain aud...
996  I didn't mind the length. Most of the time wit...
997  This is what cinema is all about, a wonderful ...
998  Sorry to say that this was far too long for th...
999  I'll start off by saying that I wanted to be p...


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK data (stopwords and lemmatization data)
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file with user reviews
df = pd.read_csv(csv_movies)

# Define functions for text cleaning
def clean_text(text):
    # Removing special characters and punctuations
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])

    # Removing numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # Lowercase the text
    text = text.lower()

    return text

def remove_stopwords(text):  #(4) stopwords
    stop_words = set(stopwords.words("english"))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

def stem_text(text):   #(5) stemming
    stemmer = PorterStemmer()
    words = text.split()
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

def lemmatize_text(text):  #(6) lemmatization
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply the cleaning functions to the "User Reviews" column
df['Cleaned Reviews'] = df['User Reviews'].apply(clean_text)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(remove_stopwords)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(stem_text)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(lemmatize_text)


# Update the existing CSV file with the new column
df.to_csv(csv_movies, index=False)
print(f"Cleaned data updated in {csv_movies}")


# Read the CSV file into a DataFrame
data = pd.read_csv(csv_movies)

# Print the top five rows
print(data.head())

# Print the bottom five rows
print(data.tail())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data updated in Killers of the Flower Moon_user_reviews.csv
                                        User Reviews  \
0  "Killers of the Flower Moon" is a Western crim...   
1  Some films warrant long runtimes. Epics like '...   
2  Martin Scorsese follows up his sloppy The Iris...   
3  Obviously this isn't bad. It's from an amazing...   
4  I'm not a die-hard Martin Scorsese fan. I have...   

                                     Cleaned Reviews  
0  killer flower moon western crime drama film co...  
1  film warrant long runtim epic like lawrenc ara...  
2  martin scorses follow sloppi irishman anoth ex...  
3  obvious isnt bad amaz director interest stori ...  
4  im diehard martin scorses fan deep appreci mov...  
                                          User Reviews  \
995  This film probably is enjoyed by a certain aud...   
996  I didn't mind the length. Most of the time wit...   
997  This is what cinema is all about, a wonderful ...   
998  Sorry to say that this was f

In [3]:
# downloading the csv file

from google.colab import files

files.download(csv_movies)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
# Your code here

import spacy

# Loading the cleaned data from the CSV file
df = pd.read_csv(csv_movies)

# Initialize spaCy
nlp = spacy.load("en_core_web_sm")

# Function to perform POS tagging and count POS categories
def pos_tagging(text):
    doc = nlp(text)
    pos_counts = {"Noun": 0, "Verb": 0, "Adjective": 0, "Adverb": 0}

    for token in doc:
        if token.pos_ == "NOUN":
            pos_counts["Noun"] += 1
        elif token.pos_ == "VERB":
            pos_counts["Verb"] += 1
        elif token.pos_ == "ADJ":
            pos_counts["Adjective"] += 1
        elif token.pos_ == "ADV":
            pos_counts["Adverb"] += 1

    return pos_counts


# Function to perform constituency parsing
def constituency_parsing(text):
    doc = nlp(text)
    constituency_tree = ""

    for sent in doc.sents:
        for token in sent:
            constituency_tree += f"({token.text} ({token.dep_} "
        constituency_tree += ")"

    return constituency_tree

# Function to perform dependency parsing
def dependency_parsing(text):
    doc = nlp(text)

    for sent in doc.sents:
        for token in sent:
            print(token.text, token.dep_, token.head.text)

# Function to perform Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = {}

    for ent in doc.ents:
        entity_type = ent.label_
        if entity_type in entities:
            entities[entity_type] += 1
        else:
            entities[entity_type] = 1

    return entities


# Example sentence for explanation
example_sentence = df['Cleaned Reviews'][0]

#(1)Perform POS tagging and count POS categories
pos_counts = pos_tagging(example_sentence)
print("POS Tagging:", pos_counts)

#(2)Perform constituency parsing
constituency_tree = constituency_parsing(example_sentence)
print("Constituency Parsing Tree:", constituency_tree)

#(2) Perform dependency parsing
print("Dependency Parsing:")
dependency_parsing(example_sentence)

#(3)Perform Named Entity Recognition (NER)
entities = named_entity_recognition(example_sentence)
print("Named Entity Recognition:", entities)

POS Tagging: {'Noun': 277, 'Verb': 97, 'Adjective': 99, 'Adverb': 25}
Constituency Parsing Tree: (killer (nsubj (flower (nsubj (moon (compound (western (amod (crime (compound (drama (compound (film (compound (cowritten (nmod (direct (amod (martin (compound (scorses (compound (base (compound (nonfict (compound (book (compound (name (compound (david (compound (grann (compound (star (compound (leonardo (compound (dicaprio (compound (robert (compound (de (compound (niro (compound (lili (compound (gladston (compound (touch (dobj (upon (prep (often (advmod (overlook (ROOT (piec (compound (american (amod (histori (nmod (best (amod (way (nmod (possibl (compound (thank (compound (talent (compound (director (compound (castin (compound (earli (compound (discoveri (amod (oil (compound (land (nsubj (belong (relcl (nativ (nmod (american (amod (osag (amod (nation (nsubj (turn (ccomp (tribe (nmod (richest (amod (peopl (compound (world (nmod (sudden (amod (acquisit (compound (wealth (compound (attract 

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

This assignment was great, it helps in recollecting the topics that we learnt in the class. By doing this in practical it helps in remembering the topics very clearly. First I thought of collecting reviews from amazon, but endup in choosing the movie reviews. There are not much challenges I faced this assignment, but worked on csv files and how to download it and store it etc. I really enjoyed in text cleaning part, like I used to observe the results after the step and how the effects were after conducting a particular step. The time provided for the assignment is perfectly sufficient. Students with more than 9 credits may have faced some time issues, but it was sufficient.
