<a href="https://colab.research.google.com/github/kesireddysiva/Sivanarayana_INFO_5737_Spring_2024/blob/main/Sivanarayana_Reddy_Kesireddy_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
def extract_reviews(imdb_url, max_pages, output_file):
    reviews= []
    movie_title = "RRR"

    for npage in range(1, max_pages + 1):
        page_url = f"{imdb_url}&page={npage}"

        response = requests.get(page_url)
        # Check if the request was successful (status code 200)
        if response.status_code != 200:
            print(f"Failed to retrieve page {npage}. Exiting.")
            break
        print(f"Scraping reviews from page {npage}...")

        reviews_on_page = BeautifulSoup(response.text, 'html.parser')
        review_containers = reviews_on_page.find_all('div', class_='lister-item-content')

        for review in review_containers:
            review_text = review.find('div', class_='text').get_text()
            username = review.find('span', class_='display-name-link').get_text()
            review_date = review.find('span', class_='review-date').get_text()

            reviews.append([movie_title,username, review_date, review_text])
    if reviews:
        # Create a DataFrame from the collected reviews
        reviews_df = pd.DataFrame(reviews, columns=['Movie', 'User', 'Date', 'Review'])

        reviews_df.to_csv(output_file, index=False, encoding='utf-8')
        print(f"{len(reviews)} reviews scraped and saved to '{output_file}'.")
    else:
        print("No reviews found.")

movie_url = 'https://www.imdb.com/title/tt8178634/reviews?ref_=tt_urv' # IMDb URL for the movie reviews
Max_pages_to_scrape = 50  #  Maximum number of pages to scrape
output_file = 'movie_reviews.csv' # Output CSV file name

extract_reviews(movie_url, Max_pages_to_scrape, output_file)

Scraping reviews from page 1...
Scraping reviews from page 2...
Scraping reviews from page 3...
Scraping reviews from page 4...
Scraping reviews from page 5...
Scraping reviews from page 6...
Scraping reviews from page 7...
Scraping reviews from page 8...
Scraping reviews from page 9...
Scraping reviews from page 10...
Scraping reviews from page 11...
Scraping reviews from page 12...
Scraping reviews from page 13...
Scraping reviews from page 14...
Scraping reviews from page 15...
Scraping reviews from page 16...
Scraping reviews from page 17...
Scraping reviews from page 18...
Scraping reviews from page 19...
Scraping reviews from page 20...
Scraping reviews from page 21...
Scraping reviews from page 22...
Scraping reviews from page 23...
Scraping reviews from page 24...
Scraping reviews from page 25...
Scraping reviews from page 26...
Scraping reviews from page 27...
Scraping reviews from page 28...
Scraping reviews from page 29...
Scraping reviews from page 30...
Scraping reviews fr

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
# Write code for each of the sub parts with proper comments.
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd

# Load CSV file
df = pd.read_csv('/content/movie_reviews.csv')

# (1) Remove noise
df['clean_review'] = df['Review'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

# (2) Remove numbers
df['clean_review'] = df['clean_review'].apply(lambda x: re.sub(r'\d+', '', x))

# (3) Remove stopwords
stop_words = set(stopwords.words('english'))
df['clean_review'] = df['clean_review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# (4) Lowercase
df['clean_review'] = df['clean_review'].apply(lambda x: x.lower())

# (5) Stemming
ps = PorterStemmer()
df['clean_review'] = df['clean_review'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))

# (6) Lemmatization
lm = WordNetLemmatizer()
df['clean_review'] = df['clean_review'].apply(lambda x: ' '.join([lm.lemmatize(word) for word in x.split()]))

# Save cleaned data to CSV
df.to_csv('reviews_cleaned.csv', index=False)

print(' cleaned data saved to reviews_cleaned.csv')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


 cleaned data saved to reviews_cleaned.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
import spacy
import nltk
from nltk.tree import Tree
from collections import Counter
import pandas as pd

# Load data
df = pd.read_csv('reviews_cleaned.csv')

# Load spacy model
nlp = spacy.load('en_core_web_sm')

# (1) POS Tagging
pos_counts = Counter()
for review in df['clean_review']:
    doc = nlp(review)
    for token in doc:
        pos_counts[token.pos_] += 1

print("Noun count:", pos_counts['NOUN'])
print("Verb count:", pos_counts['VERB'])
print("Adjective count:", pos_counts['ADJ'])
print("Adverb count:", pos_counts['ADV'])

# (2) Constituency and Dependency Parsing
import spacy
from spacy import displacy
sample_text = df['clean_review'].iloc[0]
print(sample_text)
# Dependency Parsing Tree using SpaCy
nlp = spacy.load("en_core_web_sm")
sample_doc = nlp(sample_text)

# Visualize Dependency Parsing Tree using displacy
displacy.serve(sample_doc, style="dep")

## Dependency Parsing Tree using SpaCy
print("\nDependency Parsing Tree:")
nlp = spacy.load("en_core_web_sm")
sample_doc = nlp(sample_text)

# Visualize Dependency Parsing Tree and save to a file
displacy.render(sample_doc, style="dep", options={'compact': True, 'color': 'blue'})

# (3) Named Entity Recognition
entities = Counter()
for review in df['clean_review']:
    doc = nlp(review)
    for ent in doc.ents:
        entities[ent.label_] += 1

print(entities)

Noun count: 59600
Verb count: 29050
Adjective count: 21250
Adverb count: 7850
i seen lot movi time made lot differ style differ genr around world ive seen everyth mainstream movi imagin experiment i cant even rememb last time i came away movi think id never seen anyth like but that i felt rrrthi movi much it much may turn almost my wife i nearli bail minut mark film top ridicul but got hook i total ride point i disappoint hour behemoth endingdo like see musclebound slickedup men fight tiger check how public flog turn music number got evil british peopl extrem evil england sue filmmak defam sure how evil british peopl maul rampag jungl anim you betcha behead yep romanc of cours homoerotic intens watch movi may turn gay hooboy let say anyth alreadi movi isnt worth anywaywatch rrr make sure whatev first movi watch one dont care much certainli feel like palest imit movi ever seen serious movi deliri bonker unafraid absolut absurd almost imposs watch pretti much movi disappoint one thank lo




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.

Dependency Parsing Tree:


Counter({'PERSON': 3150, 'CARDINAL': 2500, 'NORP': 1650, 'ORG': 1350, 'GPE': 850, 'TIME': 650, 'DATE': 600, 'ORDINAL': 550, 'LANGUAGE': 50, 'EVENT': 50})


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

i faced any issues when i want to extarct the data from IMDb pages and

I encountered a few challenges during the assignment, particularly when extracting data from IMDb pages. The structure of the web pages and the dynamic nature of the content made it a bit challenging to retrieve the information efficiently. However, I found the assihnment intellectually stimulating as it required a combination of web scraping techniques and data processing skills.