<a href="https://colab.research.google.com/github/pavanibasanth/pavani_INFO5731_Fall2024/blob/main/Kommineni_Pavani_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:

!pip install webdriver_manager
!apt update
!apt install chromium-chromedriver
!pip install selenium
!pip install pandas nltk

!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb && apt install ./google-chrome-stable_current_amd64.deb

from google.colab import drive
drive.mount('/content/drive')



Collecting webdriver_manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting python-dotenv (from webdriver_manager)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv, webdriver_manager
Successfully installed python-dotenv-1.0.1 webdriver_manager-4.0.2
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Hit:8 http

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import pandas as pd
from bs4 import BeautifulSoup

Google_Drive_Path = '/content/drive/MyDrive/Top_1000_Furiosa_A_Mad_Max-Saga_IMdB_Reviews.csv'

# Define driver setup function
def driversetup():
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument("lang=en")
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    options.add_argument("--incognito")
    options.add_argument("--disable-blink-features=AutomationControlled")

    driver = webdriver.Chrome(options=options)
    driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined});")

    return driver

# Initialize the driver
driver = driversetup()

# Open the IMDb reviews page
url = "https://www.imdb.com/title/tt12037194/reviews/?ref_=tt_ov_ql_2"
driver.get(url)

# Wait for the page to load
time.sleep(3)

# Click the "All" button to load all reviews
try:


    for i in range(40):
      try:
          css_selector = 'load-more-trigger'
          driver.find_element(By.ID, css_selector).click()
          time.sleep(3)  # Wait for reviews to load
      except Exception as e:
          print(f"Error clicking load-more: {e}")


except Exception as e:
    print(f"Error clicking 'All' button: {e}")

# Parse page with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find all review containers
reviews = soup.find_all('div', class_='lister-item-content')

# List to store reviews
review_data = []

# Loop through the first 1000 reviews and scrape relevant details
for i in range(min(1000, len(reviews))):
    review = reviews[i]

    # Scrape rating (if available)
    rating_tag = review.find('span', class_='rating-other-user-rating')
    if rating_tag:
        rating = rating_tag.text.strip().split('/')[0]
    else:
        rating = "No rating"

    # Scrape review summary and content
    summary = review.find('a', class_='title').text.strip()
    content = review.find('div', class_='text').text.strip()

    # Store the scraped review in the list
    review_data.append({
        'Rating': rating,
        'Summary': summary,
        'Content': content
    })

# Create a DataFrame from the review data
df = pd.DataFrame(review_data)

# Save the DataFrame to a CSV file
df.to_csv(Google_Drive_Path, index=False)

# Close the browser
driver.quit()

print("Top 1000 reviews have been scraped and saved to CSV.")


Top 1000 reviews have been scraped and saved to CSV.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [4]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Ensure NLTK resources are downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [11]:
# Load the CSV file
data_file = pd.read_csv(Google_Drive_Path)

# Initialize the lemmatizer
word_lemmatizer = WordNetLemmatizer()

# Define a function to clean text data
def text_cleaning(input_text):
    # (1) Remove noise (special characters and punctuations)
    input_text = re.sub(r'[^\w\s]', '', input_text)

    # (2) Remove numerical characters
    input_text = re.sub(r'\d+', '', input_text)

    # (3) Convert all text to lowercase
    input_text = input_text.lower()

    # (4) Eliminate stopwords
    stopwords_set = set(stopwords.words('english'))
    input_text = ' '.join(word for word in input_text.split() if word not in stopwords_set)

    # (5) Perform lemmatization
    input_text = ' '.join(word_lemmatizer.lemmatize(word) for word in input_text.split())

    return input_text

# Apply the text_cleaning function to the 'Summary' and 'Content' columns
data_file['Cleaned_Summary'] = data_file['Summary'].apply(text_cleaning)
data_file['Cleaned_Content'] = data_file['Content'].apply(text_cleaning)

# Define path to save the cleaned data
cleaned_data_filepath = '/content/drive/MyDrive/Cleaned_Top_1000_Furiosa_A_Mad_Max-Saga_IMdB_Reviews.csv'

# Save the cleaned dataset to a new CSV file
data_file.to_csv(cleaned_data_filepath, index=False)

print("Text cleaning complete. Cleaned data has been saved to 'Cleaned Top_1000_Furiosa_A Mad Max Saga IMdB_Reviews.csv'.")


Text cleaning complete. Cleaned data has been saved to 'Cleaned Top_1000_Furiosa_A Mad Max Saga IMdB_Reviews.csv'.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [12]:
import nltk
import spacy
import pandas as pd
from collections import Counter
from nltk import pos_tag, word_tokenize, sent_tokenize
from nltk.tree import Tree
from nltk.chunk import ne_chunk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [13]:
# Load spaCy model
nlp_model = spacy.load("en_core_web_sm")

def text_analysis(input_text):
    # 1. Parts of Speech (POS) Tagging
    word_tokens = word_tokenize(input_text)
    pos_labels = pos_tag(word_tokens)

    pos_summary = Counter(label for word, label in pos_labels)
    print("POS Tagging Overview:")
    print(f"Nouns: {pos_summary['NN'] + pos_summary['NNS'] + pos_summary['NNP'] + pos_summary['NNPS']}")
    print(f"Verbs: {pos_summary['VB'] + pos_summary['VBD'] + pos_summary['VBG'] + pos_summary['VBN'] + pos_summary['VBP'] + pos_summary['VBZ']}")
    print(f"Adjectives: {pos_summary['JJ'] + pos_summary['JJR'] + pos_summary['JJS']}")
    print(f"Adverbs: {pos_summary['RB'] + pos_summary['RBR'] + pos_summary['RBS']}")

    # 2. Constituency Parsing and Dependency Parsing
    print("\nConstituency Parsing Result:")
    for sent in sent_tokenize(input_text):
        tokenized_sentence = word_tokenize(sent)
        tagged_sentence = pos_tag(tokenized_sentence)
        parse_tree = nltk.chunk.ne_chunk(tagged_sentence)
        print(parse_tree)

    print("\nDependency Parsing Result:")
    parsed_doc = nlp_model(input_text)
    for sentence in parsed_doc.sents:
        for tok in sentence:
            print(f"{tok.text} --{tok.dep_}--> {tok.head.text}")

    # 3. Named Entity Recognition
    print("\nNamed Entity Recognition:")
    parsed_doc = nlp_model(input_text)
    named_entities = [(entity.text, entity.label_) for entity in parsed_doc.ents]
    entity_summary = Counter(label for _, label in named_entities)

    print("Entity Summary:")
    for entity_label, entity_count in entity_summary.items():
        print(f"{entity_label}: {entity_count}")

    print("\nExtracted Named Entities:")
    for entity_text, entity_label in named_entities:
        print(f"{entity_text} - {entity_label}")

# Load the cleaned CSV file
data_frame = pd.read_csv(cleaned_data_filepath)

# Analyze the first row of the 'Cleaned_Content' column
sample_text_content = data_frame['Cleaned_Content'].iloc[0]
text_analysis(sample_text_content)


POS Tagging Overview:
Nouns: 168
Verbs: 60
Adjectives: 81
Adverbs: 36

Constituency Parsing Result:
(S
  george/NN
  miller/RBS
  yes/UH
  yes/RB
  well/RB
  sort/VB
  anywayi/NNS
  really/RB
  really/RB
  wanted/VBN
  love/NN
  furiosa/JJ
  end/NN
  didnt/NN
  liked/VBD
  didnt/JJ
  love/NN
  big/JJ
  big/JJ
  shoe/NN
  fill/NN
  completely/RB
  love/JJ
  fury/NN
  road/NN
  perfect/JJ
  action/NN
  film/NN
  every/DT
  way/NN
  prepared/JJ
  film/NN
  fall/NN
  shadow/NN
  furiosa/NN
  fun/NN
  sadly/RB
  frthe/VBD
  good/JJ
  news/NN
  want/VBP
  action/NN
  action/NN
  load/NN
  like/IN
  good/JJ
  mad/NN
  max/NN
  story/NN
  hold/VBP
  true/JJ
  promise/NN
  entertain/NN
  mass/NN
  spectacle/NN
  glorious/JJ
  get/VB
  hot/JJ
  rod/NN
  big/JJ
  wheel/NN
  digger/NN
  bike/IN
  shape/NN
  size/NN
  well/RB
  flying/VBG
  contraption/NN
  weaponry/NN
  galore/VBD
  holding/VBG
  back/RB
  violence/NN
  explosion/NN
  body/NN
  flying/VBG
  witness/JJ
  etc/JJ
  plenty/NN
  brutal

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [10]:
# Write your response below
# What I found challenging was the scraping of the data. Many sites have implemented anti-scraping systems that prevent large-scale data scraping, and overcoming this was a bit challenging. All the other parts of the assignment I found to be easier and more straightforward, especially the text cleaning and analysis processes. I enjoyed applying various natural language processing techniques to analyze the customer reviews, as it provided valuable insights into the sentiments expressed in the reviews.
