<a href="https://colab.research.google.com/github/pramodgangula19/5731_Spring24/blob/main/Gangula_Pramod_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
# Your code here
# Write your code here
import requests
from bs4 import BeautifulSoup
import csv
import time

def imdb_reviews(movie_id, no_of_reviews):
    reviews = []
    imdb_url = f"https://www.imdb.com/title/tt9098872/reviews"


    num_pages = (no_of_reviews + 24) // 25

    for page in range(1, num_pages + 1):
        url = f"{imdb_url}?start={((page - 1) * 25)}"
        response = requests.get(url)

        if response.status_code != 200:
            print(f"Failed to retrieve page {page}. Status code: {response.status_code}")
            continue

        soup = BeautifulSoup(response.text, 'html.parser')

        review_elements = soup.find_all(class_='text show-more__control')
        for review_element in review_elements:
            review_text = review_element.text.strip()
            reviews.append(review_text)


            if len(reviews) >= no_of_reviews:
                return reviews


        time.sleep(2)

    return reviews


movie_id = 'tt9098872'

num_reviews_to_scrape = 1000  # this movie has only 170 reviews, so we can here replace 1000 if we want but

imdb_user_reviews = imdb_reviews(movie_id, num_reviews_to_scrape)

# Save the reviews to a CSV file
csv_filename = f'imdb_reviews_{movie_id}.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Review'])
    for review in imdb_user_reviews:
        writer.writerow([review])

print(f"Reviews saved to {csv_filename}")
print("\nSample Reviews:")
for i, review in enumerate(imdb_user_reviews[:1000], start=1):
    print(f"Review {i}: {review}")






Reviews saved to imdb_reviews_tt9098872.csv

Sample Reviews:
Review 1: I have watched quite a few documentaries, news segments, and Youtube videos about this cave rescue. It's a huuuuge operation so every time you watch a new video you see something totally new about the operation. There were hundreds of people directly on the scene, thousands involved, various different plans being checked out. It's just enormous. Elon Musk even promised he would build a mini submarine to rescue the kids and called one of the rescuers pedo guy for doubting him on Twitter. Of course completely insane idea because it's very narrow cave passages, but there were many such alternative plans.So all these documentaries, and movies, are not always about the same thing. Here the focus is on the rescue divers. Especially divers from UK and then one doctor/diver from Australia who gives the kids anesthesia and makes this all possible. They also bring up the Thai divers and their inexperience. And then the big po

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write your code here
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the stopwords, stemmer, and lemmatizer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# clean and preprocess text
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\d+', '', text)
    tokens = text.split()
    tokens = [word for word in tokens if word.lower() not in stop_words]
    tokens = [word.lower() for word in tokens]
    tokens = [stemmer.stem(lemmatizer.lemmatize(word)) for word in tokens]
    cleaned_text = ' '.join(tokens)

    return cleaned_text

# Read the original CSV file with uncleaned reviews
input_csv = '/content/imdb_reviews_tt9098872.csv'
output_csv = 'imdb_reviews_cleaned_tt1234567.csv'

with open(input_csv, 'r', newline='', encoding='utf-8') as infile, \
     open(output_csv, 'w', newline='', encoding='utf-8') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    header = next(reader)
    header.append('Cleaned_Review')
    writer.writerow(header)


    for row in reader:
        review = row[0]
        cleaned_review = clean_text(review)
        row.append(cleaned_review)
        writer.writerow(row)

print(f"Cleaned reviews saved to {output_csv}")


# Function to read and print the content of the CSV
def print_csv_content(filename):
    with open(filename, 'r', newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            print(', '.join(row))

# Print the content of the output CSV
print("Content of the output CSV:")
print_csv_content(output_csv)





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Cleaned reviews saved to imdb_reviews_cleaned_tt1234567.csv
Content of the output CSV:
Review, Cleaned_Review
I have watched quite a few documentaries, news segments, and Youtube videos about this cave rescue. It's a huuuuge operation so every time you watch a new video you see something totally new about the operation. There were hundreds of people directly on the scene, thousands involved, various different plans being checked out. It's just enormous. Elon Musk even promised he would build a mini submarine to rescue the kids and called one of the rescuers pedo guy for doubting him on Twitter. Of course completely insane idea because it's very narrow cave passages, but there were many such alternative plans.So all these documentaries, and movies, are not always about the same thing. Here the focus is on the rescue divers. Especially divers from UK and then one doctor/diver from Australia who gives the kids anesthesia and makes this all possible. They also bring up the Thai divers and 

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here
# Write your code here

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')

cleaned_text = "The story is amazing"

# (1) Parts of Speech (POS) Tagging
def pos_tagging(text):
    words = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(words)
    return pos_tags

# (2) Constituency Parsing
def constituency_parsing(text):
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words)
        grammar = r"""
            NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
            PP: {<IN><NP>}      # Chunk prepositions followed by NP
            VP: {<VB.*><NP|PP>*}  # Chunk verbs and their arguments
        """
        parser = nltk.RegexpParser(grammar)
        tree = parser.parse(pos_tags)
        print("Constituency Parsing Tree:")
        print(tree)

# (3) Named Entity Recognition (using NLTK's built-in named entity recognition)
def named_entity_recognition(text):
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        ner_tags = nltk.ne_chunk(nltk.pos_tag(words))
        print("Named Entity Recognition:")
        print(ner_tags)

# Perform syntax and structure analysis on the cleaned text
print("Cleaned Text:")
print(cleaned_text)

# (1) Parts of Speech (POS) Tagging
pos_tags = pos_tagging(cleaned_text)
print("\n(1) Parts of Speech (POS) Tagging:")
print(pos_tags)

# (2) Constituency Parsing
print("\n(2) Constituency Parsing:")
constituency_parsing(cleaned_text)

# (3) Named Entity Recognition (NER)
print("\n(3) Named Entity Recognition:")
named_entity_recognition(cleaned_text)






[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Cleaned Text:
The story is amazing

(1) Parts of Speech (POS) Tagging:
[('The', 'DT'), ('story', 'NN'), ('is', 'VBZ'), ('amazing', 'VBG')]

(2) Constituency Parsing:
Constituency Parsing Tree:
(S (NP The/DT story/NN) (VP is/VBZ) (VP amazing/VBG))

(3) Named Entity Recognition:
Named Entity Recognition:
(S The/DT story/NN is/VBZ amazing/VBG)


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:

# Write your response below
The assignment was quite comprehensive and covered various aspects of text data collection, cleaning, and analysis, which are essential skills in natural language processing (NLP). Here are my thoughts on the assignment:

Challenges:

Data Collection: Choosing a suitable source and collecting the data programmatically can be challenging, especially if the chosen platform does not provide straightforward access to the data.
Data Cleaning: Cleaning text data involves several steps, such as removing noise, stopwords, and numbers, which require careful implementation to ensure the integrity of the text while removing irrelevant information.
Understanding NLP Concepts: Implementing parts of speech tagging, constituency parsing, and named entity recognition requires a solid understanding of NLP concepts, which can be challenging for beginners.
Integration of Libraries: Integrating different libraries like BeautifulSoup for web scraping, NLTK for NLP tasks, and CSV handling requires some level of expertise and troubleshooting skills.
Enjoyable Aspects:

Problem Solving: I enjoyed the problem-solving aspect of the assignment, especially devising strategies to collect data from various sources and implementing cleaning and analysis techniques.
Learning Experience: The assignment provided an excellent opportunity to learn and apply different NLP techniques in a real-world context, which is always enjoyable.
Code Optimization: Optimizing code for efficiency and readability was satisfying, as it allowed me to improve my programming skills.
Exploration: Exploring different libraries and tools for web scraping, text processing, and NLP was enriching and expanded my knowledge base.
Time to Complete:
The provided time to complete the assignment seems reasonable, considering the complexity and scope of the tasks involved. However, it ultimately depends on individual proficiency in Python programming, web scraping, and NLP concepts. Some learners may find it challenging to complete within the given timeframe, especially if they encounter unforeseen difficulties or have limited prior experience with the required technologies. Overall, I believe the allotted time strikes a balance between providing a sufficient challenge and ensuring completion within a reasonable timeframe.