<a href="https://colab.research.google.com/github/gtakhil95/Akhil_INFO5731_Fall2024/blob/main/Gundampalli_Thirupalu_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [34]:
pip install requests pandas



In [36]:
import requests
import pandas as pd
import time

# Set up your Semantic Scholar API key here
API_KEY = 'PxzYpR0ff55xIhVZQrbBsK4zC5AMtHJ1AkEwCiEh'

# Headers for requests including your API key
headers = {
    'x-api-key': API_KEY,
    'Accept': 'application/json'
}

# Define the queries
queries = ["machine learning", "data science", "artificial intelligence", "information extraction","robotics","neural networks","data analyst","data extraction","artificial intelligence","information extraction"]

# Set up the base URL for the Semantic Scholar API
BASE_URL = "https://api.semanticscholar.org/graph/v1/paper/search"

# Function to fetch abstracts from Semantic Scholar
def fetch_papers(query, limit, offset):
    params = {
        'query': query,
        'limit': limit,
        'offset': offset,
        'fields': 'title,abstract'
    }
    response = requests.get(BASE_URL, headers=headers, params=params)

    # Raise an exception if the request was not successful
    if response.status_code != 200:
        raise Exception(f"Error fetching data: {response.status_code}")

    # Return the JSON response
    return response.json()

# Function to collect all abstracts
def collect_abstracts(queries, total_papers=10000):
    all_papers = []
    papers_per_query = total_papers // len(queries)  # Split papers equally across queries

    for query in queries:
        offset = 0
        while len(all_papers) < total_papers:
            try:
                print(f"Fetching papers for query: {query}, Offset: {offset}")

                # Fetch the next batch of papers
                result = fetch_papers(query, limit=100, offset=offset)

                if 'data' not in result or len(result['data']) == 0:
                    print(f"No more papers found for query: {query}")
                    break

                # Extract relevant paper data (title, abstract)
                for paper in result['data']:
                    title = paper.get('title', 'No title available')
                    abstract = paper.get('abstract', 'No abstract available')
                    all_papers.append([title, abstract])

                    if len(all_papers) >= total_papers:
                        break

                # Increment offset to get the next batch of papers
                offset += 100
                time.sleep(1)  # Add delay to avoid rate limits
            except Exception as e:
                print(f"Error fetching papers: {e}")
                break

    return all_papers

# Collect the top 10,000 abstracts
abstracts_data = collect_abstracts(queries, total_papers=10000)

# Save the abstracts data to a CSV file
df = pd.DataFrame(abstracts_data, columns=['Title', 'Abstract'])
df.to_csv('top_10000_abstracts.csv', index=False)

print("Abstracts saved to 'top_10000_abstracts.csv'")


Fetching papers for query: machine learning, Offset: 0
Fetching papers for query: machine learning, Offset: 100
Fetching papers for query: machine learning, Offset: 200
Fetching papers for query: machine learning, Offset: 300
Fetching papers for query: machine learning, Offset: 400
Fetching papers for query: machine learning, Offset: 500
Fetching papers for query: machine learning, Offset: 600
Fetching papers for query: machine learning, Offset: 700
Fetching papers for query: machine learning, Offset: 800
Fetching papers for query: machine learning, Offset: 900
Fetching papers for query: machine learning, Offset: 1000
Error fetching papers: Error fetching data: 400
Fetching papers for query: data science, Offset: 0
Fetching papers for query: data science, Offset: 100
Fetching papers for query: data science, Offset: 200
Fetching papers for query: data science, Offset: 300
Fetching papers for query: data science, Offset: 400
Fetching papers for query: data science, Offset: 500
Fetching p

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [38]:
# Write code for each of the sub parts with proper comments.
!pip install nltk pandas



In [39]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download the necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset (assumes you have the CSV created from the previous step)
df = pd.read_csv('top_10000_abstracts.csv')

# Preview the data
print(df.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                                               Title  \
0  Fashion-MNIST: a Novel Image Dataset for Bench...   
1  TensorFlow: A system for large-scale machine l...   
2  TensorFlow: Large-Scale Machine Learning on He...   
3  Stop explaining black box machine learning mod...   
4  Convolutional LSTM Network: A Machine Learning...   

                                            Abstract  
0  We present Fashion-MNIST, a new dataset compri...  
1  TensorFlow is a machine learning system that o...  
2  TensorFlow is an interface for expressing mach...  
3                                                NaN  
4  The goal of precipitation nowcasting is to pre...  


In [41]:
def remove_special_characters(text):
    # Check if the input is a string before applying regex
    if isinstance(text, str):
        return re.sub(r'[^a-zA-Z\s]', '', text)
    else:
        return text  # Return the original value if it's not a string

# Apply this to the 'Abstract' column
df['cleaned_abstract'] = df['Abstract'].apply(remove_special_characters)

print(df[['Abstract', 'cleaned_abstract']].head())




                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1  TensorFlow is a machine learning system that o...   
2  TensorFlow is an interface for expressing mach...   
3                                                NaN   
4  The goal of precipitation nowcasting is to pre...   

                                    cleaned_abstract  
0  We present FashionMNIST a new dataset comprisi...  
1  TensorFlow is a machine learning system that o...  
2  TensorFlow is an interface for expressing mach...  
3                                                NaN  
4  The goal of precipitation nowcasting is to pre...  


In [43]:
def remove_numbers(text):
    # Check if the input is a string before applying regex
    if isinstance(text, str):
        return re.sub(r'\d+', '', text)
    else:
        return str(text)  # Convert non-string values to string

# Apply this to the 'cleaned_abstract' column
df['cleaned_abstract'] = df['cleaned_abstract'].apply(remove_numbers)

print(df[['cleaned_abstract']].head())


                                    cleaned_abstract
0  We present FashionMNIST a new dataset comprisi...
1  TensorFlow is a machine learning system that o...
2  TensorFlow is an interface for expressing mach...
3                                                nan
4  The goal of precipitation nowcasting is to pre...


In [44]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    return ' '.join([word for word in words if word.lower() not in stop_words])

# Apply this to the 'cleaned_abstract' column
df['cleaned_abstract'] = df['cleaned_abstract'].apply(remove_stopwords)

print(df[['cleaned_abstract']].head())


                                    cleaned_abstract
0  present FashionMNIST new dataset comprising x ...
1  TensorFlow machine learning system operates la...
2  TensorFlow interface expressing machine learni...
3                                                nan
4  goal precipitation nowcasting predict future r...


In [45]:
def to_lowercase(text):
    return text.lower()

# Apply this to the 'cleaned_abstract' column
df['cleaned_abstract'] = df['cleaned_abstract'].apply(to_lowercase)

print(df[['cleaned_abstract']].head())

                                    cleaned_abstract
0  present fashionmnist new dataset comprising x ...
1  tensorflow machine learning system operates la...
2  tensorflow interface expressing machine learni...
3                                                nan
4  goal precipitation nowcasting predict future r...


In [46]:
ps = PorterStemmer()

def stem_words(text):
    words = text.split()
    return ' '.join([ps.stem(word) for word in words])

# Apply this to the 'cleaned_abstract' column
df['cleaned_abstract'] = df['cleaned_abstract'].apply(stem_words)

print(df[['cleaned_abstract']].head())


                                    cleaned_abstract
0  present fashionmnist new dataset compris x gra...
1  tensorflow machin learn system oper larg scale...
2  tensorflow interfac express machin learn algor...
3                                                nan
4  goal precipit nowcast predict futur rainfal in...


In [47]:
lemmatizer = WordNetLemmatizer()

def lemmatize_words(text):
    words = text.split()
    return ' '.join([lemmatizer.lemmatize(word) for word in words])

# Apply this to the 'cleaned_abstract' column
df['cleaned_abstract_lemmatized'] = df['cleaned_abstract'].apply(lemmatize_words)

print(df[['cleaned_abstract', 'cleaned_abstract_lemmatized']].head())


                                    cleaned_abstract  \
0  present fashionmnist new dataset compris x gra...   
1  tensorflow machin learn system oper larg scale...   
2  tensorflow interfac express machin learn algor...   
3                                                nan   
4  goal precipit nowcast predict futur rainfal in...   

                         cleaned_abstract_lemmatized  
0  present fashionmnist new dataset compris x gra...  
1  tensorflow machin learn system oper larg scale...  
2  tensorflow interfac express machin learn algor...  
3                                                nan  
4  goal precipit nowcast predict futur rainfal in...  


In [48]:
# Save the cleaned data to a new CSV file
df.to_csv('cleaned_top_10000_abstracts.csv', index=False)
print("Cleaned data saved to cleaned_top_10000_abstracts.csv")


Cleaned data saved to cleaned_top_10000_abstracts.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [50]:
# Your code here
!pip install spacy nltk pandas
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [51]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank

# Load the cleaned data
df = pd.read_csv('cleaned_top_10000_abstracts.csv')

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Preview the data
print(df.head())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


                                               Title  \
0  Fashion-MNIST: a Novel Image Dataset for Bench...   
1  TensorFlow: A system for large-scale machine l...   
2  TensorFlow: Large-Scale Machine Learning on He...   
3  Stop explaining black box machine learning mod...   
4  Convolutional LSTM Network: A Machine Learning...   

                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1  TensorFlow is a machine learning system that o...   
2  TensorFlow is an interface for expressing mach...   
3                                                NaN   
4  The goal of precipitation nowcasting is to pre...   

                                    cleaned_abstract  \
0  present fashionmnist new dataset compris x gra...   
1  tensorflow machin learn system oper larg scale...   
2  tensorflow interfac express machin learn algor...   
3                                                NaN   
4  goal precipit nowcast predict futur rainfal

In [53]:
# Function to perform POS tagging
def pos_tagging(text):
    # Convert to string to handle non-string values
    text = str(text)  # This line ensures 'text' is always a string
    tokens = word_tokenize(text)
    return pos_tag(tokens)

# Apply POS tagging to each abstract
df['pos_tags'] = df['cleaned_abstract_lemmatized'].apply(pos_tagging)

# Function to count specific POS (Nouns, Verbs, Adjectives, Adverbs)
def count_pos_tags(pos_tags):
    pos_counts = Counter(tag for word, tag in pos_tags)
    noun_count = sum([pos_counts[tag] for tag in ['NN', 'NNS', 'NNP', 'NNPS']])
    verb_count = sum([pos_counts[tag] for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']])
    adj_count = sum([pos_counts[tag] for tag in ['JJ', 'JJR', 'JJS']])
    adv_count = sum([pos_counts[tag] for tag in ['RB', 'RBR', 'RBS']])

    return noun_count, verb_count, adj_count, adv_count

# Apply the POS counting function to the DataFrame
df['noun_count'], df['verb_count'], df['adj_count'], df['adv_count'] = zip(*df['pos_tags'].apply(count_pos_tags))

# Output POS counts for the first few rows
print(df[['cleaned_abstract_lemmatized', 'noun_count', 'verb_count', 'adj_count', 'adv_count']].head())

                         cleaned_abstract_lemmatized  noun_count  verb_count  \
0  present fashionmnist new dataset compris x gra...          27           6   
1  tensorflow machin learn system oper larg scale...          70          10   
2  tensorflow interfac express machin learn algor...          70           8   
3                                                NaN           1           0   
4  goal precipit nowcast predict futur rainfal in...          57           4   

   adj_count  adv_count  
0         12          0  
1         17          3  
2         19          0  
3          0          0  
4         11          2  


In [59]:
from nltk import Tree

# Using an example sentence for constituency parsing
example_sentence = "Artificial intelligence is transforming many industries."

# Tokenize and POS tag the example sentence
example_tokens = nltk.word_tokenize(example_sentence)
example_pos_tags = nltk.pos_tag(example_tokens)

# Use a predefined treebank grammar to generate a constituency parse tree (a simplified example)
# The original grammar is extended to include a rule for handling the period ('.').
# The period is defined as a terminal symbol using single quotes.

# Removed comments and placed period in quotes to treat as terminal
grammar = nltk.CFG.fromstring("""
  S -> NP VP '.'
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "is" | "transforming"
  NP -> "Artificial" "intelligence" | "many" "industries"
  P -> "in"
""")
parser = nltk.ChartParser(grammar)

# Print the constituency parsing tree
for tree in parser.parse(example_tokens):
    tree.pretty_print()

In [60]:
# Function to perform dependency parsing
def dependency_parsing(text):
    doc = nlp(text)
    for token in doc:
        print(f"{token.text} -> {token.dep_} -> {token.head.text}")

# Example of dependency parsing for one abstract
example_text = df['cleaned_abstract_lemmatized'].iloc[0]
print("Dependency Parsing for the first abstract:")
dependency_parsing(example_text)


Dependency Parsing for the first abstract:
present -> amod -> product
fashionmnist -> amod -> product
new -> amod -> dataset
dataset -> compound -> product
compris -> nmod -> product
x -> punct -> product
grayscal -> amod -> imag
imag -> amod -> product
fashion -> compound -> product
product -> nsubj -> set
categori -> aux -> set
imag -> acl -> categori
per -> prep -> imag
categori -> compound -> train
train -> nsubj -> set
set -> ROOT -> set
imag -> amod -> test
test -> dobj -> set
set -> dep -> set
imag -> amod -> fashionmnist
fashionmnist -> nsubj -> intend
intend -> conj -> set
serv -> nmod -> origin
direct -> amod -> origin
dropin -> nmod -> origin
replac -> compound -> origin
origin -> compound -> mnist
mnist -> compound -> machin
dataset -> compound -> benchmark
benchmark -> compound -> machin
machin -> dobj -> intend
learn -> npadvmod -> set
algorithm -> compound -> share
share -> ccomp -> learn
imag -> amod -> format
size -> compound -> format
data -> compound -> format
format

In [62]:
# Function to extract named entities using spaCy
def named_entity_recognition(text):
    # Convert to string to handle non-string values, including NaNs and floats
    text = str(text) if not isinstance(text, str) else text
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Apply named entity recognition to the abstracts
df['entities'] = df['cleaned_abstract_lemmatized'].apply(named_entity_recognition)

# Count each type of named entity (e.g., PERSON, ORG, GPE, DATE, PRODUCT)
def count_entities(entities):
    entity_counter = Counter([ent[1] for ent in entities])
    return entity_counter

df['entity_counts'] = df['entities'].apply(count_entities)

# Output the named entities and their counts for the first few rows
print(df[['cleaned_abstract_lemmatized', 'entities', 'entity_counts']].head())

                         cleaned_abstract_lemmatized  \
0  present fashionmnist new dataset compris x gra...   
1  tensorflow machin learn system oper larg scale...   
2  tensorflow interfac express machin learn algor...   
3                                                NaN   
4  goal precipit nowcast predict futur rainfal in...   

                                            entities  \
0                [(compris, PERSON), (freeli, NORP)]   
1  [(heterogen, ORG), (flexibl, DATE), (varieti a...   
2  [(chang wide varieti heterogen system rang mob...   
3                                                 []   
4        [(fulli, GPE), (lstm fclstm convolut, ORG)]   

                             entity_counts  
0                 {'PERSON': 1, 'NORP': 1}  
1       {'ORG': 1, 'DATE': 1, 'PERSON': 1}  
2  {'PERSON': 1, 'CARDINAL': 2, 'DATE': 1}  
3                                       {}  
4                     {'GPE': 1, 'ORG': 1}  


In [63]:
# Save the results into a new CSV file
df.to_csv('syntax_structure_analysis.csv', index=False)

print("Results saved to syntax_structure_analysis.csv")

Results saved to syntax_structure_analysis.csv


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [64]:
'I have submitted 2 CSV files in the comment section- saved "cleaned_top_10000_abstracts.csv" for Question 2 and "syntax_structure_analysis.csv" for Question 3'

'I have submitted 2 CSV files in the comment section- saved "cleaned_top_10000_abstracts.csv" for Question 2 and "syntax_structure_analysis.csv" for Question 3'

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [66]:
# Write your response below
'The task was a good challenge that called for a combination of real-world NLP, data processing, and software development skills. The practical applications and theoretical depth were both enjoyable to me, however managing the complex nature of several NLP components while retaining output required a lot of work.'

'The task was a good challenge that called for a combination of real-world NLP, data processing, and software development skills. The practical applications and theoretical depth were both enjoyable to me, however managing the complex nature of several NLP components while retaining output required a lot of work.'