<a href="https://colab.research.google.com/github/nagamani0604/Nagamani_INFO5731_Fall2024/blob/main/Somireddy_Nagamani_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [48]:

!pip install requests pandas

import requests
import pandas as pd
import time
import urllib.parse

API_KEY = 'ov4H66zruU14lwrB02ZPo1AqupejrEPZ6lRFSRps'

def fetch_paper_data(query, offset=0, limit=100):
    """
    Fetch paper data from the Semantic Scholar API.

    Args:
        query (str): The search query.
        offset (int): The offset for pagination.
        limit (int): The number of records to retrieve per request.

    Returns:
        dict: The JSON response from the API.
    """
    encoded_query = urllib.parse.quote(query)

    url = f"https://api.semanticscholar.org/graph/v1/paper/search?query={encoded_query}&fields=title,abstract&offset={offset}&limit={limit}"

    headers = {
        'x-api-key': API_KEY
    }

    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Error fetching data: {response.status_code}")
        return None

    return response.json()

def collect_abstracts(queries, total_abstracts=10000):
    """
    Collect abstracts from Semantic Scholar based on given queries.

    Args:
        queries (list): A list of search queries.
        total_abstracts (int): The total number of abstracts to collect.

    Returns:
        list: A list of dictionaries containing the title and abstract of papers.
    """
    all_abstracts = []

    for query in queries:
        offset = 0
        while len(all_abstracts) < total_abstracts:
            print(f"Fetching {total_abstracts - len(all_abstracts)} more abstracts for query: '{query}'")
            data = fetch_paper_data(query, offset)

            if data is None or 'data' not in data:
                break

            for paper in data['data']:
                if 'abstract' in paper:
                    all_abstracts.append({
                        'Title': paper['title'],
                        'Abstract': paper['abstract']
                    })

            offset += len(data['data'])

            time.sleep(1)

        if len(all_abstracts) >= total_abstracts:
            break

    return all_abstracts[:total_abstracts]

def save_to_csv(data, filename='papers_abstracts.csv'):
    """
    Save the collected data to a CSV file.

    Args:
        data (list): The data to save.
        filename (str): The name of the output CSV file.
    """
    if not data:
        print("No data to save.")
        return

    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"Saved {len(data)} abstracts to {filename}.")

if __name__ == "__main__":
    queries = [
        "machine learning",
          "machine learning",
        "data science",
                "data science",
        "artificial intelligence",
        "artificial intelligence",
        "information extraction",
                "information extraction",
        "deep learning",
        "deep learning",
    ]

    abstracts_info = collect_abstracts(queries, total_abstracts=10000)
    save_to_csv(abstracts_info)


Fetching 10000 more abstracts for query: 'machine learning'
Fetching 9900 more abstracts for query: 'machine learning'
Fetching 9800 more abstracts for query: 'machine learning'
Fetching 9700 more abstracts for query: 'machine learning'
Fetching 9600 more abstracts for query: 'machine learning'
Fetching 9500 more abstracts for query: 'machine learning'
Fetching 9400 more abstracts for query: 'machine learning'
Fetching 9300 more abstracts for query: 'machine learning'
Fetching 9200 more abstracts for query: 'machine learning'
Fetching 9100 more abstracts for query: 'machine learning'
Fetching 9000 more abstracts for query: 'machine learning'
Error fetching data: 400
Fetching 9000 more abstracts for query: 'machine learning'
Fetching 8900 more abstracts for query: 'machine learning'
Fetching 8800 more abstracts for query: 'machine learning'
Fetching 8700 more abstracts for query: 'machine learning'
Fetching 8600 more abstracts for query: 'machine learning'
Fetching 8500 more abstracts f

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [49]:

!pip install pandas nltk

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

df = pd.read_csv('papers_abstracts.csv')

if 'Abstract' not in df.columns:
    raise ValueError("The 'Abstract' column is not found in the CSV file.")

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def to_string(text):
    if isinstance(text, str):
        return text
    return ''

def remove_noise(text):
    text = to_string(text)
    return re.sub(r'[^\w\s]', '', text)

def remove_numbers(text):
    text = to_string(text)
    return re.sub(r'\d+', '', text)

def remove_stopwords(text):
    text = to_string(text)
    words = text.split()
    return ' '.join([word for word in words if word.lower() not in stop_words])

def to_lowercase(text):
    text = to_string(text)
    return text.lower()

def stem_text(text):
    text = to_string(text)
    words = text.split()
    return ' '.join([ps.stem(word) for word in words])

def lemmatize_text(text):
    text = to_string(text)
    words = text.split()
    return ' '.join([lemmatizer.lemmatize(word) for word in words])

df['Cleaned_Abstract'] = df['Abstract'].apply(remove_noise)
df['Cleaned_Abstract'] = df['Cleaned_Abstract'].apply(remove_numbers)
df['Cleaned_Abstract'] = df['Cleaned_Abstract'].apply(remove_stopwords)
df['Cleaned_Abstract'] = df['Cleaned_Abstract'].apply(to_lowercase)

df['Stemmed_Abstract'] = df['Cleaned_Abstract'].apply(stem_text)
df['Lemmatized_Abstract'] = df['Cleaned_Abstract'].apply(lemmatize_text)

df.to_csv('cleaned_papers_abstracts.csv', index=False)

print(df[['Abstract', 'Cleaned_Abstract', 'Stemmed_Abstract', 'Lemmatized_Abstract']].head())






[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1  TensorFlow is a machine learning system that o...   
2  TensorFlow is an interface for expressing mach...   
3                                                NaN   
4  The goal of precipitation nowcasting is to pre...   

                                    Cleaned_Abstract  \
0  present fashionmnist new dataset comprising x ...   
1  tensorflow machine learning system operates la...   
2  tensorflow interface expressing machine learni...   
3                                                      
4  goal precipitation nowcasting predict future r...   

                                    Stemmed_Abstract  \
0  present fashionmnist new dataset compris x gra...   
1  tensorflow machin learn system oper larg scale...   
2  tensorflow interfac express machin learn algor...   
3                                                      
4  goal precipit nowcast predict futur rainfal

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [50]:
!pip install pandas nltk spacy
!pip install nltk

import pandas as pd
import nltk
from nltk import pos_tag, word_tokenize, ne_chunk
from collections import Counter

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

df = pd.read_csv('cleaned_papers_abstracts.csv')

df['Cleaned_Abstract'] = df['Cleaned_Abstract'].astype(str)

def pos_tagging(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    return pos_tags

def count_pos(pos_tags):
    counts = Counter(tag for word, tag in pos_tags)
    return {
        'Nouns': counts['NN'] + counts['NNS'] + counts['NNP'] + counts['NNPS'],
        'Verbs': counts['VB'] + counts['VBD'] + counts['VBG'] + counts['VBN'] + counts['VBP'] + counts['VBZ'],
        'Adjectives': counts['JJ'] + counts['JJR'] + counts['JJS'],
        'Adverbs': counts['RB'] + counts['RBR'] + counts['RBS']
    }

def parse_sentences(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    chunks = ne_chunk(pos_tags)
    return chunks

def named_entity_recognition(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    named_entities = ne_chunk(pos_tags)
    return named_entities

pos_counts = []
constituency_parsing = []
ner_results = []

for abstract in df['Cleaned_Abstract']:
    pos_tags = pos_tagging(abstract)
    pos_counts.append(count_pos(pos_tags))

    constituency_tree = parse_sentences(abstract)
    constituency_parsing.append(constituency_tree)

    ner_entities = named_entity_recognition(abstract)
    ner_results.append(ner_entities)

pos_counts_df = pd.DataFrame(pos_counts)
df = pd.concat([df, pos_counts_df], axis=1)

print("Parts of Speech Counts:")
print(df[['Nouns', 'Verbs', 'Adjectives', 'Adverbs']].sum())
print("\nConstituency Parsing Example (First Abstract):")
print(constituency_parsing[0])
print("\nNamed Entity Recognition Results (First Abstract):")
print(ner_results[0])

df.to_csv('syntax_structure_analysis.csv', index=False)

def visualize_tree(tree):
    tree.pretty_print()

if len(constituency_parsing) > 0:
    visualize_tree(constituency_parsing[0])




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Parts of Speech Counts:
Nouns         379338
Verbs         153280
Adjectives    163794
Adverbs        34678
dtype: int64

Constituency Parsing Example (First Abstract):
(S
  present/JJ
  fashionmnist/JJ
  new/JJ
  dataset/NN
  comprising/VBG
  x/JJ
  grayscale/JJ
  images/NNS
  fashion/NN
  products/NNS
  categories/NNS
  images/NNS
  per/IN
  category/NN
  training/NN
  set/VBN
  images/NNS
  test/VBP
  set/VBN
  images/NNS
  fashionmnist/VBP
  intended/VBN
  serve/VBP
  direct/JJ
  dropin/NN
  replacement/NN
  original/JJ
  mnist/NN
  dataset/NN
  benchmarking/NN
  machine/NN
  learning/VBG
  algorithms/JJ
  shares/NNS
  image/NN
  size/NN
  data/NNS
  format/NN
  structure/NN
  training/VBG
  testing/VBG
  splits/NNS
  dataset/VBN
  freely/RB
  available/JJ
  https/NN
  url/NN)

Named Entity Recognition Results (First Abstract):
(S
  present/JJ
  fashionmnist/JJ
  new/JJ
  dataset/NN
  comprising/VBG
  x/JJ
  grayscale/JJ
  images/NNS
  fashion/NN
  products/NNS
  categories/NNS
  i

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below

There are difficulties associated with handling API authentication, data quotas, and rate limitation when using external APIs like Semantic Scholar.
The complexity of troubleshooting issues such as 400 errors resulting from query formatting or request limits increases.

It's very powerful to use Python to automate data collection and manipulation, from web scraping to storing the results in CSV files.
It shows how much you can do in comparatively short amounts of time with programming.

Ten to fifteen days would be a reasonable time to complete an assignment like this, allowing for detailed testing, research into API documentation, and implementing solutions for any unexpected issues that arise.
