<a href="https://colab.research.google.com/github/nakulreddy0107/NakulReddy_INFO5731_Spring2025/blob/main/Sarasani_Nakul_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [1]:
import requests
import pandas as pd
import time
from tqdm import tqdm
import json

def fetch_papers(query, limit=1000):
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    headers = {
        # Replace with your API key if you have one
        # "x-api-key": "YOUR_API_KEY"
    }

    papers = []
    offset = 0
    batch_size = 100
    pbar = tqdm(total=min(limit, 1000), desc=f"Fetching papers for '{query}'")

    while offset < limit:
        params = {
            "query": query,
            "limit": min(batch_size, limit - offset),
            "offset": offset,
            "fields": "title,abstract,year,authors,url"
        }

        try:
            response = requests.get(base_url, headers=headers, params=params)
            response.raise_for_status()
            data = response.json()
            if not data.get('data'):
                break
            valid_papers = [p for p in data['data'] if p.get('abstract')]
            papers.extend(valid_papers)
            pbar.update(len(valid_papers))
            offset += batch_size
            time.sleep(2)

        except requests.exceptions.RequestException as e:
            print(f"Error fetching data: {e}")
            time.sleep(5)
            continue

    pbar.close()
    return papers

def save_to_csv(papers, filename):
    df = pd.DataFrame(papers)
    df['authors'] = df['authors'].apply(lambda x: ', '.join([author['name'] for author in x]) if x else '')
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Saved {len(df)} papers to {filename}")

def main():
    queries = [
        "machine learning",
        "data science",
        "artificial intelligence",
        "information extraction"
    ]
    all_papers = []

    for query in queries:
        print(f"\nProcessing query: {query}")
        papers = fetch_papers(query, limit=1000)
        all_papers.extend(papers)
        filename = f"papers_{query.replace(' ', '_')}_{time.strftime('%Y%m%d')}.csv"
        save_to_csv(papers, filename)
    print("\nSaving combined results...")
    filename = f"all_papers_{time.strftime('%Y%m%d')}.csv"
    save_to_csv(all_papers, filename)

    print("\nDone! Summary:")
    print(f"Total papers collected: {len(all_papers)}")
    print("Individual files saved for each query")
    print(f"Combined results saved to {filename}")

if __name__ == "__main__":
    main()


#The CSV file generated
"all_papers_20250219.csv"


Processing query: machine learning


Fetching papers for 'machine learning':   0%|          | 0/1000 [00:00<?, ?it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=0&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':   6%|▌         | 62/1000 [00:06<01:38,  9.52it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=100&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=100&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=100&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':  13%|█▎        | 133/1000 [00:25<02:56,  4.91it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=200&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':  21%|██        | 207/1000 [00:34<02:10,  6.10it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=300&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':  34%|███▍      | 339/1000 [00:45<01:15,  8.81it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=500&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':  49%|████▉     | 489/1000 [00:57<00:46, 10.96it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':  49%|████▉     | 489/1000 [01:09<00:46, 10.96it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=machine+learning&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'machine learning':  72%|███████▏  | 723/1000 [01:46<00:40,  6.77it/s]


Saved 723 papers to papers_machine_learning_20250219.csv

Processing query: data science


Fetching papers for 'data science':   0%|          | 0/1000 [00:00<?, ?it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=0&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'data science':  28%|██▊       | 283/1000 [00:16<00:37, 19.15it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=400&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=400&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'data science':  44%|████▍     | 440/1000 [00:34<00:45, 12.34it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'data science':  44%|████▍     | 440/1000 [00:53<00:45, 12.34it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'data science':  60%|█████▉    | 595/1000 [01:28<01:19,  5.08it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=800&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=800&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=800&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'data science':  68%|██████▊   | 680/1000 [01:47<01:05,  4.85it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=900&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=900&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=data+science&limit=100&offset=900&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'data science':  76%|███████▋  | 764/1000 [02:08<00:39,  5.94it/s]


Saved 764 papers to papers_data_science_20250219.csv

Processing query: artificial intelligence


Fetching papers for 'artificial intelligence':  10%|▉         | 95/1000 [00:05<00:52, 17.16it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=200&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=200&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=200&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=200&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  16%|█▌        | 159/1000 [00:29<03:13,  4.35it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=300&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  22%|██▎       | 225/1000 [00:37<02:22,  5.43it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=400&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=400&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  29%|██▉       | 291/1000 [00:52<02:20,  5.05it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=500&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  35%|███▌      | 354/1000 [01:00<01:54,  5.64it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  35%|███▌      | 354/1000 [01:14<01:54,  5.64it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  43%|████▎     | 431/1000 [01:42<02:57,  3.21it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  47%|████▋     | 471/1000 [01:56<02:49,  3.12it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=800&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=800&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  54%|█████▍    | 542/1000 [02:09<02:05,  3.64it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=900&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=artificial+intelligence&limit=100&offset=900&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'artificial intelligence':  61%|██████▏   | 613/1000 [02:25<01:32,  4.20it/s]


Saved 613 papers to papers_artificial_intelligence_20250219.csv

Processing query: information extraction


Fetching papers for 'information extraction':   8%|▊         | 77/1000 [00:01<00:12, 75.52it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=100&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  24%|██▍       | 245/1000 [00:12<00:40, 18.85it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=300&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=300&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=300&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  40%|████      | 401/1000 [00:34<00:57, 10.33it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=500&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=500&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=500&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  40%|████      | 401/1000 [00:48<00:57, 10.33it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=500&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  49%|████▉     | 490/1000 [00:58<01:22,  6.19it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=600&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  58%|█████▊    | 576/1000 [01:12<01:07,  6.25it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl
Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=700&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  66%|██████▌   | 661/1000 [01:26<00:54,  6.23it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=800&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  74%|███████▍  | 741/1000 [01:34<00:37,  6.98it/s]

Error fetching data: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=information+extraction&limit=100&offset=900&fields=title%2Cabstract%2Cyear%2Cauthors%2Curl


Fetching papers for 'information extraction':  82%|████████▏ | 821/1000 [01:45<00:23,  7.76it/s]


Saved 821 papers to papers_information_extraction_20250219.csv

Saving combined results...
Saved 2921 papers to all_papers_20250219.csv

Done! Summary:
Total papers collected: 2921
Individual files saved for each query
Combined results saved to all_papers_20250219.csv


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:

import nltk
nltk.download('punkt_tab')
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string
from tqdm import tqdm

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

def remove_noise(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = ' '.join(text.split())
    return text

def remove_numbers(text):
    return re.sub(r'\d+', '', text)

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    return ' '.join(filtered_text)

def apply_stemming(text):
    stemmer = PorterStemmer()
    word_tokens = word_tokenize(text)
    stemmed_text = [stemmer.stem(word) for word in word_tokens]
    return ' '.join(stemmed_text)

def apply_lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    word_tokens = word_tokenize(text)
    lemmatized_text = [lemmatizer.lemmatize(word) for word in word_tokens]
    return ' '.join(lemmatized_text)

def clean_text(text):
    if pd.isna(text):
        return pd.Series({
            'cleaned_no_noise': '',
            'cleaned_no_numbers': '',
            'cleaned_no_stopwords': '',
            'cleaned_lowercase': '',
            'cleaned_stemmed': '',
            'cleaned_lemmatized': ''
        })
    cleaned_no_noise = remove_noise(text)
    cleaned_no_numbers = remove_numbers(cleaned_no_noise)
    cleaned_no_stopwords = remove_stopwords(cleaned_no_numbers)
    cleaned_lowercase = cleaned_no_stopwords.lower()
    cleaned_stemmed = apply_stemming(cleaned_lowercase)
    cleaned_lemmatized = apply_lemmatization(cleaned_lowercase)

    return pd.Series({
        'cleaned_no_noise': cleaned_no_noise,
        'cleaned_no_numbers': cleaned_no_numbers,
        'cleaned_no_stopwords': cleaned_no_stopwords,
        'cleaned_lowercase': cleaned_lowercase,
        'cleaned_stemmed': cleaned_stemmed,
        'cleaned_lemmatized': cleaned_lemmatized
    })

def main():
    print("Reading CSV file...")
    df = pd.read_csv('/content/all_papers_20250219.csv')
    print("Cleaning text data...")
    tqdm.pandas()
    cleaned_data = df['abstract'].progress_apply(clean_text)
    df = pd.concat([df, cleaned_data], axis=1)
    output_file = 'cleaned_papers_20240218.csv'
    print(f"\nSaving cleaned data to {output_file}")
    df.to_csv(output_file, index=False)
    print("\nExample of cleaning steps for the first abstract:")
    example = df.iloc[0]
    print("\nOriginal text:")
    print(example['abstract'][:200] + "...")
    print("\nAfter removing noise:")
    print(example['cleaned_no_noise'][:200] + "...")
    print("\nAfter removing numbers:")
    print(example['cleaned_no_numbers'][:200] + "...")
    print("\nAfter removing stopwords:")
    print(example['cleaned_no_stopwords'][:200] + "...")
    print("\nAfter lowercasing:")
    print(example['cleaned_lowercase'][:200] + "...")
    print("\nAfter stemming:")
    print(example['cleaned_stemmed'][:200] + "...")
    print("\nAfter lemmatization:")
    print(example['cleaned_lemmatized'][:200] + "...")

if __name__ == "__main__":
    main()


#The CSV file generated
"cleaned_papers_20240218.csv"

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Reading CSV file...
Cleaning text data...


100%|██████████| 2921/2921 [00:18<00:00, 160.48it/s]



Saving cleaned data to cleaned_papers_20240218.csv

Example of cleaning steps for the first abstract:

Original text:
We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the t...

After removing noise:
We present FashionMNIST a new dataset comprising of x grayscale images of fashion products from categories with images per category The training set has images and the test set has images FashionMNIST...

After removing numbers:
We present FashionMNIST a new dataset comprising of x grayscale images of fashion products from categories with images per category The training set has images and the test set has images FashionMNIST...

After removing stopwords:
present FashionMNIST new dataset comprising x grayscale images fashion products categories images per category training set images test set images FashionMNIST intended serve direct dropin repla

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
import pandas as pd
import spacy
from collections import Counter
from tqdm import tqdm
import json

class TextAnalyzer:
    def __init__(self):
        self.nlp = spacy.load('en_core_web_sm')

    def analyze_pos(self, doc):
        """Analyze POS tags in the document"""
        pos_counts = {
            'Nouns': len([token for token in doc if token.pos_ == 'NOUN' or token.pos_ == 'PROPN']),
            'Verbs': len([token for token in doc if token.pos_ == 'VERB']),
            'Adjectives': len([token for token in doc if token.pos_ == 'ADJ']),
            'Adverbs': len([token for token in doc if token.pos_ == 'ADV'])
        }
        pos_examples = {}
        for token in doc:
            if token.pos_ not in pos_examples:
                pos_examples[token.pos_] = []
            if len(pos_examples[token.pos_]) < 3:
                pos_examples[token.pos_].append(token.text)

        return pos_counts, pos_examples

    def create_dependency_tree_string(self, doc):
        """Create a string representation of the dependency tree"""
        tree_strings = []
        for sent in doc.sents:
            words = [(token.text, token.dep_) for token in sent]
            for i, (word, dep) in enumerate(words):
                token = sent[i]
                head_idx = token.head.i - sent.start

                if token.dep_ == "ROOT":
                    tree_str = f"ROOT --> {word}"
                else:
                    if head_idx < i:
                        head_word = words[head_idx][0]
                        tree_str = f"  {head_word} --{dep}--> {word}"
                    else:
                        continue

                tree_strings.append(tree_str)

        return "\n".join(tree_strings)

    def analyze_parsing(self, doc):

        dep_triples = []
        for token in doc:
            dep_triples.append((token.text, token.dep_, token.head.text))
        dep_tree = self.create_dependency_tree_string(doc)

        return dep_triples, dep_tree

    def analyze_entities(self, doc):
        entity_counts = Counter()
        entities_list = {}

        for ent in doc.ents:
            entity_counts[ent.label_] += 1
            if ent.label_ not in entities_list:
                entities_list[ent.label_] = []
            if len(entities_list[ent.label_]) < 5:
                entities_list[ent.label_].append(ent.text)

        return dict(entity_counts), entities_list

def analyze_text(text, analyzer):
    if pd.isna(text) or not str(text).strip():
        return None

    try:
        text = str(text).strip()
        doc = analyzer.nlp(text)
        pos_counts, pos_examples = analyzer.analyze_pos(doc)
        dep_triples, dep_tree = analyzer.analyze_parsing(doc)
        entity_counts, entities_list = analyzer.analyze_entities(doc)

        return {
            'pos_counts': pos_counts,
            'pos_examples': pos_examples,
            'dep_triples': dep_triples,
            'dep_tree': dep_tree,
            'entity_counts': entity_counts,
            'entities_list': entities_list
        }
    except Exception as e:
        print(f"Warning: Error analyzing text: {e}")
        return None

def main():
    analyzer = TextAnalyzer()

    print("Reading cleaned data...")
    df = pd.read_csv('cleaned_papers_20240218.csv')
    sample_text = None
    for text in df['cleaned_no_stopwords']:
        if pd.notna(text) and str(text).strip():
            sample_text = text
            break

    if sample_text is None:
        print("No valid sample text found in the dataset!")
        return

    print("\n=== DETAILED ANALYSIS OF SAMPLE TEXT ===")
    print("\nSample text:")
    print(sample_text[:200] + "...")

    sample_analysis = analyze_text(sample_text, analyzer)

    if sample_analysis:
        print("\n1. PARTS OF SPEECH ANALYSIS")
        print("\nPOS Counts:")
        for pos, count in sample_analysis['pos_counts'].items():
            print(f"{pos}: {count}")

        print("\nPOS Examples:")
        for pos, examples in sample_analysis['pos_examples'].items():
            print(f"{pos}: {', '.join(examples)}")

        print("\n2. DEPENDENCY PARSING")
        print("\nDependency Tree:")
        print(sample_analysis['dep_tree'])

        print("\n3. NAMED ENTITY RECOGNITION")
        print("\nEntity Counts:")
        for ent_type, count in sample_analysis['entity_counts'].items():
            print(f"{ent_type}: {count}")
            print(f"Examples: {', '.join(sample_analysis['entities_list'][ent_type])}")
    print("\n=== PROCESSING FULL DATASET ===")

    total_pos_counts = Counter()
    total_entity_counts = Counter()
    dataset_analyses = []
    for text in tqdm(df['cleaned_no_stopwords']):
        analysis = analyze_text(text, analyzer)
        if analysis:
            dataset_analyses.append(analysis)
            total_pos_counts.update(analysis['pos_counts'])
            total_entity_counts.update(analysis['entity_counts'])
    print("\nOverall Statistics:")
    print("\nTotal POS Counts:")
    for pos, count in total_pos_counts.items():
        print(f"{pos}: {count}")

    print("\nTotal Entity Counts:")
    for entity_type, count in sorted(total_entity_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"{entity_type}: {count}")
    with open('text_analysis_results.json', 'w') as f:
        json.dump(dataset_analyses, f, indent=2)

    print("\nFull analysis results saved to 'text_analysis_results.json'")

if __name__ == "__main__":
    main()

#The Json file generated
"text_analysis_results.json"

Reading cleaned data...

=== DETAILED ANALYSIS OF SAMPLE TEXT ===

Sample text:
present FashionMNIST new dataset comprising x grayscale images fashion products categories images per category training set images test set images FashionMNIST intended serve direct dropin replacement...

1. PARTS OF SPEECH ANALYSIS

POS Counts:
Nouns: 26
Verbs: 9
Adjectives: 7
Adverbs: 1

POS Examples:
ADJ: present, new, direct
NUM: FashionMNIST, FashionMNIST
NOUN: dataset, grayscale, fashion
VERB: comprising, images, set
PUNCT: x
ADP: per
PROPN: MNIST, URL
ADV: freely

2. DEPENDENCY PARSING

Dependency Tree:
  dataset --acl--> comprising
  dataset --punct--> x
  dataset --appos--> grayscale
  images --dobj--> images
  images --prep--> per
  per --pobj--> images
  images --dobj--> images
  images --acl--> FashionMNIST
  intended --xcomp--> serve
  serve --dobj--> learning
  images --conj--> splits
ROOT --> dataset
  dataset --dobj--> URL

3. NAMED ENTITY RECOGNITION

Entity Counts:
ORG: 1
Examples: Fashion

100%|██████████| 2921/2921 [01:13<00:00, 39.64it/s]



Overall Statistics:

Total POS Counts:
Nouns: 168044
Verbs: 56083
Adjectives: 46331
Adverbs: 10692

Total Entity Counts:
ORG: 7287
PERSON: 1542
CARDINAL: 1520
ORDINAL: 584
DATE: 576
NORP: 514
GPE: 416
PRODUCT: 307
LOC: 129
WORK_OF_ART: 101
LAW: 66
FAC: 40
TIME: 29
LANGUAGE: 28
EVENT: 19
MONEY: 6
PERCENT: 3
QUANTITY: 3

Full analysis results saved to 'text_analysis_results.json'


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [5]:
import requests
from bs4 import BeautifulSoup
import csv
import time

BASE_URL = "https://github.com/marketplace?type=actions&page="
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/json',
    'Accept-Language': 'en-US,en;q=0.9',
}
OUTPUT_FILE = "github_marketplace_actions.csv"

def scrape_github_marketplace(max_pages=10, delay=2):
    data = []
    for page in range(1, max_pages + 1):
        url = BASE_URL + str(page)
        print(url)
        response = requests.get(url, headers=headers)

        if response.status_code != 200:
            print(f"Failed to retrieve page {page}: {response.status_code}")
            continue

        soup = BeautifulSoup(response.text, "html.parser")
        action_cards = soup.find_all("div", class_="flex-1")

        if not action_cards:
            print("No more actions found, stopping scrape.")
            break

        for card in action_cards:
            name_tag = card.find("h3")
            desc_tag = card.find("p")
            link_tag = card.find("a")

            name = name_tag.text.strip() if name_tag else "No Name"
            description = desc_tag.text.strip() if desc_tag else "No Description"
            url = "https://github.com" + link_tag["href"] if link_tag else "No URL"

            data.append([name, description, url, page])

        print(f"Scraped page {page}")
        time.sleep(delay)

    save_to_csv(data)

def save_to_csv(data):
    with open(OUTPUT_FILE, mode="w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Product Name", "Description", "URL", "Page Number"])
        writer.writerows(data)
    print(f"Data saved to {OUTPUT_FILE}")

if __name__ == "__main__":
    scrape_github_marketplace(max_pages=10, delay=2)

#The CSV file generated
"github_marketplace_actions.csv"


https://github.com/marketplace?type=actions&page=1
Scraped page 1
https://github.com/marketplace?type=actions&page=2
Scraped page 2
https://github.com/marketplace?type=actions&page=3
Scraped page 3
https://github.com/marketplace?type=actions&page=4
Scraped page 4
https://github.com/marketplace?type=actions&page=5
Scraped page 5
https://github.com/marketplace?type=actions&page=6
Scraped page 6
https://github.com/marketplace?type=actions&page=7
Scraped page 7
https://github.com/marketplace?type=actions&page=8
Scraped page 8
https://github.com/marketplace?type=actions&page=9
Scraped page 9
https://github.com/marketplace?type=actions&page=10
Scraped page 10
Data saved to github_marketplace_actions.csv


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [6]:
import pandas as pd
import re
file_path = "github_marketplace_actions.csv"
df = pd.read_csv(file_path)
custom_stopwords = set([
    "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not",
    "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was",
    "will", "with"
])
lemmatization_dict = {
    "scans": "scan",
    "generators": "generator",
    "plugins": "plugin",
    "actions": "action",
    "tools": "tool"
}

def simple_clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', str(text))
    tokens = text.lower().split()
    tokens = [lemmatization_dict.get(word, word) for word in tokens if word not in custom_stopwords]
    return ' '.join(tokens)
df['Product Name'] = df['Product Name'].apply(simple_clean_text)
df['Description'] = df['Description'].apply(simple_clean_text)
df = df[(df['Product Name'] != 'name') & (df['Description'] != 'description')]
df = df[df['URL'].str.startswith("https://github.com/marketplace/actions/")]
df.to_csv("cleaned_github_marketplace_actions.csv", index=False)
print("Data cleaning complete. Saved to 'cleaned_github_marketplace_actions.csv'")

#The CSV file generated
"cleaned_github_marketplace_actions.csv"

Data cleaning complete. Saved to 'cleaned_github_marketplace_actions.csv'


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [7]:
import tweepy
bearer_token = "AAAAAAAAAAAAAAAAAAAAALx4zQEAAAAA3XWhS9AlPYvu05rUtU%2B5R23OsqA%3DUFKNB6nfwGvDr1AmbVXTNjzkEeO466mxdElHntuY5zYrPSXqp6"
client = tweepy.Client(bearer_token)
query = "#MachineLearning OR #AI OR #ArtificialIntelligence -is:retweet lang:en"
tweets = client.search_recent_tweets(query=query, max_results=99)
tweets_data = []
for tweet in tweets.data:
    tweet_data = {
        'tweet_id': tweet.id,
        'username': tweet.author_id,
        'tweet_text': tweet.text
    }
    tweets_data.append(tweet_data)
import pandas as pd
df = pd.DataFrame(tweets_data)
df.to_csv('cleaned_tweets.csv', index=False)
print("Data saved to cleaned_tweets.csv")



#The CSV file generated
"cleaned_tweets.csv"




Data saved to cleaned_tweets.csv


In [None]:
# AI prompt for question 4 & 5 used ChatGPT

#4
# # (PART-1) Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

# The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

# The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

# (PART -2)

# Preprocess Data: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

# Perform Data Quality operations.

# Preprocessing: Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

# Data Quality: Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.

#5
# PART 1: Web Scrape tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.) The extracted data includes the tweet ID, username, and text.

# Part 2: Perform data cleaning procedures

# A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


#here is the link for the colab:
"https://colab.research.google.com/drive/1EVc0NllqygYk0ay4MLe_SiGeUf3ujkst?usp=sharing"

#I've uploaded all the csv files generated in the github for all thw questions.



# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog