# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [1]:
# importing the data library
import requests
import time
import re
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import Counter

In [2]:
# defining the search query
query = 'machine learning'

# API URL for searching research paper
url = 'https://api.semanticscholar.org/graph/v1/paper/search/bulk'

# setting query parameters for API request
query_params = {
    'query': query,  # searching research papers related to machine learning
    'offset': 0,  # starting index for fetching results
    'limit': 100,  # fetching 100 papers at a time
    'fields': 'paperId,title,abstract'  # specifying the fields to include in the response
}

# creating an empty list to store research paper details
all_papers = []

# using loop function to fetch multipl several batches of results
for batch in range(100): # running loop for 100 times to collect more papers
    print(f'\nFetching Batch {batch + 1}...') # showing batch number

    # making the get request to API with query parameters
    response = requests.get(url, params=query_params)

    # checking if the response status is successful
    if response.status_code == 200:  # checking if the request is successful
        data = response.json()  # converting response to JSON format
        papers = data.get('data', [])  # extracting paper list from the response

        # looping through each paper and extracting required details
        for paper in papers:
            title = paper.get("title", "No Title") # getting title of the paper
            abstract = paper.get("abstract", "No Abstract") # getting abstract of the paper

            # adding paper details to the list
            all_papers.append({
                'Title': title,
                'Abstract': abstract
            })

            # printing the paper details
            print(f'Title: {title}')
            print(f'Abstract: {abstract}')
            print('-' * 40)

        # updating the offset for fetching the next batch of research papers
        query_params['offset'] += 100  # increasing offset by 100 to get results

        # adding a small delay to avoid hitting API rate limit
        time.sleep(5) # pausing run time for 5 seconds

    elif response.status_code == 429:  # if API returns too many requests error
        print('Too many requests! Waiting for 10 seconds before retrying...')
        time.sleep(10)  # waiting for 10 seconds before retrying

    else:
        print(f'Request failed with status code {response.status_code}: {response.text}')
        break  # stopping further details if there is an error

# displaying that script has been completed
print('Finished fetching papers.')

# changing data to a CSV file
df = pd.DataFrame(all_papers)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 An effective approach is proposed to evaluate the service life reliability of a multi-physics coupling structure of an insulated gate bipolar transistor (IGBT) module. The node-based smoothed finite element method with stabilization terms is firstly employed to construct an electrical-thermal-mechanical (ETM) coupling structure of the IGBT module, based on which the multi-physics responses can be accurately calculated to predict the service life of the IGBT module. By using the high-quality sample data obtained through the ETM coupling model, a Monte Carlo based active learning Kriging metamodel (AK-MCS) is developed to assess the service life reliability of the IGBT module, which can greatly reduce the computational cost needed by the surrogate model construction and reliability analysis. Numerical results show that the proposed ETM coupling structure can produce high-quality sample data of the IGBT dynamics and the AK-

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
# loading the data from the previously collected research papers
df = pd.DataFrame(all_papers) # creating a df

# downloading necessary NLTK resources
nltk.download('stopwords') # downloading stopwords dataset
nltk.download('wordnet')  # downloading wordnet for lemmatization
nltk.download('punkt') # downloading tokenizer dataset

# initializing the porterstemmer and wordnet lemmatizer
stemmer = PorterStemmer() # creating an instance of porterstemmer
lemmatizer = WordNetLemmatizer() # creating an instance of wordnet lemmatizer

# defining stopwords to remove common words that do not have any meaning
stop_words = set(stopwords.words('english'))

# defining a function to clean the text
def clean_text_remove_stopwords_stem_and_lemmatize(text):
    if text is None:
        return ''  # if the text is none, return an empty string

    # removing special characters and punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)  # Keep only letters, numbers, and spaces

    # removing numbers using regex
    text = re.sub(r'\d+', '', text)  # removing all digits

    # converting text to lowercase
    text = text.lower()

    # removing stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])

    # applying stemming to each word
    text = ' '.join([stemmer.stem(word) for word in text.split()])

    # applying lemmatization to each word
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

# applying the cleaning, stemming, and lemmatization function to title and abstract
df['Cleaned_title'] = df['Title'].apply(clean_text_remove_stopwords_stem_and_lemmatize)
df['Cleaned_Abstract'] = df['Abstract'].apply(clean_text_remove_stopwords_stem_and_lemmatize)

# showing few rows of the cleaned data
df[['Cleaned_title', 'Cleaned_Abstract']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,Cleaned_title,Cleaned_Abstract
0,insight household electr vehicl charg behavior...,era burgeon electr vehicl ev popular understan...
1,person predict respons smartphonedeliv medit t...,background medit app surg popular recent year ...
2,machin learn method quantifi role vulner hurri...,
3,abstract text summar lowresourc languag use de...,background human must abl cope huge amount inf...
4,detect ddo attack cloud comput environ use mac...,grow number cloudbas servic led rise threat di...


In [4]:
# exporting to csv
abstract_paper = df[['Cleaned_title', 'Cleaned_Abstract']]
abstract_paper.to_csv('cleaned_papers.csv', index=False)

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
!pip install benepar

Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6.0->benepar)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6.0->benepar)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6.0->benepar)
  Downloading nvidia_cublas_

In [7]:
!pip install benepar torch transformers



In [8]:
# importing data libraries
import nltk # importing NLTK for natural language processing
import os # importing os to manage file paths
import spacy  # importing spacy for NLP tasks
import benepar # importing benepar for parsing sentences
from spacy import displacy # importing displacy for visualizing NLP structures

In [9]:
# setting directory for storing NLTK data
nltk_data_path = os.path.expanduser("~") + "/nltk_data" # creating a path for NLTK data
os.makedirs(nltk_data_path, exist_ok=True) # creating the directory if it does not exist
nltk.data.path.append(nltk_data_path) # adding the directory to NLTK’s data path

# downloading necessary packages
nltk.download('punkt', download_dir=nltk_data_path) # downloading tokenizer dataset
nltk.download('averaged_perceptron_tagger', download_dir=nltk_data_path) # downloading POS tagging dataset
nltk.download('punkt_tab') # downloading additional tokenizer dataset
nltk.download('averaged_perceptron_tagger_eng') # downloading english POS tagger dataset

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [10]:
# reading cleaned data from CSV file
cleaned_dataframe = pd.read_csv('/content/cleaned_papers.csv')

In [11]:
# defining a function to perform POS tagging
def pos_tagging(text):
    if pd.isna(text) or not isinstance(text, str):  # handling missing values or non-string values
        return [], 0, 0, 0, 0 # returning empty values for missing text

    words = word_tokenize(text)  # tokenizing the text
    tagged_words = pos_tag(words)  # assigning POS tagging to words

    # counting occurrences of various POS categories
    pos_counts = Counter(tag for _, tag in tagged_words)# counting occurrences of each tag

   # defining sets of POS tags for different categories
    noun_tags = {'NN', 'NNS', 'NNP', 'NNPS'} # noun tags
    verb_tags = {'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'} # verb tags
    adj_tags = {'JJ', 'JJR', 'JJS'} #adjective tags
    adv_tags = {'RB', 'RBR', 'RBS'} # adverb tags

    # counting the number of nouns, verbs, adjectives, and adverbs
    noun_count = sum(pos_counts[tag] for tag in noun_tags if tag in pos_counts)
    verb_count = sum(pos_counts[tag] for tag in verb_tags if tag in pos_counts)
    adj_count = sum(pos_counts[tag] for tag in adj_tags if tag in pos_counts)
    adv_count = sum(pos_counts[tag] for tag in adv_tags if tag in pos_counts)

    return tagged_words, noun_count, verb_count, adj_count, adv_count

#  applying POS tagging to 'cleaned_title' column
cleaned_df = cleaned_dataframe[['title_tags', 'title_nouns', 'title_verbs', 'title_adjs', 'title_advs']] = cleaned_dataframe['Cleaned_title'].apply(
    lambda text: pd.Series(pos_tagging(text))
)

# applying POS tagging to 'Cleaned_Abstract' column (Adjusted column name)
cleaned_df = cleaned_dataframe[['abstract_tags', 'abstract_nouns', 'abstract_verbs', 'abstract_adjs', 'abstract_advs']] = cleaned_dataframe['Cleaned_Abstract'].apply(
    lambda text: pd.Series(pos_tagging(text))
)

cleaned_df


Unnamed: 0,0,1,2,3,4
0,"[(era, NN), (burgeon, NN), (electr, NN), (vehi...",70,10,17,1
1,"[(background, NN), (medit, NN), (app, NN), (su...",105,22,26,4
2,[],0,0,0,0
3,"[(background, NN), (human, JJ), (must, MD), (a...",87,12,33,2
4,"[(grow, VB), (number, NN), (cloudbas, NN), (se...",52,9,12,0
...,...,...,...,...,...
99995,"[(cyber, VB), (attack, NN), (easier, JJR), (cy...",121,17,31,1
99996,[],0,0,0,0
99997,"[(era, NN), (person, NN), (abl, JJ), (determin...",43,6,3,0
99998,[],0,0,0,0


In [None]:
# downloading necessary models
nltk.download('punkt')
benepar.download('benepar_en3')

# loading spacy model with dependency parsing
nlp = spacy.load("en_core_web_sm")

# loading constituency parser
parser = benepar.Parser("benepar_en3")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
# extracting a sample sentence from the dataset
sample_sentence = df['Cleaned_Abstract'].dropna().iloc[0]  # First non-empty abstract

print("Sample Sentence:", sample_sentence)

Sample Sentence: era burgeon electr vehicl ev popular understand pattern ev user behavior imper paper examin trend household charg session time durat energi consumpt analyz realworld residenti charg data leverag inform collect session novel framework introduc effici realtim predict import charg characterist util histor data userspecif featur machin learn model train predict connect durat charg durat charg demand time next session model enhanc understand ev user behavior provid practic tool optim ev charg infrastructur effect manag charg demand transport sector becom increasingli electrifi work aim empow stakehold insight reliabl model enabl anticip local demand contribut sustain integr electr vehicl grid


In [14]:
import spacy
import pandas as pd

# Load English NLP model with NER capabilities
nlp = spacy.load("en_core_web_sm")

# Increase the max length limit
nlp.max_length = 15000000  # Adjust this value if needed

# Extracting text data (combining title and abstract for richer entity extraction)
text_data = " ".join(df['Cleaned_title'].dropna().astype(str)) + " " + " ".join(df['Cleaned_Abstract'].dropna().astype(str))

# Function to process text in chunks
def process_text_in_chunks(text, chunk_size=1000000):
    entities = []
    for i in range(0, len(text), chunk_size):
        chunk = text[i:i + chunk_size]
        doc = nlp(chunk)
        entities.extend([(ent.text, ent.label_) for ent in doc.ents])
    return entities

# Process the text in chunks
entities = process_text_in_chunks(text_data)

# Convert to DataFrame for better visualization
entity_df = pd.DataFrame(entities, columns=['Entity', 'Category'])

# Counting occurrences of each entity category
entity_counts = entity_df['Category'].value_counts()

print(entity_counts.head())



Category
PERSON      222815
ORG         169905
CARDINAL     60300
GPE          49400
NORP         38301
Name: count, dtype: int64


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import choice

In [16]:
# storaging for extracted data
product_data = []

# listing of different user-agentt strings to avoid detection
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
]

In [17]:
# creating a session to maintain cookies & headers
session = requests.Session()

# adding extra headers to mimic a real browser
headers = {
    'User-Agent': choice(user_agents),
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://github.com/',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Connection': 'keep-alive',
    'Cache-Control': 'no-cache',
}

In [18]:
# fetching multiple pages
for i in range(1, 55):  # adjusting range for more pages
    time.sleep(3)  # delay to avoid getting blocked

    base_url = f'https://github.com/marketplace?page={i}&type=actions'
    print(f"Scraping: {base_url}")

    # retrying logic
    for attempt in range(3):  # Retrying up to 3 times
        try:
            response = session.get(base_url, headers=headers, timeout=10)

            if response.status_code == 200:
                break  # exiting loop if successful
            else:
                print(f"Attempt {attempt+1}: Failed with status {response.status_code}")
                time.sleep(2)  # waiting before retrying
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}")
            time.sleep(2)

    if response.status_code != 200:
        print(f"Skipping page {i} due to failure")
        continue

    soup = BeautifulSoup(response.text, 'html.parser')

    # finding all marketplace items
    github_actions = soup.find_all('div', class_='position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3')

    for actions in github_actions:

        # handling exception error
        try:
            # extracting product details
            product_name_tag = actions.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf line-clamp-1')
            product_name = product_name_tag.text.strip() if product_name_tag else 'N/A'

            # extracting product URL
            url = product_name_tag['href'] if product_name_tag else 'N/A'
            if url.startswith('/'):
                url = f'https://github.com{url}'

            # extracting action description
            action_description_tag = actions.find('p', class_='mt-1 mb-0 text-small fgColor-muted line-clamp-2')
            action_description = action_description_tag.text.strip() if action_description_tag else 'N/A'

            # storing in a structured dictionary
            product_data.append({
                'Product Name': product_name,
                'URL': url,
                'Description': action_description,
                'Page Number': i
            })

        except Exception as e:
            print(f"Error extracting data on page {i}: {e}")

Scraping: https://github.com/marketplace?page=1&type=actions
Scraping: https://github.com/marketplace?page=2&type=actions
Scraping: https://github.com/marketplace?page=3&type=actions
Scraping: https://github.com/marketplace?page=4&type=actions
Scraping: https://github.com/marketplace?page=5&type=actions
Scraping: https://github.com/marketplace?page=6&type=actions
Scraping: https://github.com/marketplace?page=7&type=actions
Scraping: https://github.com/marketplace?page=8&type=actions
Scraping: https://github.com/marketplace?page=9&type=actions
Scraping: https://github.com/marketplace?page=10&type=actions
Scraping: https://github.com/marketplace?page=11&type=actions
Scraping: https://github.com/marketplace?page=12&type=actions
Scraping: https://github.com/marketplace?page=13&type=actions
Scraping: https://github.com/marketplace?page=14&type=actions
Scraping: https://github.com/marketplace?page=15&type=actions
Scraping: https://github.com/marketplace?page=16&type=actions
Scraping: https:/

In [19]:
# converting to a dataframe
df_products = pd.DataFrame(product_data)

# saving to csv
df_products.to_csv('github_marketplace_actions.csv', index=False)

print("Scraping Completed. Data saved to github_marketplace_actions.csv.")

Scraping Completed. Data saved to github_marketplace_actions.csv.


In [20]:
# loading the data from the previously collected github_action (replace with your actual dataframe)
df_action_products = pd.read_csv('/content/github_marketplace_actions.csv')

# downloading the necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')

# defining a function to clean the text
def clean_text_remove_stopword(text):
    # handling missing values (NaN) or non-string values
    if pd.isna(text) or not isinstance(text, str):
        return ''

     # removing special characters and punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)  # Keep only alphabets, numbers, and spaces

    # converting text to lowercase
    text = text.lower()

    # tokenizing the text
    tokens = word_tokenize(text)

    # removing stopwords and perform lemmatization
    cleaned_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalnum() and word not in stop_words]
    return ' '.join(cleaned_tokens)

# applying the cleaning function to 'Product Name' and 'Description' columns
df_action_products['Product Name'] = df_action_products['Product Name'].apply(clean_text_remove_stopword)
df_action_products['Description'] = df_action_products['Description'].apply(clean_text_remove_stopword)

# showing the first few rows of the cleaned data (output)
df_action_products[['Product Name', 'Description', 'URL', 'Page Number']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,Product Name,Description,URL,Page Number
0,trufflehog os,scan github action trufflehog,https://github.com/marketplace/actions/truffle...,1
1,metric embed,infographics generator 40 plugins 300 option d...,https://github.com/marketplace/actions/metrics...,1
2,yq portable yaml processor,create read update delete merge validate yaml,https://github.com/marketplace/actions/yq-port...,1
3,superlinter,superlinter readytorun collection linters code...,https://github.com/marketplace/actions/super-l...,1
4,gosec security checker,run gosec security checker,https://github.com/marketplace/actions/gosec-s...,1


In [21]:
# droping missing value
df_action_products = df_action_products.dropna(subset=['Description'])

In [22]:
# selecting specifice column
df_actions = df_action_products[['Product Name', 'Description', 'URL', 'Page Number']]

In [23]:
# storing data to csv
df_actions.to_csv('cleaned_github_actions_data.csv', index=False)

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [25]:
# installing tweepy for twitter scraping
!pip install tweepy



In [26]:
# importing tweepy for twitter api
import tweepy

In [27]:
# twitter API credentials
API_KEY_SECRET = "****0u4kku"
ACCESS_TOKEN = "764834168291201028-5sCwhkbhEIjWBaUzlnPfTmd1lYx8ECc"
ACCESS_TOKEN_SECRET = "5zOsd3Ge94bDtF3NsliM9gMsT56D0pC9QeFtV1h74DI6a"
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAFtkzQEAAAAAUlrwWEz4%2Bp9GMK8GnHMogUWozvQ%3DhIFbaccUYhoovuO8MMINeFrxE1XSbIK8FSDVgkiDz5WpqKbXOQ"

In [28]:
# authenticating using OAuth2
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# defining search query and parameters
query = "(#MachineLearning OR #AI) -is:retweet lang:en"
tweets = client.search_recent_tweets(query=query, tweet_fields=["id", "text", "author_id"], max_results=100)

# extracting relevant data
data = []
if tweets.data:
    for tweet in tweets.data:
        data.append({
            "Tweet ID": tweet.id,
            "Username": tweet.author_id,
            "Text": tweet.text
        })

# converting to dataframe and display
twitter_df = pd.DataFrame(data)
twitter_df.head()

Unnamed: 0,Tweet ID,Username,Text
0,1891724106434854973,940823377765371905,Alice: The Last return to Wonderland\nhttps://...
1,1891724100957061308,1878701904462831616,"@lmarena_ai @xai Wow, Grok-3 crushing it! 🏆 1..."
2,1891724092471980087,1891306202753355776,💡 58% of customers ghost businesses after ONE ...
3,1891724090056245288,2476684130,"🚀🔍 Say hello to ""Pearl"", the AI search engine ..."
4,1891724086704734611,1871092295854116864,♬ Buffering... Please Wait... https://t.co/sc4...


In [29]:
# storing twitter_df to csv
twitter_df.to_csv('twitter_data.csv', index=False)

In [30]:
# performing data quality checks
missing_values = twitter_df.isnull().sum()
duplicate_rows = twitter_df.duplicated().sum()

# printing data quality report
print("Missing Values:\n", missing_values)
print("\nDuplicate Rows:", duplicate_rows)

# removing duplicates (if any)
df_twitter_cleaned = twitter_df.drop_duplicates()

# saving the cleaned data to a new CSV file
df_twitter_cleaned.to_csv('twitter_cleaned_data.csv', index=False)

Missing Values:
 Tweet ID    0
Username    0
Text        0
dtype: int64

Duplicate Rows: 0


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

I find this assignment really challenging and difficult at the same time. While running 10000 data for the question number1, it took me forever to run the code and get the output which was really frustating.I even got google colab pro and tried running code for several times but still couldn't get the whole result beacause of large datasets. I tried scrapping the data first from IMDB but coundn't scrap more than 25 data and had to change to another one.It took me whole 4 days to work on this assignment though I took help from Canvas guidelines and Chatgpt.But at the same time, I learned about lots of data libraries, toolkits and webscraping from different websites.

**CSV_files**

https://1drv.ms/f/c/b7ab9e17013cc096/EuUHJWPPWjxDk654ltSrfRkBCR_hBQIP4I3gw_Gyc95coQ?e=z9wb1r

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog