# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import requests
import csv
from bs4 import BeautifulSoup

def get_total_reviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    total_reviews_str = soup.find('div', {'class': 'header'}).get_text().split()[0]
    total_reviews = int(total_reviews_str.replace(',', ''))
    return total_reviews

def scrape_reviews(url, csv_filename):
    total_reviews = get_total_reviews(url)

    with open(csv_filename, 'w', newline='', encoding='utf-8') as csv_file:
        csv_writer = csv.writer(csv_file)
        csv_writer.writerow(['Review'])

        for page in range(1, min(1001, total_reviews // 10 + 2)):
            page_url = f'{url}&start={10*(page-1)}'
            page_response = requests.get(page_url)
            page_soup = BeautifulSoup(page_response.content, 'html.parser')
            reviews = page_soup.find_all('div', {'class': 'text show-more__control'})

            for review in reviews:
                review_text = review.get_text().strip()
                csv_writer.writerow([review_text])

if __name__ == "__main__":
    imdb_url = 'https://www.imdb.com/title/tt10954600/reviews?ref_=tt_urv'
    csv_filename = 'Antman_reviews.csv'
    scrape_reviews(imdb_url, csv_filename)

In [None]:
# Import the necessary library
import pandas as pd

# Load the CSV file data into a DataFrame
df = pd.read_csv('Antman_reviews.csv')

# Display the first 100 rows of the DataFrame
print(df.head(100))

    Unnamed: 0                                             Review
0            1  a huge fan first one almost big fan second one...
1            2  after entri phase pas without much set next bi...
2            3  well happen the mcu run ga the last mcu film l...
3            4  well ill start say wasnt bad movi it wasnt gre...
4            5  i enjoy watch quantumania it mostli solid fair...
..         ...                                                ...
95          96  thi film unspeak badit actual wors etern becau...
96          97  a fun onei terrif time watch antman wasp quant...
97          98  a mani other point far heyday peak mcu movi an...
98          99  the mcu current state absolut mess up endgam w...
99         100  so im go say great marvel film howev i also wo...

[100 rows x 2 columns]


# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Your code here

!pip install textblob



In [None]:
# Write your code here
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
import re
import pandas as pd

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Noise Removal
df['Reviews after Noise Removal'] = df['Review'].str.replace('[^\w\s]', '')
df['Reviews after Noise Removal'] = df['Reviews after Noise Removal'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x))
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal,Lower Case,After Stemming,After Lemmatization,cleaned_text
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,entri phase pas without much set next big bad ...,entri phase pas without much set next big bad ...,entri phase pa without much set next big bad m...,entri phase pa without much set next big bad m...,entri phase pa without much set next big bad m...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...


In [None]:
# Remove Digits
df['After digits removal'] = df['Reviews after Noise Removal'].apply(lambda y: ''.join([i for i in y if not i.isdigit()]))
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...


In [None]:
# Stopwords Removal
s = stopwords.words('english')
df['Stopwords Removal'] = df['After digits removal'].apply(lambda x: " ".join(x for x in x.split() if x not in s))
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,huge fan first one almost big fan second one d...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,entri phase pas without much set next big bad ...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen mcu run ga last mcu film lacklust ...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi wasnt great ...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,enjoy watch quantumania mostli solid fairli en...


In [None]:
# Convert to Lower Case
df['Lower Case'] = df['Stopwords Removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal,Lower Case
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,entri phase pas without much set next big bad ...,entri phase pas without much set next big bad ...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...


In [None]:
# Stemming
stemmer = PorterStemmer()
df['After Stemming'] = df['Lower Case'].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()]))
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal,Lower Case,After Stemming
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,entri phase pas without much set next big bad ...,entri phase pas without much set next big bad ...,entri phase pa without much set next big bad m...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...


In [None]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
# Lemmatization
df['After Lemmatization'] = df['After Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal,Lower Case,After Stemming,After Lemmatization
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,entri phase pas without much set next big bad ...,entri phase pas without much set next big bad ...,entri phase pa without much set next big bad m...,entri phase pa without much set next big bad m...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...


In [None]:
df['cleaned_text']= df['After Lemmatization']

In [None]:
# Save the cleaned data to a new column and CSV file
df.to_csv('Antman_reviews_cleaned.csv', index=False)

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal,Lower Case,After Stemming,After Lemmatization,cleaned_text
0,1,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,a huge fan first one almost big fan second one...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...,huge fan first one almost big fan second one d...
1,2,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,after entri phase pas without much set next bi...,entri phase pas without much set next big bad ...,entri phase pas without much set next big bad ...,entri phase pa without much set next big bad m...,entri phase pa without much set next big bad m...,entri phase pa without much set next big bad m...
2,3,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen the mcu run ga the last mcu film l...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...,well happen mcu run ga last mcu film lacklust ...
3,4,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi it wasnt gre...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...,well ill start say wasnt bad movi wasnt great ...
4,5,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,i enjoy watch quantumania it mostli solid fair...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...,enjoy watch quantumania mostli solid fairli en...


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Write code for each of the sub parts with proper comments.
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [None]:
!pip install nltk
import nltk
nltk.download('words')



[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
import spacy
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tree import Tree
from collections import Counter
import pandas as pd

# Load the spaCy English model for dependency parsing and named entity recognition.
nlp = spacy.load("en_core_web_sm")

# Function to print constituency parsing tree using NLTK.
def print_constituency_tree(text):
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        tagged = pos_tag(words)
        chunked = ne_chunk(tagged)
        for subtree in chunked:
            if type(subtree) == Tree:
                print(subtree.label(), " ".join(word for word, pos in subtree.leaves()))
            else:
                print(subtree[0], subtree[1])

# Function to print dependency parsing tree using spaCy.
def print_dependency_tree(text):
    doc = nlp(text)
    for token in doc:
        print(token.text, token.dep_, token.head.text, [child for child in token.children])

# Function to extract named entities and count their occurrences.
def extract_named_entities(text):
    doc = nlp(text)
    entity_counter = Counter()
    for ent in doc.ents:
        entity_counter[ent.label_] += 1
    return entity_counter

# Read only the first 100 rows from the CSV file to avoid run time issues
df = pd.read_csv('Antman_reviews_cleaned.csv', nrows=100)

# Combine the 'cleaned_text' column values into a single text
text = ' '.join(df['cleaned_text'].astype(str))


# (1) Parts of Speech (POS) Tagging
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

# Count the number of nouns, verbs, adjectives, and adverbs in the text.
noun_count = len([word for word, pos in pos_tags if pos.startswith('N')])
verb_count = len([word for word, pos in pos_tags if pos.startswith('V')])
adj_count = len([word for word, pos in pos_tags if pos.startswith('J')])
adv_count = len([word for word, pos in pos_tags if pos.startswith('R')])

print(f"Total Nouns: {noun_count}")
print(f"Total Verbs: {verb_count}")
print(f"Total Adjectives: {adj_count}")
print(f"Total Adverbs: {adv_count}")

# (2) Constituency Parsing and Dependency Parsing
print("Constituency Parsing Trees:")
print_constituency_tree(text)
print("\nDependency Parsing Tree:")
print_dependency_tree(text)

# (3) Named Entity Recognition
entity_counter = extract_named_entities(text)
print("Named Entities:")
for entity, count in entity_counter.items():
    print(f"{entity}: {count}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
edit compound time []
time nsubj use [cgi, rough, place, edit]
could aux use []
use conj m [time, could, develop, marvel]
extra amod month []
month npadvmod develop [extra]
develop xcomp use [month, modok]
modok dobj develop []
atroci advmod marvel []
executionthat det marvel []
marvel advcl use [atroci, executionthat]
late amod quantiti []
quantiti nsubj say [late, thing]
qualiti nmod thing []
unfortunatelyanoth amod thing []
thing appos quantiti [qualiti, unfortunatelyanoth]
say ROOT say [quantiti, wish]
wish xcomp say [take, come, was]
marvel compound movi []
movi nsubj take [marvel]
would aux take []
take ccomp wish [movi, would, humor, get]
seriou compound humor []
humor dobj take [seriou]
get conj take [watch]
repetit compound enjoy []
enjoy nsubj watch [repetit]
watch ccomp get [enjoy, feel]
quantumania compound mcu []
mostli nmod mcu []
solid amod movi []
fairli compound movi []
entertain compound movi []
movi com

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_github_marketplace(pages=1, output_csv='github_actions.csv'):
    """
    Scrape GitHub Marketplace (Actions). Adjust pages as needed.
    """
    base_url = "https://github.com/marketplace?type=actions"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    with open(output_csv, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(["page_number", "product_name", "product_description", "url"])

        for page_num in range(1, pages + 1):
            url = f"{base_url}&page={page_num}"
            print(f"Scraping: {url}")

            try:
                response = requests.get(url, headers=headers, timeout=10)
                response.raise_for_status()
                print(response)

                soup = BeautifulSoup(response.text, "html.parser")

                # Adjust your selector to match the <article> elements for each action
                product_cards = soup.select("article.Box-row.marketplace-item")
                if not product_cards:
                    print(f"No items found on page {page_num}. Possibly no more pages or structure changed.")
                    break  # Stop scraping further pages if the first page has no results

                for card in product_cards:
                    # 1) Title / Name
                    title_el = card.select_one("h3 a span[itemprop='name']")
                    product_name = title_el.get_text(strip=True) if title_el else "N/A"

                    # 2) Description
                    desc_el = card.select_one("p.color-fg-muted")
                    product_description = desc_el.get_text(strip=True) if desc_el else "N/A"

                    # 3) Link
                    link_el = card.select_one("h3 a[href]")
                    product_url = f"https://github.com{link_el['href']}" if link_el else "N/A"

                    writer.writerow([page_num, product_name, product_description, product_url])

                # Sleep briefly to respect rate limits
                time.sleep(1)

            except requests.exceptions.RequestException as e:
                print(f"Error fetching page {page_num}: {e}")
                continue

if __name__ == "__main__":
    # Scrape just 1 page to test
    scrape_github_marketplace(pages=1, output_csv='github_actions.csv')


Scraping: https://github.com/marketplace?type=actions&page=1
No items found on page 1. Possibly no more pages or structure changed.


In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the GitHub Marketplace - Actions section
url = "https://github.com/marketplace?type=actions"

# Define headers to mimic a real browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edge/91.0.864.64",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Referer": "https://github.com/",
    "TE": "Trailers"
}

# Send GET request with headers
response = requests.get(url, headers=headers)
print(response.text)
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find the section containing the actions
actions = soup.find_all('a', class_='btn-link')

# Extract and print the action names and links
for action in actions:
    action_name = action.get_text(strip=True)
    action_url = f"https://github.com{action.get('href')}"
    if action_name:  # to avoid printing empty names
        print(f"Action Name: {action_name}")
        print(f"Link: {action_url}\n")









<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  
  >



  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  


  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" 

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the GitHub Marketplace Actions page
url = 'https://github.com/marketplace?type=actions'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all action cards on the page
    actions = soup.find_all('div', class_='h3 lh-condensed')

    # Loop through each action and extract its name and link
    for action in actions:
        action_name = action.get_text(strip=True)
        action_link = 'https://github.com' + action.find('a')['href']
        print(f'Action Name: {action_name}')
        print(f'Action Link: {action_link}\n')
else:
    print(f'Failed to retrieve page, status code: {response.status_code}')

Failed to retrieve page, status code: 400


In [None]:
import requests
requests.post("https://github.com/marketplace?type=actions")

<Response [403]>

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# Base URL for GitHub Marketplace Actions
BASE_URL = "https://github.com/marketplace?type=actions"

# Define headers to mimic a real browser request
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edge/91.0.864.64",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Referer": "https://github.com/",
    "TE": "Trailers"
}

# Open a CSV file to store the scraped data
with open('github_marketplace_actions.csv', mode='w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['product_name', 'description', 'url', 'page_number']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()

    # Define the total number of pages you wish to scrape (adjust this based on the total product count)
    total_pages = 10  # Example: if each page has about 100 products, 10 pages would be around 1000 products

    for page in range(1, total_pages + 1):
        # Construct URL for the current page; GitHub Marketplace uses a page parameter
        page_url = f"{BASE_URL}&page={page}"
        print(f"Processing: {page_url}")

        try:
            response = requests.get(page_url, headers=HEADERS)
            if response.status_code != 200:
                print(f"Failed to retrieve page {page}. Status code: {response.status_code}")
                continue

            # Parse the content with BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # --- Parsing the page ---
            # Note: The exact HTML structure may vary.
            # In this example, we assume each product is in a <div> (or similar container) that includes:
            #   - an <a> tag with class 'btn-link' for the product name and link,
            #   - a <p> tag containing the product description.
            #
            # Adjust the selectors based on the current page layout.

            # This is one way to search; you might need to inspect the HTML to find a more precise container.
            product_containers = soup.find_all('div', class_='flex-1')
            if not product_containers:
                print(f"No products found on page {page}. Check the HTML selectors.")
                continue

            for product in product_containers:
                # Extract product name and URL
                print(product)
                product_link = product.find('a', class_='btn-link')
                print(product_link)
                if product_link:
                    product_name = product_link.get_text(strip=True)
                    relative_link = product_link.get('href')
                    product_url = f"https://github.com{relative_link}" if relative_link else "N/A"
                else:
                    product_name = "N/A"
                    product_url = "N/A"

                # Extract a short description; adjust the tag/class as necessary.
                description_tag = product.find('p')
                description = description_tag.get_text(strip=True) if description_tag else "N/A"

                # Write the extracted data to the CSV file
                writer.writerow({
                    'product_name': product_name,
                    'description': description,
                    'url': product_url,
                    'page_number': page
                })

            print(f"Completed page {page}")
            # Introduce a delay to avoid overloading the server
            time.sleep(2)

        except Exception as e:
            print(f"An error occurred on page {page}: {e}")


Processing: https://github.com/marketplace?type=actions&page=1
<div class="flex-1">
<button aria-expanded="false" aria-label="Toggle navigation" class="js-details-target js-nav-padding-recalculate js-header-menu-toggle Button--link Button--medium Button d-lg-none color-fg-inherit p-1" data-view-component="true" type="button"> <span class="Button-content">
<span class="Button-label"><div class="HeaderMenu-toggle-bar rounded my-1"></div>
<div class="HeaderMenu-toggle-bar rounded my-1"></div>
<div class="HeaderMenu-toggle-bar rounded my-1"></div></span>
</span>
</button>
</div>
None
<div class="flex-1 flex-order-2 text-right">
<a class="HeaderMenu-link HeaderMenu-button d-inline-flex d-lg-none flex-order-1 f5 no-underline border color-border-default rounded-2 px-2 py-1 color-fg-inherit js-prevent-focus-on-mobile-nav" data-analytics-event='{"category":"Marketing nav","action":"click to Sign in","label":"ref_page:Marketing;ref_cta:Sign in;ref_loc:Header"}' data-hydro-click='{"event_type":"a

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
############################################
# PART 1: TWITTER SCRAPING WITH TWEEPY (v2)
############################################

# 1. INSTALL AND IMPORT LIBRARIES
# (Uncomment the pip installs if needed)
# !pip install tweepy pandas

import tweepy
import pandas as pd
import re

# 2. AUTHENTICATE WITH TWITTER API (v2)
# Replace placeholders with your actual credentials, including the Bearer Token
bearer_token = "AAAAAAAAAAAAAAAAAAAAAHmpzQEAAAAADh7vckrj6GLxxQXkR6Ym9XIhwOw%3DsjWJpR4OzVcs6goQcBBbbJoL9lTlbN9zN4VklnXsgfdviJ7sUL"
consumer_key = "DlI1s61j2pO8dAp2WJozzGv70"
consumer_secret = "8ug40WbwLriQ8rXu7Tq8erhrEw2o4vaDOkGu4PoKyU06XLmtTV"
access_token = "1892415559066701825-muK8S8GJk9vL7OyVbiI4XS66LyCe69"
access_token_secret = "ddjhMGjcZq1tHJIajn8IIRWDTcKU3X7D2D0yNhPozIIzj"
# Initialize Tweepy Client for v2
client = tweepy.Client(
    bearer_token=bearer_token,
    consumer_key=consumer_key,
    consumer_secret=consumer_secret,
    access_token=access_token,
    access_token_secret=access_token_secret,
    # If you often hit rate limits, set this to True to have Tweepy sleep until reset
    wait_on_rate_limit=True
)

# 3. SEARCH TWEETS WITH SPECIFIC HASHTAGS
# We can use Twitter’s “recent search” endpoint for v2 (tweets from last 7 days).
# The '-is:retweet' filter removes retweets.
# Use 'lang:en' to filter English, plus #machinelearning or #AI hashtags.
query = "#machinelearning OR #AI -is:retweet lang:en"

# The free tier allows max_results up to 10 for "Essential" (check current Twitter rules).
# If you have a higher tier or academic access, you can go up to 100 per request.
response = client.search_recent_tweets(
    query=query,
    max_results=10,  # 10 is often the max if you're on the free tier
    tweet_fields=["id", "text", "author_id", "created_at"]
)

# response.data is a list of Tweet objects (or None if no tweets found)
tweets_data = response.data if response.data else []

# 4. EXTRACT FIELDS (TWEET ID, USER ID, TEXT, ETC.)
data_list = []
for tweet in tweets_data:
    tweet_id = tweet.id
    author_id = tweet.author_id
    tweet_text = tweet.text
    created_at = tweet.created_at

    data_list.append({
        "tweet_id": tweet_id,
        "author_id": author_id,
        "text": tweet_text,
        "created_at": created_at
    })

# 5. CONVERT TO DATAFRAME
df = pd.DataFrame(data_list)

############################################
# PART 2: BASIC DATA CLEANING
############################################

# Simple text cleaner: remove URLs, convert to lowercase, remove special chars
def clean_text(text):
    # Remove URLs
    text_no_urls = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    # Remove non-alphanumeric chars except spaces, #, and @ (optional)
    text_alpha = re.sub(r"[^A-Za-z0-9\s#@]", "", text_no_urls)
    # Lowercase
    text_cleaned = text_alpha.lower()
    return text_cleaned.strip()

df["clean_text"] = df["text"].apply(clean_text)

# Data Quality
# Drop duplicates if any (by tweet_id)
df.drop_duplicates(subset="tweet_id", keep="first", inplace=True)

############################################
# PART 3: SAVE CLEANED DATA TO CSV
############################################

output_filename = "twitter_ai_ml_tweets_v2.csv"
df.to_csv(output_filename, index=False, encoding="utf-8")
print(f"Saved cleaned tweets to {output_filename}")

print("Final DataFrame:")
print(df)


Saved cleaned tweets to twitter_ai_ml_tweets_v2.csv
Final DataFrame:
              tweet_id            author_id  \
0  1892424569065029649            253682939   
1  1892424558172655855  1882134432196542465   
2  1892424555022713117   892438929525178368   
3  1892424547254882747  1682288478351159297   
4  1892424497271071077            168449286   
5  1892424495954075842  1806403041622646785   
6  1892424492548575252  1815346492410052608   
7  1892424484344189074  1857426414561087488   
8  1892424475318063167  1682288478351159297   
9  1892424443382935618  1623124126939619328   

                                                text  \
0  🔒 Privacy is a top priority! DeepSearch uses z...   
1  AI agents on blockchain offer a new paradigm f...   
2  #BE #ME #BestEngineeringcollege #Ece #Electron...   
3  ✨ Just found this stunning AI creation on #CoS...   
4  Imagine a tool that lets business leaders spea...   
5  🏃 I’ve just started earning Bytes on @despeedn...   
6  ✨ Just found this 

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

It was a fascinating task, for me at least, to pull information from different pages. My biggest obstacle was the limits imposed by GitHub on making API requests. Collecting the data set and automating the process was something I found interesting, along with using the BeautifulSoup library. The time allocation was more than adequate, but rectifying the mistakes was much more tedious than I had hoped. If there was any additional time, I could have rewritten portions of the code in a way that would have reduced the number of errors being produced.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog