# GSE GenAI Research Repo Web Scraper

**Author:** Michael Spencer

**Purpose:** Serves as an interactive environment to write and troubleshoot a python script for scraping text data from the Generative AI for Education Hub Research Repository.

## Instructions

**Objective:** To evaluate your ability to effectively scrape, process, and analyze text data from the web,
focusing on the provided data sources.

**Task Overview:** You are tasked with scraping text data from the Generative AI for Education Hub -
specifically the [Research Repository](https://scale.stanford.edu/genai/repository?search_api_fulltext=&application%5B42%5D=42&benefits%5B34%5D=34&benefits%5B36%5D=36&benefits%5B35%5D=35).

**Expected Working Time:** 2 hours or less

Step-by-Step Instructions:
1. Data Scraping
- Write a Python script (or use a tool of your choice) to scrape the above web pages.
- Focus on extracting key text elements relevant to Teaching - Instructional Materials in
K12 and Impact - Randomized Controlled Trials in secondary education.
- Save the scraped data in a structured format (e.g., CSV, JSON).
2. Basic Text Analysis
- Perform a basic keyword frequency analysis on the scraped data.
- Provide a brief summary (2-3 sentences) of the key information extracted from the data.
- Present your findings in a concise report (max 1 page, tables/figures excluded) that
includes a list of all papers’ metadata (title, author(s), date, etc.).
3. Documentation and Submission
- Document your process, including the tools and methods used (not to exceed ½ page).
- Submit the following:
  - The Python script (or other tool-based approach) used for scraping.
  - The scraped data file (CSV/JSON).
  - The analysis report.

Evaluation Criteria
- Technical Proficiency: Effectiveness and efficiency of the scraping script, choice of tools, and
handling of data.
- Data Quality: Relevance, completeness, and structure of the scraped data.
- Analytical Insight: Quality and relevance of the text analysis and the summary report.
- Documentation: Clarity and thoroughness of the process documentation.

### Notes

- Checked robots.txt on site to see if there were any scraping limits. There didn't seem to be so I went ahead and scraped the site, only to get blocked. To combat this, I added a half second delay to my requests.
- There are three articles that belong to both searches. In total as of 2025-02-02, there were 77 articles.
- Each research article is nested within the `<li class="col">` tag so I can use those to identify articles to parse.
- The url for each article's individual page is the first href in the above element. I could use that to scrape the abstracts.
- I will scrape the data first, save it, and then perform an analysis. This separation allows for more modular code, that is easier to maintain. I am also not dealing with a substantial amount of data, and hence more advanced streaming processing methods are not neccessarily needed here.
- I will utilize the repos search API feature to search for only topics that I are interested in per the task. This makes sense given the timed nature of this task.

Potential Improvements:
- Implement parellel requests
- Combine the page gathering step and the article scraping step to minimize repeated requests and reduce code. 

## Setup

### Libraries

%pip install requests beautifulsoup4 pandas pathlib os time

In [11]:
# Environment libraries
from pathlib import Path

# Analysis libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Variables

In [12]:
PROJECT_ROOT = str(Path.cwd().resolve().parent)
DATA_DIR = PROJECT_ROOT + "/data"
DATA_OUT = DATA_DIR + "/clean/test_gse_genai_articles.csv"

# Sets up the URLS for our HTML requests. Opting to use hard coded search API URLs to save time, and so
# that we don't have to do additional parsing to gather the relevant files.
REPO_BASE_URL = "https://scale.stanford.edu"
TEACHING_K12_SEARCH_URL = REPO_BASE_URL + "/genai/repository?search_api_fulltext=&application%5B42%5D=42&benefits%5B34%5D=34&benefits%5B36%5D=36&benefits%5B35%5D=35"
# "Secondary" is defined as middle and high school here.
IMPACT_SECONDARY_SEARCH_URL = REPO_BASE_URL + "/genai/repository?search_api_fulltext=&benefits%5B36%5D=36&benefits%5B35%5D=35&study_design%5B55%5D=55"
SEARCH_URLS = [TEACHING_K12_SEARCH_URL, IMPACT_SECONDARY_SEARCH_URL]

### Functions

In [13]:
def main(search_urls):
    page_urls = identify_pages_to_scrape(search_urls)
    article_urls = identify_articles_to_scrape(page_urls)
    article_data = extract_article_data(article_urls)
    write_data_to_csv(article_data)

In [14]:
# Scrapes the search results for every "Teaching - Instructional Materials in K12" and
# "Impact - Randomized Controlled Trials in secondary education" page,
# and adds them to the set of pages in which to look for articles.
def identify_pages_to_scrape(search_urls):
    pages_to_scrape = set()

    for search_url in search_urls:
        response = requests.get(search_url)

        if response.status_code != 200:
            print(f"Warning: Failed to retrieve {search_url} (status: {response.status_code}).")
            continue
        
        parsed_response = BeautifulSoup(response.text, 'html.parser')
        pagination_data = parsed_response.select("ul.pagination li a.page-link")

        # Catches the case where there is no pagination data, and we only have one page to scrape.
        if not pagination_data:
            pages_to_scrape.add(search_url)
            continue
        
        for page in pagination_data:
            page_url = REPO_BASE_URL + "/genai/repository" + page['href']
            pages_to_scrape.add(page_url)

    return pages_to_scrape

In [15]:
# Scrapes each previously identified page for the article urls and
# adds them to the set of articles to scrape.
def identify_articles_to_scrape(page_urls):
    articles_to_scrape = set()
    article_titles = set()

    for page_url in page_urls:
        response = requests.get(page_url)

        if response.status_code != 200:
            print(f"Warning: Failed to retrieve {page_url} (status: {response.status_code}).")
            continue
        
        parsed_response = BeautifulSoup(response.text, 'html.parser')
        articles_to_parse = parsed_response.select("ul.list-papers li.col")
        
        # Extracts the individual article URLs from the page HTML.
        for article in articles_to_parse:
            article_sub_url = article.select_one("div.card a[href]")
            article_url = REPO_BASE_URL + article_sub_url['href']

            # Checks if the article title is already in the set of article titles to avoid duplicates.
            article_title = article.select_one("div.card a[hreflang='en']").get_text(strip=True)
            if article_title not in article_titles:
                article_titles.add(article_title)
                articles_to_scrape.add(article_url)

    print(f"Identified {len(articles_to_scrape)} distinct articles to scrape.")

    return articles_to_scrape

In [16]:
# Parses metadata for analysis from each article
def extract_article_data(articles_to_scrape):
    article_data = []

    print(f"Attempting to scrape data from {len(articles_to_scrape)} articles...")

    for article in articles_to_scrape:
        response = requests.get(article)

        if response.status_code != 200:
            print(f"Warning: Failed to retrieve {article} (status: {response.status_code}).")
            continue

        parsed_article = BeautifulSoup(response.text, 'html.parser')

        # Gathers metadata from the article
        article_metadata = {}
        title = parsed_article.select_one("h1").get_text(strip=True)
        article_metadata["Title"] = title

        # Identifies all the metadata fields within the article HTML node.
        node_content = parsed_article.select_one("div.node__content").select("div.field")

        for field in node_content:
            article_metadata = extract_metadata_field(field, article_metadata)

        article_data.append(article_metadata)

    print(f"Successfully scraped data from {len(article_data)} distinct articles.")

    return article_data

In [17]:
# Parses each field of the article and returns an updated dictionary of the relevant metadata.
def extract_metadata_field(metadata_field, article_metadata):
    field_name_html = metadata_field.select_one("div.field__label")
        
    # If the field name is not found, I assume the field is the "Abstract" based on the site structure.
    if not field_name_html:
        field_name = "Abstract"
        field_value = metadata_field.select_one("div.field__item, p").get_text(strip=True)
        article_metadata[field_name] = field_value
        return article_metadata
    
    # If the field name is found, I use the label given and extract the relevant items, however many there may be.
    field_name = field_name_html.get_text(strip=True)
    field_items = metadata_field.select("div.field__item")
    
    item_values = ""
    # If the field name is "Link", extract the href attribute
    if field_name == "Link":
        publishing_link = metadata_field.select_one("a[href]")
        field_items = publishing_link["href"]
    else:
        for item in field_items:
            # Some fields have multiple items, so I concatenate them together and strip unneccessary linebreaks
            item_value = item.get_text().strip("\n")
            item_values = item_values + item_value

    # Add the field to the dictionary.
    article_metadata[field_name] = item_values

    return article_metadata

In [18]:
# Write the scraped data to a CSV
def write_data_to_csv(data):
    data_gse_genai_articles = pd.DataFrame(data)

    # Rename the "Who Age?" column to "What Age?" and save to a CSV file.
    (data_gse_genai_articles.rename(columns={"Who age?": "What age?"})
                            .sort_values(by=["Title"])
                            .to_csv(DATA_DIR + "/clean/gse_genai_articles.csv", index=False))
    
    print(f"Wrote data to CSV at data {DATA_OUT}.")

In [19]:
main(search_urls = SEARCH_URLS)

Identified 75 distinct articles to scrape.
Attempting to scrape data from 75 articles...
Successfully scraped data from 75 distinct articles.
Wrote data to CSV at data /Users/michaelspencer/projects/interviews/gse_gai_hub_task/part1_webscraping/data/clean/test_gse_genai_articles.csv.


### Outline of Plan
- Gather the full URLs that I must search for based on how many pages each of the two topic searches yield
- For each URL/page:
  - Make a request
  - Gather article URLs to also include abstracts. The metadata also appears easier to iterate through on the individual articles pages.
  - Parse HTML to store the metadata for each article in dictionaries. Be wary of duplicate articles:
    - Title
    - Author(s)
    - Application(s)
    - Age(s)
    - Uses
    - Study Design
- Save the data to a CSV or JSON file
- Read in data file
- Conduct basis frequency analysis of words
   - Clean titles by lowercasing, removing stop words, lemmatizing, etc.
- Report on key findings
- Create report

In [20]:
# Scrapes the search results for every "Teaching - Instructional Materials in K12" and "Impact - Randomized Controlled Trials
# in secondary education" page, and adds them to the set of pages in which to look for articles.
PAGES_TO_SCRAPE = set()

for search_url in [TEACHING_K12_SEARCH_URL, IMPACT_SECONDARY_SEARCH_URL]:
    response = requests.get(search_url)

    if response.status_code != 200:
        print(f"Warning: Failed to retrieve {search_url} (status: {response.status_code}).")
        continue
    
    parsed_response = BeautifulSoup(response.text, 'html.parser')
    pagination_data = parsed_response.select("ul.pagination li a.page-link")

    # Catches the case where there is no pagination data, and we only have one page to scrape.
    if not pagination_data:
        PAGES_TO_SCRAPE.add(search_url)
        continue
    
    for page in pagination_data:
        page_url = REPO_BASE_URL + "/genai/repository" + page['href']
        PAGES_TO_SCRAPE.add(page_url)

In [21]:
# Scrapes each identified page for the article urls and adds them to the set of articles to scrape.
ARTICLES_TO_SCRAPE = set()

for page_url in PAGES_TO_SCRAPE:
    response = requests.get(page_url)

    if response.status_code != 200:
        print(f"Warning: Failed to retrieve {page_url} (status: {response.status_code}).")
        continue
    
    parsed_response = BeautifulSoup(response.text, 'html.parser')
    articles_to_parse = parsed_response.select("ul.list-papers li.col")
    
    for article in articles_to_parse:
        article_sub_url = article.select_one("div.card a[href]")
        article_url = REPO_BASE_URL + article_sub_url['href']
        ARTICLES_TO_SCRAPE.add(article_url)

print(f"Identified {len(ARTICLES_TO_SCRAPE)} distinct articles to scrape.")

Identified 77 distinct articles to scrape.


In [27]:
# Parses relevant data from each article
ARTICLE_DATA = []

print(f"Attempting to scrape data from {len(ARTICLES_TO_SCRAPE)} articles...")

for article in ARTICLES_TO_SCRAPE:
    response = requests.get(article)

    if response.status_code != 200:
        print(f"Warning: Failed to retrieve {article} (status: {response.status_code}).")
        continue

    parsed_article = BeautifulSoup(response.text, 'html.parser')

    # Gathers metadata from the article
    article_metadata = {}
    title = parsed_article.select_one("h1").get_text(strip=True)
    article_metadata["Title"] = title

    node_content = parsed_article.select_one("div.node__content").select("div.field")
    for field in node_content:
        field_name_html = field.select_one("div.field__label")
        
        # If the field name is not found, I assume the field is the "Abstract" based on the site structure.
        if not field_name_html:
            field_name = "Abstract"
            field_value = field.select_one("div.field__item, p").get_text(strip=True)
            article_metadata[field_name] = field_value
            continue
        
        # If the field name is found, I use the label given and extract the relevant items, however many there may be.
        field_name = field_name_html.get_text(strip=True)
        field_items = field.select("div.field__item")

        item_values = ""
        # If the field name is "Link", extract the href attribute
        if field_name == "Link":
            publishing_link = field.select_one("a[href]")
            item_values = publishing_link["href"]
        else:
            for item in field_items:
                # Some fields have multiple items, so I concatenate them together and strip unneccessary linebreaks
                item_value = item.get_text().strip("\n")
                item_values = item_values + item_value

        article_metadata[field_name] = item_values

    ARTICLE_DATA.append(article_metadata)

print(f"Successfully scraped data from {len(ARTICLE_DATA)} distinct articles.")

Attempting to scrape data from 77 articles...
Successfully scraped data from 77 distinct articles.


In [33]:
data_gse_genai_articles = pd.DataFrame(ARTICLE_DATA)

# # Rename the "Who Age?" column to "What Age?" and save to a CSV file.
# (data_gse_genai_articles.rename(columns={"Who age?": "What age?"})
#                         .sort_values(by=["Title"])
#                         .to_csv(DATA_DIR + "/clean/gse_genai_articles.csv", index=False))

data_gse_genai_articles.rename(columns={"Who age?": "What age?"}).sort_values(by=["Title"]).to_csv(DATA_DIR + "/clean/gse_genai_articles.csv", index=False)