# Lab Assignment 5: Web Scraping
## DS 6001

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.


## Problem 0
Import the following packages:

In [41]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin


## Problem 1
Large language models are very impressive. They use deep neural networks with billions of parameters. They fine-tune results with clever approaches to reinforcement learning through human feedback. Many of them employ APIs with user interfaces that allow users to chat in natural language through a simple textbox, and receive a response in only a few seconds. And the societal impact of these models is undeniably enormous, to an extent we are only beginning to understand.

But, none of that is the most impressive part of LLMs. Like anything else in machine learning, the most impressive and difficult element of the work is the data collection.

The data used to train major LLMs is something that big companies communicate very little about. [OpenAI](https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-foundation-models-are-developed) only says that GPT and other baseline models are trained on data that is "freely and openly accessible on the internet." Many commentators, such as Dr. Vered Shwartz, assistant professor in the University of British Columbia department of computer science, say that LLMs like GPT are trained on "[essentially the entire internet](https://science.ubc.ca/news/chatgpt-has-read-almost-whole-internet-hasnt-solved-its-diversity-issues)." Think about pulling the entire internet into a single model and tell me this isn't the most impressive thing about ChatGPT.

A primary training set for Open AI's GPT and other major LLMs is the corpus compiled by [Common Crawl](https://commoncrawl.org/faq), a nonprofit organization that describes itself as "dedicated to providing a copy of the Internet to Internet researchers, companies and individuals at no cost for the purpose of research and analysis." According to the [Mozilla Foundation](https://www.mozillafoundation.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/), "[o]ver 80% of GPT-3 tokens (a representation unit of text data) stemmed from Common Crawl. Many models published by other developers likewise rely heavily on it: the study analyzed 47 LLMs published between 2019 and October 2023 that power text generators and found at least 64% of them were trained on Common Crawl."

Common Crawl is a massive web scraping endeavor. But it's not necessarily a sophisticated one. Mostly, Common Crawl is downloading the raw HTML from the webpages it scrapes and extracting text from the HTML. A task like this is exactly the kind of thing we can use `requests` and `BeautifulSoup` to do. 

Common Crawl is also at the center of many controversies and legal actions regarding generative AI, such as the New York Times' copyright infringement [lawsuit](https://www.npr.org/2025/03/26/nx-s1-5288157/new-york-times-openai-copyright-case-goes-forward) against OpenAI, and concerns about bias and racism in LLMs stemming from their training data.

For this problem, please examine the [Common Crawl website](https://commoncrawl.org/) and read "[Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI](https://assets.mofoprod.net/network/documents/Common_Crawl_Mozilla_Foundation_2024.pdf)" by Stefan Baack for the Mozilla Foundation.


### Part a
If you want access to the latest Common Crawl dataset, where can you get it? How many website does the dataset contain? And outside of computational and storage costs, how much will it cost to obtain the data? (Use [Common Crawl's website](https://commoncrawl.org/) to answer this question) [8 points]

You can get the latest Common Craw dataset by hovering over "The Data" tab on the top, then clicking "Latest Crawl". The lastest Crawl contains 2.39 billion web pages, and accessing the data is free.

### Part b
Is Common Crawl more useful for the pre-training or fine-tuning stage of the development of an LLM? Why is the Common Crawl corpus used by so many AI efforts, and why must it usually be altered or filtered in some way? (see pages 11-13 of Baack's article) [8 points]

It's more useful for pre-training, because pre-training requires massive amounts of data, which Common Crawl provides at scale.

It's widely used by so many AI efforts because it's open and free, and it's extremely large and diverse.

It has to be altered or filtered because it contains large amounts of content that is undesirable for AI training: like hate speech, pornography, and low quality content like "boiler-plate text like menus, error messages, or duplicate text".

### Part c
Given how Common Crawl decides which URLs to crawl, what are three reasons why the data cannot be said to be the complete internet or a representative sample of it? (see pages 17-22 of Baack's article) [8 points]

1. Crawling prioritizes domains with high harmonic centrality (frequently linked sites), so less-connected or marginalized communities’ sites are underrepresented.

2. Major sites like Facebook, The New York Times, block Common Crawl via robots.txt, so their content is excluded.

3. Because the infrastructure is U.S.-based, crawls overrepresent English-language content and underrepresent other regions.

### Part d
Of the suggestions that the Mozilla Foundation make for the future development of Common Crawl, are there any that you especially agree with or disagree with, and why? (See pages 30-31 of Baack's article) [8 points]

I agree with the recommendations on transparency and diversity. Since Common Crawl is already a cornerstone for LLM pre-training, making its biases visible and broadening its coverage would directly improve the fairness and trustworthiness of AI trained on it. Attribution would also push AI builders to be more accountable.


## Problem 2
For the following problems, you will be scraping http://books.toscrape.com/. This website is a fake book retailer, designed to mimic the design of many retail websites. It exists solely to help students practice web-scraping, so there aren’t going to be any ethical concerns with this particular exercise, and there shouldn’t be any issues with rate limits or other gates that could prevent web-scraping. Take a moment and look at this website, so that you know what you will be working with.

Your goal is to generate a dataframe with four columns: one for the title, one for the price, one for the star-rating, and one or the book cover JPEG’s URL. The dataframe will also 1000 rows, one for each of the 1000 books listed on the 50 pages of this website.

### Part a
Pull the HTML code from http://books.toscrape.com/. Make sure you provide the correct user agent string. Then parse this HTML code and save the parsed code as a separate Python variable. [8 points]

In [63]:
url = "https://books.toscrape.com/"

useragent = f'scissorballs/0.0 (nga3rp@virginia.edu) python-requests/{requests.__version__}'
headers = {'User-Agent': useragent, 'From': 'nga3rp@virginia.edu'}

In [6]:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

### Part b
Extract all 20 of the book titles and save them in a list. [8 points]

In [15]:
bookList = soup.find_all('a', attrs = {'title': True})
titleList = [a['title'] for a in bookList]
titleList

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

### Part c
Extract the price of each of the 20 books and save these prices in a list. (The prices are listed in British pounds, and include the £ symbol. Remove the £ symbols: if you’ve saved the prices in a list named `prices`, then the following code should work: `prices = [s.replace('Â£', '') for s in prices]`.) [8 points]

In [18]:
priceList = soup.find_all('p', class_ = 'price_color')
prices = [p.get_text() for p in priceList]
prices = [s.replace('Â£', '') for s in prices]
prices

['51.77',
 '53.74',
 '50.10',
 '47.82',
 '54.23',
 '22.65',
 '33.34',
 '17.93',
 '22.60',
 '52.15',
 '13.99',
 '20.66',
 '17.46',
 '52.29',
 '35.02',
 '57.25',
 '23.88',
 '37.59',
 '51.33',
 '45.17']

### Part d
Extract the star level ratings for the 20 books. [Hint: for tags such as `<p class="star-rating One">` in which the class has a space, the class is actually a list in which the first item in the list is `"star-rating"` and the second item in the list is `"One"`. It's possible to search on either item in this list.][8 points]

In [28]:
starLevelList = soup.find_all('p', class_ = 'star-rating')
starLevels = [p['class'][1] for p in starLevelList]
starLevels

['Three',
 'One',
 'One',
 'Four',
 'Five',
 'One',
 'Four',
 'Three',
 'Four',
 'One',
 'Two',
 'Four',
 'Five',
 'Five',
 'Five',
 'Three',
 'One',
 'One',
 'Two',
 'Two']

In [29]:
star_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
starLevels = [star_map[val] for val in starLevels]
starLevels

[3, 1, 1, 4, 5, 1, 4, 3, 4, 1, 2, 4, 5, 5, 5, 3, 1, 1, 2, 2]

### Part e
Extract the URLs for the JPEG thumbnail images that show the covers of the 20 books. (Maybe we want to mine the images to build models that predict the star level, literally judging books by their covers.) [8 points]

In [64]:
imgURLList = soup.find_all('img')
imgURLs = [urljoin(url, img['src']) for img in imgURLList]
imgURLs

['https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
 'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
 'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
 'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
 'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg',
 'https://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg',
 'https://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg',
 'https://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg',
 'https://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg',
 'https://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg',
 'https://books.toscrape.com/media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg',
 'https://books.toscrape.com/media/cache/10/48/1048f63

### Part f
Create a dataframe with one row for each of the 20 books, and the book titles, prices, star ratings, and cover JPEG URLs as the four columns. [8 points]

In [30]:
df = pd.DataFrame(np.column_stack([titleList, prices, starLevels, imgURLs]), columns=['Title', 'Price', 'Rating', 'Image URL'])
df

Unnamed: 0,Title,Price,Rating,Image URL
0,A Light in the Attic,51.77,3,https://books.toscrape.com/media/cache/2c/da/2...
1,Tipping the Velvet,53.74,1,https://books.toscrape.com/media/cache/26/0c/2...
2,Soumission,50.1,1,https://books.toscrape.com/media/cache/3e/ef/3...
3,Sharp Objects,47.82,4,https://books.toscrape.com/media/cache/32/51/3...
4,Sapiens: A Brief History of Humankind,54.23,5,https://books.toscrape.com/media/cache/be/a5/b...
5,The Requiem Red,22.65,1,https://books.toscrape.com/media/cache/68/33/6...
6,The Dirty Little Secrets of Getting Your Dream...,33.34,4,https://books.toscrape.com/media/cache/92/27/9...
7,The Coming Woman: A Novel Based on the Life of...,17.93,3,https://books.toscrape.com/media/cache/3d/54/3...
8,The Boys in the Boat: Nine Americans and Their...,22.6,4,https://books.toscrape.com/media/cache/66/88/6...
9,The Black Maria,52.15,1,https://books.toscrape.com/media/cache/58/46/5...


### Part g
Create a function that takes the URL of the webpage to scrape as an input, applies the code you wrote for parts a through e, and generates the dataframe from part f as the output. [10 points]

In [59]:
def scrape (url):

    useragent = f'scissorballs/0.0 (nga3rp@virginia.edu) python-requests/{requests.__version__}'
    headers = {'User-Agent': useragent, 'From': 'nga3rp@virginia.edu'}

    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    bookList = soup.find_all('a', attrs = {'title': True})
    titleList = [a['title'] for a in bookList]

    priceList = soup.find_all('p', class_ = 'price_color')
    prices = [p.get_text() for p in priceList]
    prices = [s.replace('Â£', '') for s in prices]

    starLevelList = soup.find_all('p', class_ = 'star-rating')
    starLevels = [p['class'][1] for p in starLevelList]
    star_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
    starLevels = [star_map[val] for val in starLevels]

    imgURLList = soup.find_all('img')
    imgURLs = [urljoin(url, img['src']) for img in imgURLList]

    df = pd.DataFrame(np.column_stack([titleList, prices, starLevels, imgURLs]), columns=['Title', 'Price', 'Rating', 'Image URL'])

    return df
    

In [60]:
dfTest = scrape(url)
pd.set_option('display.max_colwidth', None)
dfTest

Unnamed: 0,Title,Price,Rating,Image URL
0,A Light in the Attic,51.77,3,https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
1,Tipping the Velvet,53.74,1,https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
2,Soumission,50.1,1,https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
3,Sharp Objects,47.82,4,https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
4,Sapiens: A Brief History of Humankind,54.23,5,https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
5,The Requiem Red,22.65,1,https://books.toscrape.com/media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg
6,The Dirty Little Secrets of Getting Your Dream Job,33.34,4,https://books.toscrape.com/media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg
7,"The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull",17.93,3,https://books.toscrape.com/media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg
8,The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics,22.6,4,https://books.toscrape.com/media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg
9,The Black Maria,52.15,1,https://books.toscrape.com/media/cache/58/46/5846057e28022268153beff6d352b06c.jpg


### Part h
Notice that there are many pages to http://books.toscrape.com/. When you click on “Next” in the bottom-right corner of the screen, it takes you to http://books.toscrape.com/catalogue/page-2.html. The front page is the same as http://books.toscrape.com/catalogue/page-1.html, and there are 50 total pages.

Write a loop that uses the function you wrote in part g to scrape each of the 50 pages, and append each of these data frames together. If you write this loop correctly, your dataframe will have 1000 rows (20 books on each of the 50 pages). 

Some hints:

* Typing `new_df = pd.DataFrame()` with nothing in the parentheses will create an empty data frame on which new data can be appended.

* There are many loops you can use, but the most straightforward one is a for-values loop that counts from 1 to 50. In Python, you can initialize such a loop with for i in range(1, 51):, and indenting every line below it that belongs inside the loop. Inside the loop, the letter i is now a stand-in for the number currently being considered.

* You will need to figure out how to replace the number in URLs like http://books.toscrape.com/catalogue/page-2.html with the number currently under consideration in the loop. You might need the `str()` function, which turns numeric values into strings.

* `pd.concat()` is a method that appends dataframes together.

[10 points]

In [44]:
urlBase = "https://books.toscrape.com/catalogue/page-{}.html"

new_df = pd.DataFrame()

for i in range (1, 51):
    url = urlBase.format(i)
    page_df = scrape(url)
    new_df = pd.concat([new_df, page_df], ignore_index=True)


In [62]:
new_df

Unnamed: 0,Title,Price,Rating,Image URL
0,A Light in the Attic,51.77,3,https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
1,Tipping the Velvet,53.74,1,https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
2,Soumission,50.10,1,https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
3,Sharp Objects,47.82,4,https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
4,Sapiens: A Brief History of Humankind,54.23,5,https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
...,...,...,...,...
995,Alice in Wonderland (Alice's Adventures in Wonderland #1),55.53,1,https://books.toscrape.com/media/cache/96/ee/96ee77d71a31b7694dac6855f6affe4e.jpg
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",57.06,4,https://books.toscrape.com/media/cache/09/7c/097cb5ecc6fb3fbe1690cf0cbdea4ac5.jpg
997,A Spy's Devotion (The Regency Spies of London #1),16.97,5,https://books.toscrape.com/media/cache/1b/5f/1b5ff86f3c75e51e24c573d3f8bffd8f.jpg
998,1st to Die (Women's Murder Club #1),53.98,1,https://books.toscrape.com/media/cache/2b/41/2b4161c5b72a4ae386b644682361b34a.jpg
