## Learning objectives

use requests to fetch HTML and parse pages with BeautifulSoup

select elements via CSS selectors and extract text/attributes (.text, ['href'], ['alt'], ['src'])

clean scraped text (e.g., .strip()) and construct absolute URLs by prefixing domains

build pandas DataFrames from scraped lists (names ↔ titles, states ↔ URLs, categories ↔ URLs)

extract specific media (e.g., image src by matching an alt string)

read HTML tables directly with pandas.read_html + StringIO (no BS4 needed)

document scraping steps and decisions clearly in code comments and markdown

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [169]:
myresponse = requests.get("https://datamine.purdue.edu/about/about-welcome/")

In [170]:
mysoup = BeautifulSoup(myresponse.content, 'html.parser')

In [171]:
mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')

[<p class="purdue-home-cta-grid__card-name">Mark Daniel Ward</p>,
 <p class="purdue-home-cta-grid__card-name">Kevin Amstutz</p>,
 <p class="purdue-home-cta-grid__card-name">Ashley Arroyo</p>,
 <p class="purdue-home-cta-grid__card-name">Donald Barnes</p>,
 <p class="purdue-home-cta-grid__card-name">Brandt Barnes</p>,
 <p class="purdue-home-cta-grid__card-name">Maggie Betz</p>,
 <p class="purdue-home-cta-grid__card-name">Bryce Castle</p>,
 <p class="purdue-home-cta-grid__card-name">Cai Chen</p>,
 <p class="purdue-home-cta-grid__card-name">Doug Crabill</p>,
 <p class="purdue-home-cta-grid__card-name">Peter Dragnev</p>,
 <p class="purdue-home-cta-grid__card-name">Stacey Dunderman</p>,
 <p class="purdue-home-cta-grid__card-name">Jessica Gerlach</p>,
 <p class="purdue-home-cta-grid__card-name">Dan Hirleman</p>,
 <p class="purdue-home-cta-grid__card-name">Jessica Jud</p>,
 <p class="purdue-home-cta-grid__card-name">Kali Lacy</p>,
 <p class="purdue-home-cta-grid__card-name">Gloria Lenfestey</p

In [194]:
#Actually question. I am just naming what will be in the dataframe. When it prints the data frame this is one column
NameStaff = [element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-name"]')]

In [193]:
#Same as above 
Postions = [element.text for element in mysoup.select('p[class = "purdue-home-cta-grid__card-titleline"]')]

In [195]:
#This creates the data frame. I named the dataframe Jobtitles and the two columns Staff names and postions 
Jobtitles = pd.DataFrame({
        "Staff Names": NameStaff,
        "Position Titles": Postions
    })

In [178]:
print(Jobtitles)

           Staff Names                                    Position Titles
0     Mark Daniel Ward                                 Executive Director
1        Kevin Amstutz                              Senior Data Scientist
2        Ashley Arroyo                  Data Science Techincal Specialist
3        Donald Barnes                      Guest Relations Administrator
4        Brandt Barnes                                           Good Boy
5          Maggie Betz  Managing Director of The Data Mine at Indianap...
6         Bryce Castle            Corporate Partners Technical Specialist
7             Cai Chen            Corporate Partners Technical Specialist
8         Doug Crabill                              Senior Data Scientist
9        Peter Dragnev          Corporate Partners Technical Specialist  
10    Stacey Dunderman             Lead Program Administration Specialist
11     Jessica Gerlach            Corporate Partners Technical Specialist
12        Dan Hirleman  Regional Direc

By Following Dr Ward's video, I pulled from the Purdue Data Mine staff all and found what line is the title for the 22 employees titles. By pulling only <p class and then removing the p[class = "purdue-home-cta-grid__card-titleline"] which brought me to the final answer of printing all 22 staff members job titles

## Question 2

In [179]:
import pandas as pd
import requests

In [180]:
myresponse = requests.get('https://www.nps.gov')

In [181]:
mysoup = BeautifulSoup(myresponse.content, 'html.parser')

In [182]:
#Name of States same way I did in question 1
myState = [element.text.strip() for element in mysoup.select('a[class = "dropdown-item dropdown-state"]')]

In [183]:
# Names the StateURL
StateURL = ['https://www.nps.gov' + element['href'] for element in mysoup.select('a[class = "dropdown-item dropdown-state"]')]

In [196]:
#I just copy and pasted but changed the titles
CoolState = pd.DataFrame({
        "States": myState,
        "State URL": StateURL
    })

In [185]:
print(CoolState)

                      States                               State URL
0                    Alabama  https://www.nps.gov/state/al/index.htm
1                     Alaska  https://www.nps.gov/state/ak/index.htm
2             American Samoa  https://www.nps.gov/state/as/index.htm
3                    Arizona  https://www.nps.gov/state/az/index.htm
4                   Arkansas  https://www.nps.gov/state/ar/index.htm
5                 California  https://www.nps.gov/state/ca/index.htm
6                   Colorado  https://www.nps.gov/state/co/index.htm
7                Connecticut  https://www.nps.gov/state/ct/index.htm
8                   Delaware  https://www.nps.gov/state/de/index.htm
9       District of Columbia  https://www.nps.gov/state/dc/index.htm
10                   Florida  https://www.nps.gov/state/fl/index.htm
11                   Georgia  https://www.nps.gov/state/ga/index.htm
12                      Guam  https://www.nps.gov/state/gu/index.htm
13                    Hawaii  http

Markdown notes and sentences and analysis written here.

## Question 3

In [23]:
#I imported Pandas to be able to create a data frame. I already imported requests as well but just extra
import pandas as pd
import requests

In [24]:
#This just gets the data from the website
BookDF = requests.get('https://books.toscrape.com/')

In [25]:
mysoup = BeautifulSoup(BookDF.content, 'html.parser')

In [26]:
#This names the category names
CategoryNames = [element.text.strip() for element in mysoup.select('li ul li a')]

In [27]:
#Names the category links
CategoryLink = ['https://books.toscrape.com/' + element ['href'] for element in mysoup.select('li ul li a')]

In [28]:
#Combines and makes dataset, the BookDF is pulling from the Pandas
BookData = pd.DataFrame({
        "Category": CategoryNames,
        "Links": CategoryLink
    })

In [29]:
print(BookData)

              Category                                              Links
0               Travel  https://books.toscrape.com/catalogue/category/...
1              Mystery  https://books.toscrape.com/catalogue/category/...
2   Historical Fiction  https://books.toscrape.com/catalogue/category/...
3       Sequential Art  https://books.toscrape.com/catalogue/category/...
4             Classics  https://books.toscrape.com/catalogue/category/...
5           Philosophy  https://books.toscrape.com/catalogue/category/...
6              Romance  https://books.toscrape.com/catalogue/category/...
7       Womens Fiction  https://books.toscrape.com/catalogue/category/...
8              Fiction  https://books.toscrape.com/catalogue/category/...
9            Childrens  https://books.toscrape.com/catalogue/category/...
10            Religion  https://books.toscrape.com/catalogue/category/...
11          Nonfiction  https://books.toscrape.com/catalogue/category/...
12               Music  https://books.

Markdown notes and sentences and analysis written here.

## Question 4

In [96]:
myResponse = requests.get("https://www.gocomics.com/peanuts/1970/06/22")

In [94]:
mySoup = BeautifulSoup(myResponse.content, 'html.parser')

In [99]:
[element ['src'] for element in mySoup.select('img[alt = "Peanuts Comic Strip for June 22, 1970 "]')]

['https://assets.amuniversal.com/2181aa70f895013014ff001dd8b71c47']

In [8]:
#I Copied what Dr ward did instead of adding a different date
myResponses = requests.get("https://www.gocomics.com/peanuts/2004/10/05")

In [9]:
#Renamed and added myResponses
mySoups = BeautifulSoup(myResponses.content, 'html.parser')

In [11]:
#Input everything else
[element['src'] for element in mySoups.select('img[alt = "Peanuts Comic Strip for October 05, 2004 "]')]

['https://assets.amuniversal.com/282ff900f868013014ce001dd8b71c47']

In [18]:
#I just copied from the other two but put a different date instead
myResponsed = requests.get("https://www.gocomics.com/peanuts/2014/08/08")

In [19]:
#Renamed and added myResponsed and changed the name
mySouped = BeautifulSoup(myResponsed.content, 'html.parser')

In [20]:
#Input everything else
[element['src'] for element in mySouped.select('img[alt = "Peanuts Comic Strip for August 08, 2014 "]')]

['https://assets.amuniversal.com/b6966aa0f00f01317ff1005056a9545d']

Markdown notes and sentences and analysis written here.

## Question 5

In [119]:
import pandas as pd
from io import StringIO

In [120]:
myreponse = requests.get("https://www.scrapethissite.com/pages/forms/?per_page=600")

In [121]:
mysoup = BeautifulSoup(myreponse.content, 'html.parser')

In [126]:
#followed Dr Wards video
pd.read_html(StringIO(myreponse.text))[0]

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.550,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8.0,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10.0,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9.0,0.622,249,198,51
580,Washington Capitals,2011,42,32,8.0,0.512,222,230,-8


Markdown notes and sentences and analysis written here.