# Web Scraping
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp24&branch=main&urlpath=tree%2Fdata271_sp24%2Fdemos%2Fdata271_demo31_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook. 

In [1]:
import numpy as np
import pandas as pd

## Scraping a very basic webpage

In [2]:
# Whenever you want to scrape a website without an API
import requests
from bs4 import BeautifulSoup

BeautifulSoup documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


Let's look [this](https://bethanyj0.github.io/) very simple webpage.

In [7]:
# Get the content of a website
site = requests.get('https://bethanyj0.github.io/')

In [None]:
# What did we just get?
type(site)

In [None]:
# Check the status
site.status_code

In [4]:
# inspect the contents
site.text
# messy!

'<html>\n\n<head>\n<title>A very webpage</title>\n</head>\n\n<h1>This is an "h1" level header.</h1>\n\n<h2>This is a level h2 header.</h2>\n\n<h5>I can skip to a level h5 header!</h5>\n\n<p>This is a paragraph.</p>\n\n<p>Now <i>this</i> is a <b>cool</b> website!</p>\n\n<p>Here is a <a href=https://math.humboldt.edu/programs/data-science>link</a>.</p>\n\n<h2>Let\'s make a list (ordered)!</h2>\n<ol>\n  <li>Do\n  <li>Re\n  <li>Mi\n  <li>Fa\n  <li>So\n</ol>\n\n<h2>Let\'s make a list (unordered)!</h2>\n<ul>\n  <li>La\n  <li>Ti\n  <li>Do\n</ul>\n\n<h2>Nested Lists!</h2>\n<ul>\n  <li>Do:\n    <ol>\n      <li>A deer\n      <li>A female deer\n    </ol>\n  <li>Re:\n    <ol>\n      <li>A drop\n      <li>Of golden sun\n    </ol>\n</ul>\n\n</body>\n\n</html>\n'

In [8]:
# Let's beautify this and make it easier to parse
soup = BeautifulSoup(site.text)



In [None]:
# What does our soup look like?
soup

In [None]:
# Make it even prettier
print(soup.prettify())

### Parse the html

In [None]:
# Find a level 1 header
soup.find('h1')

In [None]:
# Find a level 2 header
soup.find('h2')

In [None]:
# Find all the level 2 headers
soup.find_all('h2')

In [None]:
# Find all the level 3 headers
soup.find_all('h3')
# There were none.

In [None]:
# Find all the paragraphs
soup.find_all('p')

In [None]:
# Find all the hyperlinks
soup.find_all('a')

In [None]:
# Get the links
soup.find('a')['href']

In [None]:
# Get the text for the hyperlink
soup.find('a').text

In [None]:
# Get all the list items
soup.find_all('li')

In [None]:
# Specifically get the ordered lists
soup.find_all('ol')

In [None]:
# Specifically get the unordered lists
soup.find_all('ul')

## Another example

Let's check out [this](https://irar.humboldt.edu/node/552) Cal Poly Humboldt website.

In [None]:
# Get the data
cph_stats = requests.get('https://irar.humboldt.edu/node/552')
cph_stats.status_code

In [None]:
# Beautify the data
cph_soup = BeautifulSoup(cph_stats.text, 'html.parser')

In [None]:
# Pretty!
print(cph_soup.prettify())

In [None]:
# check for specific tags
cph_soup.find_all('a')

In [None]:
# refine the search with css selectors
cph_soup.find_all('a', class_ = "expanded")

In [None]:
# A shorthand way of searching that
cph_soup.select('a.expanded')

In [None]:
# How many tables are there?
len(cph_soup.find_all('table'))

In [None]:
# Tables are labeled with h3 headers
cph_soup.find_all('h3')

In [None]:
# Let's just focus on one of the tables
student_ethnicity_table = cph_soup.find_all('table')[3]

In [None]:
# Look at the rows
student_ethnicity_table.find_all('tr')

In [None]:
# Look at a single row
student_ethnicity_table.find_all('tr')[0]

In [None]:
# Look at all the data points in that row
student_ethnicity_table.find_all('tr')[0].find_all('td')

In [None]:
# Look at one specific data point
student_ethnicity_table.find_all('tr')[0].find_all('td')[1]

In [None]:
# Get the text of that data point 
student_ethnicity_table.find_all('tr')[0].find_all('td')[1].text

### Create a Pandas DataFrame

In [None]:
# Create a nested list with the data
table_vals = []

for i in student_ethnicity_table.find_all('tr'):
    row_i = []
    for j in i.find_all('td'):
        row_i.append(j.text)
    table_vals.append(row_i)

In [None]:
# Check out the result
table_vals

In [None]:
# Make it a dataframe
df = pd.DataFrame(table_vals)
df

In [None]:
# Clean it up (reset column labels)
df.columns = df.iloc[0]
df.drop(0,inplace=True)
df

In [None]:
# Clean it up (reset row labels)
df.set_index('',inplace=True)

In [None]:
df

In [None]:
# Would require further cleanup
df.dtypes

## Activity

1. From the same webpage we scraped last (Cal Poly Humboldt IRAR), put the data in the Fall 2023 Geographic Origin of Current Students. Table into a Pandas DataFrame.