In [12]:
#!pip install beautifulsoup4

# XML, HTML, and Web Scraping

JSON and XML are two different ways to represent hierarchical data. Which one is better? There are lots of articles online which discuss similarities and differences between JSON and XML and their advantages and disadvantages. Both formats are still in current usage, so it is good to be familiar with both. However, JSON is more common, so we'll focus on working with JSON representations of hierarchical data.

The reading covered an example of using Beautiful Soup to parse XML. Rather than doing another example XML now, we'll skip straight to scraping HTML from a webpage. Both HTML and XML can be parsed in a similar way with Beautiful Soup.

In [23]:
import pandas as pd
import requests
from requests import get
from bs4 import BeautifulSoup

## Scraping an HTML table with Beautiful Soup

Open the URL https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population and scroll down until you see a table of the cities in the U.S. with population over 100,000 (as of Jul 1, 2022). We'll use Beautiful Soup to scrape information from this table.

Read in the HTML from the ULR using the `requests` library.

In [24]:
response = requests.get("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")

Use Beautiful Soup to parse this string into a tree called `soup`

In [25]:
soup = BeautifulSoup(response.content, "html.parser")

To find an HTML tag corresponding to a specific element on a webpage, right-click on it and choose "Inspect element". Go to the cities table Wikipedia page and do this now.

You should find that the cities table on the Wikipedia page corresponds to the element

```
<table class="wikitable sortable jquery-tablesorter" style="text-align:center">
```

There are many `<table>` tags on the page.

In [26]:
len(soup.find_all("table"))

14

We can use attributes like `class=` and `style=` to narrow down the list.

In [27]:
len(soup.find_all("table",
                  attrs={
                      "class": "wikitable sortable",
                      "style": "text-align:center"}
                  ))

3

At this point, you can manually inspect the tables on the webpage to find that the one we want is the first one (see `[0]` below). We'll store this as `table`.

In [29]:
table = soup.find_all("table",
                  attrs={
                      "class": "wikitable sortable",
                      "style": "text-align:center"}
                  )[0]

**Now you will write code to scrape the information in `table` to create a Pandas data frame with one row for each city and columns for: city, state, population (2022 estimate), and 2020 land area (sq mi).** Refer to the Notes/suggestions below as you write your code. A few Hints are provided further down, but try coding first before looking at the hints.

Notes/suggestions:

- Use as a guide the code from the reading that produced the data frame of Statistics faculty
- Inspect the page source as you write your code
- You will need to write a loop to get the information for all cities, but you might want to try just scraping the info for New York first
- You will need to pull the text from the tag. If `.text` returns text with "\n" at the end, try `.get_text(strip = True)` instead of `.text`
- Don't forget to convert to a Pandas Data Frame; it should have 333 rows and 4 columns
- The goal of this exercise is just to create the Data Frame. If you were going to use it --- e.g., what is the population density for all cities in CA? --- then you would need to clean the data first (to clean strings and convert to quantitative). (You can use Beautiful Soup to do some of the cleaning for you, but that goes beyond our scope.)

In [30]:
# initialize an empty list
rows = []

# iterate over all rows in the faculty table
for city in table.find_all("tr")[1:]:

    # Get all the cells (<td>) in the row.
    cells = city.find_all("td")

    # The information we need is the text between tags.

    # Find the the name of the cityin cell[0]
    # which for most city is contained in the <i> tag
    name_tag = cells[0].find("a") or cells[0]
    name = name_tag.text

    # Find the state of the city in cell[1]
    # which for most states is contained in the <i> tag
    state_tag = cells[1].find("a") or cells[1]
    state = state_tag.text

    # Find the population 2022 of the city in cell[2]
    # which for most populations is contained in the <a> tag
    population_2022_tag = cells[2].find("td") or cells[2]
    population_2022 = population_2022_tag.text

    # Find the population 2020 of the city in cell[3]
    # which for most population is contained in the <a> tag
    population_2020_tag = cells[3].find("td") or cells[3]
    population_2020 = population_2020_tag.text

    # Find the land area of the city in cell[5]
    # which for most land area is contained in the <a> tag
    land_area_tag = cells[5].find("td") or cells[5]
    land_area = land_area_tag.text

    # Append this data.
    rows.append({
        "name": name,
        "state": state,
        "population (2022 estimate)": population_2022,
        "population (2020 census)": population_2020,
        "land area (sq.mi)": land_area
    })

In [31]:
print(pd.DataFrame(rows).head())

          name       state population (2022 estimate)  \
0     New York    New York                8,335,897\n   
1  Los Angeles  California                3,822,238\n   
2      Chicago    Illinois                2,665,039\n   
3      Houston       Texas                2,302,878\n   
4      Phoenix     Arizona                1,644,409\n   

  population (2020 census) land area (sq.mi)  
0              8,804,190\n     300.5 sq mi\n  
1              3,898,747\n     469.5 sq mi\n  
2              2,746,388\n     227.7 sq mi\n  
3              2,304,580\n     640.4 sq mi\n  
4              1,608,139\n     518.0 sq mi\n  


Hints:

- Each city is a row in the table; find all the `<tr>` tags to find all the cities
- Look for the `<td>` tag to see table entries within a row
- The rank column is represented by `<th>` tags, rather than `<td>` tags. So within a row, the first (that is, `[0]`) `<td>` tag corresponds to the city name.

## Aside: Scraping an HTML table with Pandas



The Pandas command `read_html` can be used to scrape information from an HTML table on a webpage.

We can call `read_html` on the URL.

In [32]:
#pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")

However, this scrapes all the tables on the webpage, not just the one we want. As with Beautiful Soup, we can narrow the search by specifying the table attributes.

In [33]:
#pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population", attrs = {'class': 'wikitable sortable', "style": "text-align:center"})

This still returns 3 tables. As we remarked above, the table that we want is the first one (see `[0]` below).

In [34]:
#df_cities2 = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population", attrs = {'class': 'wikitable sortable', "style": "text-align:center"})[0]
#df_cities2

Wait, that seemed much easier than using Beautiful Soup, and it returned a data frame, and we even got for free some formatting like removing the commas from the population! Why didn't we just use `read_html` in the first place? It's true the `read_html` works well when scraping information from an HTML *table*. Unfortunately, you often want to scrape information from a webpage that isn't conveniently stored in an HTML table, in which case `read_html` won't work. (It only searches for `<table>`, `<th>`, `<tr>`, and `<td>` tags, but there are many other HTML tags.) Though Beautiful Soup is not as simple as `read_html`, it is more flexible and thus more widely applicable.

## Scraping information that is NOT in a `<table>` with Beautiful Soup

The Cal Poly course catalog http://catalog.calpoly.edu/collegesandprograms/collegeofsciencemathematics/statistics/#courseinventory contains a list of courses offered by the Statistics department. **You will scrape this website to obtain a Pandas data frame with one row for each DATA or STAT course and two columns: course name and number (e.g, DATA 301. Introduction to Data Science) and term typically offered (e.g., Term Typically Offered: F, W, SP).**

Note: Pandas `read_html` is not help here since the courses are not stored in a `<table>.`

In [35]:
#pd.read_html("http://catalog.calpoly.edu/collegesandprograms/collegeofsciencemathematics/statistics/#courseinventory")


Notes/suggestions:


- Inspect the page source as you write your code
- The courses are not stored in a `<table>`. How are they stored?
- You will need to write a loop to get the information for all courses, but you might want to try just scraping the info for DATA 100 first
- What kind of tag is the course name stored in? What is the `class` of the tag?
- What kind of tag is the quarter(s) the course is offered stored in? What is the `class` of the tag? Is this the only tag of this type with the class? How will you get the one you want?
- You don't have to remove the number of units (e.g., 4 units) from the course name and number, but you can try it if you want
- You will need to pull the text from the tag. If `.text` returns text with "\n" at the end, try `get_text(strip = True)` instead of `text`
- Don't forget to convert to a Pandas Data Frame; it should have 74 rows and 2 columns
- The goal of this exercise is just to create the Data Frame. If you were going to use it then you might need to clean the data first. (You can use Beautiful Soup to do some of the cleaning for you, but that goes beyond our scope.)



In [36]:
classes = requests.get("https://catalog.calpoly.edu/collegesandprograms/collegeofsciencemathematics/statistics/#courseinventory")
soup2 = BeautifulSoup(classes.content, "html.parser")

In [37]:
class_data = soup2.find_all("div", {"class": "courseblock"})
class_data = list(class_data)
#type(class_data)
#len(class_data)
#print(class_data)

In [38]:

# Initialize an empty list
rows = []

# Iterate over all rows in the faculty table
for course in class_data:

    # Get all the cells (<div>) in the row.
    cells = course.find_all("p")
    
    #print(cells)

    # Find the name of the course in cell[0]
    name_tag = cells[0].find("strong")
    if name_tag is not None:
        name = name_tag.text.strip()
    else:
        name = cells[0].find("span").extract.text.strip()
        
    offered_tag = cells[1].find("p", class_ = "noindent")
    if offered_tag is not None:
        offered = offered_tag.text.strip()
    else:
        offered = cells[1].text.strip()

    # Append this data.
    rows.append({
        "name": name,
        "offered" : offered
    })

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(rows)

print(df)


                                                 name  \
0          DATA 100. Data Science for All I.\n4 units   
1    DATA 301. Introduction to Data Science.\n4 units   
2   DATA 401. Data Science Process and Ethics.\n3 ...   
3   DATA 402. Mathematical Foundations of Data Sci...   
4   DATA 403. Data Science Projects Laboratory.\n1...   
..                                                ...   
69    STAT 551. Statistical Learning with R.\n4 units   
70  STAT 566. Graduate Consulting Practicum.\n2 units   
71     STAT 570. Selected Advanced Topics.\n1-4 units   
72  STAT 590. Graduate Seminar in Statistics.\n1 unit   
73                       STAT 599. Thesis.\n1-4 units   

                             offered  
0   Term Typically Offered: F, W, SP  
1   Term Typically Offered: F, W, SP  
2          Term Typically Offered: F  
3          Term Typically Offered: F  
4          Term Typically Offered: F  
..                               ...  
69         Term Typically Offered: F  
70 

Hints:

- Each course is represented by a `<div>` with `class=courseblock`, so you can find all the courses with `soup.find_all("div", {"class": "courseblock"})`
- The course name is in a `<p>` tag with `class=courseblocktitle`, inside a `<strong>` tag. (Though I don't think we need to find the strong tag here.)
- The term typically offered is in `<p>` tag with `class=noindent`. However, there are several tags with this class; term typically offered is the first one.
- If you want to use Beautiful Soup to remove the course units (e.g., 4 units), find the `<span>` tag within the course name tag and `.extract()` this span tag

# Class Notes


## API

In [6]:
#!pip install beautifulsoup4

In [7]:
import pandas as pd

from requests import get

url = "https://tasty.p.rapidapi.com/recipes/list"

querystring = {"from":"0","size":"20","q":"daikon"}

headers = {
    "X-RapidAPI-Key": 'f8cbec4e16msh9d42e764e31cf65p13ede9jsnb08c4c0cc835',
    "X-RapidAPI-Host": "tasty.p.rapidapi.com"
}

response = get(url, headers=headers, params=querystring)


In [10]:
daikon_recipes = pd.json_normalize(response.json(), "results")
#daikon_recipes

In [9]:
from time import sleep

for i in range(1, 5, 1):
  querystring = {"from":str(20*(i-1)),"size":"20","q":"daikon"}
  sleep(2)