# XML, HTML, and Web Scraping

JSON and XML are two different ways to represent hierarchical data. Which one is better? There are lots of articles online which discuss similarities and differences between JSON and XML and their advantages and disadvantages. Both formats are still in current usage, so it is good to be familiar with both. However, JSON is more common, so we'll focus on working with JSON representations of hierarchical data.

The reading covered an example of using Beautiful Soup to parse XML. Rather than doing another example XML now, we'll skip straight to scraping HTML from a webpage. Both HTML and XML can be parsed in a similar way with Beautiful Soup.

In [2]:
import pandas as pd 
import requests
from bs4 import BeautifulSoup

## Scraping an HTML table with Beautiful Soup

Open the URL https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population and scroll down until you see a table of the cities in the U.S. with population over 100,000 (as of Jul 1, 2022). We'll use Beautiful Soup to scrape information from this table.

Read in the HTML from the ULR using the `requests` library.

In [3]:
URL = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population"
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

response = requests.get(URL, headers=HEADERS)

Use Beautiful Soup to parse this string into a tree called `soup`

In [4]:
soup = BeautifulSoup(response.text, "html.parser")

To find an HTML tag corresponding to a specific element on a webpage, right-click on it and choose "Inspect element". Go to the cities table Wikipedia page and do this now.

You should find that the cities table on the Wikipedia page corresponds to the element

```
<table class="wikitable sortable jquery-tablesorter" style="text-align:center">
```

There are many `<table>` tags on the page.

In [5]:
len(soup.find_all("table"))

10

We can use attributes like `class=` and `style=` to narrow down the list.

In [6]:
len(soup.find_all("table",
                  attrs={
                      "class": "sortable wikitable sticky-header-multi static-row-numbers sort-under col1left col2center",
                      "style": "text-align:right"}
                  ))

1

At this point, you can manually inspect the tables on the webpage to find that the one we want is the first one (see `[0]` below). We'll store this as `table`.

In [7]:
table = soup.find_all("table",
                  attrs={
                      "class": "sortable wikitable sticky-header-multi static-row-numbers sort-under col1left col2center",
                      "style": "text-align:right"}
                  )[0]

**Now you will write code to scrape the information in `table` to create a Pandas data frame with one row for each city and columns for: city, state, population (2022 estimate), and 2020 land area (sq mi).** Refer to the Notes/suggestions below as you write your code. A few Hints are provided further down, but try coding first before looking at the hints.

Notes/suggestions:

- Use as a guide the code from the reading that produced the data frame of Statistics faculty
- Inspect the page source as you write your code
- You will need to write a loop to get the information for all cities, but you might want to try just scraping the info for New York first
- You will need to pull the text from the tag. If `.text` returns text with "\n" at the end, try `.get_text(strip = True)` instead of `.text`
- Don't forget to convert to a Pandas Data Frame; it should have 333 rows and 4 columns
- The goal of this exercise is just to create the Data Frame. If you were going to use it --- e.g., what is the population density for all cities in CA? --- then you would need to clean the data first (to clean strings and convert to quantitative). (You can use Beautiful Soup to do some of the cleaning for you, but that goes beyond our scope.)

In [8]:
city = table.find_all("tr")[2]

In [9]:
cells = city.find_all("td")

In [10]:
cells[0].find("strong")

In [11]:
cells[1].find("a").text

'NY'

In [12]:
cells[3].get_text(strip=True)

'8,804,190'

In [13]:
city = table.find_all("tr")
len(city)

348

In [14]:
# initialize an empty list
rows = []

# iterate over all rows in the cities table
for city in table.find_all("tr")[2:]:

    # Get all the cells (<td>) in the row.
    cells = city.find_all("td")

    # Find the name of the city in cell[1]
    name_tag = cells[1].find("a") or cells[1]
    name = name_tag.text.strip()

    # Find the state abbreviation in cell[2]
    state_tag = cells[2].find("a") or cells[2]
    state = state_tag.text.strip()

    # Find the 2024 population estimate in cell[3]
    pop_tag = cells[3] or cells[3]
    population = pop_tag.text.strip()

    area_tag = cells[6] or cells[6]
    land_area = area_tag.text.strip()

    # Append this data.
    rows.append({
        "city": name,
        "state": state,
        "population_2024": population,
        "land_area_2020_sq_mi": land_area
    })


In [None]:
pd.DataFrame(rows)


Unnamed: 0,city,state,population_2024,land_area_2020_sq_mi
0,NY,8478072,8804190,778.3
1,CA,3878704,3898747,1216.0
2,IL,2721308,2746388,589.7
3,TX,2390125,2304580,1658.6
4,AZ,1673164,1608139,1341.6
...,...,...,...,...
341,FL,100513,93692,96.6
342,WA,100252,101030,57.8
343,TX,100159,99893,154.6
344,CA,100136,93000,67.1


Hints:

- Each city is a row in the table; find all the `<tr>` tags to find all the cities
- Look for the `<td>` tag to see table entries within a row
- The rank column is represented by `<th>` tags, rather than `<td>` tags. So within a row, the first (that is, `[0]`) `<td>` tag corresponds to the city name.