# 🌐 Scraping, Part 6: Table Talk

*From HTML table to data table.*

## The `<table>` element

<pre style="font-size: 0.85em; line-height: 0.85em;"> ┌───────────┐
 │  &lt;table&gt;  │
 └─┬─────────┘
   │  ┌───────────┐
   ├─►│  &lt;thead&gt;  │
   │  └─────┬─────┘
   │        │  ┌───────┐
   │        └─►│  &lt;tr&gt; │
   │           └───┬───┘
   │               │  ┌────────┐
   │               └─►│  &lt;th&gt;  │
   │                  └────────┘
   │  ┌───────────┐
   └─►│  &lt;tbody&gt;  │
      └─────┬─────┘
            │  ┌────────┐
            └─►│  &lt;tr&gt;  │
               └───┬────┘
                   │  ┌────────┐
                   └─►│  &lt;td&gt;  │
                      └────────┘
</pre>

Let's try parsing the table of giant watermelons from "`Homework 02, Part 2: The command line is fun (I promise) (optional)`"

Here's the website: http://www.bigpumpkins.com/WeighoffResultsGPC.aspx?c=W&y=2022

In [1]:
import requests
from bs4 import BeautifulSoup

Let's fetch the HTML:

In [2]:
watermelon_url = "http://www.bigpumpkins.com/WeighoffResultsGPC.aspx?c=W&y=2022"
watermelon_html = requests.get(watermelon_url).text

... and convert it to a `BeautifulSoup` object:

In [3]:
watermelon_soup = BeautifulSoup(watermelon_html)

Now let's select all tables:

In [4]:
tables = watermelon_soup.select("table")
len(tables)

4

There's really only one table we care about, though. Let's update our CSS selector, making it *more specific*. Looking at the HTML, how would you do that?

In [5]:
tables = watermelon_soup.select("table.ReportResults")
len(tables)

1

Let's look at that table:

In [6]:
watermelon_table = tables[0]
watermelon_table.text[:500]

'PlaceWeight (lbs)Grower NameCityState/ProvCountryGPC SiteSeed (Mother)Pollinator (Father)OTTEst. WeightPct. Chart1325.40Mudd, FramkVine GroveKentuckyUnited StatesAllardt Pumpkin Festival305 Mudd 16305 Mudd223.0303.007.02309.00McCaslin, NickHawesvilleKentuckyUnited StatesChillicothe Halloween Festival301.5 McCaslinSelf224.0307.001.03306.00Vial, AndrewLibertyNorth CarolinaUnited StatesNC State Fair GPC Weigh-Off341.5 Vial 19330.5 Vial B 19223.0301.002.04302.50Mudd, FrankVine GroveKentuckyUnited St'

Now let's get the row elements:

In [7]:
row_els = watermelon_table.select("tbody tr")
len(row_els)

300

Let's take a look at the first one:

In [8]:
row_els[0]

<tr><td align="right">1</td><td align="right">325.40</td><td>Mudd, Framk</td><td>Vine Grove</td><td>Kentucky</td><td>United States</td><td>Allardt Pumpkin Festival</td><td>305 Mudd 16</td><td>305 Mudd</td><td align="right">223.0</td><td align="right">303.00</td><td align="right">7.0</td></tr>

Let's turn this row into a list, where each cell is one item in the list:

In [9]:
[ cell.text for cell in row_els[0].select("td") ]

['1',
 '325.40',
 'Mudd, Framk',
 'Vine Grove',
 'Kentucky',
 'United States',
 'Allardt Pumpkin Festival',
 '305 Mudd 16',
 '305 Mudd',
 '223.0',
 '303.00',
 '7.0']

## Exercise: How would you extract data from all the rows?

In [10]:
watermelon_entries = []
for row in row_els:
    row_cells = []
    for cell in row.select("td"):
        row_cells.append(cell.text)
    watermelon_entries.append(row_cells)

watermelon_entries[:3]

[['1',
  '325.40',
  'Mudd, Framk',
  'Vine Grove',
  'Kentucky',
  'United States',
  'Allardt Pumpkin Festival',
  '305 Mudd 16',
  '305 Mudd',
  '223.0',
  '303.00',
  '7.0'],
 ['2',
  '309.00',
  'McCaslin, Nick',
  'Hawesville',
  'Kentucky',
  'United States',
  'Chillicothe Halloween Festival',
  '301.5 McCaslin',
  'Self',
  '224.0',
  '307.00',
  '1.0'],
 ['3',
  '306.00',
  'Vial, Andrew',
  'Liberty',
  'North Carolina',
  'United States',
  'NC State Fair GPC Weigh-Off',
  '341.5 Vial 19',
  '330.5 Vial B 19',
  '223.0',
  '301.00',
  '2.0']]

Here's how we'd do that just with list comprehensions:

In [11]:
watermelon_entries = [
    [ cell.text for cell in row.select("td") ]
    for row in row_els
]

watermelon_entries[:3]

[['1',
  '325.40',
  'Mudd, Framk',
  'Vine Grove',
  'Kentucky',
  'United States',
  'Allardt Pumpkin Festival',
  '305 Mudd 16',
  '305 Mudd',
  '223.0',
  '303.00',
  '7.0'],
 ['2',
  '309.00',
  'McCaslin, Nick',
  'Hawesville',
  'Kentucky',
  'United States',
  'Chillicothe Halloween Festival',
  '301.5 McCaslin',
  'Self',
  '224.0',
  '307.00',
  '1.0'],
 ['3',
  '306.00',
  'Vial, Andrew',
  'Liberty',
  'North Carolina',
  'United States',
  'NC State Fair GPC Weigh-Off',
  '341.5 Vial 19',
  '330.5 Vial B 19',
  '223.0',
  '301.00',
  '2.0']]

Now all we're missing is the header.

In [12]:
header_cells = watermelon_table.select("thead th")
watermelon_headers = [ header.text for header in header_cells ]
watermelon_headers

['Place',
 'Weight (lbs)',
 'Grower Name',
 'City',
 'State/Prov',
 'Country',
 'GPC Site',
 'Seed (Mother)',
 'Pollinator (Father)',
 'OTT',
 'Est. Weight',
 'Pct. Chart']

Now let's put it all together, making a `DataFrame` that uses the headers as column names:

In [13]:
import pandas as pd
watermelon_df = pd.DataFrame(watermelon_entries, columns=watermelon_headers)
watermelon_df.head()

Unnamed: 0,Place,Weight (lbs),Grower Name,City,State/Prov,Country,GPC Site,Seed (Mother),Pollinator (Father),OTT,Est. Weight,Pct. Chart
0,1,325.4,"Mudd, Framk",Vine Grove,Kentucky,United States,Allardt Pumpkin Festival,305 Mudd 16,305 Mudd,223.0,303.0,7.0
1,2,309.0,"McCaslin, Nick",Hawesville,Kentucky,United States,Chillicothe Halloween Festival,301.5 McCaslin,Self,224.0,307.0,1.0
2,3,306.0,"Vial, Andrew",Liberty,North Carolina,United States,NC State Fair GPC Weigh-Off,341.5 Vial 19,330.5 Vial B 19,223.0,301.0,2.0
3,4,302.5,"Mudd, Frank",Vine Grove,Kentucky,United States,Roberts Family Farms,305 Mudd 16,Self,221.0,297.0,2.0
4,5,291.5,"VanBeck, Patrick",Willlow Spring,North Carolina,United States,NC State Fair GPC Weigh-Off,Carolina Cross Burpee,305 Vial DMG,221.0,297.0,-2.0


## Q: What grower entered the most melons?

In [14]:
watermelon_df["Grower Name"].value_counts().head()

Grower Name
Melka, Friedrich    6
Smiley, Samantha    5
Kent, Chris         5
McCaslin, Nick      5
Mudd, Frank         5
Name: count, dtype: int64

## Q: What country grew the most melons?

In [15]:
watermelon_df["Country"].value_counts().head()

Country
United States    213
Canada            29
Germany           15
Italy             11
Austria            8
Name: count, dtype: int64

## Q: Which growers entered the most total weight in watermelons?

In [16]:
(
    watermelon_df
    .astype({ "Weight (lbs)": float })
    .groupby("Grower Name")
    ["Weight (lbs)"]
    .sum()
    .sort_values(ascending=False)
    .head()
)

Grower Name
McCaslin, Nick      1327.0
Mudd, Frank         1203.2
Kent, Chris         1163.0
Smiley, Samantha    1101.3
Houston, Hank       1044.0
Name: Weight (lbs), dtype: float64

---

---

---