## Reading HTML tables

In this lecture we'll learn how to read and parse HTML tables from websites into a list of `DataFrame` objects to work with.

In [2]:
!pip install lxml



In [3]:
import pandas as pd
import requests

### **Parsing raw HTML strings**

Another useful pandas method is `read_html()`. This method will read HTML tables from a given URL, a file-like object, or a raw string containing HTML, and return a list of `DataFrame` objects.

Let's try to read the following `html_string` into a `DataFrame`.

In [4]:
html_string = """
<table>
    <thead>
      <tr>
        <th>Order date</th>
        <th>Region</th> 
        <th>Item</th>
        <th>Units</th>
        <th>Unit cost</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>1/6/2018</td>
        <td>East</td> 
        <td>Pencil</td>
        <td>95</td>
        <td>1.99</td>
      </tr>
      <tr>
        <td>1/23/2018</td>
        <td>Central</td> 
        <td>Binder</td>
        <td>50</td>
        <td>19.99</td>
      </tr>
      <tr>
        <td>2/9/2018</td>
        <td>Central</td> 
        <td>Pencil</td>
        <td>36</td>
        <td>4.99</td>
      </tr>
      <tr>
        <td>3/15/2018</td>
        <td>West</td> 
        <td>Pen</td>
        <td>27</td>
        <td>19.99</td>
      </tr>
    </tbody>
</table>
"""

In [5]:
dfs = pd.read_html(html_string)

  dfs = pd.read_html(html_string)


The `read_html` just returned one **DataFrame** object:

In [6]:
len(dfs)  # Check how many tables were found

1

In [7]:
dfs

[  Order date   Region    Item  Units  Unit cost
 0   1/6/2018     East  Pencil     95       1.99
 1  1/23/2018  Central  Binder     50      19.99
 2   2/9/2018  Central  Pencil     36       4.99
 3  3/15/2018     West     Pen     27      19.99]

In [8]:
df = dfs[0]  # Select the first table

In [9]:
df

Unnamed: 0,Order date,Region,Item,Units,Unit cost
0,1/6/2018,East,Pencil,95,1.99
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99
3,3/15/2018,West,Pen,27,19.99


Previous `DataFrame` looks quite similar to the raw HTML table, but now we have a `DataFrame` object, so we can apply any pandas operation we want to it.

In [10]:
df.shape # checking the shape of the DataFrame wich it haves (4, 5) rows and columns

(4, 5)

In [11]:
df.loc[df["Region"] == "Central"]

Unnamed: 0,Order date,Region,Item,Units,Unit cost
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99


In [12]:
df.loc[df["Units"] > 35]

Unnamed: 0,Order date,Region,Item,Units,Unit cost
0,1/6/2018,East,Pencil,95,1.99
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99


---

### **Defining header**

Pandas will automatically find the header to use thanks to the tag.

But in many cases we'll find wrong or incomplete tables that make the `read_html` method parse the tables in a wrong way without the proper headers.

To fix them we can use the `header` parameter.

In [13]:
html_string = """
<table>
  <tr>
    <td>Order date</td>
    <td>Region</td> 
    <td>Item</td>
    <td>Units</td>
    <td>Unit cost</td>
  </tr>
  <tr>
    <td>1/6/2018</td>
    <td>East</td> 
    <td>Pencil</td>
    <td>95</td>
    <td>1.99</td>
  </tr>
  <tr>
    <td>1/23/2018</td>
    <td>Central</td> 
    <td>Binder</td>
    <td>50</td>
    <td>19.99</td>
  </tr>
  <tr>
    <td>2/9/2018</td>
    <td>Central</td> 
    <td>Pencil</td>
    <td>36</td>
    <td>4.99</td>
  </tr>
  <tr>
    <td>3/15/2018</td>
    <td>West</td> 
    <td>Pen</td>
    <td>27</td>
    <td>19.99</td>
  </tr>
</table>
"""

In [14]:
pd.read_html(html_string)[0]

  pd.read_html(html_string)[0]


Unnamed: 0,0,1,2,3,4
0,Order date,Region,Item,Units,Unit cost
1,1/6/2018,East,Pencil,95,1.99
2,1/23/2018,Central,Binder,50,19.99
3,2/9/2018,Central,Pencil,36,4.99
4,3/15/2018,West,Pen,27,19.99


In this case, we'll need to pass the row number to use as `header` using the `header` parameter.

In [15]:
pd.read_html(html_string, header=0)[0]

  pd.read_html(html_string, header=0)[0]


Unnamed: 0,Order date,Region,Item,Units,Unit cost
0,1/6/2018,East,Pencil,95,1.99
1,1/23/2018,Central,Binder,50,19.99
2,2/9/2018,Central,Pencil,36,4.99
3,3/15/2018,West,Pen,27,19.99


---

### **Parsing HTML tables from the web**

Now that we know how `read_html` works, go one step beyond and try to parse HTML tables directly from an URL.

To do that we'll call the `read_html` method with an URL as paramter.

In [16]:
html_url = "https://www.basketball-reference.com/leagues/NBA_2019_per_game.html"

In [17]:
nba_tables = pd.read_html(html_url)

In [18]:
len(nba_tables)

2

In [19]:
nba_tables

[        Rk                 Player   Age Team  Pos     G    GS    MP    FG  \
 0      1.0           James Harden  29.0  HOU   PG  78.0  78.0  36.8  10.8   
 1      2.0            Paul George  28.0  OKC   SF  77.0  77.0  36.9   9.2   
 2      3.0  Giannis Antetokounmpo  24.0  MIL   PF  72.0  72.0  32.8  10.0   
 3      4.0            Joel Embiid  24.0  PHI    C  64.0  64.0  33.7   9.1   
 4      5.0           LeBron James  34.0  LAL   SF  55.0  55.0  35.2  10.1   
 ..     ...                    ...   ...  ...  ...   ...   ...   ...   ...   
 704  527.0            Zach Lofton  26.0  DET   SG   1.0   0.0   4.0   0.0   
 705  528.0           Kobi Simmons  21.0  CLE   PG   1.0   0.0   2.0   0.0   
 706  529.0             Tyler Ulis  23.0  CHI   PG   1.0   0.0   1.0   0.0   
 707  530.0            Okaro White  26.0  WAS   PF   3.0   0.0   2.0   0.0   
 708    NaN         League Average   NaN  NaN  NaN   NaN   NaN   NaN   NaN   
 
       FGA  ...  ORB   DRB   TRB  AST  STL  BLK  TOV   PF   PT

In [20]:
nba_first = nba_tables[0]  # Select the first table

In [21]:
nba_first.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,James Harden,29.0,HOU,PG,78.0,78.0,36.8,10.8,24.5,...,0.8,5.8,6.6,7.5,2.0,0.7,5.0,3.1,36.1,"MVP-2,AS,NBA1"
1,2.0,Paul George,28.0,OKC,SF,77.0,77.0,36.9,9.2,21.0,...,1.4,6.8,8.2,4.1,2.2,0.4,2.7,2.8,28.0,"MVP-3,DPOY-3,AS,NBA1,DEF1"
2,3.0,Giannis Antetokounmpo,24.0,MIL,PF,72.0,72.0,32.8,10.0,17.3,...,2.2,10.3,12.5,5.9,1.3,1.5,3.7,3.2,27.7,"MVP-1,DPOY-2,AS,NBA1,DEF1"
3,4.0,Joel Embiid,24.0,PHI,C,64.0,64.0,33.7,9.1,18.7,...,2.5,11.1,13.6,3.7,0.7,1.9,3.5,3.3,27.5,"MVP-7,DPOY-4,AS,NBA2,DEF2"
4,5.0,LeBron James,34.0,LAL,SF,55.0,55.0,35.2,10.1,19.9,...,1.0,7.4,8.5,8.3,1.3,0.6,3.6,1.7,27.4,"MVP-11,AS,NBA3"


In [22]:
nba_second = nba_tables[1]  # Select the second table

In [23]:
nba_second.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,Kevin Durant,30.0,GSW,SF,12.0,12.0,36.8,10.4,20.3,...,0.1,4.8,4.9,4.5,1.1,1.0,3.6,3.2,32.3,
1,2.0,James Harden,29.0,HOU,PG,11.0,11.0,38.5,9.9,24.0,...,0.8,6.0,6.8,6.6,2.2,0.9,4.6,3.7,31.6,
2,3.0,Kawhi Leonard,27.0,TOR,SF,24.0,24.0,39.1,10.1,20.7,...,2.3,6.8,9.1,3.9,1.7,0.7,3.1,2.3,30.5,Finals MVP-1
3,4.0,Paul George,28.0,OKC,SF,5.0,5.0,40.8,8.8,20.2,...,1.2,7.4,8.6,3.6,1.4,0.2,4.2,4.2,28.6,
4,5.0,Stephen Curry,30.0,GSW,PG,22.0,22.0,38.5,8.6,19.6,...,0.8,5.2,6.0,5.7,1.1,0.2,3.0,3.1,28.2,


We can also use the `requests` module to get HTML code from an URL to parse it into `DataFrame` objects.

If we look at the given URL we can see multiple tables about The Simpsons TV show.

We want to keep the table with information about each season. We can use the `pandas` library to read the HTML tables directly from the URL.

In [24]:
html_url = "https://en.wikipedia.org/wiki/The_Simpsons"

In [25]:
req = requests.get(html_url)

In [26]:
req.status_code  # Check if the request was successful

200

In [27]:
wiki_tables = pd.read_html(req.text, header=0)

  wiki_tables = pd.read_html(req.text, header=0)


In [28]:
len(wiki_tables)

49

In [37]:
simpsons_df = wiki_tables[2] # Selecting the third table found in the DOM

In [36]:
simpsons_df.head() 

Unnamed: 0,Season,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
0,Season,Season,No. of episodes,Season premiere,Season finale,Time slot (ET),Avg. viewers (in millions),Most watched episode,Most watched episode
1,Season,Season,No. of episodes,Season premiere,Season finale,Time slot (ET),Avg. viewers (in millions),Viewers (millions),Episode title
2,1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""
3,2,1990–91,22,"October 11, 1990","July 11, 1991",Thursday 8:00 pm,24.4,33.6[179],"""Bart Gets an 'F'"""
4,3,1991–92,24,"September 19, 1991","August 27, 1992",Thursday 8:00 pm,21.8,25.5[180],"""Colonel Homer"""


Let´s remove extra headings from the `DataFrame`.

In [38]:
simpsons_df.drop([0, 1], inplace=True) # Dropping the first two rows

In [39]:
simpsons_df.head() # Checking the DataFrame after dropping the rows

Unnamed: 0,Season,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
2,1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""
3,2,1990–91,22,"October 11, 1990","July 11, 1991",Thursday 8:00 pm,24.4,33.6[179],"""Bart Gets an 'F'"""
4,3,1991–92,24,"September 19, 1991","August 27, 1992",Thursday 8:00 pm,21.8,25.5[180],"""Colonel Homer"""
5,4,1992–93,22,"September 24, 1992","May 13, 1993",Thursday 8:00 pm,22.4,28.6[181],"""Lisa's First Word"""
6,5,1993–94,22,"September 30, 1993","May 19, 1994",Thursday 8:00 pm,18.9,24.0[182],"""Treehouse of Horror IV"""


In [40]:
simpsons_df.set_index("Season", inplace=True)

In [41]:
simpsons_df.head()

Unnamed: 0_level_0,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""
2,1990–91,22,"October 11, 1990","July 11, 1991",Thursday 8:00 pm,24.4,33.6[179],"""Bart Gets an 'F'"""
3,1991–92,24,"September 19, 1991","August 27, 1992",Thursday 8:00 pm,21.8,25.5[180],"""Colonel Homer"""
4,1992–93,22,"September 24, 1992","May 13, 1993",Thursday 8:00 pm,22.4,28.6[181],"""Lisa's First Word"""
5,1993–94,22,"September 30, 1993","May 19, 1994",Thursday 8:00 pm,18.9,24.0[182],"""Treehouse of Horror IV"""


Founding the season with the lowest number of episodes.

In [42]:
simpsons_df['No. of episodes'].unique() # checking the unique values in the column

array(['13', '22', '24', '25', '23', '21', '20', '18', 'TBA'],
      dtype=object)

In [43]:
simpsons_df = simpsons_df.loc[simpsons_df['No. of episodes'] != 'TBA'] 

In [45]:
simpsons_df

Unnamed: 0_level_0,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""
2,1990–91,22,"October 11, 1990","July 11, 1991",Thursday 8:00 pm,24.4,33.6[179],"""Bart Gets an 'F'"""
3,1991–92,24,"September 19, 1991","August 27, 1992",Thursday 8:00 pm,21.8,25.5[180],"""Colonel Homer"""
4,1992–93,22,"September 24, 1992","May 13, 1993",Thursday 8:00 pm,22.4,28.6[181],"""Lisa's First Word"""
5,1993–94,22,"September 30, 1993","May 19, 1994",Thursday 8:00 pm,18.9,24.0[182],"""Treehouse of Horror IV"""
6,1994–95,25,"September 4, 1994","May 21, 1995",Sunday 8:00 pm,15.6,22.2[183],"""Treehouse of Horror V"""
7,1995–96,25,"September 17, 1995","May 19, 1996",Sunday 8:00 pm (Episodes 1–24) Sunday 8:30 pm ...,15.1,22.6[184],"""Who Shot Mr. Burns? – Part II"""
8,1996–97,25,"October 27, 1996","May 18, 1997",Sunday 8:30 pm (Episodes 1–3) Sunday 8:00 pm (...,14.5,20.41[186],"""The Springfield Files"""
9,1997–98,25,"September 21, 1997","May 17, 1998",Sunday 8:00 pm,15.3,19.80[187],"""The Two Mrs. Nahasapeemapetilons"""
10,1998–99,23,"August 23, 1998","May 16, 1999",Sunday 8:00 pm,13.5,19.11[188],"""Sunday, Cruddy Sunday"""


In [46]:
min_season = simpsons_df['No. of episodes'].min() # Getting the minimum value in the column

In [47]:
min_season

'13'

In [48]:
min_season_df = simpsons_df.loc[simpsons_df['No. of episodes'] == min_season] # Getting the row with the minimum value

In [49]:
min_season_df

Unnamed: 0_level_0,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""


---

### **Save to CSV file**

Let´s save the `DataFrame` to a CSV file using the `to_csv` method.

In [50]:
simpsons_df.head()

Unnamed: 0_level_0,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""
2,1990–91,22,"October 11, 1990","July 11, 1991",Thursday 8:00 pm,24.4,33.6[179],"""Bart Gets an 'F'"""
3,1991–92,24,"September 19, 1991","August 27, 1992",Thursday 8:00 pm,21.8,25.5[180],"""Colonel Homer"""
4,1992–93,22,"September 24, 1992","May 13, 1993",Thursday 8:00 pm,22.4,28.6[181],"""Lisa's First Word"""
5,1993–94,22,"September 30, 1993","May 19, 1994",Thursday 8:00 pm,18.9,24.0[182],"""Treehouse of Horror IV"""


In [51]:
simpsons_df.to_csv("files/simpsons.csv") # Saving the DataFrame to a CSV file

We can now read the CSV file using the `read_csv` method.

In [60]:
pd.read_csv("files/simpsons.csv", index_col="Season").head()

Unnamed: 0_level_0,Season.1,No. of episodes,Originally aired,Originally aired.1,Originally aired.2,Viewership,Viewership.1,Viewership.2
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1989–90,13,"December 17, 1989","May 13, 1990",Sunday 8:30 pm,27.8,33.5[178],"""Life on the Fast Lane"""
2,1990–91,22,"October 11, 1990","July 11, 1991",Thursday 8:00 pm,24.4,33.6[179],"""Bart Gets an 'F'"""
3,1991–92,24,"September 19, 1991","August 27, 1992",Thursday 8:00 pm,21.8,25.5[180],"""Colonel Homer"""
4,1992–93,22,"September 24, 1992","May 13, 1993",Thursday 8:00 pm,22.4,28.6[181],"""Lisa's First Word"""
5,1993–94,22,"September 30, 1993","May 19, 1994",Thursday 8:00 pm,18.9,24.0[182],"""Treehouse of Horror IV"""
