# Reading HTML tables

In [1]:
import pandas as pd 

Parsing raw HTML strings

Another useful pandas method is read_html(). This method will read HTML tables from a given URL, a file-like object, or a raw string containing HTML, and return a list of dataframes objects. Lets try to read the following html_string into a DataFrame.

In [2]:
html_string = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample HTML Table</title>
</head>
<body>
    <h2>Employee Information</h2>
    <table border="1">
        <thead>
            <tr>
                <th>Name</th>
                <th>Position</th>
                <th>Office</th>
                <th>Age</th>
                <th>Start date</th>
                <th>Salary</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Tiger Nixon</td>
                <td>System Architect</td>
                <td>Edinburgh</td>
                <td>61</td>
                <td>2011/04/25</td>
                <td>$320,800</td>
            </tr>
            <tr>
                <td>Garrett Winters</td>
                <td>Accountant</td>
                <td>Tokyo</td>
                <td>63</td>
                <td>2011/07/25</td>
                <td>$170,750</td>
            </tr>
            <tr>
                <td>Ashton Cox</td>
                <td>Junior Technical Author</td>
                <td>San Francisco</td>
                <td>66</td>
                <td>2009/01/12</td>
                <td>$86,000</td>
            </tr>
            <tr>
                <td>Cedric Kelly</td>
                <td>Senior Javascript Developer</td>
                <td>Edinburgh</td>
                <td>22</td>
                <td>2012/03/29</td>
                <td>$433,060</td>
            </tr>
            <tr>
                <td>Airi Satou</td>
                <td>Accountant</td>
                <td>Tokyo</td>
                <td>33</td>
                <td>2008/11/28</td>
                <td>$162,700</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
"""

In [3]:
from IPython.core.display import display, HTML
display(HTML(html_string))

  from IPython.core.display import display, HTML


Name,Position,Office,Age,Start date,Salary
Tiger Nixon,System Architect,Edinburgh,61,2011/04/25,"$320,800"
Garrett Winters,Accountant,Tokyo,63,2011/07/25,"$170,750"
Ashton Cox,Junior Technical Author,San Francisco,66,2009/01/12,"$86,000"
Cedric Kelly,Senior Javascript Developer,Edinburgh,22,2012/03/29,"$433,060"
Airi Satou,Accountant,Tokyo,33,2008/11/28,"$162,700"


 What each part does:

	• from IPython.core.display import display, HTML

→ Imports the tools needed to display rich content like HTML, LaTeX, images, etc.
	
    • HTML(html_string)

→ Tells Python: “Hey, interpret this string as HTML.”
	
    • display(...)

→ Actually shows it inside your notebook (like rendering a webpage).

In [4]:
from io import StringIO
html_buffer = StringIO(html_string)

# Since we cant directly pass html inside the read_html we need to wrap the html string inside the StringIO object so that pandas treat it as a file

In [5]:
dfs = pd.read_html(html_buffer)

# Finally the read_html returned one dataframe object that is the list of dataframes

In [6]:
len(dfs)

1

In [7]:
df= dfs[0]
df
#Capturing the first dataframe from the list of dataframes

Unnamed: 0,Name,Position,Office,Age,Start date,Salary
0,Tiger Nixon,System Architect,Edinburgh,61,2011/04/25,"$320,800"
1,Garrett Winters,Accountant,Tokyo,63,2011/07/25,"$170,750"
2,Ashton Cox,Junior Technical Author,San Francisco,66,2009/01/12,"$86,000"
3,Cedric Kelly,Senior Javascript Developer,Edinburgh,22,2012/03/29,"$433,060"
4,Airi Satou,Accountant,Tokyo,33,2008/11/28,"$162,700"


Previous dataframe loopked quite like a html table but now we have the real dataframe object, so we can apply any pandas operation we want.

In [8]:
df.shape

(5, 6)

In [9]:
df.loc[df["Office"] == "Tokyo" ]

Unnamed: 0,Name,Position,Office,Age,Start date,Salary
1,Garrett Winters,Accountant,Tokyo,63,2011/07/25,"$170,750"
4,Airi Satou,Accountant,Tokyo,33,2008/11/28,"$162,700"


In [10]:
# Step 1: Remove the dollar sign and commas, then convert to float
df["Salary_cleaned"] = df["Salary"].replace('[\$,]', '', regex=True).astype(float)

# Step 2: Filter rows with salary > 100000
high_salary_df = df.loc[df["Salary_cleaned"] > 100000]

# Step 3: View the result
print(high_salary_df)

              Name                     Position     Office  Age  Start date  \
0      Tiger Nixon             System Architect  Edinburgh   61  2011/04/25   
1  Garrett Winters                   Accountant      Tokyo   63  2011/07/25   
3     Cedric Kelly  Senior Javascript Developer  Edinburgh   22  2012/03/29   
4       Airi Satou                   Accountant      Tokyo   33  2008/11/28   

     Salary  Salary_cleaned  
0  $320,800        320800.0  
1  $170,750        170750.0  
3  $433,060        433060.0  
4  $162,700        162700.0  


# Defining header

Pandas will automatically find the header thanks to the tag.

But in many cases we will find the wrong or incomplete table that make the read_html method parse the tables in the wrong way without proper headers.

That is why we have the header parameter to fix that.

In [11]:
html_string = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample HTML Table</title>
</head>
<body>
    <h2>Employee Information</h2>
    <table border="1">
        <thead>
            <tr>
                <th>Name</th>
                <th>Position</th>
                <th>Office</th>
                <th>Age</th>
                <th>Start date</th>
                <th>Salary</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Tiger Nixon</td>
                <td>System Architect</td>
                <td>Edinburgh</td>
                <td>61</td>
                <td>2011/04/25</td>
                <td>$320,800</td>
            </tr>
            <tr>
                <td>Garrett Winters</td>
                <td>Accountant</td>
                <td>Tokyo</td>
                <td>63</td>
                <td>2011/07/25</td>
                <td>$170,750</td>
            </tr>
            <tr>
                <td>Ashton Cox</td>
                <td>Junior Technical Author</td>
                <td>San Francisco</td>
                <td>66</td>
                <td>2009/01/12</td>
                <td>$86,000</td>
            </tr>
            <tr>
                <td>Cedric Kelly</td>
                <td>Senior Javascript Developer</td>
                <td>Edinburgh</td>
                <td>22</td>
                <td>2012/03/29</td>
                <td>$433,060</td>
            </tr>
            <tr>
                <td>Airi Satou</td>
                <td>Accountant</td>
                <td>Tokyo</td>
                <td>33</td>
                <td>2008/11/28</td>
                <td>$162,700</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
"""

In [12]:
from io import StringIO
html_buffer = StringIO(html_string)


In [13]:
pd.read_html(html_buffer)[0]

# pd.read_html(...) always returns a list of DataFrames, even if the HTML contains only one table.
# 	So, [0] is used to access the first table (i.e., the first DataFrame in that list).

Unnamed: 0,Name,Position,Office,Age,Start date,Salary
0,Tiger Nixon,System Architect,Edinburgh,61,2011/04/25,"$320,800"
1,Garrett Winters,Accountant,Tokyo,63,2011/07/25,"$170,750"
2,Ashton Cox,Junior Technical Author,San Francisco,66,2009/01/12,"$86,000"
3,Cedric Kelly,Senior Javascript Developer,Edinburgh,22,2012/03/29,"$433,060"
4,Airi Satou,Accountant,Tokyo,33,2008/11/28,"$162,700"


In [14]:
pd.read_html(html_buffer, header= 3)[0]
# Here the 3rd index row is being used as the header

Unnamed: 0,Ashton Cox,Junior Technical Author,San Francisco,66,2009/01/12,"$86,000"
0,Cedric Kelly,Senior Javascript Developer,Edinburgh,22,2012/03/29,"$433,060"
1,Airi Satou,Accountant,Tokyo,33,2008/11/28,"$162,700"


# Parsing HTML tables from the web

Since we now know how read_html works, we will try to parse HTML directly from the web.
To do that we will call read_html method with the url.

In [None]:
import requests
html_url = "https://www.worldometers.info/world-population/population-by-country/"

headers = {
    "User-Agent": "Mozilla/5.0",
    "Accept": "text/html",
    "Referer": "https://www.google.com"
}

# We pass headers to mimic a real browser request, which helps us avoid being blocked and gain access to the HTML tables on the webpage.

response = requests.get(html_url, headers=headers)

html_buffer = StringIO(response.text)
population_table = pd.read_html(html_buffer)[0]

In [21]:
len(population_table)

233

In [26]:
population_table.head()

Unnamed: 0,#,Country (or dependency),Population 2025,Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Median Age,Urban Pop %,World Share
0,1,India,1463865525,0.89%,12929734,492,2973190,"−495,753",1.94,28.8,37.1%,17.78%
1,2,China,1416096094,−0.23%,"−3,225,184",151,9388211,"−268,126",1.02,40.1,67.5%,17.20%
2,3,United States,347275807,0.54%,1849236,38,9147420,1230663,1.62,38.5,82.8%,4.22%
3,4,Indonesia,285721236,0.79%,2233305,158,1811570,"−39,509",2.1,30.4,59.6%,3.47%
4,5,Pakistan,255219554,1.57%,3950390,331,770880,"−1,235,336",3.5,20.6,34.4%,3.10%


Sometimes when we extract the data from the web some would have multiple headers after one certain number of data for the human readibility. Some would have many row and column span. We will have to clean it all to make the data kind of usable for our further works.

# Save to CSV File

Finally, we will save the data to a CSV file as we did earlier.

In [27]:
population_table.head()

Unnamed: 0,#,Country (or dependency),Population 2025,Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Median Age,Urban Pop %,World Share
0,1,India,1463865525,0.89%,12929734,492,2973190,"−495,753",1.94,28.8,37.1%,17.78%
1,2,China,1416096094,−0.23%,"−3,225,184",151,9388211,"−268,126",1.02,40.1,67.5%,17.20%
2,3,United States,347275807,0.54%,1849236,38,9147420,1230663,1.62,38.5,82.8%,4.22%
3,4,Indonesia,285721236,0.79%,2233305,158,1811570,"−39,509",2.1,30.4,59.6%,3.47%
4,5,Pakistan,255219554,1.57%,3950390,331,770880,"−1,235,336",3.5,20.6,34.4%,3.10%


In [28]:
df.to_csv("popultaion-data.csv", index = False)