## Web Scraping Tables using Pandas
The pandas library in Python contains *read_html()* that can be used to extract tabular information from any web page.

Consider the following example:

Let's assume we want to extract the list of the largest banks in the world by market capitalization, from the following link

In [2]:
url = 'https://en.wikipedia.org/wiki/list_of_largest_banks'

We may use the *pandas.read_html()* function in py to extract all the tables in the web page directly.

If we look at the page we can see that the required table is the first one on the page. We may execute the following lines of code to extract the required table from the web page.

In [5]:
import pandas as pd
tables = pd.read_html(url)
# tables is a list of all tables on the page so if we want the first one...
df = tables[0]
print(df)

    Rank                                Bank name  \
0      1  Industrial and Commercial Bank of China   
1      2               Agricultural Bank of China   
2      3                  China Construction Bank   
3      4                            Bank of China   
4      5                           JPMorgan Chase   
..   ...                                      ...   
95    96                         Raiffeisen Group   
96    97                            Handelsbanken   
97    98                 Industrial Bank of Korea   
98    99                                      DNB   
99   100                      Qatar National Bank   

    Total assets (2023) (US$ billion)  
0                             6303.44  
1                             5623.12  
2                             5400.28  
3                             4578.28  
4                             3875.39  
..                                ...  
95                             352.87  
96                             351.79  
97 

This will extract the required table as a dataframe *df*. The output of the print statement is as seen above.

Although convenient, this method comes with its own set of limitations. Firstly, web pages may have content saved in them as tables but they may not appear as tables on the web page.

For instance, consider the following url showing the list of countries by GDP (nominal)

In [8]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

But the images are also saved in a tabular format, so this could mess with our script. 

Secondly, the contents of the tables in web pages may contain elements such as hyperlink text and other denoters, which are also scraped directly using the pandas method. This may lead to a requirement of further cleaning of the data.

A closer look at table 3 on the above url shows us example of many hyperlinks which are also going to be treated as information by the pandas function.

In [12]:
tables = pd.read_html(url)
print(tables)
df = tables[2] # the required table will have index 2
print(df)

[                                                   0
0  Largest economies in the world by GDP (nominal...,                                                    0  \
0  > $20 trillion $10–20 trillion $5–10 trillion ...   

                                                   1  \
0  $750 billion – $1 trillion $500–750 billion $2...   

                                                   2  
0  $50–100 billion $25–50 billion $5–25 billion <...  ,     Country/Territory IMF[1][13]            World Bank[14]             \
    Country/Territory   Forecast       Year       Estimate       Year   
0               World  115494312       2025      105435540       2023   
1       United States   30337162       2025       27360935       2023   
2               China   19534894  [n 1]2025       17794782  [n 3]2023   
3             Germany    4921563       2025        4456081       2023   
4               Japan    4389326       2025        4212945       2023   
..                ...        ...        ... 

Note that the hyperlink texts have also been retained in the code output. 

It is further prudent to point out, that this method exclusively operates only on tabular data extraction. BeautifulSoup library still remains the default method of extracting any kind of information from web pages.