In [2]:
import pandas as pd

# Conditions of the Method

This is a quick and easy method to allow you to scrap table data from a webpage.
Given it's simplicity, there are conditions for this method to work properly.
It is not intended to replace more sophisticated web scrapping techniques, but still very useful for majority of the simple webpages.

- The data to be scrapped is in the **HTML table** format
- The page has no protection mechanism to prevent scrapping (or headless browser). For example, it won't work on this [page](https://mrtmapsingapore.com/mrt-stations-singapore/)
- The data is not behind of login page, which needs user interactions to fill up the username and password
- The data is a static table, which completely loaded when you visit the page. It will not work on dynamic tables that load more data while you scrolling the table/webpage, or use pagination to load next page of the table.

# 1. Get and Convert the HTML Tables into list of DataFrame

In [13]:
url = 'https://en.wikipedia.org/wiki/List_of_unicorn_startup_companies'

In [14]:
list_of_tables = pd.read_html(url)

In [15]:
# check how many tables were retrieved
len(list_of_tables)

5

In [17]:
# increase the integer stored in i to see the contents in each of the table
# 0 is the first table in the list (index starts with 0 in Python), 
# then followed by 1, 2, 3, 4, and so on.
i = 3 # <-retrieve the 4th table
list_of_tables[i]

Unnamed: 0,Company,Valuation(US$ billions),Valuation date,Industry,Country/countries,Founder(s),Unnamed: 6
0,ByteDance,140,April 2021[12],Internet,China,"Zhang Yiming, Liang Rubo",
1,SpaceX,100,October 2021[13],Aerospace,US,Elon Musk,
2,Stripe,95,March 2021[14],Financial services,US / Ireland,Patrick and John Collison,
3,Klarna,45.6,June 2021[15],Financial technology,Sweden,"Sebastian Siemiatkowski, Niklas Adalberth, Vic...",
4,Canva,40,September 2021[16],Graphic design,Australia,"Melanie Perkins, Clifford Obrecht, Cameron Adams",
...,...,...,...,...,...,...,...
601,Zenoti,1+,December 2020[541],Software company,India / US,,
602,Zhaogang.com,1+,July 2017[118],,China,,
603,Zhuanzhuan,1+,April 2017[118],,China,,
604,Zigbang,1+,July 2021[542],Real Estate,South Korea,,


# 2. More targeted approach, by matching "Keywords"

> By viewing the Wiki page on Browser, we noticed there 2 main tables (the long tables).
> To return a specific table, the **read_html** method allow us to use input parameter "match" to specify the keyword that appears as part of the HTML table.

> In this case, asume that we are only interested in the 2nd large table, called **"Former Unicorns"**, and we notice that **Uber** is a unique text that only available in this table, but not in other tables.
> Knowing this, we can specify **"Uber"** for the input parameter **"match"**, to retrieve only tables with the word we specify.

In [18]:
len(pd.read_html(url, match='Uber'))

1

In [22]:
pd.read_html(url, match='Uber')[0] # <- index 0 is used because there is only one table being retrieved

Unnamed: 0,Company,Last valuation (US $B),Valuation date,Exit date,Exit reason,Exit valuation (US $B),Country,Founders
0,Uber,72,August 2018[543],May 2019[544],IPO,82.40,US,"Travis Kalanick, Garett Camp"
1,DiDi,62,July 2019[12],June 2021[545],IPO,73.00,China,Cheng Wei
2,Facebook,50,January 2011,May 2012[546],IPO,104.00,US,"Mark Zuckerberg, Eduardo Saverin, Andrew McCol..."
3,Xiaomi,45,April 2015,July 2018[547],IPO,70.00,China,Lei Jun
4,Alibaba,42,June 2016,September 2014[548],IPO,238.00,China,Jack Ma
...,...,...,...,...,...,...,...,...
185,Zimi,1+,February 2015[118],March 2021[777],Acquired,0.40,China,
186,QingCloud,1+,June 2017[118],March 2021[778],IPO,0.46,China,
187,Novogene,1+,November 2016[118],April 2021,IPO,,China,
188,MissFresh,1+,December 2017[118],June 2021[779],IPO,2.50,China,


# 3. Failed on Certain Websites

> Noticed that the codes follow will prompt an error **"HTTP Error 403: Forbidden"**.

> There is nothing wrong with the codes. It's because the page has mechanism to prevent webscrapping.

In [19]:
url = 'https://mrtmapsingapore.com/mrt-stations-singapore/'
pd.read_html(url)

HTTPError: HTTP Error 403: Forbidden

# Wanna Go Beyond This?

Have a look at this Udemy Course on CSC LEARN [here](https://learncsc.udemy.com/course/modern-web-scraping-in-python/)


This course covers the techniques that are usually used in webscrapping tasks and able to overcome many of the limitations that we see here on the Pandas' built-in method.