# Reading tabular data from the web

**DISCLAIMER**: _The legality of web scraping needs to be considered on a site by site basis. The site in question may display a "Terms of Use", which should be passed by an Oliver Wyman lawyer in advance._

## What is web scraping?

Web scraping is a technique used to get data from web pages (semi structured data) and represent it in a comprehensible way. It consists in a series of steps needed to get and structure the data:

* Make an HTTP request to the web page in order to get the HTML data that should contain the data we want.
* Get the desired data from each of the HTML tags.
* Structure the data into Data Frames and make use of it.


## Our first approach
Sometimes, the data might be already structured as a tables inside the web page. In this cases we can use the easiest method to structure the data and format it as a Data Frame. 
The **pandas** library provides a simple method to do this stuff [[1]](https://towardsdatascience.com/scraping-table-data-from-websites-using-a-single-line-in-python-ba898d54e2bc).

In [1]:
import pandas as pd

# Page about Cricket World cup
URL = "https://en.wikipedia.org/wiki/Cricket_World_Cup"

### Getting tables from a website

In te following code we can fetch tables from any particular URL (in our case the wikipedia article about the Cricket World Cup). 

The pandas utility can load every `<table>` tag from inside the HTML.


In [7]:
tables = pd.read_html(URL)

print(f"There are : {len(tables)} tables")

There are : 19 tables


### Finding the right table

In our example 19 tables were found inside the page's HTML. There are a couple of methods to select the exactly one table you want to use.


#### __1__) Direct indexing

The first of them is direct indexing. The object returned from the `pd.read_html(URL)` is list of DataFrames (tables) and you can iterate throug the tables until you find the desired one.

In [8]:
print("Take look at table 0")
tables[0]

Take look at table 0


Unnamed: 0,0,1
0,The World Cup Trophy,The World Cup Trophy
1,Administrator,International Cricket Council (ICC)
2,Format,One Day International
3,First edition,1975 England
4,Latest edition,2019 England & Wales
5,Next edition,2023 India
6,Tournament format,↓various
7,Number of teams,20 (all tournaments)10 (current)14 (2027 onwar...
8,Current champion,England (1st title)
9,Most successful,Australia (5 titles)


#### __2__) Text matching

The second method to find the desired table on the HTML is through text matching. In this approach you provide the `match` parameter to the `pd.read_html` function. This limits the ammount of tables found. In our case, by providing the match string `Performance details` the ammount of tables found came __from 19 to only 2__. Way better to search!

P.S.: Regular expressions are supported in the `match` parameter.

In [19]:
tables = pd.read_html(URL, match="Performance details")
print(f"There are : {len(tables)} tables")

tables[0]

There are : 2 tables


Unnamed: 0,Year,Player,Performance details
0,1992,Martin Crowe,456 runs
1,1996,Sanath Jayasuriya,221 runs and 7 wickets
2,1999,Lance Klusener,281 runs and 17 wickets
3,2003,Sachin Tendulkar,673 runs and 2 wickets
4,2007,Glenn McGrath,26 wickets
5,2011,Yuvraj Singh,362 runs and 15 wickets
6,2015,Mitchell Starc,22 wickets
7,2019,Kane Williamson,578 runs and 2 wickets


#### __3__) HTML attibute matching

The third and last method of this tutorial is through HTML attribute matching.

HTML tags usually carry some extra informations inside them, in the form of atributes.

* HTML tag __without__ attributes

    `<table> </table>`



* HTML tag __with__ attributes

    `<table class='wikitable'> </table>`
    
    `<div font='arial'> </div>`
    

To select one of this tags by attribute matching the `attrs` atribute must be passed to the `pd.read_html` function. 

As an example, to match the `<table class='wikitable'> </table>` the passed argument should be `{'class': 'wikitable'}`. You can check the attributes for a particular table by looking at the page's HTML code.

In [24]:
# By using this line of code we can hit the target table directly
tables = pd.read_html(URL, attrs={'class': 'wikitable'})
tables[2]

Unnamed: 0,Year,Player,Performance details
0,1992,Martin Crowe,456 runs
1,1996,Sanath Jayasuriya,221 runs and 7 wickets
2,1999,Lance Klusener,281 runs and 17 wickets
3,2003,Sachin Tendulkar,673 runs and 2 wickets
4,2007,Glenn McGrath,26 wickets
5,2011,Yuvraj Singh,362 runs and 15 wickets
6,2015,Mitchell Starc,22 wickets
7,2019,Kane Williamson,578 runs and 2 wickets
