# Day One: Sample data acquisition skill

I'm teaching this class from the fundamentals upwards, which means that we study data formats and how the Internet works so that we can properly learn to scrape data from the web and use web APIs.   A problem with that approach is that it leaves some of the really cool stuff towards the end of the class. To mitigate this, I like to motivate the work you will do over the next few weeks by providing a simple example of how easy it is to go collected data from a cooperative website.  Most websites are uncooperative and so we have to learn to deal with those, which we will do during the class.  For now, let's figure out how to scrape some data on the coronavirus from Wikipedia.

I'd like everyone to read through this notebook and manually type all the code into your own notebook. This will give you some idea of where we are going and how straightforward it is in many cases.

## Inspect the first table element

Using Chrome, right-click on the start of the table in the right gutter of the Wikipedia page and select <b>Inspect</b> from the drop-down menu. It should something that looks like this:

<img src="figures/covid-inspect.png" width="70%">

That shows you the raw HTML and what it corresponds to visually.  The next step is to use a program to extract that HTML.

## Get the raw HTML from the website

In [39]:
import requests

CovidURL = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
response = requests.get(CovidURL)
print(response.text[0:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>COVID-19 pandemic by country and territory - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"81737266-c9f4-4855-9134-9ffc68ed9723","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"COVID-19_pandemic_by_country_and_territory","wgTitle":"COVID-19 pandemic by country and territory","wgCurRevisionId":981839685,"wgRevisionId":981839685,"wgArticleId":62938755,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from October 2020","All articles

## Get specific tag using BeautifulSoup

Now let's treat the text as HTML not just English text.  Then we can ask for a specific tag such as the title:

In [40]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")
title = soup.find('title')
print("TITLE", title)

TITLE <title>COVID-19 pandemic by country and territory - Wikipedia</title>


##  Get all text elements from all HTML tags

We can also ask for all of the text elements not inside HTML tags:

In [54]:
print(soup.text[0:1000])





COVID-19 pandemic by country and territory - Wikipedia





























COVID-19 pandemic by country and territory

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
Wikimedia list article
This article is about the status of the outbreak in different locations by continent and conveyance around the world. For further information, see National responses to the COVID-19 pandemic.


COVID-19 pandemicConfirmed cases per 100,000 population as of 4 October 2020
  >3,000  1,000–3,000  300–1,000  100–300  30–100  0–30  None or no data
DiseaseCoronavirus disease 2019 (COVID-19)Virus strainSevere acute respiratory syndromecoronavirus 2 (SARS-CoV-2)SourceProbably bats, possibly via pangolins[1][2]LocationWorldwideFirst outbreakMainland China[3]Index caseWuhan, Hubei, China30°37′11″N 114°15′28″E﻿ / ﻿30.61972°N 114.25778°E﻿ / 30.61972; 114.25778Date1 December 2019 (2019-12-01)[3]–present(10 months and 4 days)Confirmed cases35,330,119[4]Active cases9,727,061[

## Find all tables

BeautifulSoup has a cool feature where it give you a pandas data frame if you ask for table tags.

In [41]:
table = soup.findAll('table')

The first table looks like this on the page:
 
<img src="figures/covid-table-0.png" width="40%">

and you can see we get a nice data frame for that:

In [49]:
tables[0]

Unnamed: 0,COVID-19 pandemic,COVID-19 pandemic.1
0,"Confirmed cases per 100,000 population as of 4...","Confirmed cases per 100,000 population as of 4..."
1,Disease,Coronavirus disease 2019 (COVID-19)
2,Virus strain,Severe acute respiratory syndromecoronavirus 2...
3,Source,"Probably bats, possibly via pangolins[1][2]"
4,Location,Worldwide
5,First outbreak,Mainland China[3]
6,Index case,"Wuhan, Hubei, China30°37′11″N 114°15′28″E﻿ / ﻿..."
7,Date,1 December 2019[3]–present(10 months and 4 days)
8,Confirmed cases,"35,330,119[4]"
9,Active cases,"9,727,061[4]"


The second table looks like:

<img src="figures/covid-table-1.png" width="40%">

and we get this data frame:

In [42]:
tables[1]

Unnamed: 0_level_0,Location[a],Location[a],Cases[b],Deaths[c],Recov.[d],Ref.
Unnamed: 0_level_1,Unnamed: 0_level_1,World[e],"35,330,119","1,038,958","24,564,100",[4]
0,,United States[f],7505022,213056,4873669,[13][14]
1,,India,6623815,102685,5586703,[15]
2,,Brazil[g],4915289,146352,4263208,[18][19]
3,,Russia[h],1225889,21475,982324,[20]
4,,Colombia,855052,26712,761674,[21]
...,...,...,...,...,...,...
227,,Anguilla,3,0,3,[330]
228,,Solomon Islands,1,0,0,[331]
229,,Tanzania[be],No data,No data,No data,[333][334]
230,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...


The third table looks like this:

<img src="figures/covid-table-2.png" width="40%">

and we get this data frame:

In [55]:
tables[2]

Unnamed: 0,Country,Confirmed cases,Deaths,Case fatality rate,"Deaths per 100,000 population"
0,San Marino,732,42,5.7%,124.32
1,Peru,821564,32609,4.0%,101.94
2,Belgium,130235,10064,7.7%,88.11
3,Bolivia,136868,8101,5.9%,71.35
4,Brazil,4915289,146352,3.0%,69.87
...,...,...,...,...,...
163,Papua New Guinea,540,7,1.3%,0.08
164,Sri Lanka,3402,13,0.4%,0.06
165,Tanzania,509,21,4.1%,0.04
166,Vietnam,1096,35,3.2%,0.04


## Using Pandas to read a URL to extract tables

Pandas has a built-in mechanism to read a URL and extract all the table tags into dataframes. Extremely handy.

In [45]:
import pandas as pd

tables = pd.read_html(CovidURL)

In [46]:
tables[0]

Unnamed: 0,COVID-19 pandemic,COVID-19 pandemic.1
0,"Confirmed cases per 100,000 population as of 4...","Confirmed cases per 100,000 population as of 4..."
1,Disease,Coronavirus disease 2019 (COVID-19)
2,Virus strain,Severe acute respiratory syndromecoronavirus 2...
3,Source,"Probably bats, possibly via pangolins[1][2]"
4,Location,Worldwide
5,First outbreak,Mainland China[3]
6,Index case,"Wuhan, Hubei, China30°37′11″N 114°15′28″E﻿ / ﻿..."
7,Date,1 December 2019[3]–present(10 months and 4 days)
8,Confirmed cases,"35,330,119[4]"
9,Active cases,"9,727,061[4]"


In [47]:
tables[1]

Unnamed: 0_level_0,Location[a],Location[a],Cases[b],Deaths[c],Recov.[d],Ref.
Unnamed: 0_level_1,Unnamed: 0_level_1,World[e],"35,330,119","1,038,958","24,564,100",[4]
0,,United States[f],7505022,213056,4873669,[13][14]
1,,India,6623815,102685,5586703,[15]
2,,Brazil[g],4915289,146352,4263208,[18][19]
3,,Russia[h],1225889,21475,982324,[20]
4,,Colombia,855052,26712,761674,[21]
...,...,...,...,...,...,...
227,,Anguilla,3,0,3,[330]
228,,Solomon Islands,1,0,0,[331]
229,,Tanzania[be],No data,No data,No data,[333][334]
230,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...,As of 4 October 2020 (UTC) · History of cases ...


In [48]:
tables[2]

Unnamed: 0,Country,Confirmed cases,Deaths,Case fatality rate,"Deaths per 100,000 population"
0,San Marino,732,42,5.7%,124.32
1,Peru,821564,32609,4.0%,101.94
2,Belgium,130235,10064,7.7%,88.11
3,Bolivia,136868,8101,5.9%,71.35
4,Brazil,4915289,146352,3.0%,69.87
...,...,...,...,...,...
163,Papua New Guinea,540,7,1.3%,0.08
164,Sri Lanka,3402,13,0.4%,0.06
165,Tanzania,509,21,4.1%,0.04
166,Vietnam,1096,35,3.2%,0.04
