# Day One: Sample data acquisition skill

I'm teaching this class from the fundamentals upwards, which means that we study data formats and how the Internet works so that we can properly learn to scrape data from the web and use web APIs.   A problem with that approach is that it leaves some of the really cool stuff towards the end of the class. To mitigate this, I like to motivate the work you will do over the next few weeks by providing a simple example of how easy it is to go collected data from a cooperative website.  Most websites are uncooperative and so we have to learn to deal with those, which we will do during the class.  For now, let's figure out how to scrape some [data on the coronavirus from Wikipedia](https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory).

I'd like everyone to read through this notebook and manually type all the code into your own notebook. This will give you some idea of where we are going and how straightforward it is in many cases.

## Inspect the first table element

Using Chrome, go to URL:

[https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory](https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory)

and then right-click on the start of the table in the right gutter of the Wikipedia page and select <b>Inspect</b> from the drop-down menu. It should something that looks like this:

<img src="figures/covid-inspect.png" width="70%">

That shows you the raw HTML and what it corresponds to visually.  The next step is to use a program to extract that HTML.

## Get the raw HTML from the website

In [51]:
!pip install -q -U requests              # we need these libraries
!pip install -q -U beautifulsoup4

In [52]:
import requests

CovidURL = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
response = requests.get(CovidURL)
print(response.text[0:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>COVID-19 pandemic by country and territory - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"da424776-62a9-408b-9a87-610bfa93a4f4","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"COVID-19_pandemic_by_country_and_territory","wgTitle":"COVID-19 pandemic by country and territory","wgCurRevisionId":983068035,"wgRevisionId":983068035,"wgArticleId":62938755,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from October 2020","All articles

## Get specific tag using BeautifulSoup

Now let's treat the text as HTML not just English text.  Then we can ask for a specific tag such as the title:

In [53]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")
title = soup.find('title')
print("TITLE", title)

TITLE <title>COVID-19 pandemic by country and territory - Wikipedia</title>


##  Get all text elements from all HTML tags

We can also ask for all of the text elements not inside HTML tags:

In [54]:
print(soup.text[0:1000])





COVID-19 pandemic by country and territory - Wikipedia





























COVID-19 pandemic by country and territory

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
Wikimedia list article
This article is about the status of the outbreak in different locations by continent and conveyance around the world. For further information, see National responses to the COVID-19 pandemic.


COVID-19 pandemicConfirmed cases per 100,000 population as of 12 October 2020
  >3,000  1,000–3,000  300–1,000  100–300  30–100  0–30  None or no data
DiseaseCoronavirus disease 2019 (COVID-19)Virus strainSevere acute respiratory syndromecoronavirus 2 (SARS-CoV-2)SourceProbably bats, possibly via pangolins[1][2]LocationWorldwideFirst outbreakMainland China[3]Index caseWuhan, Hubei, China30°37′11″N 114°15′28″E﻿ / ﻿30.61972°N 114.25778°E﻿ / 30.61972; 114.25778Date1 December 2019 (2019-12-01)[3]–present(10 months, 1 week and 4 days)Confirmed cases37,594,267[4]Active cases1

## Find all tables

BeautifulSoup has a mechanism to find all of the tables (HTML `table` text) in an HTML document:

In [75]:
tables = soup.findAll('table')

The first table looks like this on the page:
 
<img src="figures/covid-table-0.png" width="40%">

and we can get the HTML representing each table:

In [81]:
t = str(tables[0])
print(t[0:1000])

<table class="infobox" style="width:22em"><tbody><tr><th colspan="2" style="text-align:center;font-size:125%;font-weight:bold;background:#FFCCCB">COVID-19 pandemic</th></tr><tr><td colspan="2" style="text-align:center;border-bottom:#aaa 1px solid;"><a class="image" href="/wiki/File:COVID-19_Outbreak_World_Map_per_Capita.svg"><img alt="COVID-19 Outbreak World Map per Capita.svg" data-file-height="1500" data-file-width="2921" decoding="async" height="169" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/COVID-19_Outbreak_World_Map_per_Capita.svg/330px-COVID-19_Outbreak_World_Map_per_Capita.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/COVID-19_Outbreak_World_Map_per_Capita.svg/495px-COVID-19_Outbreak_World_Map_per_Capita.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3b/COVID-19_Outbreak_World_Map_per_Capita.svg/660px-COVID-19_Outbreak_World_Map_per_Capita.svg.png 2x" width="330"/></a><div style="text-align:left;"><div class="center" style="

That is raw HTML representing the table, but we can use a Jupyter notebook trick to display that text as HTML:

In [82]:
from IPython.display import HTML # IPython is the underlying Python interpreter used by this notebook
HTML(t)                          # Render the text in t as HTML

COVID-19 pandemic,COVID-19 pandemic.1
"Confirmed cases per 100,000 population as of 12 October 2020  >3,000 1,000–3,000 300–1,000 100–300 30–100 0–30 None or no data","Confirmed cases per 100,000 population as of 12 October 2020  >3,000 1,000–3,000 300–1,000 100–300 30–100 0–30 None or no data"
Disease,Coronavirus disease 2019 (COVID-19)
Virus strain,Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
Source,"Probably bats, possibly via pangolins[1][2]"
Location,Worldwide
First outbreak,Mainland China[3]
Index case,"Wuhan, Hubei, China 30°37′11″N 114°15′28″E﻿ / ﻿30.61972°N 114.25778°E"
Date,"1 December 2019[3]–present (10 months, 1 week and 4 days)"
Confirmed cases,"37,594,267[4]"
Active cases,"10,399,681[4]"


## Using Pandas to read a URL to extract tables

Pandas has a built-in mechanism to read a URL and extract all the table tags into dataframes. Extremely handy.

In [59]:
import pandas as pd

tables = pd.read_html(CovidURL)

The first table again looks like this on the page:
 
<img src="figures/covid-table-0.png" width="40%">

And pandas can pull that into a data frame:

In [60]:
tables[0]

Unnamed: 0,COVID-19 pandemic,COVID-19 pandemic.1
0,"Confirmed cases per 100,000 population as of 1...","Confirmed cases per 100,000 population as of 1..."
1,Disease,Coronavirus disease 2019 (COVID-19)
2,Virus strain,Severe acute respiratory syndromecoronavirus 2...
3,Source,"Probably bats, possibly via pangolins[1][2]"
4,Location,Worldwide
5,First outbreak,Mainland China[3]
6,Index case,"Wuhan, Hubei, China30°37′11″N 114°15′28″E﻿ / ﻿..."
7,Date,"1 December 2019[3]–present(10 months, 1 week a..."
8,Confirmed cases,"37,594,267[4]"
9,Active cases,"10,399,681[4]"


The second table looks like:

<img src="figures/covid-table-1.png" width="40%">

and we get a nice data frame from it too:

In [61]:
tables[1]

Unnamed: 0_level_0,Location[a],Location[a],Cases[b],Deaths[c],Recov.[d],Ref.
Unnamed: 0_level_1,Unnamed: 0_level_1,World[e],"37,594,267","1,077,836","26,116,750",[4]
0,,United States[f],7877192,218292,5028717,[13][14]
1,,India,7120538,109150,6149535,[15]
2,,Brazil,5103408,150689,4495269,[16]
3,,Russia[g],1312310,22722,1024235,[17]
4,,Colombia,911316,27834,789787,[18]
...,...,...,...,...,...,...
227,,Anguilla,3,0,3,[328]
228,,Solomon Islands,2,0,0,[329]
229,,Tanzania[be],No data,No data,No data,[331][332]
230,As of 12 October 2020 (UTC) · History of cases...,As of 12 October 2020 (UTC) · History of cases...,As of 12 October 2020 (UTC) · History of cases...,As of 12 October 2020 (UTC) · History of cases...,As of 12 October 2020 (UTC) · History of cases...,As of 12 October 2020 (UTC) · History of cases...


The third table looks like this:

<img src="figures/covid-table-2.png" width="40%">

In [62]:
tables[2]

Unnamed: 0,Country,Confirmed cases,Deaths,Case fatality rate,"Deaths per 100,000 population"
0,San Marino,741,42,5.7%,124.32
1,Peru,849371,33305,3.9%,104.11
2,Belgium,162258,10191,6.3%,89.22
3,Bolivia,138574,8308,6.0%,73.18
4,Brazil,5094979,150488,3.0%,71.84
...,...,...,...,...,...
163,Papua New Guinea,554,7,1.3%,0.08
164,Sri Lanka,4752,13,0.3%,0.06
165,Tanzania,509,21,4.1%,0.04
166,Vietnam,1109,35,3.2%,0.04
