<a href="https://colab.research.google.com/github/markbriers/data-science-jupyter/blob/main/week3_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python fundamentals 2 (Week 3)

My (new) email address: mark.briers@test-and-trace.nhs.uk

## Module: Learning outcomes

* Describe the six stages of a data processing pipeline (using CRISP-DM)

* Demonstrate an understanding of the python programming language through the production of elementary data analysis programme

* Analyse at least three different data sources by applying at least one python data processing library to extract and explore pertinent features

* Be able to design a set of data requirements for a specified business problem

* Describe and apply (using the python programming language) the main approaches to supervised learning for a given classification problem

* Understand the use cases of Big Data technology (in particular Spark)

* Produce a report including appropriate data visualisations covering the analysis of a business problem using a data science based approach

## Week 3: Learning outcome

* At the end of week 3, you will be able to use Python to read data from a file on the internet. You will be able to compute elementary statistics from these data.

***

### Recap: arrays and indexing

In [None]:
import numpy as np

Indexing a Python array can be performed by specifying up to three values, separated by two colons. The index format is as follows: _inclusive_start:exclusive_end:stride_

In [None]:
# Build array/vector:
x = np.linspace(1, 10, 10)
print(x)

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]


In [None]:
print(x[0])  # first element
print(x[2])  # third element
print(x[-1]) # last element
print(x[-2]) # second to last element
print(x[1:4])     # second to fourth element. Element 5 is not included
print(x[0:-1:2])  # every other element
print(x[:])       # print the whole vector
print(x)          # print the whole vector
print(x[-1:0:-1]) # reverse the vector (but remove the starting element)
print(x[::-1])    # reverse the vector

1.0
3.0
10.0
9.0
[2. 3. 4.]
[1. 3. 5. 7. 9.]
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
[10.  9.  8.  7.  6.  5.  4.  3.  2.]
[10.  9.  8.  7.  6.  5.  4.  3.  2.  1.]


Taken from: https://docs.python.org/release/2.3.5/whatsnew/section-slices.html

## An introduction to Pandas - towards data science!

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language (from https://pandas.pydata.org/). Some of the material in this notebook follows: https://jakevdp.github.io/PythonDataScienceHandbook/index.html

We will start by importing numpy and pandas:

In [None]:
import pandas as pd

### Pandas Series

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The Series instance above has a set of *values* and an *index*. At this stage, it is conceptually similar to a numpy array. In fact, the values are a numpy array:

In [None]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

We can obtain entries from the series in a similar way to a numpy array:

In [None]:
data[1]

0.5

In [None]:
print(data[2:])    # print all elements beginning at the third
print("--")
print(data[1:3])   # print the second and third element (not the fourth)
print("--")
print(data[::-1])  # print all elements, in reverse order

2    0.75
3    1.00
dtype: float64
--
1    0.50
2    0.75
dtype: float64
--
3    1.00
2    0.75
1    0.50
0    0.25
dtype: float64


So why do we need this object when we have NumPy arrays? We can use the *index* to define non-integer indexes:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['Mon', 'Tue', 'Wed', 'Thur'])
data

Mon     0.25
Tue     0.50
Wed     0.75
Thur    1.00
dtype: float64

In [None]:
data['Thur']

1.0

We can even use non-contiguous indexes (confusingly?):

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [None]:
data[7]

1.0

Now that we have an index which no longer corresponds to the row number, how do we access the individual rows (e.g. the second row)?

In [None]:
data.iloc[1]

0.5

We can construct Series objects by creating a python *dictionary* object:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [None]:
population['New York']

19651127

And we can then slice the series:

In [None]:
population['Texas':]

Texas       26448193
New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

We can test to see whether rows exist within the Series:

In [None]:
'Texas' in population

True

In [None]:
'New Mexico' in population

False

### Pandas DataFrame

One way to think of a Series is a single colume with row labels. A *DataFrame* extends this concept to a two-dimensional array, that is, a representation of a table (think spreadsheet) where each row *and* column can have a label.

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'New York': 3327999,
                   'Florida': 19552860,
                   'Illinois': 12882135}
area_dict = {'California': 423967,
             'Texas': 695662,
             'New York': 141297,
             'Florida': 170312,
             'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [None]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [None]:
states.columns

Index(['population', 'area'], dtype='object')

In [None]:
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]])

In [None]:
type(states.values)

numpy.ndarray

In [None]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

We can access a specific cell value as follows:

In [None]:
states.at['Texas','population']

26448193

We can acquire a single row using the *loc* function:

In [None]:
states.loc['California']

population    38332521
area            423967
Name: California, dtype: int64

And we can acquire a single row using an integer index using the iloc function:

In [None]:
states.iloc[1]

population    26448193
area            695662
Name: Texas, dtype: int64

In [None]:
states.iloc[1].population

26448193

The previous line of code extracts the second line of data and the column titled 'population'.

We can construct new columns from existing columns:

In [None]:
states['density'] = states['population'] / states['area']
states

Unnamed: 0,population,area,population density,density
California,38332521,423967,90.413926,90.413926
Texas,26448193,695662,38.01874,38.01874
New York,19651127,141297,139.076746,139.076746
Florida,19552860,170312,114.806121,114.806121
Illinois,12882135,149995,85.883763,85.883763


We can switch the order of the rows and columns (transpose the DataFrame):

In [None]:
print(states.T)
print(states.T.iloc[1])

                      California         Texas  ...       Florida      Illinois
population          3.833252e+07  2.644819e+07  ...  1.955286e+07  1.288214e+07
area                4.239670e+05  6.956620e+05  ...  1.703120e+05  1.499950e+05
population density  9.041393e+01  3.801874e+01  ...  1.148061e+02  8.588376e+01

[3 rows x 5 columns]
California    423967.0
Texas         695662.0
New York      141297.0
Florida       170312.0
Illinois      149995.0
Name: area, dtype: float64


We can slice the DataFrame by integer location:

In [None]:
states.iloc[:3, :2]

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297


In [None]:
states.T.iloc[:3,:2]

Unnamed: 0,California,Texas
population,38332520.0,26448190.0
area,423967.0,695662.0
population density,90.41393,38.01874


We can also extract specific columns based on computed logic (using the index names) and *loc* function:

In [None]:
states.loc[states.density > 100, ['population', 'density']]

Unnamed: 0,population,density
New York,19651127,139.076746
Florida,19552860,114.806121


We can use numpy functions to perform operations on DataFrames, which preserves the indexes:

In [None]:
logStates = np.log(states)

In [None]:
logStates

Unnamed: 0,population,area,population density,density
California,17.461809,12.957411,4.504398,4.504398
Texas,17.090698,13.452619,3.638079,3.638079
New York,16.793645,11.858619,4.935026,4.935026
Florida,16.788632,12.045387,4.743245,4.743245
Illinois,16.371352,11.918357,4.452995,4.452995


We can compute basic summary statistics for each of the columns listed above:

In [None]:
states.describe()

Unnamed: 0,population,area,population density,density
count,5.0,5.0,5.0,5.0
mean,23373370.0,316246.6,93.639859,93.639859
std,9640386.0,242437.411951,37.672251,37.672251
min,12882140.0,141297.0,38.01874,38.01874
25%,19552860.0,149995.0,85.883763,85.883763
50%,19651130.0,170312.0,90.413926,90.413926
75%,26448190.0,423967.0,114.806121,114.806121
max,38332520.0,695662.0,139.076746,139.076746


## Loading data into a DataFrame

We can load data from a CSV file into a DataFrame as follows:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv')

In [None]:
df.shape

(194, 2)

In [None]:
df

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv', index_col=0)

In [None]:
df

Unnamed: 0_level_0,Region
Country,Unnamed: 1_level_1
Algeria,AFRICA
Angola,AFRICA
Benin,AFRICA
Botswana,AFRICA
Burkina,AFRICA
...,...
Paraguay,SOUTH AMERICA
Peru,SOUTH AMERICA
Suriname,SOUTH AMERICA
Uruguay,SOUTH AMERICA


In [None]:
df.loc['United Kingdom']

Region    EUROPE
Name: United Kingdom, dtype: object

## Case study: COVID-19 data from John Hopkins University

Let's install the Python package _wget_ into colab:

In [None]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=dc95d71d6d3be477510d87d56645a37251ac1c136a538e9f1f86948dd476a587
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget

We will grab two files from the internet, one for COVID-19 cases and one for deaths.

In [None]:
# url of the raw csv dataset
urls = [
    'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
    'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv',
]
[wget.download(url) for url in urls]

['time_series_covid19_confirmed_global.csv',
 'time_series_covid19_deaths_global.csv']

Let's read these into a dataframe:

In [None]:
confirmed_df = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('time_series_covid19_deaths_global.csv')

In [None]:
confirmed_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20,2/26/20,...,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20,12/26/20,12/27/20,12/28/20,12/29/20,12/30/20,12/31/20,1/1/21,1/2/21,1/3/21,1/4/21,1/5/21,1/6/21,1/7/21,1/8/21,1/9/21,1/10/21,1/11/21,1/12/21,1/13/21,1/14/21,1/15/21,1/16/21,1/17/21,1/18/21,1/19/21,1/20/21,1/21/21,1/22/21,1/23/21,1/24/21,1/25/21,1/26/21,1/27/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,...,49681,49817,50013,50190,50433,50655,50810,50886,51039,51280,51350,51405,51526,51526,51526,51526,53011,53105,53105,53207,53332,53400,53489,53538,53584,53584,53775,53831,53938,53984,54062,54141,54278,54403,54483,54559,54595,54672,54750,54854
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,52542,53003,53425,53814,54317,54827,55380,55755,56254,56572,57146,57727,58316,58316,58991,59438,59623,60283,61008,61705,62378,63033,63595,63971,64627,65334,65994,66635,67216,67690,67982,68568,69238,69916,70655,71441,72274,72812,73691,74567
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,...,94781,95203,95659,96069,96549,97007,97441,97857,98249,98631,98988,99311,99610,99897,100159,100408,100645,100873,101120,101382,101657,101913,102144,102369,102641,102860,103127,103381,103611,103833,104092,104341,104606,104852,105124,105369,105596,105854,106097,106359
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,7560,7577,7602,7633,7669,7699,7756,7806,7821,7875,7919,7983,8049,8117,8166,8192,8249,8308,8348,8348,8489,8586,8586,8586,8682,8818,8868,8946,9038,9083,9083,9194,9308,9379,9416,9499,9549,9596,9638,9716
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,16626,16644,16686,16802,16931,17029,17099,17149,17240,17296,17371,17433,17553,17568,17608,17642,17684,17756,17864,17974,18066,18156,18193,18254,18343,18425,18613,18679,18765,18875,18926,19011,19093,19177,19269,19367,19399,19476,19553,19580


Let's take a quick look at the column headings:

In [None]:
print(confirmed_df.columns)
print(deaths_df.columns)

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '1/18/21', '1/19/21', '1/20/21', '1/21/21', '1/22/21', '1/23/21',
       '1/24/21', '1/25/21', '1/26/21', '1/27/21'],
      dtype='object', length=376)
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '1/18/21', '1/19/21', '1/20/21', '1/21/21', '1/22/21', '1/23/21',
       '1/24/21', '1/25/21', '1/26/21', '1/27/21'],
      dtype='object', length=376)


Let's have a quick look at a summary of the columns:

In [None]:
confirmed_df.describe()

Unnamed: 0,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20,2/26/20,2/27/20,2/28/20,...,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20,12/26/20,12/27/20,12/28/20,12/29/20,12/30/20,12/31/20,1/1/21,1/2/21,1/3/21,1/4/21,1/5/21,1/6/21,1/7/21,1/8/21,1/9/21,1/10/21,1/11/21,1/12/21,1/13/21,1/14/21,1/15/21,1/16/21,1/17/21,1/18/21,1/19/21,1/20/21,1/21/21,1/22/21,1/23/21,1/24/21,1/25/21,1/26/21,1/27/21
count,272.0,272.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,...,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0
mean,20.639516,23.165585,2.040293,2.399267,3.446886,5.249084,7.758242,10.721612,20.432234,22.589744,30.164835,36.362637,44.095238,61.490842,72.846154,87.538462,101.25641,112.831502,125.992674,136.007326,147.106227,156.663004,164.142857,165.673993,221.179487,245.087912,252.937729,260.934066,268.388278,275.282051,277.113553,279.164835,281.468864,287.919414,289.311355,291.377289,294.501832,298.080586,303.062271,308.142857,...,279748.3,281701.4,283711.9,286089.3,288631.1,291070.6,292771.6,294651.2,296256.4,298074.7,300501.7,303284.8,305932.7,307878.9,310172.0,312126.0,314153.8,316858.6,319721.2,322864.0,325873.1,328664.9,330823.0,333088.8,335671.4,338409.5,341174.1,343976.8,346238.0,348250.0,350134.9,352362.8,354905.2,357310.7,359722.1,361800.4,363433.9,365297.4,367313.4,369472.6
std,25.18145,73.696719,26.928183,27.026232,33.646272,46.828263,65.44278,88.174227,216.374412,217.699503,300.001701,355.680061,437.683212,680.184951,822.250641,1013.386311,1194.199877,1342.439727,1514.391693,1644.405187,1797.481036,1924.323293,2023.436842,2023.775677,2919.944492,3294.720787,3406.217613,3523.133043,3632.411763,3734.796674,3755.9312,3780.771608,3794.149139,3880.108063,3880.148205,3892.498923,3922.720488,3947.162843,3972.463824,3992.676119,...,1348405.0,1359262.0,1370565.0,1382623.0,1396127.0,1408140.0,1414705.0,1427149.0,1436216.0,1446264.0,1458494.0,1472571.0,1486198.0,1495536.0,1512272.0,1524066.0,1534814.0,1548939.0,1564296.0,1581217.0,1598015.0,1614110.0,1626559.0,1639037.0,1652922.0,1667095.0,1681620.0,1696508.0,1708868.0,1719632.0,1728483.0,1739779.0,1751641.0,1763820.0,1775856.0,1786727.0,1794873.0,1804149.0,1813808.0,1823894.0
min,-51.7963,-178.1165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.105887,-20.02605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,643.0,666.0,666.0,684.0,684.0,684.0,684.0,684.0,684.0,684.0,684.0,684.0,684.0,689.0,710.0,712.0,712.0,712.0,743.0,743.0,771.0,808.0,811.0,819.0,823.0,836.0,842.0,843.0,851.0,855.0,862.0,865.0,865.0,865.0,866.0,866.0,866.0,886.0,890.0,893.0
50%,21.8051,20.97265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8078.0,8152.0,8237.0,8300.0,8353.0,8424.0,8481.0,8540.0,8610.0,8671.0,8724.0,8778.0,8846.0,8888.0,8923.0,8964.0,9017.0,9118.0,9118.0,9225.0,9368.0,9461.0,9630.0,9740.0,9740.0,9819.0,9991.0,10569.0,10635.0,10781.0,10781.0,10852.0,10907.0,10963.0,10963.0,11035.0,11099.0,11099.0,11286.0,11286.0
75%,41.123,84.497525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,2.0,2.0,...,116212.0,117022.0,117190.0,117946.0,120094.0,121399.0,121880.0,122413.0,122864.0,123145.0,123388.0,123856.0,124630.0,125042.0,125616.0,126345.0,126935.0,128370.0,129888.0,130070.0,135884.0,138518.0,139281.0,139713.0,141587.0,144518.0,146689.0,147016.0,148370.0,148607.0,148925.0,149462.0,149973.0,150479.0,151041.0,151646.0,151980.0,152412.0,153226.0,154083.0
max,71.7069,178.065,444.0,444.0,549.0,761.0,1058.0,1423.0,3554.0,3554.0,4903.0,5806.0,7153.0,11177.0,13522.0,16678.0,19665.0,22112.0,24953.0,27100.0,29631.0,31728.0,33366.0,33366.0,48206.0,54406.0,56249.0,58182.0,59989.0,61682.0,62031.0,62442.0,62662.0,64084.0,64084.0,64287.0,64786.0,65187.0,65596.0,65914.0,...,17737480.0,17925030.0,18123640.0,18320880.0,18549640.0,18742800.0,18839530.0,19066380.0,19222060.0,19396240.0,19595120.0,19827130.0,20061050.0,20213390.0,20513680.0,20722230.0,20906020.0,21139550.0,21393460.0,21670200.0,21962250.0,22224220.0,22437500.0,22651460.0,22877700.0,23107570.0,23342550.0,23583260.0,23784020.0,23961420.0,24104030.0,24281010.0,24463590.0,24656650.0,24846680.0,25016820.0,25147890.0,25298990.0,25445580.0,25598060.0


In [None]:
deaths_df.describe()

Unnamed: 0,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20,2/26/20,2/27/20,2/28/20,...,12/19/20,12/20/20,12/21/20,12/22/20,12/23/20,12/24/20,12/25/20,12/26/20,12/27/20,12/28/20,12/29/20,12/30/20,12/31/20,1/1/21,1/2/21,1/3/21,1/4/21,1/5/21,1/6/21,1/7/21,1/8/21,1/9/21,1/10/21,1/11/21,1/12/21,1/13/21,1/14/21,1/15/21,1/16/21,1/17/21,1/18/21,1/19/21,1/20/21,1/21/21,1/22/21,1/23/21,1/24/21,1/25/21,1/26/21,1/27/21
count,272.0,272.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,...,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0,273.0
mean,20.639516,23.165585,0.062271,0.065934,0.095238,0.153846,0.205128,0.300366,0.479853,0.487179,0.626374,0.78022,0.948718,1.326007,1.56044,1.802198,2.065934,2.322344,2.6337,2.952381,3.318681,3.710623,4.076923,4.095238,5.021978,5.578755,6.102564,6.483516,6.842491,7.355311,7.776557,8.234432,8.249084,9.007326,9.047619,9.6337,9.92674,10.150183,10.307692,10.52381,...,6174.904762,6202.981685,6237.714286,6289.937729,6340.406593,6382.919414,6413.249084,6439.153846,6465.663004,6500.681319,6556.538462,6612.571429,6660.571429,6695.003663,6724.959707,6751.703297,6789.40293,6845.772894,6900.783883,6954.454212,7010.732601,7057.32967,7087.25641,7124.725275,7187.362637,7248.216117,7304.311355,7359.406593,7407.14652,7439.043956,7473.029304,7534.732601,7600.003663,7661.333333,7719.490842,7767.967033,7800.168498,7839.282051,7902.864469,7963.842491
std,25.18145,73.696719,1.028887,1.030443,1.454612,2.421972,3.148595,4.600739,7.565435,7.565937,9.805065,12.34653,15.069603,21.18181,25.055032,28.988672,33.22477,37.400733,42.302455,47.204358,52.711175,58.944395,64.633167,64.63283,79.277757,88.174471,96.586618,102.637751,108.266492,116.255084,122.790013,129.74779,129.747082,141.971025,141.969761,150.986333,155.101444,158.248581,159.824492,162.30912,...,26435.083168,26536.659898,26658.914272,26881.329158,27095.29423,27281.610943,27380.206943,27481.672412,27570.225452,27694.624264,27935.068592,28181.898965,28400.625965,28537.160714,28677.078269,28769.887061,28904.806694,29154.759463,29410.199468,29671.192252,29928.4958,30152.512034,30273.87659,30413.50489,30703.331756,30972.023229,31229.00391,31482.743629,31709.771457,31833.441864,31943.259053,32170.97482,32469.875793,32758.782922,33019.729922,33253.944374,33379.568735,33521.827953,33808.45818,34087.686744
min,-51.7963,-178.1165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.105887,-20.02605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
50%,21.8051,20.97265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,114.0,118.0,118.0,118.0,118.0,118.0,119.0,119.0,119.0,120.0,120.0,121.0,122.0,124.0,124.0,124.0,124.0,125.0,126.0,127.0,127.0,128.0,128.0,128.0,130.0,130.0,133.0,138.0,138.0,140.0,142.0,142.0,142.0,142.0,143.0,145.0,146.0,146.0,147.0,151.0
75%,41.123,84.497525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1777.0,1788.0,1795.0,1798.0,1803.0,1808.0,1816.0,1819.0,1825.0,1879.0,1913.0,1918.0,1923.0,1937.0,1944.0,1948.0,1950.0,1996.0,2036.0,2076.0,2119.0,2171.0,2197.0,2232.0,2272.0,2301.0,2314.0,2324.0,2336.0,2339.0,2343.0,2346.0,2354.0,2363.0,2370.0,2373.0,2378.0,2385.0,2477.0,2553.0
max,71.7069,178.065,17.0,17.0,24.0,40.0,52.0,76.0,125.0,125.0,162.0,204.0,249.0,350.0,414.0,479.0,549.0,618.0,699.0,780.0,871.0,974.0,1068.0,1068.0,1310.0,1457.0,1596.0,1696.0,1789.0,1921.0,2029.0,2144.0,2144.0,2346.0,2346.0,2495.0,2563.0,2615.0,2641.0,2682.0,...,316382.0,317877.0,319664.0,323037.0,326391.0,329258.0,330434.0,332075.0,333289.0,335130.0,338793.0,342542.0,345955.0,347982.0,350310.0,351657.0,353652.0,357402.0,361292.0,365242.0,369279.0,372530.0,374354.0,376360.0,380826.0,384790.0,388720.0,392525.0,395877.0,397628.0,399033.0,401807.0,406184.0,410387.0,414147.0,417476.0,419251.0,421168.0,425252.0,429195.0


***

## Week 3: Learning outcome

* At the end of week 3, you will be able to use Python to read data from a file on the internet. You will be able to compute elementary statistics from these data.

## Exercises

Mandatory exercises:

- [ ] Replicate all of the code in this notebook.
- [ ] Introduce COVID-19 recoveries into the analysis above. 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'. What do you notice about the mean number of recoveries as a function of time?