# Advanced Data Analysis - week 1 exercises - part 1

In the advanced data analysis course, we assume basic knowledge of Python, as could be acquired by attending the *Introduction to Programming* bridging course.

This notebook includes exercises for autonomous study related with **Week 1**. There will be a second notebook with more exercises in early next week.

We will use a dataset consisting in a set of files with data for COVID obtained from this site:

[https://github.com/GCGImdea/coronasurveys](https://github.com/GCGImdea/coronasurveys)


## Preliminaries

Let's start by import Pandas, matplotlib and os libraries.

In [1]:
# imports pandas
import pandas as pd

# imports matplotlib
import matplotlib.pyplot as plt

import os

## Examples and exercises

File "ALL-covid.csv" has the information for COVID in all countries.

In [2]:
covidDF = pd.read_csv("data/ALL-covid.csv")
covidDF

Unnamed: 0,date,countrycode,population,cases,deaths
0,2020-01-25,MY,31949789,0,0
1,2020-01-26,MY,31949789,1,0
2,2020-01-27,MY,31949789,0,0
3,2020-01-28,MY,31949789,0,0
4,2020-01-29,MY,31949789,3,0
...,...,...,...,...,...
104946,2021-08-07,AU,25203200,277,-1
104947,2021-08-08,AU,25203200,302,1
104948,2021-08-09,AU,25203200,380,4
104949,2021-08-10,AU,25203200,367,2


Using ```covidDF```, which includes full data, print the total number of cases and deaths for a country of your choice.

In [19]:
covidDF.loc[covidDF["countrycode"] == "PT", ["cases", "deaths"]].sum()

cases     993239
deaths     17514
dtype: int64

Compute the top-3 countries with more deaths.

In [4]:
covidDF.groupby("countrycode")["deaths"].sum().nlargest(3)

countrycode
US    618137
BR    565748
IN    429179
Name: deaths, dtype: int64

Not all countries have the same population, so probably what would make sense is to compute the number of deaths as a function of the population. 

For doing this, it would be useful to have the number of deaths per 1M persons, instead of the raw number. This can be done by adding a column to the bases DataFrame.


In [5]:
covidDF["Deaths_per_1M"] = covidDF.deaths / ( covidDF.population / 1000000 )

You can now compute the top-3 countries with more deaths, as a function of the population.

In [6]:
covidDF.groupby("countrycode")["Deaths_per_1M"].sum().nlargest(3)

countrycode
PA    16305.422895
PE     6064.047936
SV     4259.671034
Name: Deaths_per_1M, dtype: float64

What were the 3 days with more cases (worldwide)?

In [7]:
covidDF.groupby("date")["cases"].sum().nlargest(3)

date
2020-12-10    1498372
2021-04-28     905902
2021-04-23     904281
Name: cases, dtype: int64

If we wanted to compute statistics by months, it would be useful to have a column with the month, instead of the having only the full date. This can be done by creating an additional column and using the functions to compute a substring of a column.

In [9]:
covidDF["month"] = covidDF.date.str[:7]
covidDF["month"]

0         2020-01
1         2020-01
2         2020-01
3         2020-01
4         2020-01
           ...   
104946    2021-08
104947    2021-08
104948    2021-08
104949    2021-08
104950    2021-08
Name: month, Length: 104951, dtype: object

We can now compute the three months with more cases.

In [10]:
covidDF.groupby("month")["cases"].sum().nlargest(3)

month
2021-04    22522676
2020-12    20189717
2021-01    19467053
Name: cases, dtype: int64

Explore Dataframe documentation [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to find out out to compute the cummulative number of cases for a country of your choice - the goal is to have a Dataframe that includes the data for the country you selected, with two additional columns for the cumulative number of cases and deaths.

In [11]:
gerDF = covidDF[covidDF.countrycode == "DE"]
gerDF["cum_cases"] = gerDF.cases.cumsum()
gerDF["cum_deaths"] = gerDF.deaths.cumsum()

gerDF

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gerDF["cum_cases"] = gerDF.cases.cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gerDF["cum_deaths"] = gerDF.deaths.cumsum()


Unnamed: 0,date,countrycode,population,cases,deaths,Deaths_per_1M,month,cum_cases,cum_deaths
58615,2020-01-28,DE,83019213,0,0,0.000000,2020-01,0,0
58616,2020-01-29,DE,83019213,0,0,0.000000,2020-01,0,0
58617,2020-01-30,DE,83019213,0,0,0.000000,2020-01,0,0
58618,2020-01-31,DE,83019213,1,0,0.000000,2020-01,1,0
58619,2020-02-01,DE,83019213,3,0,0.000000,2020-02,4,0
...,...,...,...,...,...,...,...,...,...
59172,2021-08-07,DE,83019213,2761,4,0.048182,2021-08,3795605,91789
59173,2021-08-08,DE,83019213,2240,2,0.024091,2021-08,3797845,91791
59174,2021-08-09,DE,83019213,2220,19,0.228863,2021-08,3800065,91810
59175,2021-08-10,DE,83019213,3282,14,0.168636,2021-08,3803347,91824


You might end up getting an error ```SettingWithCopyWarning```. What is happening? 
Although the model of Pandas is that functions over a DataFrame will create a new DataFrame, Pandas tries to optimize execution by not creating copies of DataFrames when not necessary. It turns out that sometime this leads to problems. When that happens, you can explicitly force Pandas to create a copy of the Dataframe by using the function ```copy()```.


In [12]:
gerDF = covidDF[covidDF.countrycode == "DE"].copy()
gerDF["cum_cases"] = gerDF.cases.cumsum()
gerDF["cum_deaths"] = gerDF.deaths.cumsum()

gerDF

Unnamed: 0,date,countrycode,population,cases,deaths,Deaths_per_1M,month,cum_cases,cum_deaths
58615,2020-01-28,DE,83019213,0,0,0.000000,2020-01,0,0
58616,2020-01-29,DE,83019213,0,0,0.000000,2020-01,0,0
58617,2020-01-30,DE,83019213,0,0,0.000000,2020-01,0,0
58618,2020-01-31,DE,83019213,1,0,0.000000,2020-01,1,0
58619,2020-02-01,DE,83019213,3,0,0.000000,2020-02,4,0
...,...,...,...,...,...,...,...,...,...
59172,2021-08-07,DE,83019213,2761,4,0.048182,2021-08,3795605,91789
59173,2021-08-08,DE,83019213,2240,2,0.024091,2021-08,3797845,91791
59174,2021-08-09,DE,83019213,2220,19,0.228863,2021-08,3800065,91810
59175,2021-08-10,DE,83019213,3282,14,0.168636,2021-08,3803347,91824


File ```countries.csv``` has information about countries. Let's check.


In [14]:
countriesDF = pd.read_csv("data/countries.csv")
countriesDF

Unnamed: 0,Continent_Name,Continent_Code,Country_Name,Two_Letter_Country_Code,Three_Letter_Country_Code,Country_Number
0,Asia,AS,"Afghanistan, Islamic Republic of",AF,AFG,4.0
1,Europe,EU,"Albania, Republic of",AL,ALB,8.0
2,Antarctica,AN,Antarctica (the territory South of 60 deg S),AQ,ATA,10.0
3,Africa,AF,"Algeria, People's Democratic Republic of",DZ,DZA,12.0
4,Oceania,OC,American Samoa,AS,ASM,16.0
...,...,...,...,...,...,...
257,Africa,AF,"Zambia, Republic of",ZM,ZMB,894.0
258,Oceania,OC,Disputed Territory,XX,,
259,Asia,AS,Iraq-Saudi Arabia Neutral Zone,XE,,
260,Asia,AS,United Nations Neutral Zone,XD,,


With this information, let's compute the total number of cases and deaths by continent.

In [16]:
totalDF = countriesDF[["Continent_Name", "Country_Name", "Two_Letter_Country_Code"]].set_index("Two_Letter_Country_Code").join(covidDF.set_index("countrycode"), how = "left")
totalDF

Unnamed: 0,Continent_Name,Country_Name,date,population,cases,deaths,Deaths_per_1M,month
AD,Europe,"Andorra, Principality of",2020-03-02,78015.0,0.0,0.0,0.000000,2020-03
AD,Europe,"Andorra, Principality of",2020-03-03,78015.0,0.0,0.0,0.000000,2020-03
AD,Europe,"Andorra, Principality of",2020-03-04,78015.0,0.0,0.0,0.000000,2020-03
AD,Europe,"Andorra, Principality of",2020-03-05,78015.0,0.0,0.0,0.000000,2020-03
AD,Europe,"Andorra, Principality of",2020-03-06,78015.0,0.0,0.0,0.000000,2020-03
...,...,...,...,...,...,...,...,...
,Africa,"Namibia, Republic of",2021-08-07,2550226.0,319.0,24.0,9.410931,2021-08
,Africa,"Namibia, Republic of",2021-08-08,2550226.0,229.0,9.0,3.529099,2021-08
,Africa,"Namibia, Republic of",2021-08-09,2550226.0,160.0,5.0,1.960611,2021-08
,Africa,"Namibia, Republic of",2021-08-10,2550226.0,304.0,8.0,3.136977,2021-08


In [17]:
totalDF.groupby("Continent_Name")[["cases", "deaths"]].sum()

Unnamed: 0_level_0,cases,deaths
Continent_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,7114303.0,179873.0
Antarctica,0.0,0.0
Asia,71445002.0,1114988.0
Europe,60655961.0,1220272.0
North America,43374758.0,939540.0
Oceania,105648.0,1649.0
South America,36079108.0,1106612.0


What about the number of cases and death per 1M population in each continent?

In [18]:
totalDF.groupby("Continent_Name")[["cases", "Deaths_per_1M"]].sum()

Unnamed: 0_level_0,cases,Deaths_per_1M
Continent_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,7114303.0,11257.670742
Antarctica,0.0,0.0
Asia,71445002.0,19711.478246
Europe,60655961.0,69568.176577
North America,43374758.0,35340.040122
Oceania,105648.0,1232.625206
South America,36079108.0,24810.22861
