# Advanced Data Analysis - week 1 exercises - part 1

In the advanced data analysis course, we assume basic knowledge of Python, as could be acquired by attending the *Introduction to Programming* bridging course.

This notebook includes exercises for autonomous study related with **Week 1**. There will be a second notebook with more exercises in early next week.

We will use a dataset consisting in a set of files with data for COVID obtained from this site:

[https://github.com/GCGImdea/coronasurveys](https://github.com/GCGImdea/coronasurveys)


## Preliminaries

Let's start by import Pandas, matplotlib and os libraries.

In [48]:
# imports pandas
import pandas as pd

# imports matplotlib
import matplotlib.pyplot as plt

import os

## Examples and exercises

File "ALL-covid.csv" has the information for COVID in all countries.

In [49]:
# create the path for the global file
fileName = os.path.join( "data", "ALL-covid.csv")

# Read a CSV file into a DataFrame
covidDF = pd.read_csv(fileName)

print( covidDF)

              date countrycode  population  cases  deaths
0       2020-01-25          MY    31949789      0       0
1       2020-01-26          MY    31949789      1       0
2       2020-01-27          MY    31949789      0       0
3       2020-01-28          MY    31949789      0       0
4       2020-01-29          MY    31949789      3       0
...            ...         ...         ...    ...     ...
104946  2021-08-07          AU    25203200    277      -1
104947  2021-08-08          AU    25203200    302       1
104948  2021-08-09          AU    25203200    380       4
104949  2021-08-10          AU    25203200    367       2
104950  2021-08-11          AU    25203200    372       2

[104951 rows x 5 columns]


Using ```covidDF```, which includes full data, print the total number of cases and deaths for a country of your choice.

In [50]:
## TODO : complete the code
print(covidDF[covidDF["countrycode"] == "AD"][["cases", "deaths"]].sum())
print(covidDF.groupby("countrycode")[["cases", "deaths"]].sum())

cases     14890
deaths      129
dtype: int64
               cases  deaths
countrycode                 
AD             14890     129
AE            696902    1988
AF            151290    6978
AG              1371      43
AL            134485    2460
...              ...     ...
XK            111642    2274
YE              7212    1392
ZA           2554239   75774
ZM            201340    3509
ZW            117953    3991

[198 rows x 2 columns]


Compute the top-3 countries with more deaths.

In [51]:
## TODO : complete the code

print(covidDF.groupby("countrycode")["deaths"].sum().sort_values(ascending = False).iloc[:3])

countrycode
US    618137
BR    565748
IN    429179
Name: deaths, dtype: int64


Not all countries have the same population, so probably what would make sense is to compute the number of deaths as a function of the population. 

For doing this, it would be useful to have the number of deaths per 1M persons, instead of the raw number. This can be done by adding a column to the bases DataFrame.


In [52]:
covidDF["deathsPer1M"] = covidDF["deaths"] / (covidDF["population"] / 1000000)

print(covidDF)


              date countrycode  population  cases  deaths  deathsPer1M
0       2020-01-25          MY    31949789      0       0     0.000000
1       2020-01-26          MY    31949789      1       0     0.000000
2       2020-01-27          MY    31949789      0       0     0.000000
3       2020-01-28          MY    31949789      0       0     0.000000
4       2020-01-29          MY    31949789      3       0     0.000000
...            ...         ...         ...    ...     ...          ...
104946  2021-08-07          AU    25203200    277      -1    -0.039678
104947  2021-08-08          AU    25203200    302       1     0.039678
104948  2021-08-09          AU    25203200    380       4     0.158710
104949  2021-08-10          AU    25203200    367       2     0.079355
104950  2021-08-11          AU    25203200    372       2     0.079355

[104951 rows x 6 columns]


You can now compute the top-3 countries with more deaths, as a function of the population.

In [53]:
## TODO : complete the code

print(covidDF.groupby("countrycode")["deathsPer1M"].sum().sort_values(ascending = False).iloc[:3])

countrycode
PA    16305.422895
PE     6064.047936
SV     4259.671034
Name: deathsPer1M, dtype: float64


What were the 3 days with more cases (worldwide)?

In [54]:
## TODO : complete the code

print(covidDF.groupby("date")["cases"].sum().sort_values(ascending = False).iloc[:3])


date
2020-12-10    1498372
2021-04-28     905902
2021-04-23     904281
Name: cases, dtype: int64


If we wanted to compute statistics by months, it would be useful to have a column with the month, instead of the having only the full date. This can be done by creating an additional column and using the functions to compute a substring of a column.

In [55]:
covidDF["month"] = covidDF["date"].str[:7]

print(covidDF)


              date countrycode  population  cases  deaths  deathsPer1M  \
0       2020-01-25          MY    31949789      0       0     0.000000   
1       2020-01-26          MY    31949789      1       0     0.000000   
2       2020-01-27          MY    31949789      0       0     0.000000   
3       2020-01-28          MY    31949789      0       0     0.000000   
4       2020-01-29          MY    31949789      3       0     0.000000   
...            ...         ...         ...    ...     ...          ...   
104946  2021-08-07          AU    25203200    277      -1    -0.039678   
104947  2021-08-08          AU    25203200    302       1     0.039678   
104948  2021-08-09          AU    25203200    380       4     0.158710   
104949  2021-08-10          AU    25203200    367       2     0.079355   
104950  2021-08-11          AU    25203200    372       2     0.079355   

          month  
0       2020-01  
1       2020-01  
2       2020-01  
3       2020-01  
4       2020-01  
...

We can now compute the three months with more cases.

In [56]:
## TODO : complete the code

print(covidDF.groupby("month")["cases"].sum().sort_values(ascending = False).iloc[:3])


month
2021-04    22522676
2020-12    20189717
2021-01    19467053
Name: cases, dtype: int64


Explore Dataframe documentation [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to find out out to compute the cummulative number of cases for a country of your choice - the goal is to have a Dataframe that includes the data for the country you selected, with two additional columns for the cumulative number of cases and deaths.

In [57]:
ptCovidDF = covidDF[covidDF["countrycode"]=="PT"]

# TODO : complete

ptCovidDF["cum_cases"] = ptCovidDF["cases"].cumsum()
print(ptCovidDF)

             date countrycode  population  cases  deaths  deathsPer1M  \
76917  2020-03-02          PT    10276617      0       0     0.000000   
76918  2020-03-03          PT    10276617      0       0     0.000000   
76919  2020-03-04          PT    10276617      3       0     0.000000   
76920  2020-03-05          PT    10276617      3       0     0.000000   
76921  2020-03-06          PT    10276617      5       0     0.000000   
...           ...         ...         ...    ...     ...          ...   
77440  2021-08-07          PT    10276617   2621      17     1.654241   
77441  2021-08-08          PT    10276617   1982      10     0.973083   
77442  2021-08-09          PT    10276617   1094      18     1.751549   
77443  2021-08-10          PT    10276617   2232      17     1.654241   
77444  2021-08-11          PT    10276617   2948      12     1.167699   

         month  cum_cases  
76917  2020-03          0  
76918  2020-03          0  
76919  2020-03          3  
76920  2020

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ptCovidDF["cum_cases"] = ptCovidDF["cases"].cumsum()


You might end up getting an error ```SettingWithCopyWarning```. What is happening? 
Although the model of Pandas is that functions over a DataFrame will create a new DataFrame, Pandas tries to optimize execution by not creating copies of DataFrames when not necessary. It turns out that sometime this leads to problems. When that happens, you can explicitly force Pandas to create a copy of the Dataframe by using the function ```copy()```.


In [58]:
ptCovidDF = covidDF[covidDF["countrycode"]=="PT"].copy()

# TODO : complete

ptCovidDF["cum_cases"] = ptCovidDF["cases"].cumsum()
print(ptCovidDF)

             date countrycode  population  cases  deaths  deathsPer1M  \
76917  2020-03-02          PT    10276617      0       0     0.000000   
76918  2020-03-03          PT    10276617      0       0     0.000000   
76919  2020-03-04          PT    10276617      3       0     0.000000   
76920  2020-03-05          PT    10276617      3       0     0.000000   
76921  2020-03-06          PT    10276617      5       0     0.000000   
...           ...         ...         ...    ...     ...          ...   
77440  2021-08-07          PT    10276617   2621      17     1.654241   
77441  2021-08-08          PT    10276617   1982      10     0.973083   
77442  2021-08-09          PT    10276617   1094      18     1.751549   
77443  2021-08-10          PT    10276617   2232      17     1.654241   
77444  2021-08-11          PT    10276617   2948      12     1.167699   

         month  cum_cases  
76917  2020-03          0  
76918  2020-03          0  
76919  2020-03          3  
76920  2020

File ```countries.csv``` has information about countries. Let's check.


In [59]:
# create the path for the global file
countriesFileName = os.path.join( "data", "countries.csv")

# Read a CSV file into a DataFrame
countriesDF = pd.read_csv(countriesFileName)

print( countriesDF)


    Continent_Name Continent_Code  \
0             Asia             AS   
1           Europe             EU   
2       Antarctica             AN   
3           Africa             AF   
4          Oceania             OC   
..             ...            ...   
257         Africa             AF   
258        Oceania             OC   
259           Asia             AS   
260           Asia             AS   
261           Asia             AS   

                                     Country_Name Two_Letter_Country_Code  \
0                Afghanistan, Islamic Republic of                      AF   
1                            Albania, Republic of                      AL   
2    Antarctica (the territory South of 60 deg S)                      AQ   
3        Algeria, People's Democratic Republic of                      DZ   
4                                  American Samoa                      AS   
..                                            ...                     ...   
257             

With this information, let's compute the total number of cases and deaths by continent.

In [60]:
# TODO : complete

joined = covidDF.join(countriesDF.set_index("Two_Letter_Country_Code"), on = "countrycode")
print(joined)
print(joined.groupby("Continent_Name")[["cases", "deaths"]].sum())

              date countrycode  population  cases  deaths  deathsPer1M  \
0       2020-01-25          MY    31949789      0       0     0.000000   
1       2020-01-26          MY    31949789      1       0     0.000000   
2       2020-01-27          MY    31949789      0       0     0.000000   
3       2020-01-28          MY    31949789      0       0     0.000000   
4       2020-01-29          MY    31949789      3       0     0.000000   
...            ...         ...         ...    ...     ...          ...   
104946  2021-08-07          AU    25203200    277      -1    -0.039678   
104947  2021-08-08          AU    25203200    302       1     0.039678   
104948  2021-08-09          AU    25203200    380       4     0.158710   
104949  2021-08-10          AU    25203200    367       2     0.079355   
104950  2021-08-11          AU    25203200    372       2     0.079355   

          month Continent_Name Continent_Code                Country_Name  \
0       2020-01           Asia    

What about the number of cases and death per 1M population in each continent?

In [61]:
# TODO : complete

print(joined.groupby("Continent_Name")[["cases", "deathsPer1M"]].sum())

                   cases   deathsPer1M
Continent_Name                        
Africa           7114303  11257.670742
Asia            71445002  19711.478246
Europe          60655961  69568.176577
North America   43374758  35340.040122
Oceania           105648   1232.625206
South America   36079108  24810.228610
