Activity 2.6 -- Working with COVID 19 and World Bank Data
=========================================================

In this activity, you will explore relationships between various World
Bank indicators for countries and their corresponding COVID death rates.
First you need to download data on COVID-19 (see links and instructions
below) and the selected indicators from the Open World Bank data
available at <https://data.worldbank.org>.

**COVID data set source:** <https://coviddata.github.io/coviddata/#csvs>

**Tasks.** Use pandas and dfply to perform each of the following.

1.  Download the raw **time\_series\_covid19\_confirmed\_global.csv**
    dataset.

2.  Inspect the data and discuss the need to reshape. 

In [1]:
# Code for loading and inspecting the CSV file
import pandas as pd
import numpy as np
from dfply import *

In [4]:
covidConfirmed = pd.read_csv("./data/time_series_covid19_confirmed_global.csv")
covidConfirmed

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/3/22,9/4/22,9/5/22,9/6/22,9/7/22,9/8/22,9/9/22,9/10/22,9/11/22,9/12/22
0,,Afghanistan,33.939110,67.709953,0,0,0,0,0,0,...,193912,194163,194355,194614,195012,195298,195471,195631,195925,196182
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,330062,330193,330221,330283,330516,330687,330842,330948,331036,331053
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,270426,270443,270461,270476,270489,270507,270522,270532,270539,270551
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,46027,46027,46027,46027,46113,46113,46113,46113,46113,46113
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,102636,102636,102636,102636,102636,102636,103131,103131,103131,103131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280,,West Bank and Gaza,31.952200,35.233200,0,0,0,0,0,0,...,702332,702332,702332,702591,702591,702591,702591,702591,702591,702591
281,,Winter Olympics 2022,39.904200,116.407400,0,0,0,0,0,0,...,535,535,535,535,535,535,535,535,535,535
282,,Yemen,15.552727,48.516388,0,0,0,0,0,0,...,11926,11931,11931,11931,11931,11932,11932,11932,11932,11932
283,,Zambia,-13.133897,27.849332,0,0,0,0,0,0,...,332822,333074,333086,333124,333150,333180,333204,333220,333229,333234


> Column names contain value, so we have to stack by using `gather`. There is no value in "Province/State" so we can drop the column.

3.  Write a single pipe that reshapes the data, sets the dtype of the date column, and extracts various date parts.
    1. To change the `dtype` of the date column, `date = X.date.astype('datetime64')`
    2. To extract the year and month, use the `X.date.dt.year` and `X.date.dt.month` attributes. This will need to happen in a separate `mutate` 

In [82]:

covid_cleaned = (covidConfirmed 
>> gather("Date", "Measurement", columns_from("1/22/20"), add_id=True)
>> drop("Province/State")
>> mutate(Date = X.Date.astype('datetime64'))
>> mutate(Year = X.Date.dt.year, Month = X.Date.dt.month)
)

covid_cleaned.head()

Unnamed: 0,Country/Region,Lat,Long,_ID,Date,Measurement,Year,Month
0,Afghanistan,33.93911,67.709953,0,2020-01-22,0,2020,1
1,Albania,41.1533,20.1683,1,2020-01-22,0,2020,1
2,Algeria,28.0339,1.6596,2,2020-01-22,0,2020,1
3,Andorra,42.5063,1.5218,3,2020-01-22,0,2020,1
4,Angola,-11.2027,17.8739,4,2020-01-22,0,2020,1


### World Bank Links Development Indicators

<https://databank.worldbank.org/source/world-development-indicators>

#### Constructing a data set.

First you need to construct a data set as follows

1.  Expand the Country tab and select all.

<img src="./img/media/image1.png" width="300">

2.  Click on the Series tab, search for *Health* and select the
    following indicators. **Feel free to add additional indicators!**

<img src="img/media/image2.png" width="300">

3.  Click on the Time tab and select 2020 and 2021.

4.  Click apply changes in the floating dialog.

<img src="img/media/image3.png" width="300">

5.  Select CSV from the Download Options button and save the data folder

<img src="img/media/image4.png" width="100">

#### Tasks

Use pandas and dfply to perform each of the following.

1.  Inspect the World Bank data and discuss the need to reshape. 

**Hints:** 

* You should apply `fix_names` from `more_dfply` to clean up the column names.
* This table needs to be reshaped twice




In [19]:
from more_dfply import *

In [58]:
# Code for loading and inspecting the CSV
worldBank = pd.read_csv("./data/WorldBankData.csv")

worldBank


Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2018 [YR2018],2019 [YR2019]
0,Afghanistan,AFG,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,14.12674332,13.24220181
1,Afghanistan,AFG,Current health expenditure per capita (current...,SH.XPD.CHEX.PC.CD,69.99860382,65.80603027
2,Afghanistan,AFG,Domestic general government health expenditure...,SH.XPD.GHED.GD.ZS,0.54922014,1.08443093
3,Afghanistan,AFG,Domestic general government health expenditure...,SH.XPD.GHED.PC.CD,2.72140895,5.38899046
4,Afghanistan,AFG,Domestic private health expenditure (% of curr...,SH.XPD.PVTD.CH.ZS,76.319664,79.39915466
...,...,...,...,...,...,...
1596,,,,,,
1597,,,,,,
1598,,,,,,
1599,Data from database: World Development Indicators,,,,,


> We will stack year columns and unstack the label columns. 

2.  Write a single pipe that reshapes the data, sets the dtype of the date column, and extracts various date parts.
    1. You can use the `replace` method to clean up the year column.  See lecture 3.1 for details.

In [80]:
cleaned_worldBank = (worldBank
>> fix_names()
>> filter_by(worldBank["Series Code"].notna())
>> gather("year", "measurement", columns_from("_2018_YR2018"), add_id=True)
>> mutate(year = X.year.replace("_\d*_YR", "", regex=True).astype("int"))
>> spread(X.Series_Name, X.measurement)
>> drop(X._ID)
)

cleaned_worldBank.head()

Unnamed: 0,Country_Name,Country_Code,Series_Code,year,Current health expenditure (% of GDP),Current health expenditure per capita (current US$),Domestic general government health expenditure (% of GDP),Domestic general government health expenditure per capita (current US$),Domestic private health expenditure (% of current health expenditure),Domestic private health expenditure per capita (current US$)
0,Afghanistan,AFG,SH.XPD.CHEX.GD.ZS,2018,14.12674332,,,,,
1,Afghanistan,AFG,SH.XPD.CHEX.GD.ZS,2019,13.24220181,,,,,
2,Afghanistan,AFG,SH.XPD.CHEX.PC.CD,2018,,69.99860382,,,,
3,Afghanistan,AFG,SH.XPD.CHEX.PC.CD,2019,,65.80603027,,,,
4,Afghanistan,AFG,SH.XPD.GHED.GD.ZS,2018,,,0.54922014,,,


### Investigate joining on country

Before we can proceed, we need to make sure that the columns used to join the data--namely the country--actually match.  Do this by

1. For each table, select just the country columns and make sure the column names match.
2. Perform a full outer join and filter on rows that didn't match (i.e. with a missing value in either column).
3. Determine any transformations needed to make the entries match.
4. Transform each of the original table as need (column names and/or problematic entries.

In [81]:
# Your code here
worldBank_countries = (cleaned_worldBank
>> select('Country_Name')
>> rename(country = 'Country_Name')
>> mutate(file = 'world_bank')
>> distinct
)

worldBank_countries

Unnamed: 0,country,file
0,Afghanistan,world_bank
12,Africa Eastern and Southern,world_bank
24,Africa Western and Central,world_bank
36,Albania,world_bank
48,Algeria,world_bank
...,...,...
3132,West Bank and Gaza,world_bank
3144,World,world_bank
3156,"Yemen, Rep.",world_bank
3168,Zambia,world_bank


In [85]:
covid_countries = (covid_cleaned
>> select(X['Country/Region'])
>> rename(country = 'Country/Region')
>> mutate(file = 'covid')
>> distinct
)

covid_countries

Unnamed: 0,country,file
0,Afghanistan,covid
1,Albania,covid
2,Algeria,covid
3,Andorra,covid
4,Angola,covid
...,...,...
280,West Bank and Gaza,covid
281,Winter Olympics 2022,covid
282,Yemen,covid
283,Zambia,covid


In [89]:
joined = covid_countries >> outer_join(worldBank_countries, by="country")
joined.head()

Unnamed: 0,country,file_x,file_y
0,Afghanistan,covid,world_bank
1,Albania,covid,world_bank
2,Algeria,covid,world_bank
3,Andorra,covid,world_bank
4,Angola,covid,world_bank


In [91]:
set_diff = (joined 
>> filter_by(X.country.notna())
>> filter_by(X.file_x.isna() | X.file_y.isna())
>> distinct
>> arrange(X.country)
)

set_diff

Unnamed: 0,country,file_x,file_y
199,Africa Eastern and Southern,,world_bank
200,Africa Western and Central,,world_bank
201,American Samoa,,world_bank
5,Antarctica,covid,
202,Arab World,,world_bank
...,...,...,...
294,Virgin Islands (U.S.),,world_bank
195,Winter Olympics 2022,covid,
295,World,,world_bank
196,Yemen,covid,


In [92]:
set_diff.to_csv('./data/set_diff.csv')

## Join and visualize 

Finally, you should use pandas and dfply to join these two data sets together, then create some interesting visualization using seaborn.

In [6]:
# Your code here

### Deliverables
To complete this part of the activity, you need to submit the following.

1.  A link to this notebook including all discussion and code requests
    above.

2.  A csv file containing your final dataset. **Hint.** You can use the
    to\_csv method on the final data frame.

In [7]:
# Code for writing the data here