# Preparing Country-specifc Data

'''
input files
------------



output files
-------------


'''

In [1]:
import pandas as pd

## An important CSV

In [3]:
df_timeseries_COVID_confirmed = pd.read_csv('../data_proc/COVID_TimeSeries_by_country.csv',sep=";", index_col='date')
df_timeseries_COVID_confirmed.sort_index(ascending=True) # just to be safe, btw the ISO date format helps
countries = df_timeseries_COVID_confirmed.columns.unique() # unique should be redundant, just in case

## Output item 1: the population of each country 

This might be used as a guess of the number of susceptibles in the **SIR model** 
where it is assumed that everybody is initially (100% or partially) susceptible to be infected by the virus regardless of the country).
  
The data was retrieved from [the data bank of World Bank](https://databank.worldbank.org/reports.aspx?source=2&series=SP.POP.TOTL&country=#). 

Unfortunately, the country names have been manually modified using other knowledge, e.g. mapping "Myanmar" to "Burma" so that the names will be the same as those used by John Hopkins University's csv (see [this spreadsheet]('/home/la/Dropbox/aca-active/Kienle-DS/repo/data_raw/Country-info.ods')).

In [4]:
df_country_pop_raw = pd.read_csv('../data_raw/Country-info.csv',sep=';')
df_country_pop_raw

Unnamed: 0,Country Name,name_JHU,Country Code,Population in 2019 (see link)
0,Afghanistan,Afghanistan,AFG,38041754
1,Albania,Albania,ALB,2854191
2,Algeria,Algeria,DZA,43053054
3,American Samoa,,ASM,55312
4,Andorra,Andorra,AND,77142
...,...,...,...,...
212,Virgin Islands (U.S.),,VIR,106631
213,West Bank and Gaza,West Bank and Gaza,PSE,4685306
214,"Yemen, Rep.",Yemen,YEM,29161922
215,Zambia,Zambia,ZMB,17861030


Even after the manual operation, there is *no* one-to-one correspondence between country names in the two sets of data. Here, I will first dropout the NaN row(s) in `df_country_pop_raw`

In [5]:
df_country_pop_raw.drop(['Country Name'],axis='columns', inplace=True)
df_country_pop_raw.rename(columns={'name_JHU':'country','Population in 2019 (see link)':'population'},inplace=True)
df_country_pop_raw.set_index('country',inplace=True)
df_country_pop_raw

Unnamed: 0_level_0,Country Code,population
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,AFG,38041754
Albania,ALB,2854191
Algeria,DZA,43053054
,ASM,55312
Andorra,AND,77142
...,...,...
,VIR,106631
West Bank and Gaza,PSE,4685306
Yemen,YEM,29161922
Zambia,ZMB,17861030


However, not all countries in the JHU csv (global time series of confirmed cases) are available in the population dataset...

In [6]:
for country in countries:
    try:
        print("{} \t has a population of {}".format(country,df_country_pop_raw.loc[country,'population']))
    except KeyError:
        print("\n\n !!!! the population data for Country {} is unavailable.".format(country))


Afghanistan 	 has a population of 38041754
Albania 	 has a population of 2854191
Algeria 	 has a population of 43053054
Andorra 	 has a population of 77142
Angola 	 has a population of 31825295
Antigua and Barbuda 	 has a population of 97118
Argentina 	 has a population of 44938712
Armenia 	 has a population of 2957731
Australia 	 has a population of 25364307
Austria 	 has a population of 8877067
Azerbaijan 	 has a population of 10023318
Bahamas 	 has a population of 389482
Bahrain 	 has a population of 1641172
Bangladesh 	 has a population of 163046161
Barbados 	 has a population of 287025
Belarus 	 has a population of 9466856
Belgium 	 has a population of 11484055
Belize 	 has a population of 390353
Benin 	 has a population of 11801151
Bhutan 	 has a population of 763092
Bolivia 	 has a population of 11513100
Bosnia and Herzegovina 	 has a population of 3301000
Botswana 	 has a population of 2303697
Brazil 	 has a population of 211049527
Brunei 	 has a population of 433285
Bulgaria 	

## Output item 2: the date of first confirmed cases (in YYYY-mm-dd)
this is relevant for the **SIR model** especially if we want to fit all the data (using *time-invariant* parameters alpha and beta), obviously the zeros **before** the day of first reported case should be filtered *out*.)


## more country-specific (static) information? 
examples of such info that might also be of interest to the data analysis (not yet implemented)
* average temperature
* population density
* average medical facility 
* GDP
* location
* ...