# Intro

The OECD database is composed of hundreds (if not thousands) of tables that focus on specific economic and social variables. These variables are attirbutable to individual countries, and are generally recorded over time, making them a time series. 

For the purposes of this analysis, we are only interested in the latest year's data point, which would reflect the most recent characterization of each country's social and economic situation. 

However, given that we have years of data for each country, there is also the potential to use the change in the variable as a feature of our modeling. For example, rapid improvement in immunization coverage for a country would imply effective health education and could represent a more accurate picture of how the country is better educating the public about a new pandemic versus a country with a higher, but stagnant, immunization record. 

For now, we are combining the various OECD tables on the 'country' key so that we can merge it with COVID data.

Data transformation to follow.

# Loading and Cleaning Data

Here we load and clean csv data taken from the OECD database. This includes datasets on the following topics:
- Education
- Government Debt
- Immunization
- Demographics / Population
- Tourism
- Wealth Distribution

Here we load and clean json data taken from the COVID-19 database. This includes datasets on the following topics:
- Education
- Government Debt
- Immunization
- Demographics / Population
- Tourism
- Wealth Distribution

For each dataset, there are multiple columns we are not interested in. Furthermore, for the columns we do want, there are various version of that variable, so we need to filter on what we are looking for. The tables are not standardized, so each csv file will have to be cleaned and prepared individually. 

### Load Packages

In [1]:
import pandas as pd
from urllib.request import urlopen
import ssl
import json
import requests

ssl._create_default_https_context = ssl._create_unverified_context

## OECD
### Education

We hypothesize that a country's education should play a key role in the success rate of the country's approach to COVID. Specifically, a more highly educated country should, in theory, have a more effective approach to dealing with the virus, and should have a populace that better understands public health terminiology, what viruses are, and would be more willing to take the pandemic seriously. 

So we will focus on the share of population of countries that have a tertiary education level. This is defined on wikipedia as:

*Tertiary education, also referred to as third-level, third-stage or post-secondary education, is the educational level following the completion of secondary education. The World Bank, for example, defines tertiary education as including universities as well as trade schools and colleges.*

In [2]:
url_edu = 'https://raw.githubusercontent.com/pvai-umich/SIADS591/master/Data/OECD_Education_Statistics.csv'
dfraw_edu = pd.read_csv(url_edu)


In [3]:
columns_to_use = ['COUNTRY', 'Country', 'Gender', 'ISCED 2011 A education level', 'Reference Period', 'Measure', 'Value']

# Only select the columns we want to use
df_edu = dfraw_edu[columns_to_use]

# filter some of the columns to include the variables we want to see
df_edu = df_edu[df_edu['ISCED 2011 A education level'] == "Tertiary education"]
df_edu = df_edu[df_edu['Gender'] == "Total"]
df_edu = df_edu[df_edu['Measure'] == "Value"]
df_edu = df_edu[df_edu['Reference Period'] == 2018]

# This final dataframe contains the share of each country's population that has a tertiary education. 
df_edu.head()


Unnamed: 0,COUNTRY,Country,Gender,ISCED 2011 A education level,Reference Period,Measure,Value
447,KOR,Korea,Total,Tertiary education,2018.0,Value,49.008511
701,CAN,Canada,Total,Tertiary education,2018.0,Value,57.888363
709,JPN,Japan,Total,Tertiary education,2018.0,Value,51.928062
774,CZE,Czech Republic,Total,Tertiary education,2018.0,Value,24.262077
909,FRA,France,Total,Tertiary education,2018.0,Value,36.897491


In [4]:
# But lets reduce this table down to what we'll be combining together later.
# A simple country-variable table.

df_edu = df_edu[['COUNTRY', 'Country', 'Value']]
df_edu.columns = ['ISO', 'Country', 'Tertiary_Education_Pct']
df_edu.head()

Unnamed: 0,ISO,Country,Tertiary_Education_Pct
447,KOR,Korea,49.008511
701,CAN,Canada,57.888363
709,JPN,Japan,51.928062
774,CZE,Czech Republic,24.262077
909,FRA,France,36.897491


### Debt

In [5]:
url_debt = 'https://raw.githubusercontent.com/pvai-umich/SIADS591/master/Data/OECD_Government_Debt.csv'
df_debt = pd.read_csv(url_debt)


### Immunization

In [6]:
url_imm = 'https://raw.githubusercontent.com/pvai-umich/SIADS591/master/Data/OECD_Immunization_Statistics.csv'
df_imm = pd.read_csv(url_imm)


### Population / Demographics

In [7]:
url_pop = 'https://raw.githubusercontent.com/pvai-umich/SIADS591/master/Data/OECD_Population_Statistics.csv'
df_pop = pd.read_csv(url_pop)


### Tourism

In [8]:
url_tour = 'https://raw.githubusercontent.com/pvai-umich/SIADS591/master/Data/OECD_Tourism_Statistics.csv'
df_tour = pd.read_csv(url_tour)


### Wealth Distribution

In [9]:
url_wealth = 'https://raw.githubusercontent.com/pvai-umich/SIADS591/master/Data/OECD_Wealth_Distribution_Statistics.csv'
df_wealth = pd.read_csv(url_wealth)

## COVID-19

All attempts to load json via web is throwing an SSL error. We will load locally.

-Method 1
-Request fails unless we provide a user-agent
api_response = requests.get('https://thevirustracker.com/timeline/map-data.json', headers={"User-Agent": "Chrome"})
covid_stats = api_response.json()

-Method 2
response = urlopen('https://thevirustracker.com/timeline/map-data.json')
json_data = response.read().decode('utf-8', 'replace')
df = json.loads(json_data)

### Country Covid-19 Stats

In [52]:
#Importing COVID19 Global Data
country_stats = pd.read_json(r'covid_full.json')

In [47]:
lst_1=[]
for i in country_stats['data']:
    for key,value in i.items():
        pair=[key,value]
        lst_1.append(pair)

In [49]:
pd.DataFrame((lst_1), columns =['Stat', 'Value']) 

Unnamed: 0,Stat,Value
0,countrycode,AD
1,date,6/06/20
2,cases,852
3,deaths,51
4,recovered,1
5,countrycode,AD
6,date,6/05/20
7,cases,852
8,deaths,51
9,recovered,1


### Global Covid-19 Stats

In [None]:
#Importing COVID19 Global Data
global_stats = pd.read_json(r'global_stats.json')

In [None]:
lst=[]
for i in global_stats['results']:
    for key,value in i.items():
        pair=[key,value]
        lst.append(pair)

In [None]:
pd.DataFrame((lst), columns =['Stat', 'Value']) 

## Combining the OECD Data

For now, we are combining the various OECD tables on the 'country' key so that we can merge it with COVID data.

Data transformation to follow.