# Exploratory Data Analysis of COVID19 in South Korea

After finding a cool dataset on GitHub https://github.com/jihoo-kim/Coronavirus-Dataset (all data source credits go to Jihoo Kim) about COVID19 in South Korea I decided to open up my notebook and do what I know to do best to understand the uncertainty and noise in the world currently - data science! Seems like that year in learning Korean for my study abroad last summer is going to pay off! 시작합니다 ~! 

In [1]:
#SET UP

import itertools

#Graphing
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10,10)

#Tools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Useful for probability calculations
from scipy import stats
from scipy import special

#Regression Packages
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from IPython.core.display import HTML

In [49]:
patient_url = 'https://raw.githubusercontent.com/WinsonTruong/Coronavirus-Dataset/master/patient.csv'
patient = pd.read_csv(patient_url)


Let's start off by inspecting, cleaning, and verifying  'patient.csv'

In [50]:
patient

Unnamed: 0,id,sex,birth_year,country,region,disease,group,infection_reason,infection_order,infected_by,contact_number,confirmed_date,released_date,deceased_date,state
0,1,female,1984.0,China,filtered at airport,,,visit to Wuhan,1.0,,45.0,2020-01-20,2020-02-06,,released
1,2,male,1964.0,Korea,filtered at airport,,,visit to Wuhan,1.0,,75.0,2020-01-24,2020-02-05,,released
2,3,male,1966.0,Korea,capital area,,,visit to Wuhan,1.0,,16.0,2020-01-26,2020-02-12,,released
3,4,male,1964.0,Korea,capital area,,,visit to Wuhan,1.0,,95.0,2020-01-27,2020-02-09,,released
4,5,male,1987.0,Korea,capital area,,,visit to Wuhan,1.0,,31.0,2020-01-30,2020-03-02,,released
5,6,male,1964.0,Korea,capital area,,,contact with patient,2.0,3.0,17.0,2020-01-30,2020-02-19,,released
6,7,male,1991.0,Korea,capital area,,,visit to Wuhan,1.0,,9.0,2020-01-30,2020-02-15,,released
7,8,female,1957.0,Korea,Jeollabuk-do,,,visit to Wuhan,1.0,,113.0,2020-01-31,2020-02-12,,released
8,9,female,1992.0,Korea,capital area,,,contact with patient,2.0,5.0,2.0,2020-01-31,2020-02-24,,released
9,10,female,1966.0,Korea,capital area,,,contact with patient,3.0,6.0,43.0,2020-01-31,2020-02-19,,released


In [51]:
patient.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7513 entries, 0 to 7512
Data columns (total 15 columns):
id                  7513 non-null int64
sex                 662 non-null object
birth_year          649 non-null float64
country             7513 non-null object
region              421 non-null object
disease             28 non-null float64
group               82 non-null object
infection_reason    144 non-null object
infection_order     35 non-null float64
infected_by         70 non-null float64
contact_number      50 non-null float64
confirmed_date      7513 non-null object
released_date       55 non-null object
deceased_date       36 non-null object
state               7513 non-null object
dtypes: float64(5), int64(1), object(9)
memory usage: 880.5+ KB


From reading this output summary of our data there seems to be a significant amount of missing data such as in birth_year where we only know 662/7513 of the entries. On the other hand confirmation data and state seem to not be missing any data whatsoever.

## Metadata Descriptions:

* id: a unique identifier per patient
* region: an identifier of the regions (14 unique) in which the patient was verified for COVID19
* group: an identifier for if the individual belonged to a specific religious group/cult 
* infection_reason: a description of the point of contact with the disease
* infection_order: a scale of 1 to 6 indicating ?
* infected_by: ?
* contact_number: an estaimte of the number of persons the individual has been infected with
* state: patient status described as either isolated, released, or deceased

# Cleaning
Now that we understand what this data is, let's begin cleaning it for analysis:

Just by observation some of the things I want to change are:
* id: to be deleted due to panda's default indexing
* birth_year: float ==> int
* infection_order: float ==> int
* confirmed_date: str ==> pd.datetime object

In [52]:
patient = patient.drop(columns = ['id'])

In [53]:
#Converting the floats
patient['birth_year'] = patient['birth_year'].fillna(0.0).astype(np.int64)
#Making sure I replace the 0s with NaN
patient['birth_year'].map(lambda x: x if x == 0 else np.nan)

patient.head()

Unnamed: 0,sex,birth_year,country,region,disease,group,infection_reason,infection_order,infected_by,contact_number,confirmed_date,released_date,deceased_date,state
0,female,1984,China,filtered at airport,,,visit to Wuhan,1.0,,45.0,2020-01-20,2020-02-06,,released
1,male,1964,Korea,filtered at airport,,,visit to Wuhan,1.0,,75.0,2020-01-24,2020-02-05,,released
2,male,1966,Korea,capital area,,,visit to Wuhan,1.0,,16.0,2020-01-26,2020-02-12,,released
3,male,1964,Korea,capital area,,,visit to Wuhan,1.0,,95.0,2020-01-27,2020-02-09,,released
4,male,1987,Korea,capital area,,,visit to Wuhan,1.0,,31.0,2020-01-30,2020-03-02,,released


Given the global context of the virus, the dataset should have global context. The international age system that begins counting cardinally (0) and the East Asian Age Reckoning that begins counting orindally (1). Note that we will be utilzing international age for analysis.

* East Asian: (current year - year of birth) + $1[LNY]$ where $LNY$ is 1 if we are before the 2020 Lunar New Year or 2 if we after the 2020 Lunar New Year
* International: current year - year of birth



In [54]:
patient['age'] = 2020 - patient['birth_year']

In [57]:
patient['confirmed_date'] = pd.to_datetime(patient['confirmed_date'],
                                           format='%Y%m%d', errors='ignore')
patient['released_date'] = pd.to_datetime(patient['released_date'],
                                           format='%Y%m%d', errors='ignore')

In [58]:
patient.head()

Unnamed: 0,sex,birth_year,country,region,disease,group,infection_reason,infection_order,infected_by,contact_number,confirmed_date,released_date,deceased_date,state,age
0,female,1984,China,filtered at airport,,,visit to Wuhan,1.0,,45.0,2020-01-20,2020-02-06,,released,36
1,male,1964,Korea,filtered at airport,,,visit to Wuhan,1.0,,75.0,2020-01-24,2020-02-05,,released,56
2,male,1966,Korea,capital area,,,visit to Wuhan,1.0,,16.0,2020-01-26,2020-02-12,,released,54
3,male,1964,Korea,capital area,,,visit to Wuhan,1.0,,95.0,2020-01-27,2020-02-09,,released,56
4,male,1987,Korea,capital area,,,visit to Wuhan,1.0,,31.0,2020-01-30,2020-03-02,,released,33


# Resources
* https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4
