# Practical Machine Learning for Data Mining Course
## HSE SPb 2020 - Master in Management and Analytics in Business
### Home Assignment 1 - Prepared by Mehmet Atakan Çavuşlu

Assignment uses a dataset that has patient level data on COVID-19 viral disesase, provided by _KCDC (Korea Centers for Disease Control & Prevention)_ and South Korean Government.

Original dataset contains 11 data tables, which can be found and examined in https://github.com/jihoo-kim/Data-Science-for-COVID-19

In this particular assignment, only one data table, namely __PatientInfo__ is used.

-----------

## Data
Import __NumPy__ to do linear algebra on dataset and __Pandas__ to handle data processing & I/O

In [1]:
import numpy as np
import pandas as pd

In [63]:
patient_info = pd.read_csv('PatientInfo.csv')

Check the data to confirm succesful reading of csv.

In [4]:
patient_info.head()

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,,released
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,,2020-01-30,2020-03-02,,released
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,,2020-01-30,2020-02-19,,released
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,,released
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,,2020-01-31,2020-02-24,,released


## Data Processing
__This part contains data indexing, slicing, masking examples to comply with the assignment. In order to apply these operations, some redundant processes are applied. This approaches usually have shorter variants and most of them are irrelevant to core analysis of given data.__

Create a numpy array containing country information of all the patients.

In [23]:
patient_countries = patient_info['country'].to_numpy().astype(str)
patient_countries

array(['Korea', 'Korea', 'Korea', ..., 'Korea', 'Korea', 'Korea'],
      dtype='<U13')

Create a new 2D arraw containing the unique country values in the _patient_countries_ array with their counts.

In [31]:
unique_countries, unique_counts = np.unique(patient_countries, return_counts=True)
country_counts = np.asarray((unique_countries, unique_counts))
country_counts

array([['Canada', 'China', 'France', 'Indonesia', 'Korea', 'Mongolia',
        'Spain', 'Switzerland', 'Thailand', 'United States', 'nan'],
       ['1', '10', '1', '1', '3017', '1', '1', '1', '2', '3', '90']],
      dtype='<U21')

Getting the index number of highest count and using that index to find the corresponding country.

In [39]:
index_max = country_counts[1, :].astype(int).argmax()
country_counts[0, index_max]

'Korea'

Find the number of countries that has patient records of higher than 5 in South Korea using masking operation.

In [48]:
np.sum(country_counts[1, :].astype(int) > 5)

3

Creating a new array only containing China and its count using fancy indexing.

In [59]:
row = np.array([0 , 1])
col = np.array([1])
country_counts[row, col]

array(['China', '10'], dtype='<U21')

__Create a new column _"age"_ computed from the birth_year.__

In [64]:
patient_info['age_num'] = 2020 - patient_info['birth_year']

## Basic Exploratory Data Analysis (EDA)
__This is also part of the assignment with regard to doing some basic analysis__

In [65]:
patient_info.head()

Unnamed: 0,patient_id,global_num,sex,birth_year,age,country,province,city,disease,infection_case,infection_order,infected_by,contact_number,symptom_onset_date,confirmed_date,released_date,deceased_date,state,age_num
0,1000000001,2.0,male,1964.0,50s,Korea,Seoul,Gangseo-gu,,overseas inflow,1.0,,75.0,2020-01-22,2020-01-23,2020-02-05,,released,56.0
1,1000000002,5.0,male,1987.0,30s,Korea,Seoul,Jungnang-gu,,overseas inflow,1.0,,31.0,,2020-01-30,2020-03-02,,released,33.0
2,1000000003,6.0,male,1964.0,50s,Korea,Seoul,Jongno-gu,,contact with patient,2.0,2002000000.0,17.0,,2020-01-30,2020-02-19,,released,56.0
3,1000000004,7.0,male,1991.0,20s,Korea,Seoul,Mapo-gu,,overseas inflow,1.0,,9.0,2020-01-26,2020-01-30,2020-02-15,,released,29.0
4,1000000005,9.0,female,1992.0,20s,Korea,Seoul,Seongbuk-gu,,contact with patient,2.0,1000000000.0,2.0,,2020-01-31,2020-02-24,,released,28.0


Basic distribution analsis on numerical values of age and contact_number (How many people the patient contacted with). It can be seen that with available data, the R0 number of the COVID-19 is around 18 in South Korea, which means one patient contracts on average 18.9 people (The mean of contact_number)

In [67]:
patient_info_nums = patient_info[['age_num', 'contact_number']]

In [68]:
patient_info_nums.describe()

Unnamed: 0,age_num,contact_number
count,2664.0,589.0
mean,45.543168,18.908319
std,20.250896,76.652155
min,0.0,0.0
25%,27.0,2.0
50%,46.0,4.0
75%,60.0,14.0
max,104.0,1160.0


Frequency counts of age groups in the data. It can be seen that majority of the patients are in their 20s.

In [69]:
patient_info['age'].value_counts()

20s     722
50s     562
40s     432
30s     389
60s     365
70s     189
80s     149
10s     126
90s      44
0s       43
66s       1
100s      1
Name: age, dtype: int64

Crosstab of age groups with the patient condition.

In [70]:
pd.crosstab(patient_info['age'], patient_info['state'])

state,deceased,isolated,released
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0s,0,25,18
100s,0,1,0
10s,0,74,52
20s,0,413,309
30s,1,240,148
40s,1,247,184
50s,7,334,221
60s,10,242,113
66s,0,1,0
70s,17,122,50


Average age for each patient condition. As expected, deceased age mean is on the higher side, around 74 in this case.

In [75]:
patient_info.groupby(['state'])['age_num'].mean()

state
deceased    74.275862
isolated    46.533004
released    42.222335
Name: age_num, dtype: float64

----

# Detailed Descriptive Statistics
This data only uses the shared data, and it does not implies the overall situation. Sample only consists of limited number of patients and all the results are based on that limited sample.

## Patient Statistics By Age Groups

In [91]:
overall_n = patient_info['state'].count()
overall_death = np.sum(patient_info['state'] == 'deceased')
overall_mortality = overall_death/overall_n

Group the data by ages to get the same statistics.

In [172]:
stats_by_age = pd.DataFrame(patient_info.groupby(['age'])['state'].count());

In [174]:
deceased_patients = patient_info[patient_info['state'] == 'deceased']
stats_by_age_death = pd.DataFrame(deceased_patients.groupby(['age'])['state'].count())
stats_by_age = pd.concat([stats_by_age, stats_by_age_death], axis=1)

In [175]:
stats_by_age.columns = ['overall', 'death']
stats_by_age.drop(['66s', '100s'], inplace=True)

In [176]:
stats_by_age['fatality_rate'] = stats_by_age['death']/stats_by_age['overall']

Replace NaN with 0.

In [184]:
stats_by_age.fillna(0, inplace=True)
stats_by_age.index.name = 'age'

In [185]:
stats_by_age

Unnamed: 0_level_0,overall,death,fatality_rate
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0s,43,0.0,0.0
10s,126,0.0,0.0
20s,722,0.0,0.0
30s,389,1.0,0.002571
40s,432,1.0,0.002315
50s,562,7.0,0.012456
60s,365,10.0,0.027397
70s,189,17.0,0.089947
80s,149,19.0,0.127517
90s,44,6.0,0.136364


## Patient Statistics By Genders

In [179]:
stats_by_gender = pd.DataFrame(patient_info.groupby(['sex'])['state'].count());

In [180]:
stats_by_gender_death = pd.DataFrame(deceased_patients.groupby(['sex'])['state'].count())
stats_by_gender = pd.concat([stats_by_gender, stats_by_gender_death], axis=1)

In [181]:
stats_by_gender.columns = ['overall', 'death']

In [182]:
stats_by_gender['fatality_rate'] = stats_by_gender['death']/stats_by_gender['overall']

In [183]:
stats_by_gender

Unnamed: 0_level_0,overall,death,fatality_rate
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1707,20,0.011716
male,1327,41,0.030897


## Combining The Dataframes

In [188]:
overall_dataframe = pd.DataFrame({
    'overall': overall_n,
    'death': overall_death,
    'fatality_rate': overall_mortality,
}, index=['overall'])

In [190]:
overall_stats = pd.concat([overall_dataframe, stats_by_age, stats_by_gender], keys=['overall', 'age', 'sex'])

In [192]:
overall_stats

Unnamed: 0,Unnamed: 1,overall,death,fatality_rate
overall,overall,3128,61.0,0.019501
age,0s,43,0.0,0.0
age,10s,126,0.0,0.0
age,20s,722,0.0,0.0
age,30s,389,1.0,0.002571
age,40s,432,1.0,0.002315
age,50s,562,7.0,0.012456
age,60s,365,10.0,0.027397
age,70s,189,17.0,0.089947
age,80s,149,19.0,0.127517


## Future Work
Same grouping and statistical anaylsis can be done also to prior infection condition, outside of country travel condition and case dates by the same approach. 

On top of that, since dates of positive results and dates of decease or release given, further analysis based on PD (Patient Days) can be conducted on the dataset.

Furthermore, results can be even further analysed with the addition of other data tables from the same source.