# Planning 

## Reference
[Link](https://www.kaggle.com/bhadaneeraj/cardio-vascular-disease-detection) to Kaggle Project.

## The Problem Statement:
To build an application to classify the patients to be healthy or suffering from cardiovascular disease based on the given attributes.

## Features:

Age | Objective Feature | age | int (days)  
Height | Objective Feature | height | int (cm) |  
Weight | Objective Feature | weight | float (kg) |  
Gender | Objective Feature | gender | categorical code |   
Systolic blood pressure | Examination Feature | ap_hi | int |    
Diastolic blood pressure | Examination Feature | ap_lo | int |  
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |  
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |  
Smoking | Subjective Feature | smoke | binary |  
Alcohol intake | Subjective Feature | alco | binary |  
Physical activity | Subjective Feature | active | binary |  
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |  

# Import libraries

In [1]:
import pandas as pd
import numpy as np
from IPython.display import Image
import seaborn as sns

import matplotlib as plt
import matplotlib.pyplot as plt

from sklearn.preprocessing import normalize

ModuleNotFoundError: No module named 'sklearn'

# Loading dataset

In [None]:
data_raw = pd.read_csv('dataset/cardio_train.csv', delimiter=';')

# Helper function

In [None]:
from IPython.core.display  import HTML

def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
    
jupyter_settings()

# Descriptive analysis

## Dimensions

In [None]:
data_raw.head()

In [None]:
data_raw.shape

## Renaming columns

In [None]:
data_raw.columns

In [None]:
data_raw.columns = ['id', 'age', 'gender', 'height', 'weight', 'sys_press', 'dia_press',
                    'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']

In [None]:
data_raw.columns

## Feature engineering

### Changing `age` from days to years

In [None]:
data_raw['age_year'] = data_raw['age'].apply(lambda x: x/365)

In [None]:
data_raw.sample(3)

In [None]:
data_raw.drop('age', axis=1,inplace=True)

In [None]:
data_raw.sample(3)

In [None]:
data_raw.rename(columns={'age_year': 'age'}, inplace=True)

In [None]:
data_raw.sample(3)

In [None]:
# round `age` values to 1 decimal
data_raw['age'] = data_raw['age'].apply( lambda x: np.round(x, 1) )

In [None]:
data_raw.sample(3)

## Checking NA 

In [None]:
data_raw.isna().sum()

## Descriptive statistics

### Numerical attributes

In [None]:
data_raw.dtypes 

In [None]:
ct1 = pd.DataFrame( data_raw.apply ( np.mean) ).T 
ct2 = pd.DataFrame( data_raw.apply ( np.median ) ).T

d1 = pd.DataFrame( data_raw.apply( np.std )).T
d2 = pd.DataFrame( data_raw.apply( min )).T
d3 = pd.DataFrame( data_raw.apply( max )).T
d4 = pd.DataFrame( data_raw.apply( lambda x: x.max() - x.min() )).T
d5 = pd.DataFrame( data_raw.apply( lambda x: x.skew() )).T
d6 = pd.DataFrame( data_raw.apply( lambda x: x.kurtosis() )).T

m = pd.concat([d2,d3,d4,ct1,ct2,d1,d5,d6]).T.reset_index()

# rename columns
m.columns = ["attributes","min","max","range","mean","median","std","skew","kurtosis"]
m

Duas variáveis binárias: somar agregado ou tabela de frequência

In [None]:
pd.crosstab( data_raw['smoke'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

De acordo com esses dados, a variável `smoke` parece não ser relevante para classificar indivíduos como saudáveis ou com doença cardiovascular. Investigaremos agora as outras variáveis binárias: `alco`, `gender` e `active`

In [None]:
pd.crosstab( data_raw['alco'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

In [None]:
pd.crosstab( data_raw['gender'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

In [None]:
pd.crosstab( data_raw['active'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

In [None]:
pd.crosstab( data_raw['gluc'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

In [None]:
pd.crosstab( data_raw['cholesterol'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

## Hypothesis Mind Map
Let's create some hypothesis to investigate the phenoma. Before doing that, we want to create a mind map of all elements (e.g. Person) and their attributes (e.g. age) so that we can use them to create the hypotheses (e.g. older people have higher probability of having a cardio disease) 

In [None]:
Image("images/coggle_mind_map.PNG")

Hypotheses:
* High cholesterol -> cardio disease
* High systolic blood pressure -> cardio disease
* High diastolic blood pressure -> cardio disease
* High glucose -> cardio disease
* Non active (0) -> cardio disease

## EDA (Exploratory data analysis) 
### Univariate analysis

In [None]:
sns.distplot( data_raw['cardio'] );

In [None]:
sns.countplot( data_raw['cardio'] );

High cholesterol -> cardio disease
High systolic blood pressure -> cardio disease
High diastolic blood pressure -> cardio disease
High glucose -> cardio disease
Non active (0) -> cardio disease

### Hypothesis tests

High cholesterol -> cardio disease

True

In [None]:
data_raw.columns

In [None]:
# countplot

sns.countplot( hue='cardio', x='cholesterol', data=data_raw );

High systolic blood pressure -> cardio disease

In [None]:
# boxplot

fig, ax = plt.subplots()
sns.boxplot( data=data_raw, x='cardio', y='sys_press', ax=ax);
ax.set_ylim(50, 250)
plt.show()

High diastolic blood pressure -> cardio disease

In [None]:
# boxplot

fig, ax = plt.subplots()
sns.boxplot( data=data_raw, x='cardio', y='dia_press', ax=ax);
ax.set_ylim(1, 200)
plt.show()

In [None]:
# cholesterol and sys_press

fig, ax = plt.subplots()
sns.boxplot( data=data_raw, x='cholesterol', y='dia_press', ax=ax);
ax.set_ylim(50, 120)
plt.show()

High glucose -> cardio disease

In [None]:
# countplot

sns.countplot( hue='cardio', x='gluc', data=data_raw );

Non active (0) -> cardio disease

In [None]:
pd.crosstab( data_raw['active'], data_raw['cardio'] ).apply( lambda x: x / x.sum(), axis=1 )

In [None]:
sns.countplot( hue='cardio', x='active', data=data_raw );

## Choosing the Classification Models 
### Support Vector Machine (SVM)
#### Description and Intuition
The main ideas behind Support Vector Machines are:
1. Start with data in a relatively low dimension
2. Move the data into a higher dimension
3. Find a Support Vector Classifier that separates the higher dimensional data into two groups

![SVM intuition](images/SVM.svg)

The **polynomial kernel** computes the relationship between each pair of observations and that information is used to build the Support Vector Classifier that separates the data the best. Since different transformations are possible (d=1, d=2 (squared), d=3 (cubic), etc), the polynomial kernel is computed using different values of **d** (the degree of the polynomial), and *cross-validation* is used to choose the best value of d.

![SVM kernel](images/SVM_kernel.svg)

#### Data preprocessing

Based on [this](https://www.youtube.com/watch?v=8A7L0GsBiLQ) YouTube video, we will do the following steps for SVM data preprocessing:

1)	- remove/impute missing values
	- are not optimized for high volumes of data, so we might need to downsample it
		- in the example, from 29932 to 2000 (1000 of each category)

2) splits the columns into variables (X) and the data to be predicted (y)

3) SVM support continuous data but do NOT support categorical data
    - so we need to use one-hot encoding
    - get_dummies() pandas function may do that

4) centering and scaling the data
	- the radial basis function that we are using in SVM assumes that the data are centered and scaled. In other words, each column should have a mean=0 and a std=1. So we need to do it for both training and testing datasets.

### XGBoost

#### Description and Intuition
The main ideas behind XGBoost (Extreme Gradient Boosting) are:
1. It's based on multiple decision trees to make predictions (decision tree ensemble learning algorithm). 
2. First the tree picks a root (one of the features) than it generates branches until there is a limiting factor for the growth of the tree.
3. To reduce variance of the model, each tree improves the errors of the previous one.

#### Data preprocessing
Based on [this](https://www.youtube.com/watch?v=GrJP9FLV3FE) YouTube video, we will do the following steps for XGBoost data preprocessing:

- split data into dependent and independent variables
- one-hot encoding (we won't need it)
- convert all columns to int, float or bool (we won't need it)

### Optimization
- scale_pos_weight helps to deal with unbalanced data (adds a penalty for misclassified minority class, i.e. the tree will try harder to classify the minority class)
- hyperparameters fine tunning: max_depth, learning_rate (i.e. eta), gamma (parameter that encourages pruning), reg_lambda

#### Observations
- So far, we have seen that our data needs to be numerical, i.e., we would need to transform any categorical or text data if we had it. However, all our features are numerical, so we don't need to do such a transformation. 
- By default, the XGBoost implementation treats missing data as zero. This could be an issue, however, because our data does not have any missing data, we don't need to worry about it. 
- We haven't seen any requirements for scaling our data.

#### Optimization backlog
- Define our data as either sparse or dense and apply the most approppriate datatypes (as defined [here](https://scikit-learn.org/stable/modules/svm.html))

### Baseline strategy
1. check the percentage of people who has cardio disease
2. use that number to choose the probability of assigning someone as having cardio disease

# Data Preparation

## One hot encoding for categorical variables

In [None]:
df = data_raw.copy()

In [None]:
df.columns

### Gender

In [None]:
df['gender'].unique()

In [None]:
df = pd.get_dummies( df, columns=['gender'] )

In [None]:
df.head()

### Cholesterol

In [None]:
df.columns

In [None]:
df['cholesterol'].unique()

In [None]:
df = pd.get_dummies( df, columns=['cholesterol'] )

In [None]:
df.head()

### Glucose

In [None]:
df = pd.get_dummies( df, columns=['gluc'] )

In [None]:
df.head()

## Transformation

In [None]:
sns.distplot( df['height'] );
# Candidate for normalization

In [None]:
sns.distplot( df['weight'] );
# Candidate for normalization

In [None]:
fig, ax = plt.subplots()
sns.distplot( df['sys_press'], ax=ax );
ax.set_xlim( -200,1500 )
plt.show()

In [None]:
sns.distplot( df['sys_press'] );

In [None]:
sns.boxplot( df['sys_press'] );

Let's assume that systolic pressures higher than 4000 are outliers and remove those from our dataset. According to the previous boxplot, that should remove only a few data points.

In [None]:
df = df[df['sys_press'] <= 4000]

sns.boxplot( df['sys_press'] );

In [None]:
sns.distplot( df['sys_press'] );
# Candidate for robust scaler

In [None]:
sns.distplot( df['dia_press'] );

In [None]:
sns.boxplot( df['dia_press'] );

In [None]:
df = df[df['dia_press'] <= 5000]

In [None]:
sns.distplot( df['dia_press'] )

In [None]:
sns.boxplot( df['dia_press'] )

Checking what are the datapoints that have really high systolic and diastolic pressures

In [None]:
df.loc[df['dia_press'] > 500, ['id', 'sys_press', 'dia_press']]

In [None]:
df.loc[df['sys_press'] > 500, ['id', 'sys_press', 'dia_press']]

In [None]:
sns.distplot(df['age'])

In [None]:
sns.boxplot( df['age'] )
# Candidate for min-max scaler

## Rescaling

### Normalization

In [None]:
df.columns

## Next steps

Solve pyenv issue with sklearn import
Data transformation:
- Scale our data to make it compatible with SVM models (check if is necessary)
      - Check downsampling strategy for SVM
Split data:      
- Split data into dependent and independent variables
- Split data into train and test
- 'Split' data for cross validation

Run models:
- Run SVM and XGBoost
- Check performance of both models (e.g. confusion matrix)

## Backlog
* Study `sns.distplot`
    * What's the meaning of the `y` axis? How can the `density` be interpreted?  
* Search for a real cardiodisease dataset  
    * We believe the current dataset is not real
* Evaluate renaming some binary variables
* Check information about systolic and diastolic pressure
* Check different moments of applying data split in the project