# Cardio Catch Disease Project

## Descrição do Problema

**A Empresa (ficticia)**
- A cardio Catch Disease é uma empresa especializada em detecção de doenças cardíacas em estágios iniciais. O modelo de negócio é do tipo Serviço, ou seja, a empresa oferece o diagnostico precoce de uma doença cardiovascular por um certo preço.


**Problema de Negócio**
- Atualmente o diagnostico é feito manualmente por uma equipe de especialistas. A precisão atual varia entre 55% a 65%, devido a complexidade do diagnostico e também da fadiga da equipe que se reveza em turnos para minimizar os riscos. O custo de cada diagnostico, incluindo os aparelhos e a folha de pagamento dos analistas gira em torno de R$1.000,00.

- O preço do diagnostico pago pelo cliente varia de acordo com a precisão conseguida pelo time de especialistas, o cliente paga R$500,00 a cada 5% de acurácia acima de 50%.


**Objetivo**
- Criar uma ferramenta que aumente a precisão do diagnóstico e que essa precisão seja estável para todos os diagnosticos.

**Perguntas para responder**
- Qual acurácia e precisão da ferramenta
- Quanto lucro a empresa passará a ter com a nova ferramenta
- Qual a confiabilidade do resultado dos dados pela nova ferramenta?

# 0.0. Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import xgboost as xgb

from IPython.core.display    import HTML
from matplotlib              import pyplot        as plt
from sklearn.model_selection import train_test_split
from sklearn                 import neighbors     as nh
from sklearn.metrics         import accuracy_score, f1_score, recall_score
from sklearn                 import linear_model  as lm
from sklearn                 import ensemble      as en
from lightgbm                import LGBMClassifier
from sklearn.model_selection import cross_val_score

## 0.1. Helper Functions

In [49]:
def jupyter_settings():
    
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
   

    display( HTML( '') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    
    
def cross_validation( model_name, model, X, y ):
    
    cv_score = cross_val_score( model, X, y, cv=5, scoring='accuracy')
    cv_f1score = cross_val_score( model, X, y, cv=5, scoring='f1')
    cv_recall = cross_val_score( model, X, y, cv=5, scoring='recall')

    score_mean = round( np.mean( cv_score ), 3 )
    score_std = round( np.std( cv_score ), 3 )
    
    f1_mean = round( np.mean( cv_f1score ), 3 )
    f1_std = round( np.std( cv_f1score ), 3 )
    
    recall_mean = round( np.mean( cv_recall ), 3 )
    recall_std = round( np.std( cv_recall ), 3 )
    
    return pd.DataFrame( { 'Accuracy Score': ' {} +/- {} '.format( score_mean, score_std ),
                          'F1 Score': '{} +/- {}'.format( f1_mean, f1_std ),
                         'Recall Score': '{} +/- {}'.format( recall_mean, recall_std ) }, index=[model_name] )

In [3]:
jupyter_settings()

## Loading Dataset

In [4]:
df_raw = pd.read_csv( '../data/raw/cardio_train.csv', sep=';')

In [5]:
df_raw.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


# 1.0 Data Description

In [6]:
df1 = df_raw.copy()

## Spliting to take a test dataset

In [7]:
df1['cardio'].value_counts(normalize=True)

0    0.5003
1    0.4997
Name: cardio, dtype: float64

In [8]:
X = df1.drop( 'cardio', axis=1 )
y = df1['cardio'].copy()

x_training, x_test, y_training, y_test = train_test_split( X, y, test_size=0.1, random_state=28 )

df1 = pd.concat( [x_training, y_training], axis=1 )
df_test = pd.concat( [x_test, y_test ], axis=1 )

In [9]:
df_test['cardio'].value_counts(normalize=True)

0    0.501143
1    0.498857
Name: cardio, dtype: float64

In [10]:
df1['cardio'].value_counts(normalize=True)

0    0.500206
1    0.499794
Name: cardio, dtype: float64

## 1.1.Column Names 

In [11]:
df1.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

## 1.2. Data Dimensions

In [12]:
print('Number of Rows: {}'.format( df1.shape[0] ))
print('Number of Columns: {}'.format( df1.shape[1] ))

Number of Rows: 63000
Number of Columns: 13


## 1.3. Data Types

In [13]:
df1.dtypes

id               int64
age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

## 1.4. Check NA

In [14]:
df1.isna().sum()

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

## 1.5. Descriptive Statistics

In [15]:
# Estatistica Descritiva
# Central Tendency - mean, median
ct1 = pd.DataFrame( df1.apply( np.mean ) ).T
ct2 = pd.DataFrame( df1.apply( np.median ) ).T

# Dispersion - std, min, max, range, skew, kurtosis
d1 = pd.DataFrame( df1.apply( np.std ) ).T
d2 = pd.DataFrame( df1.apply( min ) ).T
d3 = pd.DataFrame( df1.apply( max ) ).T
d4 = pd.DataFrame( df1.apply( lambda x: x.max() - x.min() ) ).T
d5 = pd.DataFrame( df1.apply( lambda x: x.skew() ) ).T
d6 = pd.DataFrame( df1.apply( lambda x: x.kurtosis() ) ).T

# Concat
m = pd.concat( [d2, d3, d4, ct1, ct2, d1, d5, d6] ).T.reset_index()
m.columns = ['attributes', 'min', 'max', 'range', 'mean', 'median', 'std', 'skew', 'kurtosis']
m

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,id,0.0,99999.0,99999.0,49962.531063,49938.5,28856.826649,0.000493,-1.199129
1,age,10798.0,23713.0,12915.0,19469.352032,19703.0,2468.012245,-0.307221,-0.823181
2,gender,1.0,2.0,1.0,1.350222,1.0,0.477039,0.627962,-1.605715
3,height,55.0,250.0,195.0,164.372841,165.0,8.219944,-0.697752,8.629868
4,weight,10.0,200.0,190.0,74.187552,72.0,14.384827,1.017653,2.6398
5,ap_hi,-140.0,16020.0,16160.0,128.778683,120.0,152.352806,85.856894,7698.28547
6,ap_lo,-70.0,11000.0,11070.0,96.719048,80.0,188.535871,31.770352,1398.122439
7,cholesterol,1.0,3.0,2.0,1.367857,1.0,0.680265,1.582175,0.981197
8,gluc,1.0,3.0,2.0,1.225952,1.0,0.571583,2.401165,4.314433
9,smoke,0.0,1.0,1.0,0.087762,0.0,0.282948,2.913945,6.491281


# 2.0. Feature Engineering

In [16]:
df2 = df1.copy()

In [17]:
# Convertendo a coluna Age de dias para anos
df2['age'] = round(df2['age'] / 365)
df2['age'] = df2['age'].astype(np.int64)

# 3.0. Filtragem de linhas e colunas

In [18]:
df3 = df2.copy()

## Filtragem de linhas

In [19]:
df3 = df3.loc[ (df3['ap_hi'] > 0) & ( df3['ap_lo'] > 0 ), : ]

## Filtragem das colunas

# 4.0. Exploratory Data Analysis

In [20]:
df4 = df3.copy()

# 5.0. Data Preparation

In [21]:
df5 = df4.copy()

In [22]:
X = df3.drop('cardio', axis=1)
y = df3['cardio'].copy()

x_train, x_validation, y_train, y_validation = train_test_split( X, y, test_size=0.20, random_state=28 )

# 6.0. Feature Selection

In [23]:
df6 = df5.copy()

# 7.0. Model Selection

In [24]:
df7 = df6.copy()

## 7.1. KNN Model

In [25]:
# Model Definition
knn_model = nh.KNeighborsClassifier( n_neighbors=8 )

# Model Training
knn_model.fit( x_train, y_train )

# Model Prediction
yhat_knn = knn_model.predict( x_validation )

# Metrics
knn_f1score = f1_score(y_validation, yhat_knn)
knn_accuracy = accuracy_score( y_validation, yhat_knn)
knn_recall = recall_score( y_validation, yhat_knn, average='binary'  )

print(knn_f1score)
print(knn_accuracy)
print( knn_recall )

0.5705440900562851
0.63652242953553
0.47664576802507835


### KNN Cross Validation

In [50]:
knn_cv = cross_validation( ' KNN CV', knn_model, X, y )

## 7.2. Logistic Regression

In [26]:
# Model Definition
lr_model = lm.LogisticRegression( random_state=28 )

# Model Training
lr_model.fit( x_train, y_train )

# Model Prediction
yhat_lr = lr_model.predict( x_validation )

lr_f1score = f1_score(y_validation, yhat_lr)
lr_accuracy = accuracy_score( y_validation, yhat_lr)
lr_recall = recall_score( y_validation, yhat_lr, average='binary'  )


print(lr_f1score)
print(lr_accuracy)
print( lr_recall )

0.6909031090639909
0.7016276300119095
0.658307210031348


### LR Cross Validation

In [51]:
lr_cv = cross_validation( 'Logistic Regression CV', lr_model, X, y )

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

## 7.3. Extra Trees

In [27]:
# Model Definition
et_model = en.ExtraTreesClassifier( n_estimators=1000, n_jobs=-1, random_state=42 )

# Model Training
et_model.fit( x_train, y_train )

# Model Prediction 
yhat_et = et_model.predict( x_validation )

et_f1score = f1_score(y_validation, yhat_et)
et_accuracy = accuracy_score( y_validation, yhat_et)
et_recall = recall_score( y_validation, yhat_et, average='binary'  )


print(et_f1score)
print(et_accuracy)
print(et_recall)

0.7115430196729083
0.7101230647082175
0.7057993730407524


### Extra Tress Cross Validation

In [57]:
et_cv = cross_validation( 'Extra Trees CV', et_model, X, y )

## 7.4. Random Forest

In [33]:
# model definition
rf_model = en.RandomForestClassifier( n_estimators=1000, n_jobs=-1, random_state=42 ) 

# model training
rf_model.fit( x_train, y_train )

# model prediction
yhat_rf = rf_model.predict( x_validation )

rf_f1score = f1_score(y_validation, yhat_rf)
rf_accuracy = accuracy_score( y_validation, yhat_rf)
rf_recall = recall_score( y_validation, yhat_rf, average='binary' ) 

print(rf_f1score)
print(rf_accuracy)
print( rf_recall )

0.718039373242266
0.7213973799126637
0.7003134796238244


### Random Forest Cross Validation

In [54]:
rf_cv = cross_validation( 'Random Forest CV', rf_model, X, y )

## 7.5. Light Gradient Boosting Machine

In [34]:
lgbm_model = LGBMClassifier()

lgbm_model.fit(x_train, y_train)

yhat_lgbm = lgbm_model.predict( x_validation )


lgbm_f1score = f1_score(y_validation, yhat_lgbm)
lgbm_accuracy = accuracy_score( y_validation, yhat_lgbm)
lgbm_recall = recall_score( y_validation, yhat_lgbm, average='binary' ) 

print(lgbm_f1score)
print(lgbm_accuracy)
print(lgbm_recall)

0.726792203501817
0.7373560936879714
0.6896551724137931


### LGBM Cross Validation

In [55]:
lgbm_cv = cross_validation( 'LGBM CV', lgbm_model, X, y )

## Results

In [35]:
results = {'Metric':['F1 Score', 'Accuracy Score', 'Recall Score'], 'KNN': [knn_f1score, knn_accuracy, knn_recall], 
           'Logistic Regression': [lr_f1score, lr_accuracy, lr_recall], 'Extra Trees': [et_f1score, et_accuracy, et_recall], 
           'Random Forest': [rf_f1score, rf_accuracy, rf_recall], 'LGBM': [lgbm_f1score, lgbm_accuracy, lgbm_recall]}
df_results = pd.DataFrame( data=results).T

In [36]:
df_results.columns = df_results.iloc[0]
df_results =df_results[1:]
df_results

Metric,F1 Score,Accuracy Score,Recall Score
KNN,0.570544,0.636522,0.476646
Logistic Regression,0.690903,0.701628,0.658307
Extra Trees,0.711543,0.710123,0.705799
Random Forest,0.718039,0.721397,0.700313
LGBM,0.726792,0.737356,0.689655


## Cross Validation Results

In [56]:
results_cv = pd.concat( [knn_cv, lr_cv, et_cv, rf_cv, lgbm_cv] )
results_cv

Unnamed: 0,Accuracy Score,F1 Score,Recall Score
KNN CV,0.637 +/- 0.004,0.57 +/- 0.005,0.482 +/- 0.005
Logistic Regression CV,0.702 +/- 0.004,0.687 +/- 0.006,0.656 +/- 0.013
Extra Trees CV,0.722 +/- 0.003,0.716 +/- 0.004,0.702 +/- 0.005
Random Forest CV,0.722 +/- 0.003,0.716 +/- 0.004,0.702 +/- 0.005
LGBM CV,0.734 +/- 0.004,0.722 +/- 0.004,0.691 +/- 0.005


# 8.0. Hyper-parameter Fine Tuning

# 9.0. Testing Model

# 10.0. Business Performance