### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

#### Custom CSS style

In [1]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
#    font-size: var(--jp-content-font-size1) !important;
}

.dashed-box table {

}

.dashed-box tr {
    background-color: white !important;
}
        
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>Cardiovascular Disease 💔</b></span><br/>
<span style='font-size: 1.5em'>Predict cardiovascular diseases</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint #5</b></span>

<img src="./imgs/cardio.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the [***Machine Learning Project Checklist by xavecoding***](https://github.com/xavecoding/IFSP-CMP-D2APR-2021.2/blob/main/cheat-sheets/machine-learning-project-checklist_by_xavecoding.pdf). <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Evaluate on the training set:
  + Decision Trees
---

### 0. Imports and default settings for plotting

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data
There is no need for feature scaling and categorical encoding. We will just use Imputation.

### 5.1. Load the cleaned training set
Let's consider the training and testing sets already cleaned (Sprint #1)

In [3]:
cardio_train = pd.read_csv('./datasets/cardio_clean_train.csv')

In [4]:
cardio_train.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,21875,2,171,74.0,120,80,1,1,0,0,0,1
1,15302,1,162,66.0,110,80,1,1,0,0,1,0
2,18079,1,166,69.0,120,80,1,2,0,0,1,0
3,21680,1,169,65.0,120,80,1,1,0,0,0,1
4,14368,1,155,80.0,120,80,1,1,0,0,1,0


In [5]:
# Just to remember what categorical variables are like
for cat_attribute in ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']:
    print(cardio_train[cat_attribute].value_counts())
    print()

1    35641
2    19079
Name: gender, dtype: int64

1    41031
2     7401
3     6288
Name: cholesterol, dtype: int64

1    46543
3     4173
2     4004
Name: gluc, dtype: int64

0    49918
1     4802
Name: smoke, dtype: int64

0    51793
1     2927
Name: alco, dtype: int64

1    43963
0    10757
Name: active, dtype: int64



### 5.2. Separate the features and the classes (target outcome)

In [6]:
cardio_train.columns

Index(['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol',
       'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

In [7]:
# store the target outcome into a numpy array
y_train = cardio_train['cardio'].values

In [8]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [9]:
y_train.shape

(54720,)

In [10]:
# overwrite the dataframe with only the features  
cardio_train = cardio_train.drop(columns=['cardio'])

In [11]:
cardio_train.head()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active
0,21875,2,171,74.0,120,80,1,1,0,0,0
1,15302,1,162,66.0,110,80,1,1,0,0,1
2,18079,1,166,69.0,120,80,1,2,0,0,1
3,21680,1,169,65.0,120,80,1,1,0,0,0
4,14368,1,155,80.0,120,80,1,1,0,0,1


In [12]:
cardio_train.shape

(54720, 11)

### 5.3. Separate the numerical and categorical features¶
Since we perform different preprocessing tasks (transformations) to _numerical features_ and _categorical ones_, let's split them into two different dataframes.

In [13]:
cardio_train.columns

Index(['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol',
       'gluc', 'smoke', 'alco', 'active'],
      dtype='object')

In [14]:
# numerical variables
num_vars = ['age', 'height', 'weight', 'ap_hi', 'ap_lo']

# categorical binary variables
bin_vars = ['gender', 'smoke', 'alco', 'active']

# categorical variables
cat_vars = ['cholesterol', 'gluc']

In [15]:
## separating the features into specific dataset according to their type
cardio_train_num = cardio_train[num_vars]
cardio_train_bin = cardio_train[bin_vars]
cardio_train_cat = cardio_train[cat_vars]

### 5.4. Creating the Preprocessing Pipeline

In [16]:
from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
])

bin_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))  # as the categories are numbers, we can use the SimpleImputer
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))
])



# (name, transformer, columns)
preprocessed_pipeline = ColumnTransformer([
    ('numerical', num_pipeline, num_vars),
    ('binary', bin_pipeline, bin_vars),
    ('categorical', cat_pipeline, cat_vars)
])

### 🏋️‍♀️ 6. Train ML Algorithms

### 6.1. Getting the independent (features) and classes (outcome)

In [17]:
X_train = preprocessed_pipeline.fit_transform(cardio_train)
X_train.shape

(54720, 11)

In [18]:
# we already have y_train
y_train.shape

(54720,)

### 6.2. Training the Models

In [19]:
# printing function
def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean())
    print("Standard deviation:", scores.std())

#### **Decision Trees - No Regularization**

##### **Accuracy**

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt_accs = cross_val_score(dt, X_train, y_train, scoring="accuracy", cv=5)

display_scores(dt_accs)

Scores: [0.6370614  0.63112208 0.63395468 0.63578216 0.63039108]

Mean: 0.6336622807017545
Standard deviation: 0.002580186735588428


#### **Decision Trees - Max Depth of 10**

##### **Accuracy**

In [21]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=10, random_state=42)
dt_accs = cross_val_score(dt, X_train, y_train, scoring="accuracy", cv=5)

display_scores(dt_accs)

Scores: [0.72423246 0.73346126 0.72669956 0.72012061 0.72423246]

Mean: 0.725749269005848
Standard deviation: 0.004396840196351594
