### **D2APR: Aprendizado de Máquina e Reconhecimento de Padrões** (IFSP, Campinas) <br/>
**Prof**: Samuel Martins (Samuka) <br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>. <br/><br/>

#### Custom CSS style

In [None]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
#    font-size: var(--jp-content-font-size1) !important;
}

.dashed-box table {

}

.dashed-box tr {
    background-color: white !important;
}
        
.alt-tab {
    background-color: black;
    color: #ffc351;
    padding: 4px;
    font-size: 1em;
    font-weight: bold;
    font-family: monospace;
}
// add your CSS styling here
</style>

<span style='font-size: 2.5em'><b>Cardiovascular Disease 💔</b></span><br/>
<span style='font-size: 1.5em'>Predict cardiovascular diseases</span>

<span style="background-color: #ffc351; padding: 4px; font-size: 1em;"><b>Sprint #2</b></span>

<img src="./imgs/cardio.png" width=300/>

---



## Before starting this notebook
This jupyter notebook is designed for **experimental and teaching purposes**. <br/>
Although it is (relatively) well organized, it aims at solving the _target problem_ by evaluating (and documenting) _different solutions_ for somes steps of the **machine learning pipeline** — see the [***Machine Learning Project Checklist by xavecoding***](https://github.com/xavecoding/IFSP-CMP-D2APR-2021.2/blob/main/cheat-sheets/machine-learning-project-checklist_by_xavecoding.pdf). <br/>
We tried to make this notebook as literally a _notebook_. Thus, it contains notes, drafts, comments, etc.<br/>

For teaching purposes, some parts of the notebook may be _overcommented_. Moreover, to simulate a real development scenario, we will divide our solution and experiments into **"sprints"** in which each sprint has some goals (e.g., perform _feature selection_, train more ML models, ...). <br/>
The **sprint goal** will be stated at the beginning of the notebook.

A ***final notebook*** (or any other kind of presentation) that compiles and summarizes all sprints — the target problem, solutions, and findings — should be created later.

#### Conventions

<ul>
    <li>💡 indicates a tip. </li>
    <li> ⚠️ indicates a warning message. </li>
    <li><span class='alt-tab'>alt tab</span> indicates and an extra content (<i>e.g.</i>, slides) to explain a given concept.</li>
</ul>

---

## 🎯 Sprint Goals
- Preprocess the data
- Evaluate on the training set: KNN, Logistic Regression, and Polynomial Logistic Regression
---

### 0. Imports and default settings for plotting

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

## 🛠️ 5. Prepare the Data

**Preprocessing tasks**
- Fill in missing values (imputation)
- Add new features
- Feature Scaling
- One-Hot Encoding

### 5.1. Load the cleaned training set
Let's consider the training and testing sets already cleaned (Sprint #1)

In [None]:
cardio_train = pd.read_csv('./datasets/cardio_clean_train.csv')

In [None]:
cardio_train.head()

In [None]:
# Just to remember what categorical variables are like
for cat_attribute in ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']:
    print(cardio_train[cat_attribute].value_counts())
    print()

### 5.2. Separate the features and the classes (target outcome)

In [None]:
cardio_train.columns

In [None]:
# store the target outcome into a numpy array
y_train = cardio_train['cardio'].values

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
# overwrite the dataframe with only the features  
cardio_train = cardio_train.drop(columns=['cardio'])

In [None]:
cardio_train.head()

In [None]:
cardio_train.shape

### 5.3. Separate the numerical and categorical features¶
Since we perform different preprocessing tasks (transformations) to _numerical features_ and _categorical ones_, let's split them into two different dataframes.

In [None]:
cardio_train.columns

In [None]:
# numerical variables
num_vars = ['age', 'height', 'weight', 'ap_hi', 'ap_lo']

# categorical binary variables
bin_vars = ['gender', 'smoke', 'alco', 'active']

# categorical variables
cat_vars = ['cholesterol', 'gluc']

In [None]:
## separating the features into specific dataset according to their type
cardio_train_num = cardio_train[num_vars]
cardio_train_bin = cardio_train[bin_vars]
cardio_train_cat = cardio_train[cat_vars]

### 5.4. Creating Preprocessing Pipelines

#### **Standard Preprocessing Pipeline**
Not suitable for polinomial-based models.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('robust_scaler', RobustScaler())
])

bin_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))  # as the categories are numbers, we can use the SimpleImputer
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # as the categories are numbers, we can use the SimpleImputer
    ('one-hot-encoding', OneHotEncoder(handle_unknown='ignore'))
])



# (name, transformer, columns)
preprocessed_pipeline = ColumnTransformer([
    ('numerical', num_pipeline, num_vars),
    ('binary', bin_pipeline, bin_vars),
    ('categorical', cat_pipeline, cat_vars)
])

#### **Preprocessing Pipeline for Polynomial Logistic Regression**

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

pol_num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('poly_feat_transformer', PolynomialFeatures(include_bias=False)),  # default degree = 2
    ('robust_scaler', RobustScaler())
])

bin_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent'))  # as the categories are numbers, we can use the SimpleImputer
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # as the categories are numbers, we can use the SimpleImputer
    ('one-hot-encoding', OneHotEncoder(handle_unknown='ignore'))
])


# (name, transformer, columns)
polynomial_preprocessed_pipeline = ColumnTransformer([
    ('numerical', pol_num_pipeline, num_vars),
    ('binary', bin_pipeline, bin_vars),
    ('categorical', cat_pipeline, cat_vars)
])

### 🏋️‍♀️ 6. Train ML Algorithms

#### 6.1. Getting the independent (features) and classes (outcome)

In [None]:
# standard pipeline (for KNN and Logistic Regression)


In [None]:
# preprocessing pipeline for polynomial-based methods (Polynomial Logistic Regression)


In [None]:
# we already have y_train


### 6.2. Training the Models

In [None]:
# printing function
def display_scores(scores):
    print("Scores:", scores)
    print("\nMean:", scores.mean())
    print("Standard deviation:", scores.std())

#### **KNN**

##### **Accuracy**

#### **Logistic Regression**

##### **Accuracy**

#### **Polynomial Logistic Regression (degree=2)**

##### **Accuracy**