# *Clean modeling*

**Author:** [Kata Ferenc](https://github.com/ferenckata) ([k.t.ferenc@ncmm.uio.no](mailto:k.t.ferenc@ncmm.uio.no))

**Achievement:** *Using three different machine learning models predicting diabetes from health data*

## Introduction

*Using the Kaggle dataset [Diabetes Health Indicators](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset) we trained machine learning models to predict diabetes from health indicators.*

# Reproducibility and code formatting

In [1]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# For better logging
%load_ext rich

# Analysis

In [2]:
# Imports
# -------

# System
import sys

# Logging
import logging

# Rich logging in jupyter
from rich.logging import RichHandler

FORMAT = "%(message)s"
logging.basicConfig(
    level="INFO", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)
log = logging.getLogger("rich")

# Other packages
# Data processing
import pandas as pd

# Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Custom functions from module
import workshop.model_utils as utils

# Random seed to fix the random generators
RANDOM_SEED = 42

Selecting a few features that we think might be useful

In [3]:
datapath = "../data/train/diabetes_binary_train.csv.zip"
data_in = pd.read_csv(datapath)

In [4]:
features_of_interest = ["Diabetes_binary", "Age", "BMI", "Sex", "HighBP", "HighChol"]
X_train, X_valid, y_train, y_valid = utils.prepare_data(
    data_in, features_of_interest, "Diabetes_binary", 0.2, seed=RANDOM_SEED
)

In [5]:
svm_model = SVC(kernel="poly", degree=5, random_state=RANDOM_SEED)
fitted_svm_model = utils.train_eval_model(svm_model, X_train, y_train)

In [6]:
utils.test_eval_model(fitted_svm_model, X_valid, y_valid)

Maybe we can make the model better by providing all features to learn from.

In [7]:
features_of_interest_all = data_in.columns
X_train, X_valid, y_train, y_valid = utils.prepare_data(
    data_in, features_of_interest, "Diabetes_binary", 0.2, seed=RANDOM_SEED
)

In [8]:
svm_model_poly = SVC(kernel="poly", degree=5, random_state=RANDOM_SEED)
fitted_svm_model_poly = utils.train_eval_model(svm_model_poly, X_train, y_train)
utils.test_eval_model(fitted_svm_model_poly, X_valid, y_valid)

Furthermore, we can train an even more flexible model.

In [9]:
svm_model_rbf = SVC(kernel="rbf", gamma="scale", random_state=RANDOM_SEED)
fitted_svm_model_rbf = utils.train_eval_model(svm_model_rbf, X_train, y_train)

A more flexible model might cause reduced validation accuracy

In [10]:
utils.test_eval_model(fitted_svm_model_rbf, X_valid, y_valid)

Incuding more features did not make the model much better. Feature engineering would let us know which features are most important.

### Summary

In [11]:
model_list_names = ["SVM_all_poly", "SVM_all_rbf"]
model_list = [fitted_svm_model_poly, fitted_svm_model_rbf]
utils.compare_models(model_list_names, model_list, X_train, y_train, X_valid, y_valid)

# Watermark

This should be the last section of your notebook, since it watermarks all your environment.

When commiting this notebook, remember to restart the kernel, rerun the notebook and run this cell last, to watermark the environment.

In [12]:
%watermark -gb -iv -m -v

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 7.31.1

Compiler    : GCC 10.3.0
OS          : Linux
Release     : 5.19.9-200.fc36.x86_64
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

Git hash: ee9e022c0b064cbe0274f8e31f07e2a829cffb37

Git branch: splits_analysis

logging: 0.5.1.2
pandas : 1.4.3
sys    : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0]

