# Machine Learning 101 - Classification

## How to Build and Interpret Classification Models

Author: Kris Barbier

### Overview:

This notebook will outline the steps to create and interpret different kinds of classification models using the sci-kit learn library.

### Classification Overview:

- Classification is another common type of problem to solve using machine learning algorithms. Where regression models predict continuous numerical values, a classification model will try to predict a label (class) that pertains to the data present.
- In this notebook, we will explore binary classification in which there are only 2 labels that can be assigned to an observation. Other types of classification models may require multi-class classification, in which more than 2 labels are needed to classify the data.
- We will follow these steps in order to complete our classification models:
    - Import needed libraries and read in data set.
    - Quickly preprocess data for modeling (for an in-depth look at how to create a preprocessor, see the preprocessing notebook in the repository).
    - Build different types of classification models, including Logistic Regression, and random forests.
    - Interpret the performance of the models using different metrics, including F1 scores, accuracy, and false positive/negative rates.

## Classification Models in Code

### Import Libraries and Read in Data

In [1]:
#Common imports for data science
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #For visualizations
import seaborn as sns #For visualizations

#Imports for machine learning 
from sklearn.model_selection import train_test_split  #For validation split

#Imports for feature transformations
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#Imports for building preprocessing object
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

#Imports for classification models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Imports for model metrics
from sklearn.metrics import ConfusionMatrixDisplay, classification_report

#Set sklearn output to pandas
from sklearn import set_config
set_config(transform_output = 'pandas')

#Mute warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read in data set for classification models
file_path = 'Data/stroke.csv'
df = pd.read_csv(file_path)

#Preview dataset
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1192,Female,31,0,0,No,Govt_job,Rural,70.66,27.2,never smoked,0
1,77,Female,13,0,0,No,children,Rural,85.81,18.6,Unknown,0
2,59200,Male,18,0,0,No,Private,Urban,60.56,33.0,never smoked,0
3,24905,Female,65,0,0,Yes,Private,Urban,205.77,46.0,formerly smoked,1
4,24257,Male,4,0,0,No,children,Rural,90.42,16.2,Unknown,0


### Preprocess Data

- The first step we will take in preprocessing this data is to check for missing values, and then decide on an appropriate imputation strategy.

In [3]:
#Use .info() to check data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 1137 non-null   int64  
 1   gender             1137 non-null   object 
 2   age                1137 non-null   object 
 3   hypertension       1137 non-null   int64  
 4   heart_disease      1137 non-null   int64  
 5   ever_married       1137 non-null   object 
 6   work_type          1137 non-null   object 
 7   Residence_type     1137 non-null   object 
 8   avg_glucose_level  1137 non-null   float64
 9   bmi                1085 non-null   float64
 10  smoking_status     1137 non-null   object 
 11  stroke             1137 non-null   int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 106.7+ KB


- There are some missing values in the "bmi" column. We can impute numerical values using the mean of the column with the Simple Imputer.

In [4]:
#Define X and y
y = df['stroke']
X = df.drop(columns = 'stroke')

In [5]:
#Perform validation split
#Setting a random state will make this reproducible in the future
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#Verify the split is correct
X_train.head()  #Note the absence of the stroke column from the X_train data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
72,23427,Female,81,0,0,Yes,Private,Rural,91.82,36.9,Unknown
1091,68171,Male,61,0,0,Yes,Self-employed,Urban,116.78,39.8,formerly smoked
381,50536,Female,62,0,1,Yes,Govt_job,Urban,124.37,28.3,never smoked
760,35999,Female,52,0,0,Yes,Private,Urban,86.85,23.8,formerly smoked
433,47427,Male,49,0,0,Yes,Self-employed,Urban,70.73,27.3,formerly smoked


In [6]:
##Create numeric pipeline
#Define numeric columns
num_cols = X_train.select_dtypes('number').columns

#Instantiate transformers
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

#Set numeric pipeline
num_pipe = make_pipeline(impute_mean, scaler)

#Create tuple for column transformer
num_tuple = ("Numeric", num_pipe, num_cols)

In [7]:
##Create categorical pipeline
#Define categorical columns
cat_cols = X_train.select_dtypes('object').columns

#Instantiate transformers
impute_missing = SimpleImputer(strategy='constant', fill_value='Missing')
cat_encode = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

#Set categorical pipeline
cat_pipe = make_pipeline(impute_missing, cat_encode)

#Create tuple for column transformer
cat_tuple = ("Categorical", cat_pipe, cat_cols)

In [8]:
#Finalize preprocessing object
preprocessor = ColumnTransformer([num_tuple, cat_tuple], verbose_feature_names_out=False)

In [9]:
#Fit preprocessor on training data
preprocessor.fit(X_train)

In [10]:
#Transform training and testing data
X_train_tf = preprocessor.transform(X_train)
X_test_tf = preprocessor.transform(X_test)

#View preprocessed training data
X_train_tf.head()

Unnamed: 0,id,hypertension,heart_disease,avg_glucose_level,bmi,gender_Female,gender_Male,gender_Other,age_*82,age_0,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
72,-0.597688,-0.368782,-0.284988,-0.340638,1.052941,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1091,1.509753,-0.368782,-0.284988,0.177741,1.449535,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
381,0.679145,-0.368782,3.508917,0.335373,-0.123165,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
760,-0.005548,-0.368782,-0.284988,-0.443857,-0.73857,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
433,0.532711,-0.368782,-0.284988,-0.778643,-0.259922,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


- The data is now preprocessed and ready to be modeled for classification.

## Model 1: Logistic Regression

- The first model we will run on the data is a logistic regression. Logistic regression is similar to a linear regression model that would be run to predict continuous values, but will predict binary classes instead.

In [11]:
#Instantiate the model
log_reg = LogisticRegression()

In [12]:
#Fit model onto training data
log_reg.fit(X_train_tf, y_train)

In [13]:
#Get predictions for training data
y_pred_train = log_reg.predict(X_train_tf)

#Get predictions for testing data
y_pred_test = log_reg.predict(X_test_tf)

In [18]:
#Evaluate training results
train_results = classification_report(y_train, y_pred_train)
print(train_results)

              precision    recall  f1-score   support

           0       0.89      0.99      0.94       749
           1       0.62      0.15      0.24       103

    accuracy                           0.89       852
   macro avg       0.76      0.57      0.59       852
weighted avg       0.86      0.89      0.85       852



In [19]:
#Evaluate testing results
test_results = classification_report(y_test, y_pred_test)
print(test_results)

              precision    recall  f1-score   support

           0       0.89      0.98      0.93       251
           1       0.33      0.06      0.10        34

    accuracy                           0.87       285
   macro avg       0.61      0.52      0.52       285
weighted avg       0.82      0.87      0.83       285



## Model 2: Random Forest

In [20]:
#Instatiate model
rf = RandomForestClassifier()

In [21]:
#Fit model on training data
rf.fit(X_train_tf, y_train)

In [22]:
#Get predictions for training data
y_pred_train = rf.predict(X_train_tf)

#Get predictions for testing data
y_pred_test = rf.predict(X_test_tf)

In [23]:
#Evaluate training results
train_results = classification_report(y_train, y_pred_train)
print(train_results)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       749
           1       1.00      1.00      1.00       103

    accuracy                           1.00       852
   macro avg       1.00      1.00      1.00       852
weighted avg       1.00      1.00      1.00       852



In [24]:
#Evaluate testing results
test_results = classification_report(y_test, y_pred_test)
print(test_results)

              precision    recall  f1-score   support

           0       0.88      0.99      0.93       251
           1       0.00      0.00      0.00        34

    accuracy                           0.87       285
   macro avg       0.44      0.50      0.47       285
weighted avg       0.77      0.87      0.82       285

