# Machine Learning 101 - Classification

## How to Build and Interpret Classification Models

Author: Kris Barbier

### Overview:

This notebook will outline the steps to create and interpret different kinds of classification models using the sci-kit learn library.

### Classification Overview:

- Classification is another common type of problem to solve using machine learning algorithms. Where regression models predict continuous numerical values, a classification model will try to predict a label (class) that pertains to the data present.
- In this notebook, we will explore binary classification in which there are only 2 labels that can be assigned to an observation. Other types of classification models may require multi-class classification, in which more than 2 labels are needed to classify the data.
- We will follow these steps in order to complete our classification models:
    - Import needed libraries and read in data set.
    - Quickly preprocess data for modeling (for an in-depth look at how to create a preprocessor, see the preprocessing notebook in the repository).
    - Build different types of classification models, including Logistic Regression, and random forests.
    - Interpret the performance of the models using different metrics, including F1 scores, accuracy, and false positive/negative rates.

## Classification Models in Code

### Import Libraries and Read in Data

In [1]:
#Common imports for data science
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #For visualizations
import seaborn as sns #For visualizations

#Imports for machine learning 
from sklearn.model_selection import train_test_split  #For validation split

#Imports for feature transformations
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#Imports for building preprocessing object
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

#Imports for classification models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Imports for model metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Set sklearn output to pandas
from sklearn import set_config
set_config(transform_output = 'pandas')

#Mute warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Read in data set for classification models
file_path = 'Data/stroke.csv'
df = pd.read_csv(file_path)

#Preview dataset
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1192,Female,31,0,0,No,Govt_job,Rural,70.66,27.2,never smoked,0
1,77,Female,13,0,0,No,children,Rural,85.81,18.6,Unknown,0
2,59200,Male,18,0,0,No,Private,Urban,60.56,33.0,never smoked,0
3,24905,Female,65,0,0,Yes,Private,Urban,205.77,46.0,formerly smoked,1
4,24257,Male,4,0,0,No,children,Rural,90.42,16.2,Unknown,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 1137 non-null   int64  
 1   gender             1137 non-null   object 
 2   age                1137 non-null   object 
 3   hypertension       1137 non-null   int64  
 4   heart_disease      1137 non-null   int64  
 5   ever_married       1137 non-null   object 
 6   work_type          1137 non-null   object 
 7   Residence_type     1137 non-null   object 
 8   avg_glucose_level  1137 non-null   float64
 9   bmi                1085 non-null   float64
 10  smoking_status     1137 non-null   object 
 11  stroke             1137 non-null   int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 106.7+ KB


### Preprocess Data