# Question  1. Look up SMOTE oversampling
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOT
E.html .
a. Describe what it is in your own words in markdown.
b. Use this technique with the diabetes dataset. Comment on the model
performance compared to other methods


A: SMOTE is a resampling technique that performs oversampling on the minority class.  This means that the minority sample-size is increased to make the distribution between classes more even.  It creates (or synthesizes) new data points for the minority class using the K-nearest neighbor technique. 

In [5]:
#import necessary modules
import pandas as pd
import numpy as np

In [6]:
#read in diabetes_csv
diabetes_df = pd.read_csv('../week_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
X = diabetes_df.drop('Outcome',axis = 1)
y = diabetes_df['Outcome']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Split the data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

#Resample the training data with SMOTE
from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train_scaler, y_train)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [10]:
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.7268518518518519

This method has an accuracy score of 0.726 which is slightly less accurate than using the RandomOversampler that we used in class which had an accuracy score of 0.74.  This is, however, more accurate than when we did Logistic Regression without resampling and when we used K nearest neighbor without resampling. 

In [13]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)

In [14]:
# Get the TN (true negatives), TP (true positives), FN (false negatives) and FP (false positives) from the matrix
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]

# Calculate accuracy
accuracy = (TN+TP)/(TN+TP+FP+FN)
print("accuracy", accuracy)

# Calculate sensitivity
sensitivity = TP/(TP+FN)
print("sensitivity", sensitivity)

# Calculate specificity
specificity = TN/(TN+FP)
print("specificity", specificity)

#Calculate precision
precision = TP/(TP + FP)
print("precision", precision)

accuracy 0.7337662337662337
sensitivity 0.7037037037037037
specificity 0.75
precision 0.6031746031746031


This has a much higher sensitivity/recall score than using Logistic Regression without resampling and than using KNeighorsClassifier.  This is really important for the Diabetes dataset. So, although accuracy improved only slightly with SMOTE, recall improved greatly (from about 0.4 to over 0.7).  

# Question 2:  Create a list of preprocessing steps you should try when working to build a model. Work with your group to come up with the most comprehensive list you can.

1. Cleaning up column names for use and make sure data types are consistent.

2. Dealing with null values. Deleting columns with too many missing values. Replacing null values with column mean or deleting rows. 

3. Deleting columns that are redundant and or highly correlated.  

4. Remove outliers. 

5. Standardize all numeric columns. 

6. Perform one-hot encoding or label encoding to turn categorical columns into numeric values (recoding the data). 

