# Week 15 In Class Work

In [14]:
# Setting dependencies, etc. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## 1. Look up SMOTE oversampling

### https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

### a. Describe what it is in your own words in markdown.

"SMOTE"stands for Synthetic Minority Oversampling Technique. 

Many real-world data sets are imbalanced, with classification catagories of unequal proportions. Usually the minority classes contain the observations exhibiting characteristics of interest. However, in a large data set, ML algorithms may fail to detect important or interesting sets of observations due to their weak representation in the data set. The end result is creation of models with poor predictive power. 

Different resampling techniques have been devised to strengthen the signal of minority-class data. 
These techniques are applied to the training data set only in order to influence model fit.  Ultimately, the goal is to bias the model to strengthen detection of minority class data. Towards this end, there are two things one can do: increase the sensitivity of the classifier to the minority by REDUCING representation of majority class observations (undersampling); and INCREASING representation of minority class observations (oversampling). SMOTE algorithms achieve the second objective.

Oversampling techniques work by first searching the dataset and evaluating its statistics. Then, some algorithm is used to resample the training dataset, effectively transforming it into a new dataset with an increased representation of majority class observations. Finally, a model is trained on the new, "transformed" data set. 
A naive resampling technique, such as random sampling, will augment the minority class with duplicates of existing values. This may not be a problem for a small amount of duplication but when scaled up, the diminishment in variations between observations restricts the feature space. Any model trained on it will likely be over-fit to a higher number of a few specific cases.

With SMOTE, the set of observations augmenting the minority class comprisse novel artificial data points synthesized from small sets of extant observations. The actual algorithm for creating the data points interpolates points between nearest neighbors. For each minority class observation, x_i, SMOTE identifies a kNN set (typically k = 5). Then, it randomly selects one member, x_j, of that particular kNN set. A vector is created between x_i and x_j. Then, this vector is multiplied by a randomly chosen number between [0,1]. This new synthetic point is added to the original data set. 

So, while RandomOverSampler augments the data set by duplicating existing observations, thus increasing the size of the data set, SMOTE also increase the VARIETY of training observations. SMOTE is not without drawbacks, but those are outside the scope of this discussion (from what I gather, issues arise because SMOTE is naive to the topology of the data set...).

### b. Use this technique with the diabetes dataset. Comment on the model performance compared to other methods. Make sure you are clear about why you chose the performance metric you did.

In [15]:
diabetes_df = pd.read_csv('../week_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [16]:
diabetes_df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

We see that about 1/3 of the data set has an outcome of 1 (True). 

In [17]:
# Creating the training and testing sets

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)

#Standardize
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

#NB: we don't scale the target variable, ever.

In [18]:
#Resample training data with SMOTE

from imblearn.over_sampling import SMOTE

#instantiating the class
oversample = SMOTE(random_state=42)

#fitting and creating sythetic data sets using SMOTE
X_resampled, y_resampled = oversample.fit_resample(X_train_scaler, y_train)

In [25]:
#Checking to see how observations are distributed in synthetic data set:

y_resampled.value_counts()

0    350
1    350
Name: Outcome, dtype: int64

Notice that the total number of observations is LESS than the original data set. 

In [26]:
#train using synthetic data set

model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [28]:
# calculate accuracy score - note, this is the accuracy score that accounts for the unequal sizes among classes. If the classes are equally sized, the result would be the accuracy score.
# the model is applied to the test data, which has NOT been augmented/rebalanced.
# the whole point of oversampling is to bias the model, which occurs in the training phase.

from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)

0.7541975308641975

In [29]:
# create the classification report particular to the imbalanced data set.  

from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y_test, y_pred))

#a technique for improving recall is to resample because it gives us a 
#better understanding of the postives, thus improving our true positive rate

                   pre       rec       spe        f1       geo       iba       sup

          0       0.84      0.78      0.73      0.81      0.75      0.57       150
          1       0.64      0.73      0.78      0.68      0.75      0.57        81

avg / total       0.77      0.76      0.75      0.76      0.75      0.57       231



Let's see how this compares to the in-class example using the RandomOverSampler algorithm:

In [30]:
#Resample training data with RandomOversampler (naive approach)
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled_RO, y_resampled_RO = ros.fit_resample(X_train_scaler, y_train)

In [31]:
y_resampled_RO.value_counts()

0    350
1    350
Name: Outcome, dtype: int64

In [32]:
#train using resampled data (I re-labeled everything with a "_RO" to make the distinction clear).

model_RO = LogisticRegression(random_state=42)
model_RO.fit(X_resampled_RO, y_resampled_RO)

LogisticRegression(random_state=42)

In [33]:
# calculate accuracy

y_pred_RO = model_RO.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred_RO)

0.7575308641975309

In [34]:
# generate classification report

print(classification_report_imbalanced(y_test, y_pred_RO))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.84      0.79      0.73      0.81      0.76      0.58       150
          1       0.65      0.73      0.79      0.69      0.76      0.57        81

avg / total       0.77      0.77      0.75      0.77      0.76      0.57       231



Let's compare to the performance of the model trained on the unbalanced data set:

In [37]:
model_UB = LogisticRegression(random_state = 42)
model_UB.fit(X_train_scaler, y_train)

LogisticRegression(random_state=42)

In [38]:
y_pred_UB = model_UB.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred_UB)

0.6859259259259259

In [39]:
print(classification_report_imbalanced(y_test, y_pred_UB))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.77      0.85      0.52      0.81      0.67      0.46       150
          1       0.66      0.52      0.85      0.58      0.67      0.43        81

avg / total       0.73      0.74      0.64      0.73      0.67      0.45       231



### PERFORMANCE SUMMARY

The original, unbalanced data set haad an accuracy score of 0.686. Oversampling methods improved the accuracy score, with the naive random method slightly edging out SMOTE (RandomOverSampler: 0.757 ; SMOTE: 0.754). The precision, sensitivity, and specificity of the oversampling methods were higher compared to the unbalanced data set, however, there's little difference between the two methods when it comes to the average of each metric. 

## 2. Create a function called rec_digit_sum that takes in an integer. This function is the recursive sum of all the digits in a number.

### Given n, take the sum of all the digits in n. If the resulting value has more than one digit, continue calling the function in this way until a single-digit number is produced. The input will be a non-negative integer, and this should work for extremely large values as well as for single-digit inputs. 

#### Examples:
#### 16 --> 1 + 6 = 7
#### 942 --> 9 + 4 + 2 = 15 --> 1 + 5 = 6
#### 132189 --> 1 + 3 + 2 + 1 + 8 + 9 = 24 --> 2 + 4 = 6
#### 493193 --> 4 + 9 + 3 + 1 + 9 + 3 = 29 --> 2 + 9 = 11 --> 1 + 1 = 2

#### Applying the function to test cases:

In [5]:
rec_digit_sum(16)

7

In [6]:
rec_digit_sum(942)

6

In [7]:
rec_digit_sum(132189)

6

In [8]:
rec_digit_sum(493193)

2

#### Applying function to cases that are not allowed:

In [12]:
rec_digit_sum(-2)

ValueError: invalid literal for int() with base 10: '-'

In [13]:
rec_digit_sum(0.1)

ValueError: invalid literal for int() with base 10: '.'

In both of these cases, the function fails because the input has been cast to a string, and the symbols characterizing negative and fractional numbers cannot be converted into integers. I'm satisfied with that, instead of writing a try/except clause in the function. 