# 1.	Perform combined over and undersampling on the diabetes dataset (use SMOTEENN). Explain how combined sampling works.

In [1]:
#import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#load the diabetes dataset from csv file and save as a pandas dataframe
diabetes_df = pd.read_csv('../week_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


#set your dependent variable as outcome and all other columns as your X variables
X = diabetes_df.drop('Outcome',axis = 1)
y = diabetes_df['Outcome']

#perform test train split on the dataset using 20% of the dataset as a testing sample
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4, stratify = y)

#Standardize the dataset using standard scaler and save the scaled dataset to X_train_scaler and X_test_scaler
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)


In [4]:
#import the combined overunder sampler SMOTEENN which is a combination of SMOTE and edited Nearest Neighbors
from imblearn.combine import SMOTEENN

smoteenn = SMOTEENN(random_state=4)
X_resampled, y_resampled = smoteenn.fit_resample(X_train_scaler, y_train)


In [5]:
#train using the resampled data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)
model.fit(X_resampled, y_resampled)

LogisticRegression(random_state=42)

In [6]:
#calculate the accuracy score
from sklearn.metrics import balanced_accuracy_score
y_pred = model.predict(X_test_scaler)
balanced_accuracy_score(y_test, y_pred)
#increased accuracy from .74 (using RandomOverSampler) to .77 by using SMOTEENN.  Without resampling the accuracy was 0.6. Undersampling using ClusterCentroids had an accuracy score of 0.69.

0.7701851851851852

Combined sampling combines an oversampling technique with an undersampling technique.  In this case, we are using the oversampling technique SMOTE combined with the undersampling technique Edited Nearest Neighbors.  SMOTE is a resampling technique that performs oversampling on the minority class.  This means that the minority sample-size is increased to make the distribution between classes more even.  It creates (or synthesizes) new data points for the minority class using the K-nearest neighbor technique. Edited Nearest Neighbors identifies the closest neighbors that are incorrectly classified and removes them. 

# 2.	Comment on the performance of combined sampling vs the other approaches we have used for the diabetes dataset.

This method increased accuracy to .77 by using SMOTEENN. When using RandomOversampler we had an accuracy of .74. Without resampling the accuracy was 0.6. Undersampling using ClusterCentroids had an accuracy score of 0.69.  Using SMOTE alone gave an accuracy score of 0.73, so using edited nearest neighbors combined with SMOTE improved the accuracy score to the highest that we have seen yet. 

# 3. What is outlier detection? Why is it useful? What methods can you use for outlier detection?

Outlier detection means you identify outliers in the dataset, which are data points that differ significantly from the other data points, and can remove them from the dataset.  One method of Outlier dectection is called Elliptic Envelope.  Elliptic Envelope is a machine learning technique that is used for data that is normally distributed.  The method considers all datapoints and an ellipse is drawn around the datapoints based on certain criteria about the data.  You must specify a contamination hyperparameter based on whether you believe there are a lot of outliers or few outliers in the dataset.  Any points outside of the ellipse are considered outliers.  Once you identify the outliers, you can then remove them from the dataset.  One drawback to this method is that you have to define a hyperparameter related to the percent of datapoints that are outliers which is unknown, so you have to guess a little bit for this term. 

IQR (Interquartile range) is a statistical method for Outlier detection.  It is applied for individual features.  You calculate the first and third quartile of the dataset and get the difference, which is called the interquartile range.  You then set an upper bound and a lower bound for that feature as the first quartile minus 1.5*IQR (lower bound) and the third quartile plus 1.5*IQR (upper bound).  Any points outside the upper bound and the lower bound are considered outliers. 

# 4.	Perform a linear SVM to predict credit approval (last column) using this dataset: https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29 . 

Make sure you look at the accompanying document that describes the data in the dat file. You will need to either convert this data to another file type or import the dat file to python. 
You can use this code, but otherwise you follow standard practices we have already used many times: 
from sklearn.svm import SVC
classifier = SVC(kernel='linear')


In [20]:
import csv

with open("australian.dat") as infile, open("australian.csv", "w") as outfile:
    csv_writer = csv.writer(outfile)
    prev = ''
    csv_writer.writerow(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15'])
    for line in infile:
        row = [field.strip() for field in line.split(' ')]
        csv_writer.writerow(row)

In [21]:
#load the credit dataset from csv file and save as a pandas dataframe
credit_df = pd.read_csv('./australian.csv')
credit_df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15
0,1,22.08,11.46,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,0,22.67,7.0,2,8,4,0.165,0,0,0,0,2,160,1,0
2,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
3,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
4,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1


In [27]:
#set your dependent variable as outcome and all other columns as your X variables
X = credit_df.drop('A15',axis = 1)
y = credit_df['A15']

In [28]:
from sklearn.model_selection import train_test_split

#split the dataset into testing and training portions, with the testing portion making up 20% of the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)

In [34]:
from sklearn.preprocessing import StandardScaler

#Standardize the dataset
sc = StandardScaler()
X_train_scaler = sc.fit_transform(X_train)
X_test_scaler = sc.fit_transform(X_test)

In [36]:
from sklearn.svm import SVC
classifier = SVC(kernel='linear')

#fit the model to the dataset on the training data
classifier.fit(X_train_scaler, y_train)
print(classifier.score(X_train_scaler,y_train))
print(classifier.score(X_test_scaler,y_test))

#make predictions using the scaled test dataset
y_pred = classifier.predict(X_test_scaler)


0.8623188405797102
0.8333333333333334


# 5.	How did the SVM model perform? Use a classification report. 

In [37]:
from imblearn.metrics import classification_report_imbalanced

#get a classification report for the model.
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.95      0.73      0.95      0.83      0.84      0.68        75
          1       0.75      0.95      0.73      0.84      0.84      0.71        63

avg / total       0.86      0.83      0.85      0.83      0.84      0.70       138



This model did well.  The accuracy was 0.83 when done on the test data.  The precision was 0.86 and the recall was 0.83.  This is higher than the values that we got when predicting outcome from the diabetes dataset.  

# 6.	What kinds of jobs in data are you most interested in? Do some research on what is out there. Write about your thoughts in under 400 words. 

I would love to find a job related to climate change and the environment. My education is related to climate change and its affect on agriculture and I am interested in continuing work in that area or a related area.  There are multiple companies in the St. Louis area that focus on agriculture and I have looked into jobs with them in the past (Bayer, Climate Corporation, Danforth).  I am also very interested in sustainability efforts and renewable energy and would be interested in projects related to this as well.  This is such an important area for future research and it seems like several companies in the St. Louis area are beginning more efforts into research in this area also. 