# Classification Applications

### Contents: <a class="anchor" id="top"></a>
### [Credit Fraud Feature Engineering](#p1)
### [Credit Fraud Modeling](#p2) 
### [Diabetes Prediction Modeling](#p3) 
---


In [98]:
import numpy as np
import pandas as pd

from datetime import datetime, date 

# PART A. Credit Fraud

In [99]:
df = pd.read_csv('../data/fraud_credit_card.csv')

### Feature Engineering <a class="anchor" id="p1"></a>
[back to top](#top)  
Run the following code chunks in order, all only once

In [100]:
df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,21/06/2020 12:14,2291160000000000.0,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,...,33.9659,-80.9355,333497,Mechanical engineer,19/03/1968,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,1,21/06/2020 12:14,3573030000000000.0,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,...,40.3207,-110.436,302,"Sales professional, IT",17/01/1990,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,2,21/06/2020 12:14,3598220000000000.0,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,...,40.6729,-73.5365,34496,"Librarian, public",21/10/1970,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,3,21/06/2020 12:15,3591920000000000.0,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,...,28.5697,-80.8191,54767,Set designer,25/07/1987,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,4,21/06/2020 12:15,3526830000000000.0,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,...,44.2529,-85.017,1126,Furniture designer,06/07/1955,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


In [101]:
df.columns

Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

In [102]:
# drop unnecessary or redundant columns 
df = df.drop(columns=['Unnamed: 0', 'cc_num', 'first', 'last', 'trans_num', 'unix_time'])

# customerID, customer name ('first', 'last'), and transaction number should not influence whether the transaction 
## was a fraud. the transaction timestamp in unix format, `unix_time`, holds the same information as 
### trans_date_trans_time, thus it is redundant. 

In [103]:
df.head()

Unnamed: 0,trans_date_trans_time,merchant,category,amt,gender,street,city,state,zip,lat,long,city_pop,job,dob,merch_lat,merch_long,is_fraud
0,21/06/2020 12:14,fraud_Kirlin and Sons,personal_care,2.86,M,351 Darlene Green,Columbia,SC,29209,33.9659,-80.9355,333497,Mechanical engineer,19/03/1968,33.986391,-81.200714,0
1,21/06/2020 12:14,fraud_Sporer-Keebler,personal_care,29.84,F,3638 Marsh Union,Altonah,UT,84002,40.3207,-110.436,302,"Sales professional, IT",17/01/1990,39.450498,-109.960431,0
2,21/06/2020 12:14,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,9333 Valentine Point,Bellmore,NY,11710,40.6729,-73.5365,34496,"Librarian, public",21/10/1970,40.49581,-74.196111,0
3,21/06/2020 12:15,fraud_Haley Group,misc_pos,60.05,M,32941 Krystal Mill Apt. 552,Titusville,FL,32780,28.5697,-80.8191,54767,Set designer,25/07/1987,28.812398,-80.883061,0
4,21/06/2020 12:15,fraud_Johnston-Casper,travel,3.19,M,5783 Evan Roads Apt. 465,Falmouth,MI,49632,44.2529,-85.017,1126,Furniture designer,06/07/1955,44.959148,-85.884734,0


In [104]:
# create distint time and date features 
trans_datetime_series = pd.to_datetime(df['trans_date_trans_time'])
df['trans_date_trans_time'] = trans_datetime_series
df['date'] = df['trans_date_trans_time'].dt.date
df['time'] = df['trans_date_trans_time'].dt.time

df = df.drop('trans_date_trans_time', axis=1)
df.head()

Unnamed: 0,merchant,category,amt,gender,street,city,state,zip,lat,long,city_pop,job,dob,merch_lat,merch_long,is_fraud,date,time
0,fraud_Kirlin and Sons,personal_care,2.86,M,351 Darlene Green,Columbia,SC,29209,33.9659,-80.9355,333497,Mechanical engineer,19/03/1968,33.986391,-81.200714,0,2020-06-21,12:14:00
1,fraud_Sporer-Keebler,personal_care,29.84,F,3638 Marsh Union,Altonah,UT,84002,40.3207,-110.436,302,"Sales professional, IT",17/01/1990,39.450498,-109.960431,0,2020-06-21,12:14:00
2,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,9333 Valentine Point,Bellmore,NY,11710,40.6729,-73.5365,34496,"Librarian, public",21/10/1970,40.49581,-74.196111,0,2020-06-21,12:14:00
3,fraud_Haley Group,misc_pos,60.05,M,32941 Krystal Mill Apt. 552,Titusville,FL,32780,28.5697,-80.8191,54767,Set designer,25/07/1987,28.812398,-80.883061,0,2020-06-21,12:15:00
4,fraud_Johnston-Casper,travel,3.19,M,5783 Evan Roads Apt. 465,Falmouth,MI,49632,44.2529,-85.017,1126,Furniture designer,06/07/1955,44.959148,-85.884734,0,2020-06-21,12:15:00


In [105]:
# determine that 'fraud' precedes the merchant name for every record, and remove 
val = 1
for m in df.merchant:
    if m[:5] != 'fraud':
        val *= 0
print('val =', val)

df['merchant'] = df['merchant'].apply(lambda x: x[6:])

val = 1


In [106]:
df.head()

Unnamed: 0,merchant,category,amt,gender,street,city,state,zip,lat,long,city_pop,job,dob,merch_lat,merch_long,is_fraud,date,time
0,Kirlin and Sons,personal_care,2.86,M,351 Darlene Green,Columbia,SC,29209,33.9659,-80.9355,333497,Mechanical engineer,19/03/1968,33.986391,-81.200714,0,2020-06-21,12:14:00
1,Sporer-Keebler,personal_care,29.84,F,3638 Marsh Union,Altonah,UT,84002,40.3207,-110.436,302,"Sales professional, IT",17/01/1990,39.450498,-109.960431,0,2020-06-21,12:14:00
2,"Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,9333 Valentine Point,Bellmore,NY,11710,40.6729,-73.5365,34496,"Librarian, public",21/10/1970,40.49581,-74.196111,0,2020-06-21,12:14:00
3,Haley Group,misc_pos,60.05,M,32941 Krystal Mill Apt. 552,Titusville,FL,32780,28.5697,-80.8191,54767,Set designer,25/07/1987,28.812398,-80.883061,0,2020-06-21,12:15:00
4,Johnston-Casper,travel,3.19,M,5783 Evan Roads Apt. 465,Falmouth,MI,49632,44.2529,-85.017,1126,Furniture designer,06/07/1955,44.959148,-85.884734,0,2020-06-21,12:15:00


In [107]:
# difference in latitude between cardholder and merchant
df['lat_diff'] = np.abs(df['lat'] - df['merch_lat'])

# difference in logitude between cardholder and merchant
df['long_diff'] = np.abs(df['long'] - df['merch_long'])

# compute the manhattan distance between carholder and merchant
df['dist'] = df['lat_diff'] + df['long_diff']


# remove the lat and long variables since we are using that information in the distance feature
# remove the street and zipcode variable since that is too fine grained
df = df.drop(['lat', 'long', 'street', 'zip', 'lat_diff', 'long_diff', 'merch_lat', 'merch_long'], axis=1)

In [108]:
df.head()

Unnamed: 0,merchant,category,amt,gender,city,state,city_pop,job,dob,is_fraud,date,time,dist
0,Kirlin and Sons,personal_care,2.86,M,Columbia,SC,333497,Mechanical engineer,19/03/1968,0,2020-06-21,12:14:00,0.285705
1,Sporer-Keebler,personal_care,29.84,F,Altonah,UT,302,"Sales professional, IT",17/01/1990,0,2020-06-21,12:14:00,1.345771
2,"Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,Bellmore,NY,34496,"Librarian, public",21/10/1970,0,2020-06-21,12:14:00,0.836701
3,Haley Group,misc_pos,60.05,M,Titusville,FL,54767,Set designer,25/07/1987,0,2020-06-21,12:15:00,0.306659
4,Johnston-Casper,travel,3.19,M,Falmouth,MI,1126,Furniture designer,06/07/1955,0,2020-06-21,12:15:00,1.573982


In [109]:
# combine cardholders city and state information as a cardholder_loc variable
df['cardholder_loc'] = df['city'] + ' ' + df['state']

# notice that some cities are located in different states and have the same name
print('distinct city,state locations: ', df['cardholder_loc'].unique().size)
print('distinct cities: ', df['city'].unique().size)
print('thus combining the city and state features in this way was justified')

# now drop these columns
df = df.drop(['city', 'state'], axis=1)

distinct city,state locations:  880
distinct cities:  849
thus combining the city and state features in this way was justified


In [110]:
df.head()

Unnamed: 0,merchant,category,amt,gender,city_pop,job,dob,is_fraud,date,time,dist,cardholder_loc
0,Kirlin and Sons,personal_care,2.86,M,333497,Mechanical engineer,19/03/1968,0,2020-06-21,12:14:00,0.285705,Columbia SC
1,Sporer-Keebler,personal_care,29.84,F,302,"Sales professional, IT",17/01/1990,0,2020-06-21,12:14:00,1.345771,Altonah UT
2,"Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,34496,"Librarian, public",21/10/1970,0,2020-06-21,12:14:00,0.836701,Bellmore NY
3,Haley Group,misc_pos,60.05,M,54767,Set designer,25/07/1987,0,2020-06-21,12:15:00,0.306659,Titusville FL
4,Johnston-Casper,travel,3.19,M,1126,Furniture designer,06/07/1955,0,2020-06-21,12:15:00,1.573982,Falmouth MI


In [111]:
# create an age feature from 'dob' then drop 'dob'

# helper function
def age(born): 
    '''Convert DOB to age.'''
    today = date.today() 
    return today.year - born.year - ((today.month,  
                                      today.day) < (born.month,  
                                                    born.day))

dob_series = pd.to_datetime(df['dob'])
df['dob'] = dob_series
df['cardholder_age'] = df['dob'].apply(age)
df = df.drop(['dob'], axis=1)
df.head()

Unnamed: 0,merchant,category,amt,gender,city_pop,job,is_fraud,date,time,dist,cardholder_loc,cardholder_age
0,Kirlin and Sons,personal_care,2.86,M,333497,Mechanical engineer,0,2020-06-21,12:14:00,0.285705,Columbia SC,56
1,Sporer-Keebler,personal_care,29.84,F,302,"Sales professional, IT",0,2020-06-21,12:14:00,1.345771,Altonah UT,34
2,"Swaniawski, Nitzsche and Welch",health_fitness,41.28,F,34496,"Librarian, public",0,2020-06-21,12:14:00,0.836701,Bellmore NY,53
3,Haley Group,misc_pos,60.05,M,54767,Set designer,0,2020-06-21,12:15:00,0.306659,Titusville FL,36
4,Johnston-Casper,travel,3.19,M,1126,Furniture designer,0,2020-06-21,12:15:00,1.573982,Falmouth MI,68


In [112]:
# round time feature to hour of the day then replace

# helper function
def hour_round(dt):
    h = dt.hour
    m = dt.minute
    if m>30:
        if h==23:
            return 0
        else:
            return h+1
    return h

df['hour_of_day'] = df['time'].apply(hour_round)
df = df.drop(['time'], axis=1)

In [113]:
df.tail()

Unnamed: 0,merchant,category,amt,gender,city_pop,job,is_fraud,date,dist,cardholder_loc,cardholder_age,hour_of_day
555714,Reilly and Sons,health_fitness,43.77,M,519,Town planner,0,2020-12-31,1.104132,Luray MO,58,0
555715,Hoppe-Parisian,kids_pets,111.84,M,28739,Futures trader,0,2020-12-31,1.368282,Lake Jackson TX,24,0
555716,Rau-Robel,kids_pets,86.88,F,3684,Musician,0,2020-12-31,1.275094,Burbank WA,42,0
555717,Breitenberg LLC,travel,7.99,M,129,Cartographer,0,2020-12-31,0.786563,Mesa ID,58,0
555718,Dare-Marvin,entertainment,38.13,M,116001,Media buyer,0,2020-12-31,0.987025,Edmond OK,30,0


In [114]:
# oneHot encode the merchant, category, gender, job, date, and cardholder_loc features
df = pd.get_dummies(df, columns=['merchant', 'category', 'gender', 'job', 'date', 'cardholder_loc'])
df.head()

Unnamed: 0,amt,city_pop,is_fraud,dist,cardholder_age,hour_of_day,merchant_Abbott-Rogahn,merchant_Abbott-Steuber,merchant_Abernathy and Sons,merchant_Abshire PLC,...,cardholder_loc_Winger MN,cardholder_loc_Winslow AR,cardholder_loc_Winter WI,cardholder_loc_Winthrop ME,cardholder_loc_Wittenberg WI,cardholder_loc_Woods Cross UT,cardholder_loc_Woodville AL,cardholder_loc_Yellowstone National Park WY,cardholder_loc_Zaleski OH,cardholder_loc_Zavalla TX
0,2.86,333497,0,0.285705,56,12,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,29.84,302,0,1.345771,34,12,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,41.28,34496,0,0.836701,53,12,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,60.05,54767,0,0.306659,36,12,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3.19,1126,0,1.573982,68,12,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Modeling <a class="anchor" id="p2"></a>
[back to top](#top)  
Only using part of the dataframe because otherwise the code chunks take too long to run.

In [115]:
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [116]:
# selecting only a subset of the columns and rows
df_subset = df.iloc[:10000,:20]
y = df_subset['is_fraud']
X = df_subset.drop(['is_fraud'], axis=1)

In [117]:
X.shape

(10000, 19)

In [157]:
# we can observe below how the subsetting keeps a similar ratio of fraudulent transactions as the origial data
print('original ratio:', sum(df['is_fraud']==1)/df['is_fraud'].size)
print('subsetted ratio:', sum(y==1)/y.size)

original ratio: 0.0038598644278853163
subsetted ratio: 0.0022


In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=2001)

### Decision Tree

In [120]:
# Create a Decision Tree using entropy as attribute selection
DT_clf = tree.DecisionTreeClassifier(criterion = 'entropy')

In [121]:
#Training the tree
DT_clf = DT_clf.fit(X_train, y_train)

In [122]:
y_pred = DT_clf.predict(X_test)

In [123]:
print('Decision Tree Accuracy:', sum(y_pred == y_test)/len(y_pred))

Decision Tree Accuracy: 0.9963333333333333


In [124]:
c_mat = confusion_matrix(y_test, y_pred)
c_mat

array([[2988,    6],
       [   5,    1]])

In [125]:
F1_score = f1_score(y_test, y_pred)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('DT F1 Score:', F1_score)
print('DT Precision:', precision)
print('DT Recall:', recall)

DT F1 Score: 0.15384615384615383
DT Precision: 0.998329435349148
DT Recall: 0.9979959919839679


### Logistic Regression

In [147]:
LR_clf = LogisticRegression(random_state=0).fit(X_train, y_train)

In [148]:
y_pred = LR_clf.predict(X_test)

In [149]:
print('Logistic Reg. Accuracy:', sum(y_pred == y_test)/len(y_pred))

Logistic Reg. Accuracy: 0.9973333333333333


In [150]:
c_mat = confusion_matrix(y_test, y_pred)
c_mat

array([[2992,    2],
       [   6,    0]])

In [151]:
F1_score = f1_score(y_test, y_pred)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('LR F1 Score:', F1_score)
print('LR Precision:', precision)
print('LR Recall:', recall)

LR F1 Score: 0.0
LR Precision: 0.9979986657771848
LR Recall: 0.9993319973279893


### KNN

In [142]:
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [143]:
predictions = neigh.predict(X_test)

In [144]:
#print(predictions)
acc = np.sum(predictions == y_test) / len(y_test)
print('KNN accuracy:', acc)

KNN accuracy: 0.998


In [145]:
c_mat = confusion_matrix(y_test, predictions)
c_mat

array([[2989,    5],
       [   1,    5]])

In [146]:
F1_score = f1_score(y_test, predictions)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('KNN F1 Score:', F1_score)
print('KNN Precision:', precision)
print('KNN Recall:', recall)

KNN F1 Score: 0.625
KNN Precision: 0.9996655518394649
KNN Recall: 0.9983299933199733


### SVM

In [136]:
svm_clf = svm.SVC(kernel='linear').fit(X_train,y_train)

In [137]:
predictions = svm_clf.predict(X_test)

In [138]:
acc = np.sum(predictions == y_test) / len(y_test)
print('SVM accuracy:', acc)

SVM accuracy: 0.9946666666666667


In [139]:
c_mat = confusion_matrix(y_test, predictions)
c_mat

array([[2984,   10],
       [   6,    0]])

In [141]:
F1_score = f1_score(y_test, predictions)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('SVM F1 Score:', F1_score)
print('SVM Precision:', precision)
print('SVM Recall:', recall)

SVM F1 Score: 0.0
SVM Precision: 0.9979933110367893
SVM Recall: 0.9966599866399466


### Comparison
In the context of predicting credit fraud, the most important metric to consider is F1 Score. We ideally want a high sensitivity, because we want to make sure we catch any fraudulent behavior as it can result in losses of cardholders. However, we are also interested in a high specificity, or a small number of false positives. This is because it can be a real detriment to a cardholder to not be able to use their card if they need it in some situation, i.e. by classifying a non-fraudulent transaction as fraudulent, the cardholder would be denied access to their funds for no good reason. F1-score takes both of these into account. Therefore, in this context, the KNN model performs the best. 

One con I noticed was that the SVM took a very long time to train. All the other models took a relatively short time to train and predict compared to the SVM. Originally, the KNN model was taking a very long time to create predictions, however, upon subsetting the data and reducing the number of neighbors utilized in the prediction calculations, the runtime sped up.

# PART B. Diabetes <a class="anchor" id="p3"></a>
[back to top](#top)

In [67]:
dfb = pd.read_csv('../data/diabetes.csv')

In [68]:
dfb.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [69]:
dfb.shape

(768, 9)

In [70]:
y = dfb['Outcome']
X = dfb.drop(['Outcome'], axis=1)

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=2001)

### Decision Tree

In [72]:
# Create a Decision Tree using entropy as attribute selection
DT_clf = tree.DecisionTreeClassifier(criterion = 'entropy')

In [73]:
#Training the tree
DT_clf = DT_clf.fit(X_train, y_train)

In [74]:
y_pred = DT_clf.predict(X_test)

In [75]:
print('Decision Tree Accuracy:', sum(y_pred == y_test)/len(y_pred))

Decision Tree Accuracy: 0.70995670995671


In [76]:
c_mat = confusion_matrix(y_test, y_pred)
c_mat

array([[127,  28],
       [ 39,  37]])

In [77]:
F1_score = f1_score(y_test, y_pred)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('DT F1 Score:', F1_score)
print('DT Precision:', precision)
print('DT Recall:', recall)

DT F1 Score: 0.524822695035461
DT Precision: 0.7650602409638554
DT Recall: 0.8193548387096774


### Logistic Regression

In [78]:
LR_clf = LogisticRegression(random_state=0).fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [79]:
y_pred = LR_clf.predict(X_test)

In [80]:
print('Logistic Reg. Accuracy:', sum(y_pred == y_test)/len(y_pred))

Logistic Reg. Accuracy: 0.7835497835497836


In [81]:
c_mat = confusion_matrix(y_test, y_pred)
c_mat

array([[145,  10],
       [ 40,  36]])

In [83]:
F1_score = f1_score(y_test, y_pred)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('LR F1 Score:', F1_score)
print('LR Precision:', precision)
print('LR Recall:', recall)

LR F1 Score: 0.5901639344262294
LR Precision: 0.7837837837837838
LR Recall: 0.9354838709677419


### KNN

In [84]:
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)

KNeighborsClassifier()

In [85]:
predictions = neigh.predict(X_test)

In [86]:
#print(predictions)
acc = np.sum(predictions == y_test) / len(y_test)
print('KNN accuracy:', acc)

KNN accuracy: 0.7186147186147186


In [87]:
c_mat = confusion_matrix(y_test, predictions)
c_mat

array([[133,  22],
       [ 43,  33]])

In [88]:
F1_score = f1_score(y_test, y_pred)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('KNN F1 Score:', F1_score)
print('KNN Precision:', precision)
print('KNN Recall:', recall)

KNN F1 Score: 0.5901639344262294
KNN Precision: 0.7556818181818182
KNN Recall: 0.8580645161290322


### SVM

In [89]:
svm_clf = svm.SVC(kernel='linear').fit(X_train,y_train)

In [90]:
predictions = svm_clf.predict(X_test)

In [91]:
acc = np.sum(predictions == y_test) / len(y_test)
print('SVM accuracy:', acc)

SVM accuracy: 0.7878787878787878


In [92]:
c_mat = confusion_matrix(y_test, predictions)
c_mat

array([[146,   9],
       [ 40,  36]])

In [93]:
F1_score = f1_score(y_test, predictions)
precision = c_mat[0,0] / (c_mat[0,0]+c_mat[1,0])
recall = c_mat[0,0] / (c_mat[0,0]+c_mat[0,1])

print('SVM F1 Score:', F1_score)
print('SVM Precision:', precision)
print('SVM Recall:', recall)

SVM F1 Score: 0.5950413223140496
SVM Precision: 0.7849462365591398
SVM Recall: 0.9419354838709677


### Comparison
We can see above that the model with the largest accuracy, as well as the strongest F1 score, is the SVM. Because it is very important for someone to know whether or not they have Diabetes, for this problem, we would prefer the model with the smallest false negative rate. In other words, we prefer the model with the largest Recall in this conext. Thus, this leads us to choose the SVM as the most appropriate model in this context, as it has the largest recall (and the smallest false negative rate) out of any of the models: 0.942.

With this relatively small data set, all of the models were able to train and make predictions in reasonable amounts of time. 