SVM Hyperparameter Tuning - Diabetes Data
Description
You have already built a linear SVM model on the Pima Indians Diabetes dataset, which contains observations on various health-related attributes, such as plasma glucose concentration, body mass index (BMI) etc.

Recall that you had used C=1 while building the model.

In this question, you will find the optimal value of the hyperparameter C using GridSearchCV(), and then build a linear SVM model using the optimal value of C to predict whether a given patient has diabetes.

To find the optimal value of C, you can plot training and test accuracy versus C using matplotlib (the code is already written - you will see the plot displayed below the coding console).

A sample of the training data is shown below:

 No_Times_Pregnant  Plasma_Glucose  Diastolic_BP  Triceps  Insulin   BMI  Age  Diabetes  id
0                  1              89            66       23       94  28.1   21         0   1
1                  0             137            40       35      168  43.1   33         1   2
2                  3              78            50       32       88  31.0   26         1   3
3                  2             197            70       45      543  30.5   53         1   4
4                  1             189            60       23      846  30.1   59         1   5

 The training data is provided here:
/data/training/diabetes_train.csv

After you train the model, use the test data to make predictions. The test data can be accessed here. 
/data/test/diabetes_test.csv

You have to write the predictions in the file
/code/output/diabetes_predictions.csv

in the following format (note the column names carefully):
    Diabetes_Predicted   id
0                   0  314
1                   0   85
2                   0  183
3                   0  371
4                   0  152

Your model's accuracy will be evaluated on an unseen test dataset.
Datasets
Training dataset
Execution Time Limit
15 seconds

info_outline
You have reached the maximum submission limit for this problem. Your further submissions will not be considered for evaluation.

In [None]:
import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt


# read training data
diabetes_train = pd.read_csv("/data/training/diabetes_train.csv")

# read test data
diabetes_test = pd.read_csv("/data/test/diabetes_test.csv")

print(diabetes_train.head())
print(diabetes_test.head())


# creat x and y train 
x_train = diabetes_train.iloc[:, :7]
y_train = diabetes_train.iloc[:, 7]

# tune the model
n_folds = 5

# specify range of parameters (C) as a list
params = {"C": [0.1, 1, 10, 100, 1000]}

# create SVC object
linear_model = SVC()

# set up grid search scheme
model_cv = GridSearchCV(estimator = linear_model, param_grid = params, 
                        scoring= 'accuracy', 
                        cv = n_folds, 
                        verbose = 1,
                       return_train_score=True)      



# fit
model_cv.fit(x_train, y_train)

# results
cv_results = model_cv.cv_results_
print(pd.DataFrame(model_cv.cv_results_))


# plot C versus train and test scores
plt.figure(figsize=(8, 6))
plt.plot(cv_results['param_C'], cv_results['mean_test_score'])
plt.plot(cv_results['param_C'], cv_results['mean_train_score'])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
plt.show()
plt.savefig('hyperparam_c.png') 

# best hyperparameters
best_score = model_cv.best_score_
best_C = model_cv.best_params_['C']

print(best_score)
print(best_C)

#  choose best C
C = 1

# model with best C
linear_model = SVC(C=1)
linear_model.fit(x_train, y_train)


# make predictions
print(diabetes_test.head())
predictions = linear_model.predict(diabetes_test.iloc[:, :7])
predictions[:5]

# YOUR CODE ENDS HERE

# write columns id, predictions into the output file
d = pd.DataFrame({'id': diabetes_test['id'], 'Diabetes_Predicted': predictions})
print("\n", "d", "\n", d.head())

# write the output
d.to_csv('/code/output/diabetes_predictions.csv', sep=",")
