You are given a dataset called ’Diabetes.csv’.
- Load the dataset
- Print the first ten observations on the screen
-  Check the shape of the dataset
-  Print the column names of the dataset
-  Split the dataset into feature set (X) and the target variable (y). The target variable is ‘OnDiab’, indicating onset of diabetes within five years.
-  Find the unique number of classes for the target variable.
-  Split the dataset into training and test datasets
-  Train KNearestNeighbor classifer on your train dataset and print the score on the the test dataset. Set number of neighbors to 5.
- Import GridSearchCV from `sklearn.modelselection`
-  Split your data into train and test datasets
-  For neighbors=1 to 30, compute GridSearchCV for train dataset with kfold=10. 12. Print the best cross validation score
-  Print the best parameter
-  Print the test score

In [4]:
import pandas as pd


df = pd.read_csv('diabetes.csv')
df.head(10)

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
5,3,78,50,32,88,31.0,0.248,26,1
6,10,115,0,0,0,35.3,0.134,29,0
7,2,197,70,45,543,30.5,0.158,53,1
8,8,125,96,0,0,0.0,0.232,54,1
9,4,110,92,0,0,37.6,0.191,30,0


In [5]:
df.shape

(767, 9)

In [6]:
df.columns

Index(['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'], dtype='object')

In [7]:
column_names = [
    "Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", 
    "BMI", "DiabetesPedigreeFunction", "Age", "OnDiab"
]

In [8]:
df = pd.read_csv('diabetes.csv', names=column_names, header=None)

In [9]:
df.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,OnDiab
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [10]:
X = df.drop(columns=["OnDiab"])
y = df["OnDiab"]

In [11]:
unique_classes = y.nunique()

unique_classes

2

In [12]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

In [15]:
knn.score(X_test, y_test)

0.6623376623376623

In [16]:
param_grid = {'n_neighbors': range(1, 31)}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10)
grid_search.fit(X_train, y_train)

In [17]:
grid_search.best_score_

np.float64(0.7444738233738764)

In [18]:
grid_search.best_params_

{'n_neighbors': 12}

In [19]:
best_knn = grid_search.best_estimator_

In [20]:
best_knn.score(X_test, y_test)

0.7792207792207793