You are given a dataset which consists of the medical records of Pima Indian women. The problem is to find if a new person whose medical data is given as input has a chance of getting diabetes in the next 5 years. This is indicated by 1 or 0 in the last column of the dataset. Please use classification models such as KNN and Naïve Bayes and find the best of the two models for the given data set. Use python to prepare the data and create the models and cross-validate the dataset. Please make sure you also use appropriate performance measures, error values, bias and variance to come to conclusion of the best dataset

**Load Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

**Read CSV file into a DataFrame**

In [None]:
data = pd.read_csv("../input/pima indian diabetes.csv", 
                 names=["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
                       "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"])

**Display the first five rows of the data**

In [None]:
data.head()

**Summary of the data**

In [None]:
data.info()

**Convert the DatFrame objects into a integer/float**

In [None]:
features = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
                       "BMI", "DiabetesPedigreeFunction", "Age"]

for feature in features:
    data[feature] = pd.to_numeric(data[feature], errors='coerce')

# print the data summary again
data.info()

**Detect any missing values in the data**

In [None]:
data.isnull().sum()

**Replace missing values with a mean of columns/features**

In [None]:
for feature in features:
    data[feature].fillna(data[feature].mean(), inplace=True)

# check if there are still any missing values
data.isnull().sum()

**Visualizing the distribution of features**

In [None]:
plt.subplots(figsize=(20,20))

for i, j in zip(features, range(len(features))):
    plt.subplot(4, 2, j+1)
    sns.distplot(data[i])

**Following scatter plot shows an association of BMI with thickness of skin layer**

In [None]:
sns.scatterplot(data['BMI'], data['SkinThickness'])

**Following scatter plot shows an association of insulin and blood glucose**

In [None]:
sns.scatterplot(data['Insulin'],data['Glucose'])

**Diabetic (class 1) & Non-Diabetic (class 0) counts**

In [None]:
sns.countplot(data['Outcome'])

**Define input and Output to model**

In [None]:
# Input Features
x = data[features].values
# Output
y = data['Outcome'].values 

**Transform features by scaling each feature to a range (0,1)**

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)

**Split the data into random train and test subsets**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)

print('Training data samples : {}'.format(x_train.shape[0]))
print('Testing data samples  : {}'.format(x_test.shape[0]))

**KNeighbors Classifier**

In [None]:
# Knn with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5)
# Fit Knn on training dataset
knn.fit(x_train, y_train)
# Perform classification on test dataset
y_pred = knn.predict(x_test)
# Print Claasification Report
target_names = ['Class 0', 'Class 1']
print(classification_report(y_test, y_pred, target_names=target_names))

**Lets try Gaussian Naive Bayes**

In [None]:
# Gaussian Naive Bayes 
gnb = GaussianNB()
# Fit Gaussian Naive Bayes on training dataset
gnb.fit(x_train, y_train)
# Perform classification on test dataset
y_pred = gnb.predict(x_test)
# Print classification report
print(classification_report(y_test, y_pred, target_names=target_names))

**Naive Bayes classifiers have worked quite well on diabetes classification task despite it makes the “naive” assumption of conditional independence between every pair of features given the value of the class variable.**