Importing important dependencies

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import warnings    #warnings to ignore any kind of warnings that we may recieve.
warnings.filterwarnings('ignore')

In [2]:
def display_all(df):
    '''
    input: dataframe
    description: it takes a dataframe and allows use to show a mentioned no. of rows and columns in the screen
    '''
    with pd.option_context("display.max_rows",10,"display.max_columns",9):  #you might want to change these numbers.
        display(df)

In [3]:
df=pd.read_csv('../input/diabetes.csv')
df.shape

(768, 9)

In [4]:
display_all(df)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


Making a function for showing us a well defind table regarding the no. of missing values in each rows of the dataframe

In [5]:
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        
        return mis_val_table_ren_columns

**Checking Missing values**

In [6]:
missing_values_table(df)

Your selected dataframe has 9 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values


We found that no missing values are there in our dataset. But one thing we forgot to analyse that few features that are mentioned in the dataset like **BMI**  ,**Insulin** ,**BloodPressure**,**SkinThickness**,**Glucose** cannot have a value of zero. So there is a strong possibility that the rows in which these features are termed as zero is due to unavailability of data and hence can be termed as missing values 

In [7]:
features_with_missing_values=['BMI','SkinThickness','BloodPressure','Insulin','Glucose']


**It is worth mentioning that why we used median and not mean to replace the value of 0 in these mentioned columns?**
It is due to the fact that there can be some outliers (more spread out data points) that may have a strong effect on mean and mean can be more biased towards these outliers. So a good thing is to use median since median is not affected by outliers. To study more on this topic : [https://medium.com/@pswaldia1/statistics-for-data-science-why-it-is-important-e30c60c5018d](http://)

In [8]:
for i in features_with_missing_values:
    df[i]=df[i].replace(0,np.median(df[i].values))

**Making target column different from the dataset**

In [9]:
target=df['Outcome'].values
df.drop(['Outcome'],inplace=True,axis=1)

**Now we need to standardise the dataset because data is not well spread and is varied in magnitude that may make training harder**

In [10]:
#from sklearn importing standard scalar that will convert the provided dataframe into standardised one.
from sklearn.preprocessing import StandardScaler                                              
sta=StandardScaler()
input=sta.fit_transform(df)    #will give numpy array as output

**Splitting the dataset into train and test set**

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(input,target,test_size=0.1,random_state=0)

**Using Knearest classifier**

In [12]:
from sklearn.neighbors import KNeighborsClassifier

In [13]:
knn=KNeighborsClassifier(n_neighbors=7)

**Training model on train set**

In [14]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=7, p=2,
           weights='uniform')

**Checking accuracy on test set**

In [15]:
knn.score(X_test,y_test)

0.8181818181818182