2. Describe the content of the data set in your own words

The model predicts the chances of women getting diagnosed as positive for diabetes based on different attributes. 

3. Describe the features and formulate hypothesis on which may be relevant in predicting diabetes

The features mentioned are 
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   
The column 'Class Variable' takes values 1 or 0, where 1 is interpreted as tested positive for diabetes, and 0 is interpreted as tested negative for diabetes

Hypothesis: 


In [155]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [156]:
# Read the csv file into a data frame
df = pd.read_csv('pima-indians-diabetes.csv')
df.head(2)

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable (0 or 1)
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [157]:
# First I want to change the column names to smaller and easier to use names
df.columns = ['pregnant', 'glucose', 'bp', 'skthickness', 'insulin', 'bmi', 'pedigree', 'age', 'diabetes']
df.head(2)

Unnamed: 0,pregnant,glucose,bp,skthickness,insulin,bmi,pedigree,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [161]:
# Replacing 0's with nan in all the columns except diabetes column
df[['pregnant', 'glucose', 'bp', 'skthickness', 'insulin', 'bmi', 'pedigree', 'age']] = df[['pregnant', 'glucose', 'bp', 'skthickness', 'insulin', 'bmi', 'pedigree', 'age']].replace(0, np.nan)
df.head()

Unnamed: 0,pregnant,glucose,bp,skthickness,insulin,bmi,pedigree,age,diabetes
0,6.0,148,72,35.0,,33.6,0.627,50,1
1,1.0,85,66,29.0,,26.6,0.351,31,0
2,8.0,183,64,,,23.3,0.672,32,1
3,1.0,89,66,23.0,94.0,28.1,0.167,21,0
4,,137,40,35.0,168.0,43.1,2.288,33,1


In [162]:
# gives the number of nulls in each column
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnant       657 non-null float64
glucose        763 non-null float64
bp             733 non-null float64
skthickness    541 non-null float64
insulin        394 non-null float64
bmi            757 non-null float64
pedigree       768 non-null float64
age            768 non-null int64
diabetes       768 non-null int64
dtypes: float64(7), int64(2)
memory usage: 60.0 KB


Looks like pregnant, glucose, bp, skthickness, insulin, bmi have some nulls in them

I want to replace these nulls with mean of the respective coulumn

In [163]:
# To get the mean of each column
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pregnant,657,4.494673,3.217291,1.0,2.0,4.0,7.0,17.0
glucose,763,121.686763,30.535641,44.0,99.0,117.0,141.0,199.0
bp,733,72.405184,12.382158,24.0,64.0,72.0,80.0,122.0
skthickness,541,29.15342,10.476982,7.0,22.0,29.0,36.0,99.0
insulin,394,155.548223,118.775855,14.0,76.25,125.0,190.0,846.0
bmi,757,32.457464,6.924988,18.2,27.5,32.3,36.6,67.1
pedigree,768,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
age,768,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
diabetes,768,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [None]:
# Replace all the null values in each column with the mean of the respective columns


Looks like there are no null values

In [61]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Number of times pregnant,768,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Plasma glucose concentration a 2 hours in an oral glucose tolerance test,768,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
Diastolic blood pressure (mm Hg),768,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
Triceps skin fold thickness (mm),768,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
2-Hour serum insulin (mu U/ml),768,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
Body mass index (weight in kg/(height in m)^2),768,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
Diabetes pedigree function,768,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age (years),768,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Class variable (0 or 1),768,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [62]:
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [63]:
# Splitting the data into training and test data
X = df.ix[:,:-2].values
y = df['Class variable (0 or 1)'].values
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0)

In [64]:
# training knn classifiers on the train data
myknn = KNeighborsClassifier(3).fit(X_train,y_train)

In [65]:
# Predicting on the test data
myknn.predict(X_test)

array([1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [66]:
# generating a score to identify what percentage of labels did the model correctly identify
myknn.score(X_test, y_test)

0.73376623376623373

In [67]:
to_predict = df['Class variable (0 or 1)']
features=['Number of times pregnant','Plasma glucose concentration a 2 hours in an oral glucose tolerance test','Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)', '2-Hour serum insulin (mu U/ml)', 'Body mass index (weight in kg/(height in m)^2)', 'Diabetes pedigree function', 'Age (years)']
data = df[features]
label = df[to_predict]
folds=5

In [68]:
data.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years)
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [71]:
cross_val_score(myknn, data, label, cv=5)

ValueError: multiclass-multioutput is not supported

In [72]:
df[df['Class variable (0 or 1)']==0].head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable (0 or 1)
1,1,85,66,29,0,26.6,0.351,31,0
3,1,89,66,23,94,28.1,0.167,21,0
5,5,116,74,0,0,25.6,0.201,30,0
7,10,115,0,0,0,35.3,0.134,29,0
10,4,110,92,0,0,37.6,0.191,30,0


In [73]:
df[df['Class variable (0 or 1)']==1].head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable (0 or 1)
0,6,148,72,35,0,33.6,0.627,50,1
2,8,183,64,0,0,23.3,0.672,32,1
4,0,137,40,35,168,43.1,2.288,33,1
6,3,78,50,32,88,31.0,0.248,26,1
8,2,197,70,45,543,30.5,0.158,53,1


In [74]:
df['Class variable (0 or 1)'].value_counts()

0    500
1    268
dtype: int64