## About the data

Dataset of diabetes, taken from the hospital Frankfurt, Germany
The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database.

Dataset link - https://www.kaggle.com/johndasilva/diabetes


## Content

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

## Properties

Pregnancies - Number of times pregnant

Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure - Diastolic blood pressure (mm Hg)

SkinThickness - Triceps skin fold thickness (mm)

Insulin - 2-Hour serum insulin (mu U/ml)

BMI - Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction - Diabetes pedigree function

Age - Age (years)

Outcome - Class variable (0 or 1) 


Dataset consists of 2000 rows

In [1]:
# Importing libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading data
df = pd.read_csv("../input/diabetes/diabetes.csv")

In [3]:
# determining dataset size
df.shape

In [4]:
# printing 1st 5rows
df.head()

In [5]:
# determining types of data
df.info()

There are no missing values & all the columns consist of numeric data

In [6]:
# basic statistics
df.describe()

In [7]:
# checking dataset is balanced or not
diabetes_true_count = len(df.loc[df['Outcome'] == 1])
diabetes_false_count = len(df.loc[df['Outcome'] == 0])

In [8]:
(diabetes_true_count,diabetes_false_count)

Its almost an imbalanced dataset 

In [9]:
# plotting graph for output classes counts
sns.countplot(x = 'Outcome',data = df)

In [10]:
# plotting variation graphs for each property
df.hist(figsize = (30,30))

In [11]:
df.isnull().sum()

No null value present in data

### Correlation & heatmap generation

In [12]:
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [13]:
df.columns

Checking if data has 0 values present 

In [14]:
print("Pregnancies: {0}".format(len(df.loc[df['Pregnancies'] == 0])))
print("Glucose: {0}".format(len(df.loc[df['Glucose'] == 0])))
print("bp: {0}".format(len(df.loc[df['BloodPressure'] == 0])))
print("SkinThickness: {0}".format(len(df.loc[df['SkinThickness'] == 0])))
print("Insulin: {0}".format(len(df.loc[df['Insulin'] == 0])))
print("BMI: {0}".format(len(df.loc[df['BMI'] == 0])))
print("DiabetesPedigreeFunction: {0}".format(len(df.loc[df['DiabetesPedigreeFunction'] == 0])))
print("Age: {0}".format(len(df.loc[df['Age'] == 0])))

### Preparing the data for model building


In [15]:
from sklearn.model_selection import train_test_split
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure','SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
predicted_class = ['Outcome']

Splitting dataset into train & test set


In [16]:
X = df[feature_columns]
y = df[predicted_class]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=10)

In [17]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Filling in the 0 values present with the mean of that particular property


In [18]:
from sklearn.impute import SimpleImputer

fill_values = SimpleImputer(missing_values=0, strategy="mean")

X_train = fill_values.fit_transform(X_train)
X_test = fill_values.fit_transform(X_test)

##### Fitting the training data into RandomForest Classifier

In [19]:
from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)

model = random_forest_model.fit(X_train, y_train)

##### Predicting model over test set & acquiring accuracy achieved


In [20]:
predict_train_data = model.predict(X_test)

from sklearn import metrics

print("Accuracy = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))

##### Confusion matrix for prediction TP,TN,FP,FN

In [21]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predict_train_data)
cm

#### Saving the model

In [22]:
import joblib
joblib.dump(model, "./random_forest.joblib")