# Problem Statement
The purpose is to predict whether the Pima Indian women shows signs of diabetes or not. We are using a dataset collected by 
"National Institute of Diabetes and Digestive and Kidney Diseases" which consists of a number of attributes which would help us 
to perform this prediction.

Constraints on data collection
All patients whose data has been collected are females at least 21 years old of Pima Indian heritage

# Dataset:
https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

# 1. Import Libraries and load dataset

In [None]:
#Import all the necessary modules
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import random
import matplotlib.pyplot as plt
import numpy as np

num_bins = 10

In [None]:
colnames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pima_df = pd.read_csv('pima-indians-diabetes.csv')

It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of structure of the file, number 
of attributes, types of attributes and a general idea of likely challenges in the dataset. You would notice that it is a comma 
separated file. There are no column names!. Check the associated folders and find out about each attribute the name. What 
information is available about the data.

# 2. Print 10 samples from the dataset

In [None]:
pima_df.head(10)
#0s signify a lot of missing values

# 3. Print the datatypes of each column and the shape of the dataset

In [None]:
pima_df.shape

In [None]:
pima_df.dtypes

There are '0's in the data. Are they really valid '0's or they are missing values? Plasma, BP, skin thickness etc. these values 
cannot be 0. look at column by column logically to understand this.

# 4. Replace all the 0s in the column with the median of the same column value accordingly.

In [None]:
pima_df.loc[pima_df.Plas == 0, 'Plas'] = pima_df.Plas.median()
pima_df.loc[pima_df.Pres == 0, 'Pres'] = pima_df.Pres.median()
pima_df.loc[pima_df.skin == 0, 'skin'] = pima_df.skin.median()
pima_df.loc[pima_df.test == 0, 'test'] = pima_df.test.median()
pima_df.loc[pima_df.mass == 0, 'mass'] = pima_df.mass.median()

# 5. Print the descriptive statistics of each & every column using describe() function

In [None]:
pima_df.describe().transpose()

# 6. See the distribution of 'Class' variable and plot it using appropriate graph

In [None]:
pima_df.groupby("class").agg({'class': 'count'})

# 7. Use pairplots and correlation method to observe the relationship between different variables and state your insights.
Hint: Use seaborn plot and check the relationship between different variables

In [None]:
import seaborn as sns
sns.pairplot(pima_df, hue="class", palette="husl")

In [None]:
pima_df.corr()

Check for correlation between variables whose values are >0.8

Observations:

Diagonal plots have already been discussed in the Observations I of Univariate Analysis.
There are no linear relationships between any two variables.
There is no strong correlation between any two variables.
There is no strong correlation between any independent variable and class variable.

Using the plot - infer the relationship between different variables

# 8. Split the pima_df into training and test set in the ratio of 70:30 (Training:Test).

In [None]:
# splitting data into training and test set for independent attributes
n=pima_df['class'].count()
train_set = pima_df.head(int(round(n*0.7))) # Up to the last initial training set row
test_set = pima_df.tail(int(round(n*0.3))) # Past the last initial training set row

# capture the target column ("class") into separate vectors for training set and test set
train_labels = train_set.pop("class")
test_labels = test_set.pop("class")

# 9. Create the decision tree model using “entropy” method of reducing the entropy and fit it to training data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion = 'entropy' )
dt_model.fit(train_set, train_labels)

# 10. Print the accuracy of the model & print the confusion matrix

In [None]:
dt_model.score(test_set , test_labels)
test_pred = dt_model.predict(test_set)

In [None]:
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = train_set.columns))#Print the feature importance of the decision model

# 11. Apply the Random forest model and print the accuracy of Random forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(criterion = 'entropy', class_weight={0:.5,1:.5}, max_depth = 5, min_samples_leaf=5)
rfcl = rfcl.fit(train_set, train_labels)
test_pred = rfcl.predict(test_set)
rfcl.score(test_set , test_labels)

# 12. Apply Adaboost Ensemble Algorithm for the same data and print the accuracy.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
#abcl = AdaBoostClassifier(base_estimator=dt_model, n_estimators=50)
abcl = AdaBoostClassifier( n_estimators= 20)
abcl = abcl.fit(train_set, train_labels)

test_pred = abcl.predict(test_set)
abcl.score(test_set , test_labels)

# 13. Apply Bagging Classifier Algorithm and print the accuracy.

In [None]:
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(n_estimators=10, max_samples= .7, bootstrap=True)
bgcl = bgcl.fit(train_set, train_labels)

In [None]:
test_pred = bgcl.predict(test_set)
bgcl.score(test_set , test_labels)

# 14. Apply GradientBoost Classifier Algorithm for the same data and print the accuracy

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.05)
gbcl = gbcl.fit(train_set, train_labels)


In [None]:
test_pred = gbcl.predict(test_set)
gbcl.score(test_set , test_labels)