In [1]:
"""
Supervised learning algorithm (pre defined target variable)
Can be used for classification and regression
Mostly used for classification
It splits the population or sample into two or more homogeneous sets on the basis of the most significant splitter
Suitable for large number of classes in a categorical variable


Types of Decision Trees 
-Categorical variable DT
-Continuous variable DT

Terminalogy
-Root node : represents the entire population. Starting point.
-Decision node: sub-node which further splits into sub-nodes
-Terminal/Leaf node: nodes do not split any further
-Splitting: process of dividing the node
-Pruning: remove sub-nodes or restricting the tree size to prevent over-fitting
-Branch/sub-tree
-Parent and child node

Advantages:
-easy to understand
-useful in data exploration
-less data cleaning required
-data type is not a constraint (can handle both numerical and categorical variables)
-non-parametric method (DT have no assumptions)

Disadvantages:
-Over-fitting
-Not fit for continuos variables

Model building
1) when dependent variable is categorical in nature then it is a classification tree,
whereas when dependent variable is continuous in nature then it is a regression tree
2) In case of classification tree, the value obtained by terminal node in the training data is the mode of the 
observations falling in that region, thus if an unseen data observations falls in that region, we'll make its prediction 
with mode value.
3) In case of regression tree, the value obtained by terminal node in the training data is the mean of the observations 
falling in that region. thus if an unseen data observations falls in that region, we'll make its prediction with mean value.
4) It is a greedy algorithm, it focuses on the current split and not about the future plit


##Criteria for Classification DT:

1)Gini Index (criteria for a classifier DT)
-works with categorical target variable
-performs binary split
-higher the value of gini higher is the homogeneity
-Classification and regression trees uses gini method to create binary splits

2)Chi square
-Another technique that can be used for classification DT
-higher the chi-sq value, higher is the homogeneity

3)Information Gain
-where the outcome is known with some certainity then the entropy is low, whereas if the outcome is uncertain then the 
entropy is high
-variable with low entropy value will give higher homogeneity
Information gain = 1-Entropy


##Criteria for Regressor DT:
1)Reduction in variance: it tries to split the node on the basis of lower variance value.



Pruning the DT
-- setting constraints on DT
1) min samples for node split (default value = 0)
2) min samples for terminal node split (default value = 1)
3) max depth of the tree
4) max number of terminal nodes
5) max features to consider for split

-- ensemble modelling


"""




"\nSupervised learning algorithm (pre defined target variable)\nCan be used for classification and regression\nMostly used for classification\nIt splits the population or sample into two or more homogeneous sets on the basis of the most significant splitter\nSuitable for large number of classes in a categorical variable\n\n\nTypes of Decision Trees \n-Categorical variable DT\n-Continuous variable DT\n\nTerminalogy\n-Root node : represents the entire population. Starting point.\n-Decision node: sub-node which further splits into sub-nodes\n-Terminal/Leaf node: nodes do not split any further\n-Splitting: process of dividing the node\n-Pruning: remove sub-nodes or restricting the tree size to prevent over-fitting\n-Branch/sub-tree\n-Parent and child node\n\nAdvantages:\n-easy to understand\n-useful in data exploration\n-less data cleaning required\n-data type is not a constraint (can handle both numerical and categorical variables)\n-non-parametric method (DT have no assumptions)\n\nDisad

In [5]:
import pandas as pd
import numpy as np

In [6]:
cars_data=pd.read_csv('cars.csv', header=None)
cars_data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [7]:
cars_data.shape

(1728, 7)

In [8]:
cars_data.columns=['buying','maint','doors','persons','lug_boot','safety','classes'] ##assigning header for the dataset

In [9]:
cars_data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,classes
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


## Preprocessing Dataset 

In [10]:
cars_data.isnull().sum() ##no missing values
#outlier treatement is not required

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
classes     0
dtype: int64

In [11]:
cars_df=pd.DataFrame.copy(cars_data) #creating a dataframe as a copy of the original data set

## Converting Categorical data into numerical data 

In [12]:
colname=cars_df.columns
colname

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'classes'], dtype='object')

In [13]:
#all variables are catagorical so we will apply Label encoding to all X variables.
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

for x in colname:
    cars_df[x]=le.fit_transform(cars_df[x]) #fit and transform are two steps of label encoding
    

In [14]:
cars_df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,classes
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


In [15]:
cars_data.classes.value_counts()
"""
acc==>0
good==>1
unacc==>2
vgood==>3
"""

'\nacc==>0\ngood==>1\nunacc==>2\nvgood==>3\n'

## Creating X and Y 

In [16]:
X=cars_df.values[:,:-1] #all rows and all columns except last one
Y=cars_df.values[:,-1] #all rows and last column


## Scaling the dataset 

In [17]:
#Standard scaler reduces the data variations (sort of outlier treatment) 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X)

X=scaler.transform(X)




## Model Building - Decision Tree

### Train and Test split

In [18]:
from sklearn.model_selection import train_test_split

#less than 1800 then 70:30, if greater than 1800 then 80:20

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3, random_state=10) #70:30 split 
print(X_train)

[[-1.34164079 -1.34164079 -0.4472136   1.22474487 -1.22474487 -1.22474487]
 [-0.4472136   1.34164079 -1.34164079  1.22474487  0.          0.        ]
 [ 0.4472136   1.34164079  0.4472136   1.22474487  1.22474487  0.        ]
 ...
 [-1.34164079  1.34164079  1.34164079  0.          0.         -1.22474487]
 [ 0.4472136   0.4472136   0.4472136   0.         -1.22474487  0.        ]
 [ 0.4472136  -0.4472136   1.34164079  1.22474487  1.22474487 -1.22474487]]


### Running Decision Tree Model 

In [19]:
#predicting using the Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

model_DecisionTree = DecisionTreeClassifier(random_state=10)

#model_DecisionTree = DecisionTreeClassifier(random_state=10, min_samples_leaf=3, max_depth=10) #manually pruning the model


#fit the model on the data and predict the values

model_DecisionTree.fit(X_train,Y_train)

"""
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=10,
            splitter='best')
            
Here criteria used is Gini index
max_depth=None -- max depth of the tree is not defined
max_features=None -- max features of the tree is not defined
max_leaf_nodes=None -- max_leaf_nodes of the tree is not mentioned
spillter = best -- try finding the point on the basis of which the sample is split to get max homogeneity

"""


"\nDecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n            max_features=None, max_leaf_nodes=None,\n            min_impurity_decrease=0.0, min_impurity_split=None,\n            min_samples_leaf=1, min_samples_split=2,\n            min_weight_fraction_leaf=0.0, presort=False, random_state=10,\n            splitter='best')\n            \nHere criteria used is Gini index\nmax_depth=None -- max depth of the tree is not defined\nmax_features=None -- max features of the tree is not defined\nmax_leaf_nodes=None -- max_leaf_nodes of the tree is not mentioned\nspillter = best -- try finding the point on the basis of which the sample is split to get max homogeneity\n\n"

In [20]:
Y_pred=model_DecisionTree.predict(X_test) #predicting Y values for X test data
#print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [23]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

# Here the model is overfitting

[[101   0   1   0]
 [  2  19   0   0]
 [  0   0 371   0]
 [  1   0   0  24]]


0.9922928709055877


              precision    recall  f1-score   support

           0       0.97      0.99      0.98       102
           1       1.00      0.90      0.95        21
           2       1.00      1.00      1.00       371
           3       1.00      0.96      0.98        25

   micro avg       0.99      0.99      0.99       519
   macro avg       0.99      0.96      0.98       519
weighted avg       0.99      0.99      0.99       519



In [22]:
#After training the model##Variable importance is checked, higher values are better
print(list(zip(colname,model_DecisionTree.feature_importances_)))

[('buying', 0.1510848831946676), ('maint', 0.2506508516803624), ('doors', 0.060026331736828115), ('persons', 0.19355707150872045), ('lug_boot', 0.09892620952419463), ('safety', 0.2457546523552268)]


In [25]:
#generate the file (saved as text in directory) and upload the code from text file in webgraphviz.com to plot the decision tree
from sklearn import tree
with open("model_DecisionTree.txt", "w") as f: #.txt file will be created and stored in python library (C:\Users\Manoj Nahak)
    #open function is used to open a text file, "w" open the file in write mode
    f = tree.export_graphviz(model_DecisionTree, feature_names=colname[:-1],out_file=f)
    #export_graphviz documents all the steps taking place in the backgroud for a Decision tree

#generate the file and upload the code in webgraphviz.com to plot the decision tree


"""
Plot the graph on webgraphviz.com from model_DecisionTree.txt file

"""

'\nPlot the graph on webgraphviz.com from model_DecisionTree.txt file\n\n'

## Model Building using SVM & Logistic Regression

In [42]:
#building the default SVM model
# 4 types in Kernal
# 1) linear = for less variables
# 2) rbf = by default (radial basis function)
# 3) Poly
# 4) sigmoid = very rarely used
from sklearn import svm
classifier=svm.SVC(kernel="rbf", gamma=0.1, C=70) #hyperplane adjustment from support vectors is done

#fitting training data to the model
classifier.fit(X_train,Y_train)

Y_pred=classifier.predict(X_test) #applying the model
#print(list(zip(Y_test,Y_pred)))

In [43]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print(accuracy_score(Y_test,Y_pred))
print()
print(classification_report(Y_test,Y_pred))

# Here the model accuracy for SVM base model when cost = 1.0 is 85.54% and when cost=70 the accuacy is 99.42%. 
# Basis tuning the model accuacy improves


[[ 99   2   1   0]
 [  0  21   0   0]
 [  0   0 371   0]
 [  0   0   0  25]]

0.9942196531791907

              precision    recall  f1-score   support

           0       1.00      0.97      0.99       102
           1       0.91      1.00      0.95        21
           2       1.00      1.00      1.00       371
           3       1.00      1.00      1.00        25

   micro avg       0.99      0.99      0.99       519
   macro avg       0.98      0.99      0.98       519
weighted avg       0.99      0.99      0.99       519



# LogisticRegression

In [35]:
from sklearn.linear_model import LogisticRegression
#create a model
classifier=LogisticRegression()
#fitting training data to the model
classifier.fit(X_train,Y_train)

Y_pred=classifier.predict(X_test)
#print(list(zip(Y_test,Y_pred))) #O/P will be [(2, 2), (2, 2), (2, 2), (2, 2), (1, 2), (2, 2), (0, 2), (0, 2),......



In [36]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

# Here the model accuracy for Logistic Regression base model is 69.94% 
# Logistic is meant for binary target variable
#Here it predicted well for two classes and not for the remaining class

[[ 22   0  80   0]
 [  3   0  18   0]
 [ 30   0 341   0]
 [ 11   0  14   0]]


0.6994219653179191


              precision    recall  f1-score   support

           0       0.33      0.22      0.26       102
           1       0.00      0.00      0.00        21
           2       0.75      0.92      0.83       371
           3       0.00      0.00      0.00        25

   micro avg       0.70      0.70      0.70       519
   macro avg       0.27      0.28      0.27       519
weighted avg       0.60      0.70      0.64       519



  'precision', 'predicted', average, warn_for)


## Ensemble Model Techniques

## #Bagging Approach

In [25]:
"""
We never run a single algorithm on the dataset. Multiple algorithms are run on the dataset

Basis above example, accuracy is as follows: 
-DT --> 99.22%
-SVC --> 85.58%
-Tuned SVC --> 99.42%
-lr --> 69.94%

Two approaches are followed in the industry:
1) Running individual models --> selecting the best model 
2) Running combination of models --> averaging the output

Ensemble model---
Bagging : Parallel approach -- Used to improvise over the accuracy score
-Bags of data are created simultaneously. Bags of training data is created basis the number of DTs you wish to create.
-For each of the bag the observations are selected basis random sampling with replacement. 
-Each observation has an equal probability of getting selected for all the bags.
-The observations will be repeated in the bags. Approx 60% of the training data is selected for each of the bag.
-Multiple DTs are run on the individual bags. The model is trained after running on these bags.
-Now in case of classification prediction, all the model objects predict each of the observations in the test data and 
then finally the mode value of all the model outcomes is taken as the final value/class assigned to observation   
-In case of regression prediction, the only difference is that the mean(average) value of all the model outcomes 
is taken as the final value/class assigned to observation
-The number of bags/model should be odd to avoid cases where it is difficult to decide the mode value


Bagging algorithms:
ExtraTreesClassifier()
RandomForestClassifier()

Random Forest
In case of Random forest, the major difference between bagging and randomforest is that in case of randomforest the observations
are sampled as well as the variables are also sampled.
Sampling is different for both
Each data set has randomly sampled observations and different set of variables.
Multiple DTs are created
It is time consuming and costly

"""

'\nWe never run a single algorithm on the dataset. Multiple algorithms are run on the dataset\n\nBasis above example, accuracy is as follows: \n-DT --> 99.22%\n-SVC --> 85.58%\n-Tuned SVC --> 99.42%\n-lr --> 69.94%\n\nTwo approaches are followed in the industry:\n1) Running individual models --> selecting the best model \n2) Running combination of models --> averaging the output\n\nEnsemble model---\nBagging : Parallel approach -- Used to improvise over the accuracy score\n-Bags of data are created simultaneously. Bags of training data is created basis the number of DTs you wish to create.\n-For each of the bag the observations are selected basis random sampling with replacement. \n-Each observation has an equal probability of getting selected for all the bags.\n-The observations will be repeated in the bags. Approx 60% of the training data is selected for each of the bag.\n-Multiple DTs are run on the individual bags. The model is trained after running on these bags.\n-Now in case of 

### ExtraTreesClassifier()

In [25]:
#predicting using the Bagging_Classifier
from sklearn.ensemble import ExtraTreesClassifier

model=(ExtraTreesClassifier(50,random_state=10))  ##first argument specifies the number of bags we want to create
#fit the model on the data and predict the values
model=model.fit(X_train,Y_train)

Y_pred=model.predict(X_test)

In [26]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

# Here the model accuracy for Logistic Regression base model is 97.88%


[[100   0   2   0]
 [  3  18   0   0]
 [  3   0 368   0]
 [  1   2   0  22]]


0.9788053949903661


              precision    recall  f1-score   support

           0       0.93      0.98      0.96       102
           1       0.90      0.86      0.88        21
           2       0.99      0.99      0.99       371
           3       1.00      0.88      0.94        25

   micro avg       0.98      0.98      0.98       519
   macro avg       0.96      0.93      0.94       519
weighted avg       0.98      0.98      0.98       519



### RandomForestClassifier()

In [27]:
#predicting using the Random_Forest_Classifier
from sklearn.ensemble import RandomForestClassifier

model=(RandomForestClassifier(100,random_state=10))  ##first argument specifies the number of bags we want to create
#fit the model on the data and predict the values
model=model.fit(X_train,Y_train)

Y_pred=model.predict(X_test)

In [28]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

# Here the model accuracy for Logistic Regression base model is 97.88%


[[ 99   3   0   0]
 [  1  19   0   1]
 [  2   0 369   0]
 [  2   0   0  23]]


0.9826589595375722


              precision    recall  f1-score   support

           0       0.95      0.97      0.96       102
           1       0.86      0.90      0.88        21
           2       1.00      0.99      1.00       371
           3       0.96      0.92      0.94        25

   micro avg       0.98      0.98      0.98       519
   macro avg       0.94      0.95      0.95       519
weighted avg       0.98      0.98      0.98       519



# Boosting - Sequential

In [29]:
#here the model will predect the final output and then we will introduce the testing data to the model
#whereas in cas eon bagging we take majority vote to select the output

In [30]:
"""
Two commonly use boosting algorithm
AdaBoostClassifier() - for any model (increase the weight of misclassified and decrese the weight of correcltly classified)
GradientBoostingClassifier() - invented For tuning the decision tree model & increase the weight of misclassified
XGBoost() xtrem gradient boosting

"""

'\nTwo commonly use boosting algorithm\nAdaBoostClassifier() - for any model (increase the weight of misclassified and decrese the weight of correcltly classified)\nGradientBoostingClassifier() - invented For tuning the decision tree model & increase the weight of misclassified\nXGBoost() xtrem gradient boosting\n\n'

# Running Asaboost classifier

In [32]:
#predicting using the Adaboost classifier
from sklearn.ensemble import AdaBoostClassifier

model_AdaBoost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), 
                                    n_estimators=50, random_state=10)

#fit the model on the data and predict the values
model_AdaBoost.fit(X_train,Y_train)

Y_pred=model_AdaBoost.predict(X_test)

In [33]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

[[ 99   2   1   0]
 [  4  17   0   0]
 [  0   0 371   0]
 [  1   0   0  24]]


0.9845857418111753


              precision    recall  f1-score   support

           0       0.95      0.97      0.96       102
           1       0.89      0.81      0.85        21
           2       1.00      1.00      1.00       371
           3       1.00      0.96      0.98        25

   micro avg       0.98      0.98      0.98       519
   macro avg       0.96      0.94      0.95       519
weighted avg       0.98      0.98      0.98       519



# Running Gradient Boosting Classifier

In [34]:
#predicting using the Gradient boosting classifier

from sklearn.ensemble import GradientBoostingClassifier

model_GradientBoosting=GradientBoostingClassifier(n_estimators=200,
                                                 min_samples_leaf=3,
                                                 random_state=10)

#fit the model on the data and predict the values
model_GradientBoosting.fit(X_train, Y_train)

Y_pred= model_GradientBoosting.predict(X_test)

In [35]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

[[101   1   0   0]
 [  0  21   0   0]
 [  0   0 371   0]
 [  0   0   0  25]]


0.9980732177263969


              precision    recall  f1-score   support

           0       1.00      0.99      1.00       102
           1       0.95      1.00      0.98        21
           2       1.00      1.00      1.00       371
           3       1.00      1.00      1.00        25

   micro avg       1.00      1.00      1.00       519
   macro avg       0.99      1.00      0.99       519
weighted avg       1.00      1.00      1.00       519



In [36]:
"""
Bagging/Boosting :- different data but same classifier

Ensemble model : -same data but different classifiers
"""

'\nBagging/Boosting :- different data but same classifier\n\nEnsemble model : -same data but different classifiers\n'

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [38]:
#create the sub models
estimators = []
#model1 = LogisticRegression()
#estimators.append(('log',model1))

model2 = DecisionTreeClassifier(random_state=10)
estimators.append(('cart',model2))

model3 = SVC(kernel='rbf',gamma=0.1,C=70)
estimators.append(('svm',model3))

#print(estimators)

In [39]:
#create the ensemble model
ensemble = VotingClassifier(estimators)
ensemble.fit(X_train, Y_train)
Y_pred=ensemble.predict(X_test)
#print(Y_pred)

In [40]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print(confusion_matrix(Y_test,Y_pred))
print()
print()
print(accuracy_score(Y_test,Y_pred))
print()
print()
print(classification_report(Y_test,Y_pred))

[[102   0   0   0]
 [  2  19   0   0]
 [  0   0 371   0]
 [  1   0   0  24]]


0.9942196531791907


              precision    recall  f1-score   support

           0       0.97      1.00      0.99       102
           1       1.00      0.90      0.95        21
           2       1.00      1.00      1.00       371
           3       1.00      0.96      0.98        25

   micro avg       0.99      0.99      0.99       519
   macro avg       0.99      0.97      0.98       519
weighted avg       0.99      0.99      0.99       519

