# Linear Classifiers : Using different Classification Models like Decision Trees

<img src='Data/3.PNG'>

Classification and Regression Trees or **CART** is used to refer to **Decision Tree algorithms** that can be used for classification or regression predictive modeling problems.

The representation of the CART model is a **binary tree**.

- A **node** represents a single input variable *(X)* and a split point on that variable, assuming the variable is numeric. 
- The **leaf nodes** (also called terminal nodes) of the tree contain an output variable *(Y)* which is used to make a prediction.

## BUILDING A DECISION TREE CLASSIFIER

### Data Set

<img src='Data/4.PNG'>

## Imports

In [43]:
import numpy as np
import pandas as pd

import sklearn
from sklearn import cross_validation, metrics, model_selection
from sklearn.cross_validation import train_test_split, cross_val_predict

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

## Loading Dataset

- Datset = Balance sheet data from UCI

In [44]:
data = pd.read_csv('Data\data2_balance_scale_data.csv')

In [45]:
data.head()

Unnamed: 0,Class,Left_Weight,Left_Distance,Right_Weight,Right_Distance
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [46]:
data.shape

(625, 5)

In [47]:
X = data.iloc[:,1:5]
Y = data.Class

In [75]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size = 0.80, random_state =100)

    criterion = "gini" Gini Index(for continous features) or "entropy" Information Gain(for categorical features)

#### Classifier 1: Model Using Gini Index

In [49]:
model_gini = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=3, min_samples_leaf=5)
model_gini.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

#### Classifier 2: Model using Information Gain

In [50]:
model_info = DecisionTreeClassifier(criterion="entropy", random_state=100, max_depth=3, min_samples_leaf=5)
model_info.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

#### Predictions

In [51]:
Y_hat_gini = model_gini.predict(X_test)
Y_hat_info = model_info.predict(X_test)

#### Final Data set

In [52]:
test_data = X_test.copy()
test_data['Y'] = Y_test
test_data['Y_Hat_Gini'] = Y_hat_gini
test_data['Y_Hat_Info'] = Y_hat_info
test_data.head()

Unnamed: 0,Left_Weight,Left_Distance,Right_Weight,Right_Distance,Y,Y_Hat_Gini,Y_Hat_Info
462,4,4,3,3,L,R,R
200,2,4,1,1,L,L,L
397,4,1,5,3,R,R,R
75,1,4,1,1,L,L,L
259,3,1,2,5,R,L,L


#### Accuracy

In [71]:
# Accuracy for Classification -- 

print "Accuracy using Gini Index = " , metrics.accuracy_score(Y_test, Y_hat_gini)*100
print "Accuracy using Information Gain = " , metrics.accuracy_score(Y_test, Y_hat_info)*100

Accuracy using Gini Index =  70.4
Accuracy using Information Gain =  71.2


### Using Cross Validation instead of train-test splits

In [76]:
#Alternate: Using K-Folds Cross Validation
K = model_selection.KFold(n_splits=3, random_state=7)

In [73]:
model = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=3, min_samples_leaf=5)

Y_hat_cv = model_selection.cross_val_predict(model, X,Y, cv=K)
metrics.accuracy_score(Y, Y_hat_cv)*100

63.519999999999996

In [74]:
model = DecisionTreeClassifier(criterion="entropy", random_state=100, max_depth=3, min_samples_leaf=5)

Y_hat_cv = model_selection.cross_val_predict(model, X,Y, cv=K)
metrics.accuracy_score(Y, Y_hat_cv)*100

66.719999999999999

<img src='Data/5.PNG'>