# What is Decision Tree?

<div  style="color:blue;font-family:Candara,arial,helvetica;line-height:20px"><strong>

## Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.
    
<img src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1545934190/1_r5ikdb.png" alt="drawing" width="600" height="300"/>     
    
<img src="https://www.xoriant.com/blog/wp-content/uploads/2017/08/Decision-Trees-modified-1.png " alt="drawing" width="600" height="300"/>     
   
    
</strong></div>

In [22]:
from IPython.display import IFrame

IFrame(src='https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html', width=1000, height=1000)

# Impurity Criterion

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTUu8H6L79w3fZK0CLt9FMgqOQp1jBQmWLYp18tOPZmtPEj37sgpg&s " alt="drawing" width="600" height="300"/> 



# Run and Evaluate Model

## Import Libraries and split data to test and train

In [1]:
# Import libraries
import pandas as pd

# Read dataset
data = pd.read_csv('04 - decisiontreeAdultIncome.csv')

# Check for Null values
data.isnull().sum(axis=0)

# Create Dummy variables
data.dtypes
data_prep = pd.get_dummies(data, drop_first=True)


# Create X and Y Variables
X = data_prep.iloc[:, :-1]
Y = data_prep.iloc[:, -1]


# Split the X and Y dataset into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size = 0.3, random_state = 1234, stratify=Y)

## Train the model

In [2]:


# Import and train classifier
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=1234)
dtc.fit(X_train, Y_train)


# Test the model
Y_predict = dtc.predict(X_test)

# Evaluate the model
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_predict)
score = dtc.score(X_test, Y_test)

print(score)
print(cm)



0.7710965133906014
[[3814  559]
 [ 800  764]]


# Solve Iris Dataset problem 

In [3]:
# import and load the Iris Dataset
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target


# split, train test....
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size = 0.3, random_state = 1234, stratify=Y)

# Train the SVC 
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [4]:
# Import and train classifier
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=1234)
dtc.fit(X_train, Y_train)


# Test the model
Y_predict = dtc.predict(X_test)

# Evaluate the model
from sklearn.metrics import confusion_matrix
cm_iris = confusion_matrix(Y_test, Y_predict)
score_iris = dtc.score(X_test, Y_test)

print(score_iris)
print(cm_iris)

0.9333333333333333
[[15  0  0]
 [ 0 13  2]
 [ 0  1 14]]


# Ensemble Learning and Bagging,Boosting

<div  style="color:blue;font-family:Candara,arial,helvetica;line-height:20px"><strong>


## Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. 

## Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification.  
    
    
<img src="https://miro.medium.com/max/1169/1*_pfQ7Xf-BAwfQXtaBbNTEg.png" alt="drawing" width="600" height="300"/>     
    
<img src="https://miro.medium.com/max/850/1*DwvwMlOcT1T9hZwIJvMfng.png" alt="drawing" width="600" height="300"/>     
   

</strong></div>

# Evaluate the Adult Income Dataset using Random forests

## Import Library and Split into Test/Train

In [13]:
# Import libraries
import pandas as pd

# Read dataset
data = pd.read_csv('04 - decisiontreeAdultIncome.csv')

# Check for Null values
data.isnull().sum(axis=0)

# Create Dummy variables
data.dtypes
data_prep = pd.get_dummies(data, drop_first=True)


# Create X and Y Variables
X = data_prep.iloc[:, :-1]
Y = data_prep.iloc[:, -1]


# Split the X and Y dataset into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size = 0.3, random_state = 1234, stratify=Y)

## Evaluate the model

In [18]:

# Import and train Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=1234)
rfc.fit(X_train, Y_train)


# Test the RFC model
Y_predict = rfc.predict(X_test)

# Evaluate the RFC model
cm2 = confusion_matrix(Y_test, Y_predict)
score2 = rfc.score(X_test, Y_test)

print(cm2)
print(score2)

[[3882  491]
 [ 712  852]]
0.7973724103082365




## Solving Iris problem using Random forest

In [20]:
# import and load the Iris Dataset
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target


# split, train test....
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size = 0.3, random_state = 1234, stratify=Y)

# Train the SVC 
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [21]:
# Import and train Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=1234)
rfc.fit(X_train, Y_train)


# Test the model
Y_predict = dtc.predict(X_test)

# Evaluate the model
from sklearn.metrics import confusion_matrix
cm_iris = confusion_matrix(Y_test, Y_predict)
score_iris = dtc.score(X_test, Y_test)

print(score_iris)
print(cm_iris)

0.9333333333333333
[[15  0  0]
 [ 0 13  2]
 [ 0  1 14]]


