In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [17]:
# Load the iris Dataset
iris = datasets.load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [19]:
df['target'].value_counts()
# We see there are 50 values of each type, here 0.0 is representing setosa, 1.0 is representing verginca and 2.0 is representing vergenica.
# We did the mapping because ML cannot work with categorical data, hence they must be encoded to numeric values first to be used.

target
0.0    50
1.0    50
2.0    50
Name: count, dtype: int64

In [None]:
# Visualizing how does the data looks like first, in order to get the idea which model we should use.
# Here, we are only plotting two features
sns.FacetGrid(df, hue="target", height=6).map(plt.scatter, "petal length (cm)", "petal width (cm)").add_legend()

# Just by looking at the below plot we can easily observe that we can classify blue i.e. type 0.0 -> Setosa with 100% accuracy, I can do this by simply drawing a line (Linearly Seperable) and then conclude that below that line there will be Setosa
# For type 1.0 and 2.0 we can see there is a mix so I may not get 100% accurate classification between Vercicolor and Vergenica.
# All the above observation is done just by puring analyzing the plot, that's why vizualization of dataset is very important step.

### Apply on Iris Dataset

In [37]:
# Fit a CART (Classification and Regression Model) to the data
model = DecisionTreeClassifier()

# Here we using two parameter criterion to specify the splitting should be done on basis of entropy not giniIndex(default) and maxDepth to tell that cut the tree after level 3 in order to avoid overfitting.
# model = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# fit function below actually trains the model with the data, here, just for the sake of simplicity, we are avoiding one crutial step of splitting the dataset for training and testing purpose.
# for real dataset we split the data to train data and test data, this helps avoid biased outcomes and better accuracy.
model.fit(iris.data, iris.target)
print(model)

DecisionTreeClassifier(criterion='entropy', max_depth=3)


In [38]:
# Here if I am getting 1 means the model(Decision Tree CLassifier) is giving me 100% accuracy for this dataset.
model.score(iris.data, iris.target)

0.9733333333333334

### Make predictions

In [39]:
# expected is what we are expecting from our model's prediction.
expected = iris.target
# predicted is the value generated by the model after the training.
predicted = model.predict(iris.data)

### Summerize the fit of the model

In [40]:
# We use precision, recall, f1-score and confusion matrix to know how much accurate our model is, it basically gives us thorough summary of accuracy level of our trained model on the dataset.
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.98      0.94      0.96        50
           2       0.94      0.98      0.96        50

    accuracy                           0.97       150
   macro avg       0.97      0.97      0.97       150
weighted avg       0.97      0.97      0.97       150

[[50  0  0]
 [ 0 47  3]
 [ 0  1 49]]


##### One disadvantage of Decision tree is that Decision tree has tendency to Overfit. So, we need to define the maxDepth of the tree otherwise tree will overfit and we will the pure node at leaf always when the tree will be formed fully, giving us 100% accuracy always. To avoid Overfitting we define maxDepth of the tree.