### Decision Tree Classification on the Iris Dataset

The provided code snippet demonstrates the application of a Decision Tree classifier to the Iris dataset. This dataset includes measurements of 150 iris flowers from three different species. The code utilizes the Decision Tree algorithm to predict the species based on features such as sepal length, sepal width, petal length, and petal width. The code makes use of two criteria for splitting the data: entropy and Gini impurity.

#### Steps and Concepts:

1. **Splitting the Dataset**: The dataset is split into training and test sets using `train_test_split()`, where 70% of the data is used for training and 30% for testing. The dimensions of the training and test sets are printed to confirm the sizes.

2. **Training the Decision Tree**: Two Decision Tree models are trained:
   - **Using Entropy**: A tree (`clf_entropy`) is trained using entropy as the criterion for splitting. Entropy measures the impurity of a node and is given by \(-\sum p_i \log_2 p_i\), where \(p_i\) is the proportion of the samples that belong to class \(i\) at a given node.
   - **Using Gini Impurity**: Another tree (`clf_gini`) is trained using the Gini impurity as the criterion, calculated as \(1 - \sum p_i^2\). Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

3. **Making Predictions and Evaluating the Model**: Each model makes predictions on the test set. The predicted species for each sample are printed.

4. **Accuracy and Metrics**: The accuracy of the model, along with a confusion matrix and a classification report (which includes precision, recall, and F1-score), is printed for each classifier. This provides a detailed view of the model's performance:
   - **Confusion Matrix**: Shows the correct and incorrect predictions for each class.
   - **Accuracy**: Gives the percentage of total correct predictions.
   - **Precision and Recall**: Precision is the ratio of correct positive observations to the total predicted positives. Recall is the ratio of correct positive observations to all actual positives.
   - **F1-Score**: The harmonic mean of precision and recall, providing a balance between them.

#### Summary:
The Decision Tree classifier is applied here to effectively distinguish between different species of iris plants based on their physical characteristics. By using both entropy and Gini impurity, the code demonstrates two approaches to manage decision-making in tree structures. This is particularly useful in educational settings or in practical scenarios where the trade-offs between different decision criteria are being evaluated.


In [1]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
iris=load_iris()

In [2]:
X,y=iris.data,iris.target

In [3]:
def train_using_gini(X_train, y_train):
    clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,max_depth=3, min_samples_leaf=4)
    clf_gini.fit(X_train, y_train)
    return clf_gini

In [4]:
def train_using_entropy(X_train,y_train):
    clf_entropy = DecisionTreeClassifier(criterion="entropy",random_state = 100,max_depth=3,min_samples_leaf=4)
    clf_entropy.fit(X_train,y_train)
    return clf_entropy

In [5]:
def prediction(X_test,clf_object):
    y_pred=clf_object.predict(X_test)
    print("Predicted values:",y_pred)
    return y_pred

In [6]:
def cal_accuracy(y_test,y_pred):
    print("Confusion Matrix: ",confusion_matrix(y_test,y_pred))
    print("Accuracy:",accuracy_score(y_test,y_pred)*100)
    print("Report :",classification_report(y_test,y_pred))

In [7]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 100)
print("Dimensions for training data",X_train.shape)
print("Dimensions for testing data",y_train.shape)

Dimensions for training data (105, 4)
Dimensions for testing data (105,)


In [8]:
clf_gini = train_using_gini(X_train, y_train)
print("Result:")
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)

Result:
Predicted values: [2 0 2 0 2 2 0 0 2 0 0 2 0 0 2 1 1 2 2 2 2 0 2 0 1 2 1 0 1 2 1 1 1 0 0 1 0
 1 2 2 0 1 2 2 0]
Confusion Matrix:  [[16  0  0]
 [ 0 10  1]
 [ 0  1 17]]
Accuracy: 95.55555555555556
Report :               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.91      0.91      0.91        11
           2       0.94      0.94      0.94        18

    accuracy                           0.96        45
   macro avg       0.95      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45



In [9]:
clf_entropy = train_using_entropy(X_train,y_train)
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)

Predicted values: [2 0 2 0 2 2 0 0 2 0 0 2 0 0 2 1 1 2 2 2 2 0 2 0 1 2 1 0 1 2 1 1 1 0 0 1 0
 1 2 2 0 1 2 2 0]
Confusion Matrix:  [[16  0  0]
 [ 0 10  1]
 [ 0  1 17]]
Accuracy: 95.55555555555556
Report :               precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.91      0.91      0.91        11
           2       0.94      0.94      0.94        18

    accuracy                           0.96        45
   macro avg       0.95      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45

