Decision Tree ID3
Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an appropriate data set for building the decision tree.

Loads the Iris dataset: The dataset contains 150 samples with four features and three classes.
Splits the dataset: The data is split into training (70%) and testing (30%) sets.
Initializes the classifier: The DecisionTreeClassifier is initialized with criterion='entropy' to use entropy and information gain for splitting.
Trains the classifier: The classifier is trained using the training data.
Makes predictions: The trained model predicts the labels for the test set.
Calculates accuracy: The accuracy of the model is calculated.
Prints the classification report: This report includes precision, recall, f1-score, and support for each class.
Prints actual vs predicted labels: The first 10 actual and predicted labels are printed for comparison.

In [8]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Decision Tree Classifier with entropy criterion
clf = DecisionTreeClassifier(criterion='entropy')

# Train the classifier
clf.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = clf.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Print the first 10 actual vs predicted labels
print("\nActual vs Predicted labels for the first 10 samples:")
for actual, predicted in zip(y_test[:10], y_pred[:10]):
    print(f"Actual: {actual}, Predicted: {predicted}")


Accuracy: 97.78%

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.93      1.00      0.96        13
           2       1.00      0.92      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45


Actual vs Predicted labels for the first 10 samples:
Actual: 1, Predicted: 1
Actual: 0, Predicted: 0
Actual: 2, Predicted: 2
Actual: 1, Predicted: 1
Actual: 1, Predicted: 1
Actual: 0, Predicted: 0
Actual: 1, Predicted: 1
Actual: 2, Predicted: 2
Actual: 1, Predicted: 1
Actual: 1, Predicted: 1


Decision Tree ID3
Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an appropriate data set for building the decision tree (apply this knowledge to classify a new sample).

In [1]:
import math
from collections import Counter

def entropy(data):
    """ Calculate the entropy of a dataset """
    labels = [item['label'] for item in data]
    label_counts = Counter(labels)
    entropy = 0
    total = len(data)
    for count in label_counts.values():
        probability = count / total
        entropy -= probability * math.log2(probability)
    return entropy

def information_gain(data, attribute):
    """ Calculate the information gain of an attribute in the dataset """
    values = set([item[attribute] for item in data])
    remainder = 0
    total = len(data)
    
    for value in values:
        subset = [item for item in data if item[attribute] == value]
        remainder += (len(subset) / total) * entropy(subset)
        
    return entropy(data) - remainder

def id3(data, features, target_attribute):
    """ ID3 algorithm to build a decision tree """
    labels = [item[target_attribute] for item in data]
    if len(set(labels)) == 1:  # If all labels are the same, return a leaf node with that label
        return labels[0]
    
    if len(features) == 0:  # If no more features to split on, return the most common label
        return Counter(labels).most_common(1)[0][0]
    
    # Choose the best attribute to split on
    best_attribute = max(features, key=lambda attribute: information_gain(data, attribute))
    tree = {best_attribute: {}}
    remaining_features = [f for f in features if f != best_attribute]
    
    # Split the dataset based on the best attribute
    for value in set([item[best_attribute] for item in data]):
        subset = [item for item in data if item[best_attribute] == value]
        subtree = id3(subset, remaining_features, target_attribute)
        tree[best_attribute][value] = subtree
        
    return tree

# Example usage
if __name__ == "__main__":
    # Example dataset (weather outlook and whether to play golf)
    data = [
        {'outlook': 'sunny', 'temperature': 'hot', 'humidity': 'high', 'windy': False, 'label': 'no'},
        {'outlook': 'sunny', 'temperature': 'hot', 'humidity': 'high', 'windy': True, 'label': 'no'},
        {'outlook': 'overcast', 'temperature': 'hot', 'humidity': 'high', 'windy': False, 'label': 'yes'},
        {'outlook': 'rainy', 'temperature': 'mild', 'humidity': 'high', 'windy': False, 'label': 'yes'},
        {'outlook': 'rainy', 'temperature': 'cool', 'humidity': 'normal', 'windy': False, 'label': 'yes'},
        {'outlook': 'rainy', 'temperature': 'cool', 'humidity': 'normal', 'windy': True, 'label': 'no'},
        {'outlook': 'overcast', 'temperature': 'cool', 'humidity': 'normal', 'windy': True, 'label': 'yes'},
        {'outlook': 'sunny', 'temperature': 'mild', 'humidity': 'high', 'windy': False, 'label': 'no'},
        {'outlook': 'sunny', 'temperature': 'cool', 'humidity': 'normal', 'windy': False, 'label': 'yes'},
        {'outlook': 'rainy', 'temperature': 'mild', 'humidity': 'normal', 'windy': False, 'label': 'yes'},
        {'outlook': 'sunny', 'temperature': 'mild', 'humidity': 'normal', 'windy': True, 'label': 'yes'},
        {'outlook': 'overcast', 'temperature': 'mild', 'humidity': 'high', 'windy': True, 'label': 'yes'},
        {'outlook': 'overcast', 'temperature': 'hot', 'humidity': 'normal', 'windy': False, 'label': 'yes'},
        {'outlook': 'rainy', 'temperature': 'mild', 'humidity': 'high', 'windy': True, 'label': 'no'}
    ]

    features = ['outlook', 'temperature', 'humidity', 'windy']
    target_attribute = 'label'

    decision_tree = id3(data, features, target_attribute)
    print("Decision Tree:")
    print(decision_tree)


Decision Tree:
{'outlook': {'overcast': 'yes', 'sunny': {'humidity': {'high': 'no', 'normal': 'yes'}}, 'rainy': {'windy': {False: 'yes', True: 'no'}}}}


In [22]:
import pandas as pd
df_tennis=pd.read_csv('E:/2a)Even sem/AIML 6th Sem/1)6th Sem AIML/4)Module 4 AIML/3.csv')
#from pandas import DataFrame
#df_tennis = DataFrame.from_csv('/home/chaitra/Desktop/ML Lab 2020/fwdmlprograms/3.csv')
#df_tennis

In [23]:
import math
def entropy(probs):
    return sum([-i * math.log(i, 2) for i in probs])

In [24]:
from collections import Counter
def entropy_of_list(a_list):
    cnt  = Counter (x for x in a_list)
    print('No and yes classes are:', a_list.name, cnt)
    num_instances = len(a_list) * 1.0
    probs = [x / num_instances for x in cnt.values()]
    return entropy(probs)

total_entropy = entropy_of_list(df_tennis['play?'])
print ("Entropy of given play tennis dataset:", total_entropy)

No and yes classes are: play? Counter({'yes': 9, 'no': 5})
Entropy of given play tennis dataset: 0.9402859586706309


In [25]:
def information_gain(df, split_attribute_name, target_attribute_name,trace = 0):
    print("Information gain calculation of", split_attribute_name)
    df_split = df.groupby(split_attribute_name)
    for name,group in df_split:
        print('Name:',name)
        print('Group:',group)
        
    nobs = len(df.index) * 1.0
    df_agg_ent = df_split.agg({target_attribute_name:[entropy_of_list, lambda x:len(x)/ nobs]})[target_attribute_name]
    df_agg_ent.columns = ['entropy', 'propobservations']
    new_entropy = sum(df_agg_ent['entropy'] * df_agg_ent['propobservations'])
    old_entropy = entropy_of_list(df[target_attribute_name])
    return old_entropy - new_entropy
print ("Entropy of given play tennis dataset:",entropy_of_list(dataset['play?']))
print('Information gain for outlook is:'+str(information_gain(df_tennis, 'outlook', 'play?')),"\n")
print('Information gain for humidity is:'+str(information_gain(df_tennis, 'humidity', 'play?')),"\n")
print('Information gain for wind is:'+str(information_gain(df_tennis, 'wind', 'play?')),"\n")
print('Information gain for temperature is:'+str(information_gain(df_tennis, 'temp', 'play?')),"\n")

No and yes classes are: play? Counter({'yes': 9, 'no': 5})
Entropy of given play tennis dataset: 0.9402859586706309
Information gain calculation of outlook
Name: overcast
Group:      outlook  temp humidity    wind play?
2   overcast   hot     high    weak   yes
6   overcast  cool   normal  strong   yes
11  overcast  mild     high  strong   yes
12  overcast   hot   normal    weak   yes
Name: rain
Group:    outlook  temp humidity    wind play?
3     rain  mild     high    weak   yes
4     rain  cool   normal    weak   yes
5     rain  cool   normal  strong    no
9     rain  mild   normal    weak   yes
13    rain  mild     high  strong    no
Name: sunny
Group:    outlook  temp humidity    wind play?
0    sunny   hot     high    weak    no
1    sunny   hot     high  strong    no
7    sunny  mild     high    weak    no
8    sunny  cool   normal    weak   yes
10   sunny  mild   normal  strong   yes
No and yes classes are: play? Counter({'yes': 4})
No and yes classes are: play? Counter({'yes':