The ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used in machine learning for classification tasks. It was developed by Ross Quinlan in the 1980s. ID3 uses a greedy approach to build a decision tree by recursively partitioning the dataset into smaller subsets based on the attribute that provides the maximum information gain at each node.

## ID3 Algorithm Steps:

1. **Start**:
   - Begin with the entire dataset as the root node.

2. **Entropy Calculation**:
   - Calculate the entropy of the target variable for the dataset.

3. **Information Gain**:
   - For each attribute, calculate the information gain.
   - Select the attribute with the highest information gain as the decision attribute for the node.

4. **Tree Construction**:
   - For each unique value of the decision attribute, create a branch using the equality condition.
   - Example: If the decision attribute is "Color" with values "Red," "Green," and "Blue," create three branches:
     - If Color = Red, then follow the Red branch.
     - If Color = Green, then follow the Green branch.
     - If Color = Blue, then follow the Blue branch.

5. **Recursive Splitting**:
   - Split the dataset into subsets based on the attribute values.
   - Apply the ID3 algorithm recursively to each subset.

6. **Termination**:
   - Stop if all instances in a subset belong to the same class (pure).
   - Stop if there are no more attributes to be selected, but the instances still don't belong to the same class (use majority voting).
   - Stop if there are no instances left.

7. **Pruning (Optional)**:
   - Prune the tree to handle overfitting if necessary.

8. **Final Decision Tree**:
   - Use the constructed (and possibly pruned) tree to make predictions.


In [20]:
import nltk
from nltk.tokenize import word_tokenize
from collections import defaultdict
from math import log2

# Sample dataset
# Each entry in the dataset is a tuple (text, label)
dataset = [
    ("This is a good day", "Positive"),
    ("I feel sad about the news", "Negative"),
    ("I am excited to see you", "Positive"),
    ("I am afraid of the dark", "Negative")
]

# Tokenize and preprocess the text
def preprocess(text):
    return word_tokenize(text.lower())

# Calculate entropy
def entropy(subset):
    label_counts = defaultdict(int)
    for _, label in subset:
        label_counts[label] += 1
    total = len(subset)
    return -sum((count/total) * log2(count/total) for count in label_counts.values() if count/total > 0)

# Calculate information gain
def information_gain(dataset, partitions):
    total = len(dataset)
    dataset_entropy = entropy(dataset)
    weighted_entropy = sum((len(partition)/total) * entropy(partition) for partition in partitions)
    return dataset_entropy - weighted_entropy

# ID3 algorithm
def id3(dataset, attributes):
    labels = [label for _, label in dataset]
    if len(set(labels)) == 1:
        return labels[0]

    if not attributes:
        return max(set(labels), key=labels.count)

    max_gain = 0
    best_attribute = None
    best_partitions = None  # Initialize best_partitions

    for attribute in attributes:
        partitions = defaultdict(list)
        for text, label in dataset:
            key = preprocess(text)[attribute]
            partitions[key].append((text, label))
        gain = information_gain(dataset, partitions.values())
        if gain > max_gain:
            max_gain = gain
            best_attribute = attribute
            best_partitions = partitions  # Update best_partitions with the best current partitions

    tree = {best_attribute: {}}
    for attribute_value, subset in best_partitions.items():
        subtree = id3(subset, [a for a in attributes if a != best_attribute])
        tree[best_attribute][attribute_value] = subtree

    return tree

# Example usage
attributes = list(range(len(preprocess(dataset[0][0])))) # Assuming all texts are tokenized similarly
decision_tree = id3(dataset, attributes)
print(decision_tree)


{2: {'a': 'Positive', 'sad': 'Negative', 'excited': 'Positive', 'afraid': 'Negative'}}
