# C4.5 Decision Tree Algorithm - Employee Database Analysis

## Problem Statement
Given the employee database training data, we need to determine which attribute should be selected to split the records in the first iteration using the C4.5 decision tree algorithm with Gain Ratio as the uncertainty measure.

## Dataset
The training data consists of 11 records with attributes:
- **department**: sales, systems, marketing, secretary
- **status**: senior, junior  
- **age**: 21-30, 31-40, 41-50
- **salary** (class attribute): Low, Medium, High


In [26]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import math

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


In [27]:
# Create the employee dataset
data = {
    'department': ['sales', 'sales', 'sales', 'systems', 'systems', 'systems', 'systems', 'marketing', 'marketing', 'secretary', 'secretary'],
    'status': ['senior', 'junior', 'junior', 'junior', 'senior', 'junior', 'senior', 'senior', 'junior', 'senior', 'junior'],
    'age': ['31-40', '21-30', '31-40', '21-30', '31-40', '21-30', '41-50', '31-40', '31-40', '41-50', '21-30'],
    'salary': ['Medium', 'Low', 'Low', 'Medium', 'High', 'Medium', 'High', 'Medium', 'Medium', 'Medium', 'Low']
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)
print(f"\nDataset shape: {df.shape}")
print(f"Total records: {len(df)}")


   department  status    age  salary
0       sales  senior  31-40  Medium
1       sales  junior  21-30     Low
2       sales  junior  31-40     Low
3     systems  junior  21-30  Medium
4     systems  senior  31-40    High
5     systems  junior  21-30  Medium
6     systems  senior  41-50    High
7   marketing  senior  31-40  Medium
8   marketing  junior  31-40  Medium
9   secretary  senior  41-50  Medium
10  secretary  junior  21-30     Low

Dataset shape: (11, 4)
Total records: 11


In [28]:
# Step 1: Calculate entropy of the target attribute (salary)
def calculate_entropy(data, target_column):
    """
    Calculate entropy of the target attribute
    Entropy = -Σ(p_i * log2(p_i)) where p_i is the proportion of class i
    """
    # Count frequency of each class
    class_counts = Counter(data[target_column])
    total_samples = len(data)
    
    entropy = 0
    for count in class_counts.values():
        probability = count / total_samples
        if probability > 0:  # Avoid log(0)
            entropy -= probability * math.log2(probability)
    
    return entropy, class_counts

# Calculate entropy of salary (target attribute)
salary_entropy, salary_counts = calculate_entropy(df, 'salary')
print("=== STEP 1: Calculate Entropy of Target Attribute (Salary) ===")
print(f"Salary class distribution: {salary_counts}")
print(f"Total samples: {len(df)}")

# Display the calculation step by step
print("\nDetailed calculation:")
for salary_class, count in salary_counts.items():
    probability = count / len(df)
    log_prob = math.log2(probability) if probability > 0 else 0
    contribution = -probability * log_prob
    print(f"P({salary_class}) = {count}/{len(df)} = {probability:.4f}")
    print(f"  -{probability:.4f} * log2({probability:.4f}) = {contribution:.4f}")

print(f"\nTotal Entropy = {salary_entropy:.4f}")


=== STEP 1: Calculate Entropy of Target Attribute (Salary) ===
Salary class distribution: Counter({'Medium': 6, 'Low': 3, 'High': 2})
Total samples: 11

Detailed calculation:
P(Medium) = 6/11 = 0.5455
  -0.5455 * log2(0.5455) = 0.4770
P(Low) = 3/11 = 0.2727
  -0.2727 * log2(0.2727) = 0.5112
P(High) = 2/11 = 0.1818
  -0.1818 * log2(0.1818) = 0.4472

Total Entropy = 1.4354


In [None]:
# Step 2: Calculate Information Gain for each attribute
def calculate_information_gain(data, attribute, target_column):
    """
    Calculate information gain for a given attribute
    Information Gain = Entropy(S) - Σ(|Sv|/|S|) * Entropy(Sv)
    where Sv is the subset of S for which attribute A has value v
    """
    # Get unique values of the attribute
    attribute_values = data[attribute].unique()
    total_samples = len(data)
    
    # Calculate weighted entropy for each attribute value
    weighted_entropy = 0
    print(f"\n--- Information Gain for {attribute} ---")
    
    for value in attribute_values:
        # Get subset of data where attribute = value
        subset = data[data[attribute] == value]
        subset_size = len(subset)
        
        # Calculate entropy for this subset
        subset_entropy, subset_counts = calculate_entropy(subset, target_column)
        
        # Weight by proportion of samples
        weight = subset_size / total_samples
        weighted_entropy += weight * subset_entropy
        
        print(f"  {attribute} = {value}: {subset_size}/{total_samples} samples")
        print(f"    Subset entropy: {subset_entropy:.4f}")
        print(f"    Weight: {weight:.4f}")
        print(f"    Weighted contribution: {weight * subset_entropy:.4f}")
        print(f"    Class distribution: {dict(subset_counts)}")
    
    # Information Gain = Original Entropy - Weighted Entropy
    information_gain = salary_entropy - weighted_entropy
    print(f"  Weighted entropy: {weighted_entropy:.4f}")
    print(f"  Information Gain: {salary_entropy:.4f} - {weighted_entropy:.4f} = {information_gain:.4f}")
    
    return information_gain, weighted_entropy

attributes = ['department', 'status', 'age']
information_gains = {}
weighted_entropies = {}

print("=== STEP 2: Calculate Information Gain for Each Attribute ===")
for attr in attributes:
    ig, we = calculate_information_gain(df, attr, 'salary')
    information_gains[attr] = ig
    weighted_entropies[attr] = we

print(f"\nSummary of Information Gains:")
for attr, ig in information_gains.items():
    print(f"  {attr}: {ig:.4f}")


=== STEP 2: Calculate Information Gain for Each Attribute ===

--- Information Gain for department ---
  department = sales: 3/11 samples
    Subset entropy: 0.9183
    Weight: 0.2727
    Weighted contribution: 0.2504
    Class distribution: {'Medium': 1, 'Low': 2}
  department = systems: 4/11 samples
    Subset entropy: 1.0000
    Weight: 0.3636
    Weighted contribution: 0.3636
    Class distribution: {'Medium': 2, 'High': 2}
  department = marketing: 2/11 samples
    Subset entropy: 0.0000
    Weight: 0.1818
    Weighted contribution: 0.0000
    Class distribution: {'Medium': 2}
  department = secretary: 2/11 samples
    Subset entropy: 1.0000
    Weight: 0.1818
    Weighted contribution: 0.1818
    Class distribution: {'Medium': 1, 'Low': 1}
  Weighted entropy: 0.7959
  Information Gain: 1.4354 - 0.7959 = 0.6395

--- Information Gain for status ---
  status = senior: 5/11 samples
    Subset entropy: 0.9710
    Weight: 0.4545
    Weighted contribution: 0.4413
    Class distribution:

In [30]:
# Step 3: Calculate Split Information for each attribute
def calculate_split_information(data, attribute):
    """
    Calculate split information for a given attribute
    Split Information = -Σ(|Si|/|S|) * log2(|Si|/|S|)
    where Si is the subset of S for which attribute A has value i
    """
    # Get unique values of the attribute
    attribute_values = data[attribute].unique()
    total_samples = len(data)
    
    split_info = 0
    print(f"\n--- Split Information for {attribute} ---")
    
    for value in attribute_values:
        # Count samples for this attribute value
        subset_size = len(data[data[attribute] == value])
        proportion = subset_size / total_samples
        
        if proportion > 0:  # Avoid log(0)
            split_info -= proportion * math.log2(proportion)
            print(f"  {attribute} = {value}: {subset_size}/{total_samples} = {proportion:.4f}")
            print(f"    -{proportion:.4f} * log2({proportion:.4f}) = {-proportion * math.log2(proportion):.4f}")
    
    print(f"  Split Information: {split_info:.4f}")
    return split_info

# Calculate split information for each attribute
split_informations = {}

print("=== STEP 3: Calculate Split Information for Each Attribute ===")
for attr in attributes:
    si = calculate_split_information(df, attr)
    split_informations[attr] = si

print(f"\nSummary of Split Information:")
for attr, si in split_informations.items():
    print(f"  {attr}: {si:.4f}")


=== STEP 3: Calculate Split Information for Each Attribute ===

--- Split Information for department ---
  department = sales: 3/11 = 0.2727
    -0.2727 * log2(0.2727) = 0.5112
  department = systems: 4/11 = 0.3636
    -0.3636 * log2(0.3636) = 0.5307
  department = marketing: 2/11 = 0.1818
    -0.1818 * log2(0.1818) = 0.4472
  department = secretary: 2/11 = 0.1818
    -0.1818 * log2(0.1818) = 0.4472
  Split Information: 1.9363

--- Split Information for status ---
  status = senior: 5/11 = 0.4545
    -0.4545 * log2(0.4545) = 0.5170
  status = junior: 6/11 = 0.5455
    -0.5455 * log2(0.5455) = 0.4770
  Split Information: 0.9940

--- Split Information for age ---
  age = 31-40: 5/11 = 0.4545
    -0.4545 * log2(0.4545) = 0.5170
  age = 21-30: 4/11 = 0.3636
    -0.3636 * log2(0.3636) = 0.5307
  age = 41-50: 2/11 = 0.1818
    -0.1818 * log2(0.1818) = 0.4472
  Split Information: 1.4949

Summary of Split Information:
  department: 1.9363
  status: 0.9940
  age: 1.4949


In [31]:
# Step 4: Calculate Gain Ratio for each attribute
def calculate_gain_ratio(information_gain, split_information):
    """
    Calculate gain ratio for an attribute
    Gain Ratio = Information Gain / Split Information
    """
    if split_information == 0:
        return 0  # Avoid division by zero
    return information_gain / split_information

# Calculate gain ratio for each attribute
gain_ratios = {}

print("=== STEP 4: Calculate Gain Ratio for Each Attribute ===")
print("Gain Ratio = Information Gain / Split Information")
print()

for attr in attributes:
    ig = information_gains[attr]
    si = split_informations[attr]
    gr = calculate_gain_ratio(ig, si)
    gain_ratios[attr] = gr
    
    print(f"{attr}:")
    print(f"  Information Gain: {ig:.4f}")
    print(f"  Split Information: {si:.4f}")
    print(f"  Gain Ratio: {ig:.4f} / {si:.4f} = {gr:.4f}")
    print()

print("Summary of Gain Ratios:")
for attr, gr in gain_ratios.items():
    print(f"  {attr}: {gr:.4f}")

# Find the attribute with the highest gain ratio
best_attribute = max(gain_ratios, key=gain_ratios.get)
best_gain_ratio = gain_ratios[best_attribute]
print()
print("RESULTS:")
print("=" * 50)
print(f"Attribute with highest Gain Ratio: {best_attribute}")
print(f"Gain Ratio: {best_gain_ratio:.4f}")
print(f"{best_attribute} attribute should be selected to split the records in the first iteration: {best_attribute}")

=== STEP 4: Calculate Gain Ratio for Each Attribute ===
Gain Ratio = Information Gain / Split Information

department:
  Information Gain: 0.6395
  Split Information: 1.9363
  Gain Ratio: 0.6395 / 1.9363 = 0.3303

status:
  Information Gain: 0.4486
  Split Information: 0.9940
  Gain Ratio: 0.4486 / 0.9940 = 0.4513

age:
  Information Gain: 0.2668
  Split Information: 1.4949
  Gain Ratio: 0.2668 / 1.4949 = 0.1784

Summary of Gain Ratios:
  department: 0.3303
  status: 0.4513
  age: 0.1784

RESULTS:
Attribute with highest Gain Ratio: status
Gain Ratio: 0.4513
status attribute should be selected to split the records in the first iteration: status


## Summary and Conclusion

### C4.5 Decision Tree Algorithm Analysis

The C4.5 algorithm uses **Gain Ratio** as the splitting criterion to overcome the bias of Information Gain toward attributes with many values. The Gain Ratio is calculated as:

**Gain Ratio = Information Gain / Split Information**

### Final Answer:
The attribute ''Status'' should be selected for the first split in the C4.5 decision tree algorithm.


# 2. Naive Bayesian Classifier for Tennis Playing Decision

## Problem Statement
Given the training dataset for tennis playing decisions based on weather conditions, we need to:
1. Build a Naive Bayesian Classifier (NBC) based on the training data
2. Use the classifier to predict whether to play tennis when Outlook=sunny, Temperature=cool, Humidity=high, and Windy=True

## Training Dataset
The dataset contains 14 records with attributes:
- **Outlook**: sunny, overcast, rainy
- **Temperature**: hot, mild, cool  
- **Humidity**: high, normal
- **Windy**: TRUE, FALSE
- **Play** (class label): yes, no


In [18]:
# Import necessary libraries for Naive Bayesian Classifier
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import math

# Set up the tennis playing dataset
tennis_data = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 
                'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy'],
    'Temperature': ['hot', 'hot', 'hot', 'mild', 'cool', 'cool', 'cool', 
                    'mild', 'cool', 'mild', 'mild', 'mild', 'hot', 'mild'],
    'Humidity': ['high', 'high', 'high', 'high', 'normal', 'normal', 'normal', 
                 'high', 'normal', 'normal', 'normal', 'high', 'normal', 'high'],
    'Windy': [False, True, False, False, False, True, True, 
              False, False, False, True, True, False, True],
    'Play': ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 
             'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
}

# Create DataFrame
df_tennis = pd.DataFrame(tennis_data)
print("=== TENNIS PLAYING DATASET ===")
print(df_tennis)
print(f"\nDataset shape: {df_tennis.shape}")
print(f"Total records: {len(df_tennis)}")

# Display class distribution
play_counts = Counter(df_tennis['Play'])
print(f"\nClass distribution:")
for play_class, count in play_counts.items():
    print(f"  {play_class}: {count} ({count/len(df_tennis)*100:.1f}%)")


=== TENNIS PLAYING DATASET ===
     Outlook Temperature Humidity  Windy Play
0      sunny         hot     high  False   no
1      sunny         hot     high   True   no
2   overcast         hot     high  False  yes
3      rainy        mild     high  False  yes
4      rainy        cool   normal  False  yes
5      rainy        cool   normal   True   no
6   overcast        cool   normal   True  yes
7      sunny        mild     high  False   no
8      sunny        cool   normal  False  yes
9      rainy        mild   normal  False  yes
10     sunny        mild   normal   True  yes
11  overcast        mild     high   True  yes
12  overcast         hot   normal  False  yes
13     rainy        mild     high   True   no

Dataset shape: (14, 5)
Total records: 14

Class distribution:
  no: 5 (35.7%)
  yes: 9 (64.3%)


In [19]:
# Part (a): Build Naive Bayesian Classifier
# Calculate all probabilities required by the classifier

def calculate_prior_probabilities(data, target_column):
    """
    Calculate prior probabilities P(C) for each class
    """
    class_counts = Counter(data[target_column])
    total_samples = len(data)
    
    prior_probs = {}
    print("=== PRIOR PROBABILITIES P(Play) ===")
    for class_name, count in class_counts.items():
        prob = count / total_samples
        prior_probs[class_name] = prob
        print(f"P(Play = {class_name}) = {count}/{total_samples} = {prob:.4f}")
    
    return prior_probs

def calculate_likelihood_probabilities(data, target_column, attributes):
    """
    Calculate likelihood probabilities P(Ai|C) for each attribute given each class
    """
    likelihood_probs = {}
    class_counts = Counter(data[target_column])
    
    print("\n=== LIKELIHOOD PROBABILITIES P(Attribute|Class) ===")
    
    for attr in attributes:
        likelihood_probs[attr] = {}
        print(f"\n--- {attr} ---")
        
        # Get unique values for this attribute
        attr_values = data[attr].unique()
        
        for class_name in class_counts.keys():
            likelihood_probs[attr][class_name] = {}
            
            # Filter data for this class
            class_data = data[data[target_column] == class_name]
            class_size = len(class_data)
            
            print(f"\n  Given Play = {class_name} (n = {class_size}):")
            
            for attr_value in attr_values:
                # Count occurrences of this attribute value in this class
                count = len(class_data[class_data[attr] == attr_value])
                prob = count / class_size if class_size > 0 else 0
                likelihood_probs[attr][class_name][attr_value] = prob
                
                print(f"    P({attr} = {attr_value} | Play = {class_name}) = {count}/{class_size} = {prob:.4f}")
    
    return likelihood_probs

# Calculate all probabilities
target_column = 'Play'
attributes = ['Outlook', 'Temperature', 'Humidity', 'Windy']

# Calculate prior probabilities
prior_probs = calculate_prior_probabilities(df_tennis, target_column)

# Calculate likelihood probabilities
likelihood_probs = calculate_likelihood_probabilities(df_tennis, target_column, attributes)

print(f"\n=== SUMMARY OF ALL PROBABILITIES ===")
print(f"Prior probabilities: {prior_probs}")
print(f"Likelihood probabilities calculated for attributes: {attributes}")


=== PRIOR PROBABILITIES P(Play) ===
P(Play = no) = 5/14 = 0.3571
P(Play = yes) = 9/14 = 0.6429

=== LIKELIHOOD PROBABILITIES P(Attribute|Class) ===

--- Outlook ---

  Given Play = no (n = 5):
    P(Outlook = sunny | Play = no) = 3/5 = 0.6000
    P(Outlook = overcast | Play = no) = 0/5 = 0.0000
    P(Outlook = rainy | Play = no) = 2/5 = 0.4000

  Given Play = yes (n = 9):
    P(Outlook = sunny | Play = yes) = 2/9 = 0.2222
    P(Outlook = overcast | Play = yes) = 4/9 = 0.4444
    P(Outlook = rainy | Play = yes) = 3/9 = 0.3333

--- Temperature ---

  Given Play = no (n = 5):
    P(Temperature = hot | Play = no) = 2/5 = 0.4000
    P(Temperature = mild | Play = no) = 2/5 = 0.4000
    P(Temperature = cool | Play = no) = 1/5 = 0.2000

  Given Play = yes (n = 9):
    P(Temperature = hot | Play = yes) = 2/9 = 0.2222
    P(Temperature = mild | Play = yes) = 4/9 = 0.4444
    P(Temperature = cool | Play = yes) = 3/9 = 0.3333

--- Humidity ---

  Given Play = no (n = 5):
    P(Humidity = high | Pl

In [20]:
# Part (b): Use the classifier to make predictions
# Test case: Outlook=sunny, Temperature=cool, Humidity=high, Windy=True

def naive_bayes_predict(instance, prior_probs, likelihood_probs, attributes):
    """
    Make prediction using Naive Bayesian Classifier
    P(C|X) ∝ P(C) * ∏ P(Ai|C)
    """
    print("=== NAIVE BAYESIAN PREDICTION ===")
    print(f"Test instance: {instance}")
    print()
    
    # Calculate posterior probabilities for each class
    posterior_probs = {}
    
    for class_name in prior_probs.keys():
        print(f"--- Calculating P(Play = {class_name} | X) ---")
        
        # Start with prior probability
        posterior = prior_probs[class_name]
        print(f"P(Play = {class_name}) = {posterior:.4f}")
        
        # Multiply by likelihood for each attribute
        for attr in attributes:
            attr_value = instance[attr]
            likelihood = likelihood_probs[attr][class_name][attr_value]
            posterior *= likelihood
            
            print(f"P({attr} = {attr_value} | Play = {class_name}) = {likelihood:.4f}")
            print(f"  Updated posterior: {posterior:.6f}")
        
        posterior_probs[class_name] = posterior
        print(f"Final P(Play = {class_name} | X) = {posterior:.6f}")
        print()
    
    # Find the class with highest posterior probability
    predicted_class = max(posterior_probs, key=posterior_probs.get)
    
    print("=== PREDICTION RESULT ===")
    print(f"Posterior probabilities:")
    for class_name, prob in posterior_probs.items():
        print(f"  P(Play = {class_name} | X) = {prob:.6f}")
    
    print(f"\nPredicted class: {predicted_class}")
    print(f"Confidence: {posterior_probs[predicted_class]:.6f}")
    
    return predicted_class, posterior_probs

# Test instance: Outlook=sunny, Temperature=cool, Humidity=high, Windy=True
test_instance = {
    'Outlook': 'sunny',
    'Temperature': 'cool', 
    'Humidity': 'high',
    'Windy': True
}

# Make prediction
predicted_class, posterior_probs = naive_bayes_predict(test_instance, prior_probs, likelihood_probs, attributes)


=== NAIVE BAYESIAN PREDICTION ===
Test instance: {'Outlook': 'sunny', 'Temperature': 'cool', 'Humidity': 'high', 'Windy': True}

--- Calculating P(Play = no | X) ---
P(Play = no) = 0.3571
P(Outlook = sunny | Play = no) = 0.6000
  Updated posterior: 0.214286
P(Temperature = cool | Play = no) = 0.2000
  Updated posterior: 0.042857
P(Humidity = high | Play = no) = 0.8000
  Updated posterior: 0.034286
P(Windy = True | Play = no) = 0.6000
  Updated posterior: 0.020571
Final P(Play = no | X) = 0.020571

--- Calculating P(Play = yes | X) ---
P(Play = yes) = 0.6429
P(Outlook = sunny | Play = yes) = 0.2222
  Updated posterior: 0.142857
P(Temperature = cool | Play = yes) = 0.3333
  Updated posterior: 0.047619
P(Humidity = high | Play = yes) = 0.3333
  Updated posterior: 0.015873
P(Windy = True | Play = yes) = 0.3333
  Updated posterior: 0.005291
Final P(Play = yes | X) = 0.005291

=== PREDICTION RESULT ===
Posterior probabilities:
  P(Play = no | X) = 0.020571
  P(Play = yes | X) = 0.005291

Pre

## Summary and Conclusion

### Naive Bayesian Classifier Implementation

The Naive Bayesian Classifier was successfully implemented with the following key components: **Predicted Class - No.**




# Independence Analysis: Humidity and Windy Attributes

## Problem Statement
We need to analyze the independence relationships between Humidity and Windy attributes:
- **(c) Marginal Independence**: Are Humidity and Windy independent?
- **(d) Conditional Independence**: Are Humidity and Windy conditionally independent given the class label Play?

## Statistical Framework
For two attributes A and B to be independent:
- **Marginal Independence**: P(A, B) = P(A) × P(B)
- **Conditional Independence**: P(A, B | C) = P(A | C) × P(B | C)

We'll use chi-square tests and probability analysis to determine independence.


In [36]:
# Part (c): Marginal Independence Analysis
# Are Humidity and Windy independent?

# Marginal probabilities
p_humidity = df_tennis["Humidity"].value_counts(normalize=True)
p_windy = df_tennis["Windy"].value_counts(normalize=True)

# Joint probabilities
p_joint = df_tennis.groupby(["Humidity","Windy"]).size() / len(df_tennis)

# Compare expected vs actual
results = []
for h in p_humidity.index:
    for w in p_windy.index:
        expected = p_humidity[h] * p_windy[w]
        actual = p_joint.loc[h, w]
        results.append({
            "Humidity": h,
            "Windy": w,
            "Expected": round(expected, 3),
            "Actual": round(actual, 3),
            "Match?": abs(expected - actual) < 1e-6
        })

pd.DataFrame(results)

Unnamed: 0,Humidity,Windy,Expected,Actual,Match?
0,high,False,0.286,0.286,True
1,high,True,0.214,0.214,True
2,normal,False,0.286,0.286,True
3,normal,True,0.214,0.214,True


## Answer
Since in all cases actual = expected, the attributes Humidity and Windy are independent in this dataset.

In [37]:
# Part (d): Conditional Independence Analysis
# Are Humidity and Windy conditionally independent given Play class?

# Function to check conditional independence
def check_conditional_independence(df, attr1, attr2, given):
    results = {}
    for val in df[given].unique():
        subset = df[df[given] == val]
        total = len(subset)
        
        # Marginal probabilities
        p_attr1 = subset[attr1].value_counts(normalize=True).to_dict()
        p_attr2 = subset[attr2].value_counts(normalize=True).to_dict()
        
        # Joint probabilities
        joint = subset.groupby([attr1, attr2]).size() / total
        
        # Compare P(attr1, attr2 | given) vs P(attr1|given) * P(attr2|given)
        comparison = {}
        for (a1, a2), p_joint in joint.items():
            p_expected = p_attr1.get(a1, 0) * p_attr2.get(a2, 0)
            comparison[(a1, a2)] = {
                "P_joint": round(p_joint, 3),
                "P_expected": round(p_expected, 3),
                "Match": abs(p_joint - p_expected) < 1e-6
            }
        results[val] = comparison
    return results

results = check_conditional_independence(df_tennis, "Humidity", "Windy", "Play")

# Print results
for play_val, comp in results.items():
    print(f"\nPlay = {play_val}")
    for cond, vals in comp.items():
        print(f"{cond}: {vals}")


Play = no
('high', False): {'P_joint': 0.4, 'P_expected': 0.32, 'Match': False}
('high', True): {'P_joint': 0.4, 'P_expected': 0.48, 'Match': False}
('normal', True): {'P_joint': 0.2, 'P_expected': 0.12, 'Match': False}

Play = yes
('high', False): {'P_joint': 0.222, 'P_expected': 0.222, 'Match': True}
('high', True): {'P_joint': 0.111, 'P_expected': 0.111, 'Match': True}
('normal', False): {'P_joint': 0.444, 'P_expected': 0.444, 'Match': True}
('normal', True): {'P_joint': 0.222, 'P_expected': 0.222, 'Match': True}


## Answer:
For Play = yes → all Match = True → independent.
For Play = no → Match = False → not independent.

## 3. Efficient Modification for Generalized Data Records

**Key Modifications:**

1. **Weighted Entropy Calculation**: Replace simple counts with weighted sums
   - `H(S) = -Σ(w_i/W) * log2(w_i/W)` where `w_i` is the weight (count) of class i and `W` is total weight

2. **Weighted Information Gain**: Use weighted proportions in entropy calculations
   - `IG(S,A) = H(S) - Σ(W_v/W) * H(S_v)` where `W_v` is total weight for attribute value v

3. **Weighted Split Information**: Calculate using weighted counts
   - `SI(S,A) = -Σ(W_i/W) * log2(W_i/W)` where `W_i` is total weight for attribute value i

**Efficiency Benefits:**
- **Space Complexity**: O(n) instead of O(Σcount_i) where n is unique records
- **Memory Usage**: 93.3% reduction (165 records → 11 unique records)
- **Time Complexity**: O(n) instead of O(Σcount_i) for calculations

**Implementation Strategy:**
- Preserve count attribute throughout all calculations
- Use weighted sums instead of simple counts
- Maintain mathematical correctness with weighted proportions
- Validate results by comparing with expanded unweighted data

**Example Efficiency:**
- Original approach: Would need to expand 165 records (sum of all counts)
- Weighted approach: Only processes 11 unique records
- **Result**: Identical mathematical results with dramatically better computational efficiency

This approach maintains the mathematical integrity of the C4.5 algorithm while providing significant performance improvements for generalized data records.