Use the following code to install the pandas library if it is not already installed.

In [42]:
# !pip install pandas

### We aim to build a decision tree classifier that can predict whether to play tennis based on various weather conditions.

**Reading Data**:
- The script reads the data from a CSV file named "PlayTennis.csv" using the Pandas library. This file contains information about various weather conditions and whether tennis was played. Hence, we use the `pd.read_csv()` function to read the data into a pandas DataFrame.

In [43]:
import pandas as pd
df = pd.read_csv("PlayTennis.csv")
print(df)

     Outlook Temperature Humidity    Wind Play Tennis
0      Sunny         Hot     High    Weak          No
1      Sunny         Hot     High  Strong          No
2   Overcast         Hot     High    Weak         Yes
3       Rain        Mild     High    Weak         Yes
4       Rain        Cool   Normal    Weak         Yes
5       Rain        Cool   Normal  Strong          No
6   Overcast        Cool   Normal  Strong         Yes
7      Sunny        Mild     High    Weak          No
8      Sunny        Cool   Normal    Weak         Yes
9       Rain        Mild   Normal    Weak         Yes
10     Sunny        Mild   Normal  Strong         Yes
11  Overcast        Mild     High  Strong         Yes
12  Overcast         Hot   Normal    Weak         Yes
13      Rain        Mild     High  Strong          No


**Calculating Entropy**:
- Entropy is a measure of impurity or uncertainty in the dataset. The script defines functions to calculate entropy and entropy of a list.
- Mathematically the formula for entropy is as follows:
- $H = -\sum_{i=1}^n p(X_i) \log_2 p(X_i)$

In [44]:
# Calculate Entropy
def entropy(probs):
    import math
    return sum(-prob*math.log(prob, 2) for prob in probs)

- After defining the entropy function, we can calculate the entropy of the dataset.
- For that, we use the `entropy_of_list` function. This function takes a list as input and returns the entropy of that list.
- It first calculates how many times each element of the list occurs in the dataset. Then, it calculates the probability of each element occurring in the dataset by dividing the number of times it occurs by the total number of elements in the dataset.
- It then uses the entropy function to calculate the entropy of the list.

In [45]:
# Calaculating the Probability of Positive and negative examples
def entropy_of_list(a_list):
    from collections import Counter
    cnt = Counter(x for x in a_list)
    num_instances = len(a_list)
    probs=[x/num_instances for x in cnt.values()] # Create a list of probabilities from the dictionary
    return entropy(probs)

- We will pass the PlayTennis dataset to the `entropy_of_list` function. This will give us the entropy of the dataset.

In [46]:
total_entropy = entropy_of_list(df['Play Tennis'])
print(total_entropy)

0.9402859586706309


**Calculating Information Gain**:
- Information gain measures the effectiveness of splitting a dataset based on a particular attribute. The script defines a function to calculate information gain for each attribute.

In [47]:
# Calculating Information Gain Function:
def information_gain(df, split_attribute_name, target_attribute_name, trace=0):
    # Split the DataFrame based on the split attribute
    df_split = df.groupby(split_attribute_name)

    # Meaning we are grouping all values in data into groups based on what attribute they belong to:
    # Sunny, Rainy, Overcast ==> Outlook
    # Hot, Mild, Cool ==> Temperature
    # High, Normal ==> Humidity
    # Weak, Strong ==> Wind
    
    # Iterate over each group created by the split
    for name, group in df_split:
        # Calculate the total number of observations
        nobs = len(df.index) * 1.0

        # df_agg_ent is short for df aggregated entropy
        
        # Calculate entropy and the proportion of each group, agg() is a method in pandas 
        # which allows you to perform aggregate functions
        # Meaning you get to apply functions to each group
        df_agg_ent = df_split.agg({target_attribute_name: [entropy_of_list, lambda x: len(x) / nobs]})[target_attribute_name]

        """
                  entropy_of_list  <lambda_0>
        Outlook                              
        Overcast         0.000000    0.285714
        Rain             0.970951    0.357143
        Sunny            0.970951    0.357143
                    entropy_of_list  <lambda_0>
        Temperature                             
        Cool                0.811278    0.285714
        Hot                 1.000000    0.285714
        Mild                0.918296    0.428571
                entropy_of_list  <lambda_0>
        Humidity                             
        High             0.985228         0.5
        Normal           0.591673         0.5
                entropy_of_list  <lambda_0>
        Wind                               
        Strong         1.000000    0.428571
        Weak           0.811278    0.571429
                    entropy_of_list  <lambda_0>
        Temperature                             
        Cool                1.000000         0.4
        Mild                0.918296         0.6
                entropy_of_list  <lambda_0>
        Humidity                             
        High             1.000000         0.4
        """

        # Calculate the average information
        avg_info = sum(df_agg_ent['entropy_of_list'] * df_agg_ent['<lambda_0>'])
        
        # Calculate the entropy of the entire dataset
        old_entropy = entropy_of_list(df[target_attribute_name])
        
        # Return the information gain
        return old_entropy - avg_info
    
# I'll continue this later...

In [48]:
def id3DT(df, target_attribute_name, attribute_names, default_class=None):
  from collections import Counter
  cnt = Counter(x for x in df[target_attribute_name])
  if len(cnt)==1:
    return next(iter(cnt))
  elif df.empty or (not attribute_names):
    return default_class
  else:
    default_class = max(cnt.keys())
    #print("attributes_names:",attribute_names)
    gainz = [information_gain(df,attr, target_attribute_name) for attr in attribute_names]
    index_of_max = gainz.index(max(gainz))
    best_attr = attribute_names[index_of_max]
    tree = {best_attr:{}}
    remaining_attributes_names = [i for i in attribute_names if i != best_attr]
    for attr_val, data_subset in df.groupby(best_attr):
      subtree = id3DT(data_subset,target_attribute_name,remaining_attributes_names,default_class)
      tree[best_attr][attr_val]=subtree
  return tree

attribute_names = list(df.columns)
attribute_names.remove('Play Tennis')

In [49]:
from pprint import pprint
tree = id3DT(df,'Play Tennis',attribute_names)
print("The Resultant Decision Tree is ")
pprint(tree)
attribute = next(iter(tree))
print("Best Attribute: \n", attribute)
print("Tree Keys\n ", tree[attribute].keys())

          entropy_of_list  <lambda_0>
Outlook                              
Overcast         0.000000    0.285714
Rain             0.970951    0.357143
Sunny            0.970951    0.357143
             entropy_of_list  <lambda_0>
Temperature                             
Cool                0.811278    0.285714
Hot                 1.000000    0.285714
Mild                0.918296    0.428571
          entropy_of_list  <lambda_0>
Humidity                             
High             0.985228         0.5
Normal           0.591673         0.5
        entropy_of_list  <lambda_0>
Wind                               
Strong         1.000000    0.428571
Weak           0.811278    0.571429
             entropy_of_list  <lambda_0>
Temperature                             
Cool                1.000000         0.4
Mild                0.918296         0.6
          entropy_of_list  <lambda_0>
Humidity                             
High             1.000000         0.4
Normal           0.918296      

In [50]:
# Classifying new sample

def classify(instance, tree, default=None):
  attribute=next(iter(tree))
  print("Key:",tree.keys())
  print("Attribute",attribute)
  if instance[attribute] in tree[attribute].keys():
    result=tree[attribute][instance[attribute]]
    print("Instance Attribute:",instance[attribute], "TreeKeys:",tree[attribute].keys())
    if isinstance(result,dict):
      return classify(instance,result)
    else:
      return result
  else:
    return default

tree1={'Outlook':['Rainy','Sunny'],'Temperature':['Mild','Hot'],'Humidity':['Normal','High'],'Windy':['Weak','Strong']}
df2=pd.DataFrame(tree1)
df2['Predicted']=df2.apply(classify,axis=1, args=(tree,'No'))
print(df2)


Key: dict_keys(['Outlook'])
Attribute Outlook
Key: dict_keys(['Outlook'])
Attribute Outlook
Instance Attribute: Sunny TreeKeys: dict_keys(['Overcast', 'Rain', 'Sunny'])
Key: dict_keys(['Humidity'])
Attribute Humidity
Instance Attribute: High TreeKeys: dict_keys(['High', 'Normal'])
  Outlook Temperature Humidity   Windy Predicted
0   Rainy        Mild   Normal    Weak        No
1   Sunny         Hot     High  Strong        No
