Write a program to do: A dataset collected in a cosmetics shop showing
details of customers and whether or not they responded to a special offer
to buy a new lip-stick is shown in table below. (Implement step by step
using commands - Dont use library) Use this dataset to build a decision
tree, with Buys as the target variable, to help in buying lipsticks in the
future. Find the root node of the decision tree.

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
df = pd.read_csv('cosmetics_shop_data.csv', header=0)
df.head()

Unnamed: 0,Age,Income,Gender,Marital Status,Buys
0,19,Medium,Other,Single,Yes
1,35,High,Other,Married,No
2,27,Low,Male,Married,Yes
3,45,Low,Male,Single,Yes
4,29,High,Other,Married,No


In [3]:
#  Bin the continuous 'Age' attribute into discrete categories
bins = [0, 30, 40, 100]  # Define the bins: Young (0-30), Middle (30-40), Old (40+)
labels = ["Young", "Middle", "Old"]  # Labels for the bins
df["Age"] = pd.cut(df["Age"], bins=bins, labels=labels)

In [4]:
df.head()

Unnamed: 0,Age,Income,Gender,Marital Status,Buys
0,Young,Medium,Other,Single,Yes
1,Middle,High,Other,Married,No
2,Young,Low,Male,Married,Yes
3,Old,Low,Male,Single,Yes
4,Young,High,Other,Married,No


In [5]:
#helper function for entropy
def entropy(data, target_attr):
    values = data[target_attr].value_counts(normalize=True)
    return -sum(values * values.apply(math.log2))


In [6]:
def split_data(data, attribute, value):
    return data[data[attribute]==value]

In [7]:
#helper function to calculate information gain
def information_gain(data, attribute, target_attr):
    total_entropy= entropy(data, target_attr)
    values=data[attribute].unique()
    subset_entropy=0

    for value in values:
        subset = split_data(data, attribute, value)
        subset_entropy+= len(subset)/len(data)*entropy(subset, target_attr)
    
    return total_entropy-subset_entropy

In [8]:
#find root node
target_attribute ='Buys'
features = ['Age','Income', 'Gender', 'Marital Status']
root_node=None
max_gain=-1

print('information_ gain for each feature: ')
for feature in features:
    gain = information_gain(df, feature, target_attribute)
    print(f'{feature}: {gain:.4f}')
    if gain>max_gain:
        max_gain=gain
        root_node=feature

print(f'root node is {root_node}')

information_ gain for each feature: 
Age: 0.0035
Income: 0.0040
Gender: 0.0004
Marital Status: 0.0000
root node is Income
