# Question 1.

In the Question 1, we have to build a decision tree of depth 1 using the given dataset with '**Annual_Income**' as a target column.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('A2_Q1.csv')
df1 = df.copy()
df.sample(5)

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income
3,4,30,bachelors,married,professional,mid
4,5,37,high school,married,agriculture,mid
10,11,36,high school,never married,transport,mid
18,19,39,high school,divorced,professional,high
17,18,37,bachelors,married,professional,mid


## Part A

In Part A of the the question 1, we have to compute the impurity of the Target Feature i.e. '**Annual_Income**' using Gini Index method.

The gini impurity index is defined as follows:
$$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$

Gini Index is used to calculate the Information Gain of the feature. 
The more impure the feature is, the higher the Gini Index value. 

Below, We have calculated Gini Impurity of the target feature using custom method called '**compute_GiniImpurity**'.

In [3]:
## Method to compute Impurity by given feature using Gini Index.
def compute_GiniImpurity(feature):
    probs = feature.value_counts(normalize=True)
    impurity = 1 - np.sum(np.square(probs))    
    return(round(impurity, 3))

In [4]:
## Step 1 : Calculate Target Impurity using Gini Index.
Target_Impurity = compute_GiniImpurity(df['Annual_Income'])
Target_Impurity

0.555

## Part B

In the part B of the question 1, We have to compute Impurity Remainder and Information_Gain values for each of the feature in the given data set using Gini Index. and we have to choose a feature with highest Information Gain as an Optimal feature for split.

Impurity Remainder is calculated by multiplying impurity of each feature split probablity value by its weight.

Information Gain is then calculated by subtracting the Impurity Remainder of a feature from Target Impurity calculated in part A.

To compute Information Gain of provided feature, we have created a method called 'compute_Information_Gain'.

In [5]:
## Method to compute Information Gain based on given (DataFrame, Target Feature, Descriptive Feature)

def compute_Information_Gain(df, target, Desc_feature):
    
    print("Descriptive Feature: ", Desc_feature)
    print("Target Feature: ", target)
    Remaining_Impurity_list = list()
    
    for level in df[Desc_feature].unique():
        df_feature_level = df[df[Desc_feature]== level]
        Remaining_Impurity_list.append(round((compute_GiniImpurity(df_feature_level['Annual_Income'])* (len(df_feature_level)/len(df))),3))
        
    feature_remaining_impurity = round(np.sum(Remaining_Impurity_list),3)
    IG = round(Target_Impurity - feature_remaining_impurity, 3)
    
    print("Remaining Impurity of the feature is: ", feature_remaining_impurity)
    print("Information Gain of the feature is: ", IG)
    df_split.loc[len(df_split)]= [Desc_feature,feature_remaining_impurity,IG,False]
    print("==========================================")
    
## Defining the final Data Frame df_split    
df_split = pd.DataFrame(columns = ['Split', 'Remainder', 'Information_Gain', 'Is_Optimal'])

## Calculating IG of each descriptive feature except 
## ID which is not a Descriptive Feature, 
## Annual_Income which is a Target Feature,
## Age which is currently a Continuous Feature and we Made it categorical for splitting in next cells.
for feature in df.drop(columns = ['ID', 'Age', 'Annual_Income']).columns:
    compute_Information_Gain(df, 'Annual_Income', feature)


Descriptive Feature:  Education
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.537
Information Gain of the feature is:  0.018
Descriptive Feature:  Marital_Status
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.468
Information Gain of the feature is:  0.087
Descriptive Feature:  Occupation
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.433
Information Gain of the feature is:  0.122


After calculating the Impurity Remainder and Information_Gain, We have added those values in the target data frame 'df_split'. We have not calculated Information Gain of all features so for now we have populated 'Is_Optimal' value to 'False' as a default value.

In [6]:
## Description of df_split.
## Right Now, we dont have IG of Age groups so we cannot fill the feature 'Is_Optimal' correctly right now.
## Hence, we assign default value 'False' to each cell of the feature 'Is_Optimal'
df_split

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Education,0.537,0.018,False
1,Marital_Status,0.468,0.087,False
2,Occupation,0.433,0.122,False


After calculating Information Gain of each categorical feature, we have to calculate Information Gain of 'Age' feature which is a continuous feature.

The easiest way to handle a continuous descriptive feature in a decision tree is to define a threshold within the range of values that the continuous feature can take and use this threshold to partition the instances based on whether their values for the feature are above or below the threshold. 

To find best threshold feature, we use below 4 steps:
    1. We have sorted the data set in the ascending order of the 'Age' feature.
    2. We consider the adjecent instances in the data set having different 'Annual_Income' value for computing Threshold value.
    3.  We take the Average of those instances 'Age' values to generate threshold values.
    4. Each threshold value is a boolean feature which then can compete with other categorical feature by computing their Information Gain and Impurity Remainder.

Step 1:

In [7]:
## To work with 'Age' continuous feature, First we sort the data frame in Ascending Order of Age Values
df = df.sort_values(by = 'Age').reset_index(drop = True)
df.head(5)

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income
0,3,18,high school,never married,agriculture,low
1,6,23,high school,never married,agriculture,low
2,13,23,bachelors,never married,agriculture,low
3,20,25,bachelors,married,transport,high
4,14,25,high school,married,professional,high


Step 2 and Step 3:

In [8]:
##In this cell, we create categories from the Age values where Target value changes.
##AgeFroup list contain threshold values which are classified as split candidates (thresholds).
Agegroup_list = list()
for i in range(0,len(df)-1):
    if(df.loc[i]['Annual_Income']!= df.loc[i+1]['Annual_Income']):
        Agegroup_list.append((df.loc[i]['Age'] + df.loc[i+1]['Age'])/2)
Agegroup_list

[24.0, 27.0, 38.0, 42.0]

In [9]:
## We are creating binary Split features as 'Age_24', 'Age_27', 'Age_38', 'Age_42' and populate them appropriately.

df.loc[df['Age']>= 24.0, 'Age_24'] = 'True'
df.loc[df['Age']< 24.0, 'Age_24'] = 'False'
df.loc[df['Age']>= 27.0, 'Age_27'] = 'True'
df.loc[df['Age']< 27.0, 'Age_27'] = 'False'
df.loc[df['Age']>= 38.0, 'Age_38'] = 'True'
df.loc[df['Age']< 38.0, 'Age_38'] = 'False'
df.loc[df['Age']>= 42.0, 'Age_42'] = 'True'
df.loc[df['Age']< 42.0, 'Age_42'] = 'False'

df.sample(5)

Unnamed: 0,ID,Age,Education,Marital_Status,Occupation,Annual_Income,Age_24,Age_27,Age_38,Age_42
11,18,37,bachelors,married,professional,mid,True,True,False,False
1,6,23,high school,never married,agriculture,low,False,False,False,False
0,3,18,high school,never married,agriculture,low,False,False,False,False
14,8,40,doctorate,married,professional,high,True,True,True,False
10,5,37,high school,married,agriculture,mid,True,True,False,False


Step 4: 

In [10]:
## Computing Information Gain of each of the Age feature by calling method 'compute_Information_Gain'.
Agefeature_list = ['Age_24', 'Age_27', 'Age_38', 'Age_42']
for feature in Agefeature_list:
    compute_Information_Gain(df, 'Annual_Income', feature)

Descriptive Feature:  Age_24
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.353
Information Gain of the feature is:  0.202
Descriptive Feature:  Age_27
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.36
Information Gain of the feature is:  0.195
Descriptive Feature:  Age_38
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.529
Information Gain of the feature is:  0.026
Descriptive Feature:  Age_42
Target Feature:  Annual_Income
Remaining Impurity of the feature is:  0.473
Information Gain of the feature is:  0.082


After computing Information Gain of each of the feature in the data set, we have to choose Optimal feature for the first split in a decision tree. For that, we assign value 'True' to the column 'Is_Optimal' of an instance having highest Information Gain.

We can see that feature '**Age_24**' have maximum Inforamtion gain of all the features which is 0.202. So this 'Is_Optimal' value of this feature is assigned to 'True'.

In [11]:
## Populating value 'True' to the feature 'Is_Optimal' of split having highest Information Gain.
## The final data frame 'df_split'
df_split.loc[df_split['Information_Gain']==df_split['Information_Gain'].max(), 'Is_Optimal'] = 'True'

In [12]:
df_split

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Education,0.537,0.018,False
1,Marital_Status,0.468,0.087,False
2,Occupation,0.433,0.122,False
3,Age_24,0.353,0.202,True
4,Age_27,0.36,0.195,False
5,Age_38,0.529,0.026,False
6,Age_42,0.473,0.082,False


## Part C

In this part of the Question 1,  we assumed that the 'Education' descriptive feature is the root node of the decison tree.

We have to calculate probability of the target feature levels by each of the Education leaf value. and finally we have to find the leaf prediction value for each leaf which is calculated on the basis of calculated probabilty of the target feature levels.

First we calculate the leaf values after splitting on the basis of 'Education' feature. There are 3 leaf conditions created ('high school, bachelors, doctorate'.

In [13]:
df1['Education'].value_counts()

high school    8
bachelors      8
doctorate      4
Name: Education, dtype: int64

We have created method called 'prediction' which will caculate probabilty of each target leavel for each leaf condition. and then compute the leaf prediction for each leaf and create data frame 'df_predicted' as shown below.

In [14]:
##Method to calculate df_predicted data frame as required.
def prediction(df, target):
    
    for level in df['Education'].unique():
        Leaf_Condition = "Education == "+level
        df_Education_level = df[df['Education']== level]
        #print(df_Education_level)
        Low_Income_Prob = round(len(df_Education_level[df_Education_level['Annual_Income']=='low'])/ len(df_Education_level),3)
        Mid_Income_Prob = round(len(df_Education_level[df_Education_level['Annual_Income']=='mid'])/ len(df_Education_level),3)
        High_Income_Prob = round(len(df_Education_level[df_Education_level['Annual_Income']=='high'])/ len(df_Education_level),3)
        
        var = {Low_Income_Prob:"low", Mid_Income_Prob:"mid", High_Income_Prob:"high"}
        Leaf_prediction = var.get(max(var))
        
        df_predicted.loc[len(df_predicted)]= [Leaf_Condition,Low_Income_Prob,Mid_Income_Prob,High_Income_Prob,Leaf_prediction]
        #print("==============================") 

In [15]:
df_predicted = pd.DataFrame(columns = ['Leaf_Condition', 'Low_Income_Prob', 'Mid_Income_Prob', 'High_Income_Prob','Leaf_prediction'])       
prediction(df1,"Annual_Income")

Question 1 wrap-up.
1. df_split
2. df_predicted

In [16]:
##Final df_split data frame
df_split

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Education,0.537,0.018,False
1,Marital_Status,0.468,0.087,False
2,Occupation,0.433,0.122,False
3,Age_24,0.353,0.202,True
4,Age_27,0.36,0.195,False
5,Age_38,0.529,0.026,False
6,Age_42,0.473,0.082,False


In [17]:
## Final df_predicted data frame
df_predicted

Unnamed: 0,Leaf_Condition,Low_Income_Prob,Mid_Income_Prob,High_Income_Prob,Leaf_prediction
0,Education == bachelors,0.125,0.625,0.25,mid
1,Education == doctorate,0.0,0.75,0.25,mid
2,Education == high school,0.25,0.5,0.25,mid


---