## I solemnly swear that I have not discussed my assignment solutions with anyone in any way and the solutions I am submitting are my own personal work.

## Full Name: Karan Singh

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np

Loading our Dataset and dropping our column ID becasue we cannot retrieve any information from Column ID.

In [2]:
pd.set_option('display.max_columns', None) 

df = pd.read_csv('A2.csv') #Loading the Data Set
df = df.drop("ID", axis=1) 
df.head()

Unnamed: 0,Age,Education,Marital_Status,Occupation,Annual_Income
0,39,bachelors,never married,professional,high
1,50,doctorate,married,professional,mid
2,18,high school,never married,agriculture,low
3,30,bachelors,married,professional,mid
4,37,high school,married,agriculture,mid


## Part A Solution

For part A we are defining our new function from which we can get compute gini index and entropy for our target feature.

In [3]:
def compute_impurity(feature, impurity_criterion):
    """
    This function calculates impurity of a feature.
    Supported impurity criteria: 'entropy', 'gini'
    input: feature (this needs to be a Pandas series)
    output: feature impurity
    """
    probs = feature.value_counts(normalize=True)
    
    if impurity_criterion == 'entropy':
        impurity = -1 * np.sum(np.log2(probs) * probs)
    elif impurity_criterion == 'gini':
        impurity = 1 - np.sum(np.square(probs))
    else:
        raise ValueError('Unknown impurity criterion')
        
    return(round(impurity, 3))

The gini impurity index is defined as follows:
Gini(x):=1−∑i=1ℓP(t=i)2
 
The idea with Gini index is the same as in entropy in the sense that the more heterogenous and impure a feature is, the higher the Gini index.

In [4]:
split_criterion = 'gini'
target_entropy = compute_impurity(df['Annual_Income'], split_criterion)
target_entropy

0.555

Part A answer impurity of target feature is 0.555

##  Part B Solution

Let's compute the information gain for splitting based on a descriptive feature to figure out the best feature to split on. For this task, we do the following:

Compute impurity of the target feature (using either entropy or gini index).
Partition the dataset based on unique values of the descriptive feature.
Compute impurity for each partition.
Compute the remaining impurity as the weighted sum of impurity of each partition.
Compute the information gain as the difference between the impurity of the target feature and the remaining impurity.

We will define another function to achieve this, called comp_feature_information_gain().

In [5]:
def comp_feature_information_gain(df, target, descriptive_feature, split_criterion, con_feature_cut=None):
    """
    This function calculates information gain for splitting on 
    a particular descriptive feature for a given dataset
    and a given impurity criteria.
    Supported split criterion: 'entropy', 'gini'
    """
    
    print('target feature:', target)
    print('descriptive_feature:', descriptive_feature if con_feature_cut is None else '{}_{}'.format(descriptive_feature, con_feature_cut))
    print('split criterion:', split_criterion)
            
    target_entropy = compute_impurity(df[target], split_criterion)

    # we define two lists below:
    # entropy_list to store the entropy of each partition
    # weight_list to store the relative number of observations in each partition
    entropy_list = list()
    weight_list = list()
    
    if con_feature_cut is not None:
        # For less than con_feature_cut
        df_feature_level = df[df[descriptive_feature] < con_feature_cut]
        entropy_level = compute_impurity(df_feature_level[target], split_criterion)
        entropy_list.append(round(entropy_level, 3))
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))
        # For more than con_feature_cut
        df_feature_level = df[df[descriptive_feature] > con_feature_cut]
        entropy_level = compute_impurity(df_feature_level[target], split_criterion)
        entropy_list.append(round(entropy_level, 3))
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))
    else:
        # loop over each level of the descriptive feature
        # to partition the dataset with respect to that level
        # and compute the entropy and the weight of the level's partition
        for level in df[descriptive_feature].unique():
            df_feature_level = df[df[descriptive_feature] == level]
            entropy_level = compute_impurity(df_feature_level[target], split_criterion)
            entropy_list.append(round(entropy_level, 3))
            weight_level = len(df_feature_level) / len(df)
            weight_list.append(round(weight_level, 3))

    print('impurity of partitions:', entropy_list)
    print('weights of partitions:', weight_list)

    feature_remaining_impurity = round(np.sum(np.array(entropy_list) * np.array(weight_list)), 4)
    print('remaining impurity:', feature_remaining_impurity)
    
    information_gain = target_entropy - feature_remaining_impurity
    print('information gain:', round(information_gain, 4))
    
    print('====================')

    return round(information_gain, 4)

In [6]:
df.sort_values(by='Age', ascending=True)

Unnamed: 0,Age,Education,Marital_Status,Occupation,Annual_Income
2,18,high school,never married,agriculture,low
5,23,high school,never married,agriculture,low
12,23,bachelors,never married,agriculture,low
19,25,bachelors,married,transport,high
13,25,high school,married,professional,high
15,29,bachelors,never married,agriculture,mid
3,30,bachelors,married,professional,mid
9,33,high school,married,transport,mid
14,35,bachelors,married,agriculture,mid
10,36,high school,never married,transport,mid


As Age feature is a continuous feature, After sorting the rows based on age we will pick mid-point of two adjacent ages where target feature value is changing.

As per above strategy we will get these cuts: 24, 27, 38, 42

In [7]:
for cut in [24, 27, 38, 42]:
    comp_feature_information_gain(df, 'Annual_Income', 'Age', split_criterion, con_feature_cut=cut)

target feature: Annual_Income
descriptive_feature: Age_24
split criterion: gini
impurity of partitions: [0.0, 0.415]
weights of partitions: [0.15, 0.85]
remaining impurity: 0.3527
information gain: 0.2023
target feature: Annual_Income
descriptive_feature: Age_27
split criterion: gini
impurity of partitions: [0.48, 0.32]
weights of partitions: [0.25, 0.75]
remaining impurity: 0.36
information gain: 0.195
target feature: Annual_Income
descriptive_feature: Age_38
split criterion: gini
impurity of partitions: [0.569, 0.469]
weights of partitions: [0.6, 0.4]
remaining impurity: 0.529
information gain: 0.026
target feature: Annual_Income
descriptive_feature: Age_42
split criterion: gini
impurity of partitions: [0.631, 0.0]
weights of partitions: [0.75, 0.25]
remaining impurity: 0.4732
information gain: 0.0818


In [8]:
for feature in df.drop(['Age', 'Annual_Income'],axis=1).columns:
    feature_info_gain = comp_feature_information_gain(df, 'Annual_Income', feature, split_criterion)

target feature: Annual_Income
descriptive_feature: Education
split criterion: gini
impurity of partitions: [0.531, 0.375, 0.625]
weights of partitions: [0.4, 0.2, 0.4]
remaining impurity: 0.5374
information gain: 0.0176
target feature: Annual_Income
descriptive_feature: Marital_Status
split criterion: gini
impurity of partitions: [0.611, 0.42, 0.375]
weights of partitions: [0.3, 0.5, 0.2]
remaining impurity: 0.4683
information gain: 0.0867
target feature: Annual_Income
descriptive_feature: Occupation
split criterion: gini
impurity of partitions: [0.5, 0.5, 0.278]
weights of partitions: [0.4, 0.3, 0.3]
remaining impurity: 0.4334
information gain: 0.1216


In [9]:
df_splits = pd.DataFrame(columns=['Split', 'Remainder', 'Information_Gain', 'Is_Optimal'])
df_splits.loc[len(df_splits)] = ['Age_24', 0.3527, 0.2023, True]
df_splits.loc[len(df_splits)] = ['Age_27', 0.36, 0.195, False]
df_splits.loc[len(df_splits)] = ['Age_38', 0.529, 0.026, False]
df_splits.loc[len(df_splits)] = ['Age_42', 0.4732, 0.0818, False]
df_splits.loc[len(df_splits)] = ['Education', 0.5374, 0.0176, False]
df_splits.loc[len(df_splits)] = ['Marital_Status', 0.4683, 0.0867, False]
df_splits.loc[len(df_splits)] = ['Occupation', 0.4334, 0.1216, False]
df_splits

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Age_24,0.3527,0.2023,True
1,Age_27,0.36,0.195,False
2,Age_38,0.529,0.026,False
3,Age_42,0.4732,0.0818,False
4,Education,0.5374,0.0176,False
5,Marital_Status,0.4683,0.0867,False
6,Occupation,0.4334,0.1216,False


## Part C Solution

In this problem we are considering 'Education' descriptive feature as our root node and making predictions.

In [10]:
for degree in df['Education'].unique():
    print("Education level: ", degree)
    for income_level in df['Annual_Income'].unique():
        print("Income: ", income_level)
        result_df = df.loc[(df['Education'] == degree) & (df['Annual_Income'] == income_level)]
        print("probability: ", round(len(result_df)/ len(df), 3))
    print("=============================")
        

Education level:  bachelors
Income:  high
probability:  0.1
Income:  mid
probability:  0.25
Income:  low
probability:  0.05
Education level:  doctorate
Income:  high
probability:  0.05
Income:  mid
probability:  0.15
Income:  low
probability:  0.0
Education level:  high school
Income:  high
probability:  0.1
Income:  mid
probability:  0.2
Income:  low
probability:  0.1


In [11]:
df_prediction = pd.DataFrame(columns=['Leaf_Condition', 'Low_Income_Prob', 'Mid_Income_Prob', 'High_Income_Prob', 'Leaf_Prediction'])
df_prediction.loc[len(df_prediction)] = ['bachelors', 0.05, 0.25, 0.2, 'mid']
df_prediction.loc[len(df_prediction)] = ['doctorate', 0, 0.15, 0.1, 'mid']
df_prediction.loc[len(df_prediction)] = ['high school', 0.1, 0.2, 0.1, 'mid']
df_prediction

Unnamed: 0,Leaf_Condition,Low_Income_Prob,Mid_Income_Prob,High_Income_Prob,Leaf_Prediction
0,bachelors,0.05,0.25,0.2,mid
1,doctorate,0.0,0.15,0.1,mid
2,high school,0.1,0.2,0.1,mid


Reason for such leaf_predictions is because education provides least information gain.

In [12]:
df_splits

Unnamed: 0,Split,Remainder,Information_Gain,Is_Optimal
0,Age_24,0.3527,0.2023,True
1,Age_27,0.36,0.195,False
2,Age_38,0.529,0.026,False
3,Age_42,0.4732,0.0818,False
4,Education,0.5374,0.0176,False
5,Marital_Status,0.4683,0.0867,False
6,Occupation,0.4334,0.1216,False


In [13]:
df_prediction

Unnamed: 0,Leaf_Condition,Low_Income_Prob,Mid_Income_Prob,High_Income_Prob,Leaf_Prediction
0,bachelors,0.05,0.25,0.2,mid
1,doctorate,0.0,0.15,0.1,mid
2,high school,0.1,0.2,0.1,mid
