## AdaBoost Tutorial

## Install requirements

In [1]:
!pip install pandas
!pip install numpy



## Load data set

Dataset source:

* https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

Dataset attributes:

* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [3]:
import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                 names = ["age", "workclass", "fnlwgt", "education", "education-num", 
                          "marital-status", "occupation", "relationship",
                         "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"], index_col=False)

In [4]:
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [6]:
print(f"Lenght of the whole dataset: {len(df)}")

Lenght of the whole dataset: 32561


## Simplify the dataset

The following code creates a simplified version of the dataset to make the following explanations more comprehensible

Attributes:

* Is the person older than 50? - Yes/No
* Is the person male? - Yes/No
* Does the person works more than 40 hours per week? - Yes/No

Target variable:

* Does the person earn more than 50 000 Dollar? - Yes/No

In [7]:
import numpy as np

def male_or_not(row):
    if row['sex'].lstrip() == 'Male':
        val = 'Yes'
    else:
        val = 'No'
    return val

def income_over_50(row):
    if row['income'].lstrip() == '>50K':
        val = 'Yes'
    else:
        val = 'No'
    return val

df['male'] = df.apply(male_or_not, axis=1)
df['>40 hours'] = np.where(df['hours-per-week']>40, 'Yes', 'No')
df['>50 years'] = np.where(df['age']>50, 'Yes', 'No')

# Target
df['>50k income'] = df.apply(income_over_50, axis=1)

df_simpl = df[['male', '>40 hours','>50 years','>50k income']]
df_simpl = df_simpl.head(10)
df_simpl

Unnamed: 0,male,>40 hours,>50 years,>50k income
0,Yes,No,No,No
1,Yes,No,No,No
2,Yes,No,No,No
3,Yes,No,Yes,No
4,No,No,No,No
5,No,No,No,No
6,No,No,No,No
7,Yes,Yes,Yes,Yes
8,No,Yes,No,Yes
9,Yes,No,No,Yes


## Find the first Decision Stump

* In the first step we are calculating the gini index for each attribute
* We are selecting the attribute as root node of our first decision stump, that shows the lowest gini index (highest gini gain)

![title](img/images.svg)

In [8]:
def calc_weighted_gini_index(attribute, df_simpl):
    '''
    Args:
        df_simpl: the simplified data set stored in a data frame
        attribute: the chosen attribute for the root node of the tree
    Return:
        Gini_attribute: the gini index for the chosen attribute
    '''
    d_node = df_simpl[[attribute, '>50k income']]
    
    n = len(d_node)
    n_1 = len(d_node[d_node[attribute] == 'Yes'])
    n_2 = len(d_node[d_node[attribute] == 'No'])

    n_1_yes = len(d_node[(d_node[attribute] == 'Yes') & (d_node[">50k income"] == 'Yes')])
    n_1_no = len(d_node[(d_node[attribute] == 'Yes') & (d_node[">50k income"] == 'No')])
    n_2_yes = len(d_node[(d_node[attribute] == 'No') & (d_node[">50k income"] == 'Yes')])
    n_2_no = len(d_node[(d_node[attribute] == 'No') & (d_node[">50k income"] == 'No')])

    Gini_1 = 1-(n_1_yes/(n_1_yes + n_1_no)) ** 2-(n_1_no/(n_1_yes + n_1_no)) ** 2
    Gini_2 = 1-(n_2_yes/(n_2_yes + n_2_no)) ** 2-(n_2_no/(n_2_yes + n_2_no)) ** 2

    #weighted Gini impuraty for the selected attribute
    Gini_attribute = (n_1/n) * Gini_1 + (n_2/n) * Gini_2
    Gini_attribute = round(Gini_attribute, 3)
    
    #print(f'Gini_{attribute} = {Gini_attribute}')
    
    return Gini_attribute

attributes = []
gini_indexes = []

# calculate gini index for each attribute in the data set and store them in a list
for attribute in df_simpl.columns[:-1]:
    gini_index = calc_weighted_gini_index(attribute, df_simpl)
    
    attributes.append(attribute)
    gini_indexes.append(gini_index)
    
# create a data frame using the just defined lists for the calculated gini_indexes
print("Calculated Gini indexes for each attribute of the data set:")
d_calculated_indexes = {'attribute':attributes,'gini_index':gini_indexes}
d_indexes_df = pd.DataFrame(d_calculated_indexes)
display(d_indexes_df)

print("Find the attribute for the first stump, the attribute where the Gini index is lowest the thus the Gini gain is highest")
attribute_stump_1 = d_indexes_df.min()["attribute"]
print(f"Attribute for the root node of the first stump: {attribute_stump_1}")

Calculated Gini indexes for each attribute of the data set:


Unnamed: 0,attribute,gini_index
0,male,0.417
1,>40 hours,0.175
2,>50 years,0.4


Find the attribute for the first stump, the attribute where the Gini index is lowest the thus the Gini gain is highest
Attribute for the root node of the first stump: >40 hours


## Calculate weight 

In [17]:
df_simpl

Unnamed: 0,male,>40 hours,>50 years,>50k income,sample_weight,chosen_stump_incorrect,error
0,Yes,No,No,No,0.1,0,0.0
1,Yes,No,No,No,0.1,0,0.0
2,Yes,No,No,No,0.1,0,0.0
3,Yes,No,Yes,No,0.1,0,0.0
4,No,No,No,No,0.1,0,0.0
5,No,No,No,No,0.1,0,0.0
6,No,No,No,No,0.1,0,0.0
7,Yes,Yes,Yes,Yes,0.1,0,0.0
8,No,Yes,No,Yes,0.1,0,0.0
9,Yes,No,No,Yes,0.1,1,0.1


In [34]:
def calculate_error_for_chosen_stump(df_simpl, attribute_stump_1): 
    '''
    Attributes:
        df_simpl: trainings data set
        attribute_stump_1: name of the column used for the root node of the stump
    
    Return:
        df_simpl_extended: df_simpl extended by the calculated weights and error
        error: calculated error for the stump - sum of the weights of all samples that were misclassified by the decision stub
    '''
    # add column for the sample weight, for the first step its simply defined as 1/n, so the sum of all weights is 1
    df_simpl["sample_weight"] = 1/len(df_simpl)

    df_simpl[attribute_stump_1]
    df_simpl[">50k income"]

    # in binary classification, we have two ways to build the tree. (1) That attribute and target value show the same value or (2) attribute and target value show the opposite value
    # we choose the one which shows less errors

    df_simpl["stump_1_incorrect_v1"] = np.where(((df_simpl[attribute_stump_1] == "Yes") & (df_simpl[">50k income"] == "Yes")) |
                                             ((df_simpl[attribute_stump_1] == "No") & (df_simpl[">50k income"] == "No")),0,1)
    df_simpl["stump_1_incorrect_v2"] = np.where(((df_simpl[attribute_stump_1] == "Yes") & (df_simpl[">50k income"] == "No")) |
                                             ((df_simpl[attribute_stump_1] == "No") & (df_simpl[">50k income"] == "Yes")),0,1)

    # select the stump with fewer samples misclassified
    if sum(df_simpl['stump_1_incorrect_v1']) <= sum(df_simpl["stump_1_incorrect_v2"]): 
        df_simpl["chosen_stump_incorrect"] = df_simpl['stump_1_incorrect_v1']
    else:
        df_simpl["chosen_stump_incorrect"] = df_simpl['stump_1_incorrect_v2']

    # drop the columns for the two versions of the tree
    df_simpl = df_simpl.drop(['stump_1_incorrect_v1', 'stump_1_incorrect_v2'], axis=1)

    # calculate the error by multiplying sample weight and chosen_stump
    df_simpl["error"] = df_simpl["sample_weight"] * df_simpl["chosen_stump_incorrect"]
    error = sum(df_simpl["error"])
    
    # calculate the amount of say, the weighted error rate of the weak classifier
    amount_of_say = 1/2 * np.log((1-error)/error)

    df_simpl_extended = df_simpl
    
    return df_simpl_extended, error, amount_of_say

# call function to calculate error and data frame
df_simpl_extended_1, error, amount_of_say = calculate_error_for_chosen_stump(df_simpl, attribute_stump_1)

# display extended df_simpl
display(df_simpl[[attribute_stump_1,">50k income","sample_weight", "chosen_stump_incorrect", "error"]])

print(f'Error of the stump [{attribute_stump_1}] = {error}')
print(f'Weighted error rate of the weak classifier / Amount of say = {round(amount_of_say,3)}')

Unnamed: 0,>40 hours,>50k income,sample_weight,chosen_stump_incorrect,error
0,No,No,0.1,0,0.0
1,No,No,0.1,0,0.0
2,No,No,0.1,0,0.0
3,No,No,0.1,0,0.0
4,No,No,0.1,0,0.0
5,No,No,0.1,0,0.0
6,No,No,0.1,0,0.0
7,Yes,Yes,0.1,0,0.0
8,Yes,Yes,0.1,0,0.0
9,No,Yes,0.1,1,0.1


Error of the stump [>40 hours] = 0.1
Weighted error rate of the weak classifier / Amount of say = 1.099


## Next Steps?

- Choose of Stump
- Calculate weight of it

In [10]:
df_simpl

Unnamed: 0,male,>40 hours,>50 years,>50k income,sample_weight,chosen_stump_incorrect,error
0,Yes,No,No,No,0.1,0,0.0
1,Yes,No,No,No,0.1,0,0.0
2,Yes,No,No,No,0.1,0,0.0
3,Yes,No,Yes,No,0.1,0,0.0
4,No,No,No,No,0.1,0,0.0
5,No,No,No,No,0.1,0,0.0
6,No,No,No,No,0.1,0,0.0
7,Yes,Yes,Yes,Yes,0.1,0,0.0
8,No,Yes,No,Yes,0.1,0,0.0
9,Yes,No,No,Yes,0.1,1,0.1


In [9]:
!git add .
!git config git.mail "dmnkplzr@googlemail.com"
!git config git.name "Dominik Polzer"
!git commit -m "update"
!git push

The file will have its original line endings in your working directory
The file will have its original line endings in your working directory


[main 0991784] update
 2 files changed, 1538 insertions(+), 818 deletions(-)


To github.com:polzerdo55862/Ada-Boost-Tutorial.git
   c929478..0991784  main -> main
