## AdaBoost

## Load data set

* https://archive.ics.uci.edu/ml/machine-learning-databases/adult/


* age: continuous.
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
* fnlwgt: continuous.
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
* education-num: continuous.
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
* sex: Female, Male.
* capital-gain: continuous.
* capital-loss: continuous.
* hours-per-week: continuous.
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\z004j58u\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [2]:
# import pandas as pd

# df = pd.read_csv("./data/adult.data",
#                  names = ["age", "workclass", "fnlwgt", "education", "education-num", 
#                           "marital-status", "ocacupation", "relationship",
#                          "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"], index_col=False)

In [20]:
import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                 names = ["age", "workclass", "fnlwgt", "education", "education-num", 
                          "marital-status", "occupation", "relationship",
                         "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"], index_col=False)

In [21]:
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [4]:
len(df)

32561

## Simplify the data set

In [17]:
import numpy as np

def male_or_not(row):
    if row['sex'].lstrip() == 'Male':
        val = 'Yes'
    else:
        val = 'No'
    return val

def income_over_50(row):
    if row['income'].lstrip() == '>50K':
        val = 'Yes'
    else:
        val = 'No'
    return val

df['male'] = df.apply(male_or_not, axis=1)
df['>40 hours'] = np.where(df['hours-per-week']>40, 'Yes', 'No')
df['>50 years'] = np.where(df['age']>50, 'Yes', 'No')

# Target
df['>50k income'] = df.apply(income_over_50, axis=1)

df_simpl = df[['male', '>40 hours','>50 years','>50k income']]
df_simpl = df_simpl.head(10)
df_simpl

Unnamed: 0,male,>40 hours,>50 years,>50k income
0,Yes,No,No,No
1,Yes,No,No,No
2,Yes,No,No,No
3,Yes,No,Yes,No
4,No,No,No,No
5,No,No,No,No
6,No,No,No,No
7,Yes,Yes,Yes,Yes
8,No,Yes,No,Yes
9,Yes,No,No,Yes


## Find the first Decision Stump

- Calculate Gini Index

![title](img/images.svg)

In [19]:
def calc_weighted_gini_index(attribute, df_simpl):
    '''
    Args:
        df_simpl: the simplified data set stored in a data frame
        attribute: the chosen attribute for the root node of the tree
    Return:
        Gini_attribute: the gini index for the chosen attribute
    '''
    d_node = df_simpl[[attribute, '>50k income']]
    
    n = len(d_node)
    n_1 = len(d_node[d_node[attribute] == 'Yes'])
    n_2 = len(d_node[d_node[attribute] == 'No'])

    n_1_yes = len(d_node[(d_node[attribute] == 'Yes') & (d_node[">50k income"] == 'Yes')])
    n_1_no = len(d_node[(d_node[attribute] == 'Yes') & (d_node[">50k income"] == 'No')])
    n_2_yes = len(d_node[(d_node[attribute] == 'No') & (d_node[">50k income"] == 'Yes')])
    n_2_no = len(d_node[(d_node[attribute] == 'No') & (d_node[">50k income"] == 'No')])

    Gini_1 = 1-(n_1_yes/(n_1_yes + n_1_no)) ** 2-(n_1_no/(n_1_yes + n_1_no)) ** 2
    Gini_2 = 1-(n_2_yes/(n_2_yes + n_2_no)) ** 2-(n_2_no/(n_2_yes + n_2_no)) ** 2

    #weighted Gini impuraty for the selected attribute
    Gini_attribute = (n_1/n) * Gini_1 + (n_2/n) * Gini_2
    Gini_attribute = round(Gini_attribute, 3)
    print(f'Gini_{attribute} = {Gini_attribute}')
    
    return Gini_attribute

for attribute in df_simpl.columns[:-1]:
    calc_weighted_gini_index(attribute, df_simpl)

Gini_male = 0.417
Gini_>40 hours = 0.175
Gini_>50 years = 0.4


## Calculate weight 

## Next Steps?

- Choose of Stump
- Calculate weight of it

In [7]:
!git add .
!git config git.mail "dmnkplzr@googlemail.com"
!git commit -m "update"
!git push

The file will have its original line endings in your working directory


[main 0142422] update
 1 file changed, 35 insertions(+), 14 deletions(-)


To github.com:polzerdo55862/Ada-Boost-Tutorial.git
   b21e2c7..0142422  main -> main
