Using the ADULT dataset (https://archive-beta.ics.uci.edu/ml/datasets/adult). The goal is to build a regression model, that would predict whether income is >50 k USD

Variables:

>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Some variables would need to be one-hot encoded, but we'll need to careful about how we proceed with this.

Education: OrdinalEncoder, since the earning potential of some one who has received higher education, is more likely to be higher than for someone who has not in the employment market
sex, race, native-country, marital-status, relationship, occupation, workplace: LabelEncoder. These variables can be quite complex as they play a huge role in someones earning potential, especially women and taking into consideration intersectionality.

In [525]:
! pip install pandas numpy scikit-learn pandas-profiling



In [526]:

import os

from typing import List

import numpy as np
import pandas as pd


colnames: List[str] = ["age", "workclass", "fnlwgt", \
                            "education", "education-nom", "marital-status"\
    ,                       "education-occupation", "relationship", "race", \
                            "sex", "capital-gain", "capital-loss",\
                                "hours-per-week", "native-country", "income"]

train, test = pd.read_csv(os.path.join(os.getcwd(), "data/adult.data"), names=colnames, header=None), pd.read_csv(os.path.join(os.getcwd(), "data/adult.test"), names=colnames, header=None)

print(train.shape, test.shape)

(32561, 15) (16282, 15)


In [527]:
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States 

In [528]:
test = test.iloc[1:]
print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
5  18           ?  103497.0   Some-college           10.0   

        marital-status education-occupation relationship    race      sex  \
1        Never-married    Machine-op-inspct    Own-child   Black     Male   
2   Married-civ-spouse      Farming-fishing      Husband   White     Male   
3   Married-civ-spouse      Protective-serv      Husband   White     Male   
4   Married-civ-spouse    Machine-op-inspct      Husband   Black     Male   
5        Never-married                    ?    Own-child   White   Female   

   capital-gain  capital-loss  hours-per-week  native-country   income  
1           0.0           0.0            40.0   United-States   <=50K.  
2           0.0         

In [529]:
train = train.drop_duplicates()
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0              40   United-States 

In [530]:
print(train.isin([" ?"]).any())

age                     False
workclass                True
fnlwgt                  False
education               False
education-nom           False
marital-status          False
education-occupation     True
relationship            False
race                    False
sex                     False
capital-gain            False
capital-loss            False
hours-per-week          False
native-country           True
income                  False
dtype: bool


In [531]:
train = train.replace(' ?', np.NaN)

In [532]:
print(train.isin([" ?"]).any())

age                     False
workclass               False
fnlwgt                  False
education               False
education-nom           False
marital-status          False
education-occupation    False
relationship            False
race                    False
sex                     False
capital-gain            False
capital-loss            False
hours-per-week          False
native-country          False
income                  False
dtype: bool


In [533]:
print(train.shape, y_train.shape)

(32537, 15) (30139, 1)


In [534]:
na_train = train[train.isnull().any(axis=1)]
train = train.dropna()

In [535]:
train.reset_index(drop=True, inplace=True)
y_train = train["income"]

del train["income"]

In [536]:
from pandas_profiling import ProfileReport

report: ProfileReport = ProfileReport(train, explorative=True)
report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [537]:
print(train.shape, y_train.shape)

(30139, 14) (30139,)


In [538]:
### continuous variables: age, fnlwgt, education-nom, capital-gain, captical-loss and hours-per-week
### categorical variables: workclass, marital-status, education-occupation, relationship, race, sex, native-country
### Ordincal variables: education

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

numeric_features: List[str] = ["age", "fnlwgt", "education-nom", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features: List[str] = ["workclass", "marital-status", "education-occupation", "relationship", "race", "sex", "native-country"]
ordinal_features: List[str] = ["education"]

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown="error")
ordinal_transformer = OrdinalEncoder(handle_unknown="error")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("ord", ordinal_transformer, ordinal_features),
    ]
)

In [539]:
from sklearn.linear_model import LogisticRegression

clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

In [540]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

y_train = lb.fit_transform(y_train)


In [541]:
print(y_train)

[[0]
 [0]
 [0]
 ...
 [0]
 [0]
 [1]]


In [542]:
print(train.head())

   age          workclass  fnlwgt   education  education-nom  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status education-occupation    relationship    race      sex  \
0        Never-married         Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse      Exec-managerial         Husband   White     Male   
2             Divorced    Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse    Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse       Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  
0          2174             0              40   United-States  
1     

In [543]:

clf.fit(train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [544]:

print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
5  18           ?  103497.0   Some-college           10.0   

        marital-status education-occupation relationship    race      sex  \
1        Never-married    Machine-op-inspct    Own-child   Black     Male   
2   Married-civ-spouse      Farming-fishing      Husband   White     Male   
3   Married-civ-spouse      Protective-serv      Husband   White     Male   
4   Married-civ-spouse    Machine-op-inspct      Husband   Black     Male   
5        Never-married                    ?    Own-child   White   Female   

   capital-gain  capital-loss  hours-per-week  native-country   income  
1           0.0           0.0            40.0   United-States   <=50K.  
2           0.0         

In [545]:
test = test.replace(' ?', np.NaN)

na_test = test[test.isnull().any(axis=1)]
test = test.dropna()

print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
6  34     Private  198693.0           10th            6.0   

        marital-status education-occupation    relationship    race    sex  \
1        Never-married    Machine-op-inspct       Own-child   Black   Male   
2   Married-civ-spouse      Farming-fishing         Husband   White   Male   
3   Married-civ-spouse      Protective-serv         Husband   White   Male   
4   Married-civ-spouse    Machine-op-inspct         Husband   Black   Male   
6        Never-married        Other-service   Not-in-family   White   Male   

   capital-gain  capital-loss  hours-per-week  native-country   income  
1           0.0           0.0            40.0   United-States   <=50K.  
2           0.0   

In [546]:
test["income"] = lb.transform(test["income"])

print(test.head())


  age   workclass    fnlwgt      education  education-nom  \
1  25     Private  226802.0           11th            7.0   
2  38     Private   89814.0        HS-grad            9.0   
3  28   Local-gov  336951.0     Assoc-acdm           12.0   
4  44     Private  160323.0   Some-college           10.0   
6  34     Private  198693.0           10th            6.0   

        marital-status education-occupation    relationship    race    sex  \
1        Never-married    Machine-op-inspct       Own-child   Black   Male   
2   Married-civ-spouse      Farming-fishing         Husband   White   Male   
3   Married-civ-spouse      Protective-serv         Husband   White   Male   
4   Married-civ-spouse    Machine-op-inspct         Husband   Black   Male   
6        Never-married        Other-service   Not-in-family   White   Male   

   capital-gain  capital-loss  hours-per-week  native-country  income  
1           0.0           0.0            40.0   United-States       0  
2           0.0     

In [547]:
y_test = test["income"]
del test["income"]

train.reset_index(drop=True, inplace=True)

In [548]:
print("model score: %.3f" % clf.score(test, y_test))

model score: 0.794


In [549]:
### Now let's try several classifiers

from typing import Dict

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier

seed: int = 42
scores: Dict = {}

models: Dict = {
    "svc": SVC(gamma='auto', random_state=42),
    "tree": DecisionTreeClassifier(random_state=seed),
    "knn": KNeighborsClassifier(),
    "rf": RandomForestClassifier(random_state=seed),
    "gb": GradientBoostingClassifier(random_state=seed),
    "ada": AdaBoostClassifier(random_state=seed),
    "gbm": GradientBoostingClassifier(random_state=seed),
    "logistic": LogisticRegression(random_state=seed),
    "sgd": SGDClassifier(random_state=seed),
}

for key, value in models.items():
    classifier = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", value)])
    models[key] = classifier.fit(train, y_train)
    scores[key] = models[key].score(test, y_test)
 

  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


In [550]:
for key, value in scores.items():
      print(f"{key} - model score: {value: .5f}")

svc - model score:  0.81235
tree - model score:  0.74495
knn - model score:  0.77663
rf - model score:  0.79004
gb - model score:  0.81089
ada - model score:  0.79482
gbm - model score:  0.81089
logistic - model score:  0.79449
sgd - model score:  0.72357


In [551]:
test.reset_index(drop=True, inplace=True)
print(test.head())

  age   workclass    fnlwgt      education  education-nom  \
0  25     Private  226802.0           11th            7.0   
1  38     Private   89814.0        HS-grad            9.0   
2  28   Local-gov  336951.0     Assoc-acdm           12.0   
3  44     Private  160323.0   Some-college           10.0   
4  34     Private  198693.0           10th            6.0   

        marital-status education-occupation    relationship    race    sex  \
0        Never-married    Machine-op-inspct       Own-child   Black   Male   
1   Married-civ-spouse      Farming-fishing         Husband   White   Male   
2   Married-civ-spouse      Protective-serv         Husband   White   Male   
3   Married-civ-spouse    Machine-op-inspct         Husband   Black   Male   
4        Never-married        Other-service   Not-in-family   White   Male   

   capital-gain  capital-loss  hours-per-week  native-country  
0           0.0           0.0            40.0   United-States  
1           0.0           0.0       

In [556]:
test = test.to_numpy()
y_test = y_test.to_numpy()

# print(test[:10])

print(type(test), type(y_test))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [562]:
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import precision_score, recall_score, f1_score, balanced_accuracy_score, roc_auc_score, accuracy_score
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score

skfolds: StratifiedKFold = StratifiedKFold(n_splits=10, random_state=seed, shuffle=True)

columnnames: List[str] = ["age", "workclass", "fnlwgt", \
                            "education", "education-nom", "marital-status"\
    ,                       "education-occupation", "relationship", "race", \
                            "sex", "capital-gain", "capital-loss",\
                                "hours-per-week", "native-country"]

for test_index, y_index in skfolds.split(test, y_test):
    tmp_test = pd.DataFrame(test[test_index], columns=columnnames)
    y_true = y_test[y_index]

    for key, value in models.items():
        y_pred = value.predict(tmp_test)

        precision = precision_score(y_true, y_pred, average="micro")
        recall = recall_score(y_true, y_pred, average="micro")
        f1 = f1_score(y_true, y_pred, average="micro")
        accuracy = accuracy_score(y_true, y_pred)

        # print(f"Model: {key}")
        print(f"Model: {key} - Precision: {precision: .3f} - Recall: {recall: .3f} - F1 Score: {f1: .3f} - Accuracy: {accuracy: .3f}")
    #     # # print(f"Accuracy: {accuracy: .3f} - Balanced Accuracy: {bal_accuracy: .3f}")
    
    # print("=========================================================================================================================")





KeyboardInterrupt: 