# Adult UCI dataset

Classification with preprocessing and pipeline

In this example we will use the [Adult UCI dataset](https://archive.ics.uci.edu/dataset/2/adult)

The target is the `income` boolean column, specifying if the individual income si either *low* or *high*.

In this example we will do simply the classification with hold-out test and no optimisation. The students are invited to expand the exercise with a GridSearchCV optimisation and several classifiers.

## Technical aspects
In `sklearn` a `Pipeline` is a sequence of operation such that the output of one operation is the input to the following. Each operation is a transformation of data.

Examples of those operations are:
- column transformations:
  - encodings:
    - `OneHotEncoding` to transform a categorical column into a set of `0-1` columns
    - `OrdinalEncoding` to transform an ordinal column into a numeric one
  - imputation of null values, such as `SimpleImputer`, this one requires a **strategy**, for example, in numbers the nulls can be filled with the `median`, in categoricals the nulls can be filled with a constant value, such as `unknown`
  - numeric transformations, such as:
    - `MinMaxScaler`
    - `StandardScaler`
- estimators

In `sklearn` we have available the `ColumnTransformer` to transform groups of columns in a uniform way. It requires, for each group, a transformation pipeline and a list of attributes to be transformed with that pipeline

We will then build a final pipeline composed by the preprocessing with the `ColumnTransformer` and the classifier

## Workflow
1. load the file `adults.csv` into a dataframe and explore it
1. use as target the column `income`, separate the predicting columns and the target into `X` and `y`
1. show the percentage of null values in each column
1. prepare the variable `categorical_features` containing all the names of the categorical features of `X` excluding `education` (it will be ignored) and `sex` (it will be transformed separately);
1. prepare the variable `numeric_features` containing all the names of the numeric features excluding `fnlwgt` (this is the *importance* of each observation, it can be used in the final classification report as `sample_weight`)
1. the target should be binary, inspecting the unique values you will see that there are four distinct values due to typing errors, reduce the target to binary with an appropriate mapping (hint: you can use the `.map()` function of Pandas
1. split into train and test
1. prepare the column transformers for:
  1. numeric: simple imputation with the median and standard scaling
  1. categorical: simple imputation with nulls substituted by `unknown`
  1. boolean: ordinal encoding
1. prepare a pipeline with:
  1. the column transformer as preprocessor
  1. the `DecisionTreeClassifier`
1. fit the pipeline to the train part
1. predict the test and produce a classification report

In [2]:
import pandas as pd

df = pd.read_csv('../data/adult/adults.csv', sep=',', index_col=False)
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [3]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [4]:
categorical_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

for column in categorical_columns:
    df[column] = df[column].astype('category')

In [5]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

random_state = 346
np.random.seed(random_state)

In [6]:
# show the percentage of missing values in each column
df.isna().sum() / len(df)

age               0.000000
workclass         0.019717
fnlwgt            0.000000
education         0.000000
education-num     0.000000
marital-status    0.000000
occupation        0.019778
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.005610
income            0.000000
dtype: float64

In [7]:
df['workclass'] = df['workclass'].fillna(df['workclass'].value_counts().index[0])
df['occupation'] = df['occupation'].fillna(df['occupation'].value_counts().index[0])
df['native-country'] = df['native-country'].fillna(df['native-country'].value_counts().index[0])

In [8]:
target = 'income'

X = df.drop(target, axis=1)
y = pd.Series(df[target])

display(X.head())
display(y.head())
display(X.isna().sum())
display(y.isna().sum())

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: income, dtype: category
Categories (4, object): ['<=50K', '<=50K.', '>50K', '>50K.']

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
dtype: int64

0

In [9]:
categorical_features = X.select_dtypes(exclude=["number","bool_","object_"]).columns.drop(['education', 'sex'])
categorical_features

Index(['workclass', 'marital-status', 'occupation', 'relationship', 'race',
       'native-country'],
      dtype='object')

In [10]:
numeric_features = X.select_dtypes(include=np.number).columns.drop(['fnlwgt'])
numeric_features

Index(['age', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')

In [11]:
display(y.value_counts())
y = y.map({"<=50K": 0, "<=50K.": 0, ">50K": 1, ">50K.": 1})
y.value_counts()

income
<=50K     24720
<=50K.    12435
>50K       7841
>50K.      3846
Name: count, dtype: int64

income
0    37155
1    11687
Name: count, dtype: int64

### Here you should prepare the numeric, boolean and categorical transformers

`your_transformer = Pipeline(
  steps = [
              ("step name", transformer )
            , ("step name", transformer)
          ]
)`

1. prepare the column transformers for:
  1. numeric: simple imputation with the median and standard scaling
  1. categorical: simple imputation with nulls substituted by `unknown`
  1. boolean: ordinal encoding

In [12]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

boolean_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("ordinal", OrdinalEncoder()),
    ]
)

### Here you collect the transformers in the `ColumnTransformer`

In [13]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("bool", boolean_transformer, y)
    ],
    sparse_threshold=0  # this prevents internal representation in sparse matrices
    # it is useful to speed operations
)

### Final operations

- prepare the final pipeline composed by the preprocessor and the classifier
- do the train/test split
- fit the pipeline to the train part
- predict the test part
- produce the classification report

In [14]:
clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("feature_selection", SelectPercentile(chi2)),
        ("classifier", DecisionTreeClassifier())
    ])

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=random_state, stratify=y)

display(X_train.shape)
display(X_test.shape)

(32724, 14)

(16118, 14)

In [16]:
fitted_pipeline = clf.fit(X_train, y_train)

In the classification report use the parameter `sample_weight` passing the column `fnlwgt`

In [None]:
from sklearn.metrics import classification_report

classification_report(y_test, y_pred)