# Adult UCI dataset

Classification with preprocessing and pipeline

In this example we will use the [Adult UCI dataset](https://archive.ics.uci.edu/dataset/2/adult)

The target is the `income` boolean column, specifying if the individual income si either *low* or *high*.

In this example we will do simply the classification with hold-out test and no optimisation. The students are invited to expand the exercise with a GridSearchCV optimisation and several classifiers.

## Technical aspects
In `sklearn` a `Pipeline` is a sequence of operation such that the output of one operation is the input to the following. Each operation is a transformation of data.

Examples of those operations are:
- column transformations:
  - encodings:
    - `OneHotEncoding` to transform a categorical column into a set of `0-1` columns
    - `OrdinalEncoding` to transform an ordinal column into a numeric one
  - imputation of null values, such as `SimpleImputer`, this one requires a **strategy**, for example, in numbers the nulls can be filled with the `median`, in categoricals the nulls can be filled with a constant value, such as `unknown`
  - numeric transformations, such as:
    - `MinMaxScaler`
    - `StandardScaler`
- estimators

In `sklearn` we have available the `ColumnTransformer` to transform groups of columns in a uniform way. It requires, for each group, a transformation pipeline and a list of attributes to be transformed with that pipeline

We will then build a final pipeline composed by the preprocessing with the `ColumnTransformer` and the classifier

## Workflow
1. load the file `adults.csv` into a dataframe and explore it
1. use as target the column `income`, separate the predicting columns and the target into `X` and `y`
1. show the percentage of null values in each column
1. prepare the variable `categorical_features` containing all the names of the categorical features of `X` excluding `education` (it will be ignored) and `sex` (it will be transformed separately);
1. prepare the variable `numeric_features` containing all the names of the numeric features excluding `fnlwgt` (this is the *importance* of each observation, it can be used in the final classification report as `sample_weight`)
1. the target should be binary, inspecting the unique values you will see that there are four distinct values due to typing errors, reduce the target to binary with an appropriate mapping (hint: you can use the `.map()` function of Pandas
1. split into train and test
1. prepare the column transformers for:
  1. numeric: simple imputation with the median and standard scaling
  1. categorical: simple imputation with nulls substituted by `unknown`
  1. boolean: ordinal encoding
1. prepare a pipeline with:
  1. the column transformer as preprocessor
  1. the `DecisionTreeClassifier`
1. fit the pipeline to the train part
1. predict the test and produce a classification report

In [1]:
import pandas as pd
df_url = 'adults.csv'
df = pd.read_csv(df_url)

In [2]:
df.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,47879,48842.0,48842,48842.0,48842,47876,48842,48842,48842,48842.0,48842.0,48842.0,48568,48842
unique,,9,,16,,7,15,6,5,2,,,,42,4
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,24720
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


In [3]:
target = "income"
X = df.drop(target, axis = 1)
y = df[target]

In [4]:
X.isna().sum()/X.shape[0]*100

age               0.000000
workclass         1.971664
fnlwgt            0.000000
education         0.000000
education-num     0.000000
marital-status    0.000000
occupation        1.977806
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    0.560993
dtype: float64

In [5]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder

random_state = 346
np.random.seed(random_state)

In [6]:
categorical_features = X.select_dtypes(include='object').columns.tolist()
categorical_features.remove('sex')
categorical_features.remove('education')
categorical_features

['workclass',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'native-country']

In [7]:
numeric_features = X.select_dtypes(include='number').columns.tolist()
numeric_features.remove('fnlwgt')
numeric_features

['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

In [8]:
income_map = {'<=50K.':'<=50K', '>50K.': '>50K', '<=50K':'<=50K', '>50K': '>50K'}
y = y.map(income_map)
y.value_counts()

<=50K    37155
>50K     11687
Name: income, dtype: int64

### Here you should prepare the numeric, boolean and categorical transformers

`your_transformer = Pipeline(
  steps = [
              ("step name", transformer )
            , ("step name", transformer)
          ]
)`

In [9]:
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median"))
        , ("scaler", StandardScaler())]
)

boolean_transformer = OrdinalEncoder()

categorical_transformer = ..........

### Here you collect the transformers in the `ColumnTransformer`

In [10]:
# preprocessor = ColumnTransformer(
#     transformers=[
#           ("transformer name", transformer, list_of_features)
#         , ("transformer name", transformer, list_of_features)
#         , ("transformer name", transformer, list_of_features)
#     ]
#     , sparse_threshold=0 # this prevents internal representation in sparse matrices
#                          # it is useful to speed operations
# )

In [11]:
preprocessor = ColumnTransformer(
    transformers=[
          ("num", numeric_transformer, numeric_features)
        , ("cat", categorical_transformer, categorical_features)
        , ("bool", boolean_transformer, ['sex'])
    ]
    , sparse_threshold=0
)

### Final operations

- prepare the final pipeline composed by the preprocessor and the classifier
- do the train/test split
- fit the pipeline to the train part
- predict the test part
- produce the classification report

In [12]:
# clf = Pipeline(
#     steps=[
#           ("step name", component)
#         , ("step name", component)
#         ]
# )

In [15]:
clf = Pipeline(
    steps=[
          ("preprocessor", preprocessor)
        , ("classifier", DecisionTreeClassifier(random_state=random_state))
        ]
)
# from sklearn.pipeline import make_pipeline
# clf = make_pipeline(preprocessor, DecisionTreeClassifier(random_state=random_state))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

clf.fit(X_train, y_train);

In the classification report use the parameter `sample_weight` passing the column `fnlwgt`

In [14]:
y_test_p = clf.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_p
                            , sample_weight=X_test['fnlwgt']
                            )
)

              precision    recall  f1-score   support

       <=50K       0.88      0.89      0.89 1404639602.0
        >50K       0.64      0.60      0.62 443124458.0

    accuracy                           0.82 1847764060.0
   macro avg       0.76      0.75      0.75 1847764060.0
weighted avg       0.82      0.82      0.82 1847764060.0

