# About

Welcome to this notebook where we will explore the Census Adult Income dataset, a rich source of socio-economic information derived from the 1994 U.S. Census Bureau database. Our focus in this analysis is on employing a robust Python data discretization pipeline: **Discretizer**. This versatile tool is able to discretize various types of data, both quantitative and qualitative, making it an ideal companion for preprocessing tasks.

Data discretization is a crucial step in preparing datasets for machine learning models. It involves transforming continuous or categorical variables into discrete bins, allowing for improved interpretability, handling non-linearity, and addressing potential outliers. The **Discretizer** we will employ is designed to handle a diverse range of data types, including quantitative variables such as age and education level, as well as qualitative variables like marital status and occupation.

Throughout this notebook, we will delve into the intricacies of the Census Adult Income dataset, exploring the distribution of features and employing the Discretizer to discretize the data effectively. By the end of this preprocessing journey, we aim to create a dataset that is well-suited for subsequent machine learning tasks, enabling the development of robust models for predicting income levels based on socio-economic attributes.

Let's embark on this exploration and witness the power of **Discretizer** in preparing our data for insightful analysis and accurate modeling.

# Setting things up
## Installation

In [None]:
%pip install autocarver

## Census Data

In this example notebook, we will use the Census dataset.

The Census Adult Income dataset, commonly referred to as the "Adult" dataset, is a well-known dataset in the realm of machine learning and data analysis. It is frequently used for tasks related to classification and predictive modeling. The dataset is derived from the 1994 U.S. Census Bureau database and contains a diverse set of features that aim to predict whether an individual earns more than $50,000 annually, making it a binary classification problem.

The features in the Adult dataset include demographic information such as age, education level, marital status, occupation, and work-related details like hours worked per week. The primary objective when working with this dataset is typically to build a predictive model capable of discerning between individuals with annual incomes above and below the $50,000 threshold.

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
adult_data = adult.data.features 
adult_data = adult_data.join(adult.data.targets)
  
# Display the first few rows of the dataset
adult_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Target type

In [5]:
target = "income"

# cleaning target
adult_data[target] = adult_data[target].apply(lambda u: u.replace(".", ""))

adult_data[target].value_counts(dropna=False)

income
<=50K    37155
>50K     11687
Name: count, dtype: int64

The target ``"income"`` is a binary target of type ``str`` used in a classification task.

## Data Sampling

In [13]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(adult_data, test_size=0.20, random_state=42, stratify=adult_data[target])

# checking target rate per dataset
train_set[target].value_counts(normalize=True), dev_set[target].value_counts(normalize=True)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


(income
 <=50K    0.76073
 >50K     0.23927
 Name: proportion, dtype: float64,
 income
 <=50K    0.760672
 >50K     0.239328
 Name: proportion, dtype: float64)

## Picking up columns to Discretize

In [14]:
train_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
34495,37,Private,193106,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,30,United-States,<=50K
18591,56,Self-emp-inc,216636,12th,8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1651,40,United-States,<=50K
12562,53,Private,126977,HS-grad,9,Separated,Craft-repair,Not-in-family,White,Male,0,0,35,United-States,<=50K
552,72,Private,205343,11th,7,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
3479,46,State-gov,106705,Masters,14,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,38,United-States,<=50K


In [23]:
n = ""
for f in train_set.groupby("education-num")["education"].apply(lambda u: f'"{u.unique()[0]}", '):
    n += f
print(n)

"Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters", "Prof-school", "Doctorate", 


In [16]:
# column data types
train_set.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [19]:
n = ''
for f in train_set:
    n += f'"{f}", '
print(n)

"age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income", 


* ``"education"`` is  the only qualitative ordinal feature. It will be added to the list of ``ordinal_features`` and ``values_orders`` has to be set. 

* ``"sex"``, ``"marital-status"``, ``"occupation"``, ``"relationship"``, ``"race"``, ``"native-country"`` and ``"workclass"`` are quantitative categorical features. Those features will be added to the list of ``qualitative_features``.

* ``"capital-gain"`` and ``"capital-loss"`` are quantitative continuous features, whilst ``"education-num"``, ``"Age"`` and ``"hours-per-week"`` can be considered as quantitative discrete features. Those features will be added to the list of ``quantitative_features``.

* ``"fnlwgt"`` is the weighting column. It is not currently usable in **AutoCarver**

In [24]:
# lists of features per data type
quantitative_features = ["capital-gain", "capital-loss", "education-num", "hours-per-week", "age"]
qualitative_features = ["sex", "workclass", "marital-status", "occupation", "relationship", "race", "native-country", "income"]
ordinal_features = ["education"]
weighting = ["fnlwgt"]

# user-specified ordering for ordinal features
values_orders = {
    "education": ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters", "Prof-school", "Doctorate"]
}

# Using Discretizer
## Discretizer Settings
### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:

- For quantitative features, it defines the number of quantiles to initialy discretize the features with.

- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

In [25]:
min_freq = 0.02

**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)

## Fitting Discretizer

* First, all qualitative features are discretized:
    1. Using ``StringDiscretizer`` to convert them to ``str`` if not already the case
    2. For qualitative ordinal features: using ``OrdinalDiscretizer`` for under-represented values (less frequent than ``min_freq=0.05``) to be grouped with its closest modality
    3. For qualitative categorical features: using ``CategoricalDiscretizer`` for under-represented values (less frequent than ``min_freq=0.05``) to be grouped with a default value (``str_default="__OTHER__"``)

* Second, all quantitative features are discretized:
    1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq=0.05``)
    2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2=0.025``) to be grouped with its closest modality

In [37]:
from AutoCarver import BinaryCarver

# intiating AutoCarver
auto_carver = BinaryCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    sort_by="tschuprowt",
    min_freq=min_freq,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, (train_set[target] == ">50K").astype(int))

------
[Discretizer] Fit Qualitative Features
---
 - [OrdinalDiscretizer] Fit ['education']
 - [CategoricalDiscretizer] Fit ['sex', 'race', 'workclass', 'occupation', 'relationship', 'marital-status', 'native-country', 'income']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['age', 'hours-per-week', 'education-num', 'capital-loss', 'capital-gain']
 - [OrdinalDiscretizer] Fit ['age', 'hours-per-week', 'education-num', 'capital-loss', 'capital-gain']
------


------
[AutoCarver] Fit education (1/14)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
7th-8th to 9th,0.0527,0.053
10th,0.0578,0.0284
"12th, 11th",0.056,0.0508
HS-grad,0.1582,0.322
Some-college,0.1919,0.2237
Assoc-voc,0.2468,0.0424
Assoc-acdm,0.2602,0.0333
Bachelors,0.4135,0.1633
Masters,0.5504,0.0545
"Doctorate, Prof-school",0.7478,0.0287


Grouping modalities   : 100%|██████████| 255/255 [00:00<00:00, 9360.40it/s]
Computing associations: 100%|██████████| 255/255 [00:00<00:00, 3893.47it/s]
Testing robustness    :   0%|          | 0/255 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
7th-8th to 10th,0.1596,0.7535
Bachelors to Masters,0.4828,0.2465


------


------
[AutoCarver] Fit age (2/14)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.800e+01,0.0,0.0285
1.800e+01 < x <= 1.900e+01,0.0023,0.022
1.900e+01 < x <= 2.000e+01,0.0011,0.0229
2.000e+01 < x <= 2.100e+01,0.0056,0.0229
2.100e+01 < x <= 2.200e+01,0.0118,0.0239
2.200e+01 < x <= 2.300e+01,0.0142,0.027
2.300e+01 < x <= 2.400e+01,0.0386,0.0245
2.400e+01 < x <= 2.500e+01,0.0665,0.0242
2.500e+01 < x <= 2.600e+01,0.0802,0.0239
2.600e+01 < x <= 2.700e+01,0.0947,0.0249


Grouping modalities   :  46%|████▌     | 56793/124313 [00:08<00:08, 7560.64it/s]

KeyboardInterrupt: 

In [39]:
from AutoCarver.discretizers import Discretizer

# intiating AutoCarver
discretizer = Discretizer(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample
train_set_processed = discretizer.fit_transform(train_set, (train_set[target] == ">50K").astype(int))

------
[Discretizer] Fit Qualitative Features
---
 - [OrdinalDiscretizer] Fit ['education']
 - [CategoricalDiscretizer] Fit ['marital-status', 'sex', 'workclass', 'occupation', 'relationship', 'race', 'native-country', 'income']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['age', 'hours-per-week', 'education-num', 'capital-loss', 'capital-gain']
 - [OrdinalDiscretizer] Fit ['age', 'hours-per-week', 'education-num', 'capital-loss', 'capital-gain']
------

