# About

Welcome to this notebook where we will explore the Census Adult Income dataset, a rich source of socio-economic information derived from the 1994 U.S. Census Bureau database. Our focus in this analysis is on employing a robust Python data discretization pipeline: **Discretizer**. This versatile tool is able to discretize various types of data, both quantitative and qualitative, making it an ideal companion for preprocessing tasks.

Data discretization is a crucial step in preparing datasets for machine learning models. It involves transforming continuous or categorical variables into discrete bins, allowing for improved interpretability, handling non-linearity, and addressing potential outliers. The **Discretizer** we will employ is designed to handle a diverse range of data types, including quantitative variables such as age and education level, as well as qualitative variables like marital status and occupation.

Throughout this notebook, we will delve into the intricacies of the Census Adult Income dataset, exploring the distribution of features and employing the Discretizer to discretize the data effectively. By the end of this preprocessing journey, we aim to create a dataset that is well-suited for subsequent machine learning tasks, enabling the development of robust models for predicting income levels based on socio-economic attributes.

Let's embark on this exploration and witness the power of **Discretizer** in preparing our data for insightful analysis and accurate modeling.

# Setting things up
## Installation

In [1]:
%pip install autocarver

## Census Data

In this example notebook, we will use the Census dataset.

The Census Adult Income dataset, commonly referred to as the "Adult" dataset, is a well-known dataset in the realm of machine learning and data analysis. It is frequently used for tasks related to classification and predictive modeling. The dataset is derived from the 1994 U.S. Census Bureau database and contains a diverse set of features that aim to predict whether an individual earns more than $50,000 annually, making it a binary classification problem.

The features in the Adult dataset include demographic information such as age, education level, marital status, occupation, and work-related details like hours worked per week. The primary objective when working with this dataset is typically to build a predictive model capable of discerning between individuals with annual incomes above and below the $50,000 threshold.

In [3]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
adult_data = adult.data.features 
adult_data = adult_data.join(adult.data.targets)
  
# Display the first few rows of the dataset
adult_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Target type

In [4]:
target = "income"

# cleaning target
adult_data[target] = adult_data[target].apply(lambda u: u.replace(".", ""))

# conversion to 0/1
adult_data[target] = (adult_data[target] == ">50K").astype(int)

# target rate
adult_data[target].value_counts(dropna=False)

income
0    37155
1    11687
Name: count, dtype: int64

The target ``"income"`` is a binary target used in a classification task.

## Data Sampling

In [5]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(adult_data, test_size=0.20, random_state=42, stratify=adult_data[target])

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


(0.23927008420136667, 0.23932848807452145)

## Picking up columns to Discretize

In [6]:
train_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
34495,37,Private,193106,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,30,United-States,0
18591,56,Self-emp-inc,216636,12th,8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1651,40,United-States,0
12562,53,Private,126977,HS-grad,9,Separated,Craft-repair,Not-in-family,White,Male,0,0,35,United-States,0
552,72,Private,205343,11th,7,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
3479,46,State-gov,106705,Masters,14,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,38,United-States,0


In [7]:
# column data types
train_set.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income             int32
dtype: object

* ``"education"`` is  the only qualitative ordinal feature. It will be added to the list of ``ordinal_features`` and ``values_orders`` has to be set. 

* ``"sex"``, ``"marital-status"``, ``"occupation"``, ``"relationship"``, ``"race"``, ``"native-country"`` and ``"workclass"`` are quantitative categorical features. Those features will be added to the list of ``qualitative_features``.

* ``"capital-gain"`` and ``"capital-loss"`` are quantitative continuous features, whilst ``"education-num"``, ``"Age"`` and ``"hours-per-week"`` can be considered as quantitative discrete features. Those features will be added to the list of ``quantitative_features``.

* ``"fnlwgt"`` is the weighting column. It is not currently usable in **AutoCarver**

In [8]:
# lists of features per data type
quantitative_features = ["capital-gain", "capital-loss", "education-num", "hours-per-week", "age"]
qualitative_features = ["sex", "workclass", "marital-status", "occupation", "relationship", "race", "native-country"]
ordinal_features = ["education"]
weighting = ["fnlwgt"]

# user-specified ordering for ordinal features
values_orders = {
    "education": ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters", "Prof-school", "Doctorate"]
}

# Using Discretizer
## Discretizer Settings
### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:

- For quantitative features, it defines the number of quantiles to initialy discretize the features with.

- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

In [9]:
min_freq = 0.02

**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)

## Fitting Discretizer

* First, all qualitative features are discretized using ``QualitativeDiscretizer``:
    1. (Optionaly) Using ``StringDiscretizer`` to convert them to ``str`` if not already the case
    2. For qualitative ordinal features: using ``OrdinalDiscretizer`` for under-represented values (less frequent than ``min_freq=0.05``) to be grouped with its closest modality
    3. For qualitative categorical features: using ``CategoricalDiscretizer`` for under-represented values (less frequent than ``min_freq=0.05``) to be grouped with a default value (``str_default="__OTHER__"``)

* Second, all quantitative features are discretized using ``QuantitativeDiscretizer``:
    1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq=0.05``)
    2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2=0.025``) to be grouped with its closest modality

In [10]:
from AutoCarver.discretizers import Discretizer

# intiating AutoCarver
discretizer = Discretizer(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample
train_set_processed = discretizer.fit_transform(train_set, train_set[target])

------
[Discretizer] Fit Qualitative Features
---
 - [OrdinalDiscretizer] Fit ['education']


 - [CategoricalDiscretizer] Fit ['workclass', 'occupation', 'marital-status', 'race', 'native-country', 'relationship', 'sex']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['capital-loss', 'hours-per-week', 'capital-gain', 'education-num', 'age']
 - [OrdinalDiscretizer] Fit ['capital-loss', 'hours-per-week', 'capital-gain', 'education-num', 'age']
------



## Discretizer Analysis
### Quantitative Continuous Feature

In [11]:
# Discretization Summary
feature = "capital-gain"
discretizer.summary(feature)

Unnamed: 0_level_0,label,content
dtype,Unnamed: 1_level_1,Unnamed: 2_level_1
float,0.000e+00 < x <= 3.411e+03,[0.000e+00 < x <= 3.411e+03]
float,1.355e+04 < x,[1.355e+04 < x]
float,3.411e+03 < x <= 7.298e+03,[3.411e+03 < x <= 7.298e+03]
float,7.298e+03 < x <= 1.355e+04,[7.298e+03 < x <= 1.355e+04]
float,x <= 0.000e+00,[x <= 0.000e+00]


In [12]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])

Over-represented values of capital-gain:
 capital-gain
0    0.918358
Name: proportion, dtype: float64


In [13]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of capital-gain:
 capital-gain
x <= 0.000e+00                0.918358
3.411e+03 < x <= 7.298e+03    0.026847
0.000e+00 < x <= 3.411e+03    0.020474
1.355e+04 < x                 0.019835
7.298e+03 < x <= 1.355e+04    0.014486
Name: proportion, dtype: float64


For quantitative continuous feature ``capital-gain``:

* An over-represented value has been identified and kept by itself: value ``0`` represents 91.8% of observed data (more than ``min_freq=0.02``)
* Remaining 8.2% of values have been discretized in quantiles of sizes 2% (as specified with ``min_freq=0.02``)

### Quantitative Discrete Feature

In [14]:
# Discretization Summary
feature = "hours-per-week"
discretizer.summary(feature)

Unnamed: 0_level_0,label,content
dtype,Unnamed: 1_level_1,Unnamed: 2_level_1
float,1.000e+01 < x <= 1.500e+01,[1.000e+01 < x <= 1.500e+01]
float,1.500e+01 < x <= 2.000e+01,[1.500e+01 < x <= 2.000e+01]
float,2.000e+01 < x <= 2.500e+01,[2.000e+01 < x <= 2.500e+01]
float,2.500e+01 < x <= 3.000e+01,[2.500e+01 < x <= 3.000e+01]
float,3.000e+01 < x <= 3.400e+01,[3.000e+01 < x <= 3.400e+01]
float,3.400e+01 < x <= 3.500e+01,[3.400e+01 < x <= 3.500e+01]
float,3.500e+01 < x <= 3.900e+01,[3.500e+01 < x <= 3.900e+01]
float,3.900e+01 < x <= 4.000e+01,[3.900e+01 < x <= 4.000e+01]
float,4.000e+01 < x <= 4.400e+01,[4.000e+01 < x <= 4.400e+01]
float,4.400e+01 < x <= 4.500e+01,[4.400e+01 < x <= 4.500e+01]


In [15]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])

Over-represented values of hours-per-week:
 hours-per-week
40    0.464976
50    0.086965
45    0.056970
60    0.044993
35    0.039567
20    0.037084
30    0.034730
55    0.021959
Name: proportion, dtype: float64


In [16]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of hours-per-week:
 hours-per-week
3.900e+01 < x <= 4.000e+01    0.464976
4.900e+01 < x <= 5.400e+01    0.093901
4.400e+01 < x <= 4.500e+01    0.056970
5.500e+01 < x <= 6.000e+01    0.049267
1.500e+01 < x <= 2.000e+01    0.046835
2.500e+01 < x <= 3.000e+01    0.039669
3.400e+01 < x <= 3.500e+01    0.039567
2.000e+01 < x <= 2.500e+01    0.029816
3.500e+01 < x <= 3.900e+01    0.027948
x <= 1.000e+01                0.023367
5.400e+01 < x <= 5.500e+01    0.021959
4.500e+01 < x <= 4.900e+01    0.020884
1.000e+01 < x <= 1.500e+01    0.019911
4.000e+01 < x <= 4.400e+01    0.019169
6.000e+01 < x <= 7.000e+01    0.018504
7.000e+01 < x                 0.016072
3.000e+01 < x <= 3.400e+01    0.011184
Name: proportion, dtype: float64


In [17]:
# values between 50 and 55 hours per week are under-represented
print(f"Observed data for {feature} values strictly between 50 and 55: {train_set.query(f'50<`{feature}`<55').shape[0] / len(train_set):.2%}")

Observed data for hours-per-week values strictly between 50 and 55: 0.69%


In [18]:
print(f"Target rate for {feature} values equal to 50: {train_set.query(f'`{feature}`==50')[target].mean():.2%}")
print(f"Target rate for {feature} values strictly between 50 and 55: {train_set.query(f'50<`{feature}`<55')[target].mean():.2%}")
print(f"Target rate for {feature} values equal to 55: {train_set.query(f'`{feature}`==55')[target].mean():.2%}")

Target rate for hours-per-week values equal to 50: 44.29%
Target rate for hours-per-week values strictly between 50 and 55: 32.10%
Target rate for hours-per-week values equal to 55: 47.32%


For quantitative discrete feature ``hours-per-week``:

* Some over-represented values have been identified:
    * values ``20``, ``30``, ``35``, ``40``, ``45``, ``50``, ``55`` and ``60`` each represent more than 2.0% of observed data (more than ``min_freq=0.02``)
    * they are kept as their own modality
* Some under-represented values have been identified:
    * values between ``50`` and ``55`` represent only 0.7% of observed data, which is not enough to make a whole quantile out of (at least ``min_freq/2=0.01``). 
    * hence there grouping with there closest modality, ``50``, in terms of target rate (32.1% is closer to 44.3% than to 47.3%)
* Remaining values have been discretized in quantiles of sizes 2% (as specified with ``min_freq=0.02``)

## Qualitative Categorical Feature 

In [19]:
# Discretization Summary
feature = "workclass"
discretizer.summary(feature)

Unnamed: 0_level_0,label,content
dtype,Unnamed: 1_level_1,Unnamed: 2_level_1
str,?,[?]
str,Federal-gov,[Federal-gov]
str,Local-gov,[Local-gov]
str,Private,[Private]
str,Self-emp-inc,[Self-emp-inc]
str,Self-emp-not-inc,[Self-emp-not-inc]
str,State-gov,[State-gov]
str,__NAN__,[__NAN__]
str,__OTHER__,"[Never-worked, Without-pay]"


In [20]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
print(f"\nUnder-represented values of {feature}:\n", stats[stats < min_freq])

Over-represented values of workclass:
 workclass
Private             0.708568
Self-emp-not-inc    0.080096
Local-gov           0.066181
State-gov           0.040622
?                   0.038221
Self-emp-inc        0.035584
Federal-gov         0.030023
Name: proportion, dtype: float64

Under-represented values of workclass:
 workclass
Without-pay     0.000496
Never-worked    0.000209
Name: proportion, dtype: float64


In [21]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of workclass:
 workclass
Private             0.694623
Self-emp-not-inc    0.078520
Local-gov           0.064879
State-gov           0.039823
?                   0.037468
Self-emp-inc        0.034883
Federal-gov         0.029432
__NAN__             0.019681
__OTHER__           0.000691
Name: proportion, dtype: float64


For qualitative categorical feature ``workclass``:

* Some over-represented categories have been identified:
    * categories ``"Private"``, ``"Self-emp-not-inc"``, ``"Local-gov"``, ``"State-gov"``, ``"?"``, ``"Self-emp-inc"`` and ``"Federal-gov"``,  each represent more than 2.0% of observed data (more than ``min_freq=0.02``)
    * they are kept as their own modality
* Some under-represented categories have been identified: 
    * categories ``"Never-worked"`` and ``"Without-pay"``,  each represent less than 2.0% of observed data (less than ``min_freq=0.02``)
    * they are grouped in the default value ``str_default="__OTHER__"``
* Missing values are left by themselves whatsoever (nan value ``str_nan="__NAN__"``)

## Qualitative Ordinal Feature

In [22]:
# Discretization Summary
feature = "education"
discretizer.summary(feature)

Unnamed: 0_level_0,label,content
dtype,Unnamed: 1_level_1,Unnamed: 2_level_1
str,10th,[10th]
str,11th,"[11th, 12th]"
str,7th-8th,"[1st-4th, 5th-6th, 7th-8th, 9th, Preschool]"
str,Assoc-acdm,[Assoc-acdm]
str,Assoc-voc,[Assoc-voc]
str,Bachelors,[Bachelors]
str,HS-grad,[HS-grad]
str,Masters,[Masters]
str,Prof-school,"[Doctorate, Prof-school]"
str,Some-college,[Some-college]


In [23]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
print(f"\nUnder-represented values of {feature}:\n", stats[stats < min_freq])

Over-represented values of education:
 education
HS-grad         0.322013
Some-college    0.223658
Bachelors       0.163259
Masters         0.054539
Assoc-voc       0.042408
11th            0.037136
Assoc-acdm      0.033348
10th            0.028357
Name: proportion, dtype: float64

Under-represented values of education:
 education
7th-8th        0.019348
Prof-school    0.016738
9th            0.015637
12th           0.013616
Doctorate      0.011978
5th-6th        0.010928
1st-4th        0.005451
Preschool      0.001587
Name: proportion, dtype: float64


In [24]:
discretizer.values_orders[feature].content

{'7th-8th': ['9th', '5th-6th', '1st-4th', 'Preschool', '7th-8th'],
 '10th': ['10th'],
 '11th': ['12th', '11th'],
 'HS-grad': ['HS-grad'],
 'Some-college': ['Some-college'],
 'Assoc-voc': ['Assoc-voc'],
 'Assoc-acdm': ['Assoc-acdm'],
 'Bachelors': ['Bachelors'],
 'Masters': ['Masters'],
 'Prof-school': ['Doctorate', 'Prof-school']}

In [25]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of education:
 education
HS-grad         0.322013
Some-college    0.223658
Bachelors       0.163259
Masters         0.054539
7th-8th         0.052952
11th            0.050751
Assoc-voc       0.042408
Assoc-acdm      0.033348
Prof-school     0.028715
10th            0.028357
Name: proportion, dtype: float64


In [26]:
print("Provided ordering:", values_orders[feature])

Provided ordering: ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']


For qualitative ordinal feature ``education``:

* Some over-represented categories have been identified:
    * categories ``"10th"``, ``"11th"``, ``"HS-grad"``, ``"Some-college"``, ``"Assoc-voc"``, ``"Assoc-acdm"``, ``"Bachelors"`` and ``"Masters"`` each represent more than 2.0% of observed data (more than ``min_freq=0.02``)
    * they are kept as their own modality
* Some under-represented categories have been identified:
    * categories ``"Preschool"``, ``"1st-4th"``, ``"5th-6th"``, ``"7th-8th"``, ``"9th"``, ``"12th"``, ``"Prof-school"`` and ``"Doctorate"``  each represent less than 2.0% of observed data (less than ``min_freq=0.02``)
    * starting from the least represented category, they are grouped with there respective closest modality: 
        - ``"Preschool"`` is grouped with ``"1st-4th"`` as it is the only available modality in the specified order (see definition of ``values_orders``)
        - In the same manner they are then grouped successively with ``"5th-6th"``, ``"7th-8th"`` and ``"9th"``
        - Same goes for ``"12th"`` with  ``"11th"`` and ``"Prof-school"`` with ``"Doctorate"``
* Missing values are left by themselves (nan value ``str_nan="__NAN__"``)

## Saving and Loading Discretizer
### Saving

All **Discretizers** can safely be stored as a .json file.

In [27]:
import json

# storing as json file
with open('discretizer.json', 'w') as my_discretizer_json:
    json.dump(discretizer.to_json(), my_discretizer_json)

### Loading

**Carvers** can safely be loaded from a .json file.

In [28]:
import json

from AutoCarver.discretizers import load_discretizer

# loading json file
with open('discretizer.json', 'r') as my_discretizer_json:
    discretizer = load_discretizer(json.load(my_discretizer_json))

## Applying Discretizer

In [30]:
dev_set_processed = discretizer.transform(dev_set)

## What's next?

* Thanks to **Discretizers** all of your features are now quantitative ordinal features with representative enough modalities!
* **Discretizers** are directly integrated in **Carvers**
* **Carvers** make good use of this discretization step to find out the most target associated consecutive combination of modalities, so make sure to check out [Carvers Examples](https://autocarver.readthedocs.io/en/latest/carvers_examples.html)!

## Well done!

You have successfully navigated the intricacies of feature discretization using the AutoCarver package, creating a dataset with finely tuned discrete representations that enhance the interpretability and effectiveness of your features.

Your meticulous approach to discretizing both quantitative and qualitative attributes, utilizing the QuantitativeDiscretizer and QualitativeDiscretizer components, showcases your commitment to preparing data for meaningful analysis and modeling.

We appreciate your trust in the AutoCarver package, and we hope that the discretized features contribute to the success of your machine learning endeavors. As you move forward with your analyses and predictive modeling tasks, may the insights gained from this well-crafted dataset lead to impactful and informed decisions.

Thank you for choosing AutoCarver as your companion in the data preprocessing journey. Your dedication to refining and optimizing your datasets reflects a commitment to excellence in data science. We look forward to being part of your future data adventures and wish you continued success in your endeavors.