# Setting things up

## Installation

In [1]:
# !pip install AutoCarver[jupyter]

In [None]:
import sys
import os

print(os.listdir('../../../../../AutoCarver'))

sys.path.append('../../../../../AutoCarver')
sys.path.append('../../../../../AutoCarver/AutoCarver')
sys.path.append('../../../../../AutoCarver/AutoCarver/discretizers')
sys.path.append('../../../../../AutoCarver/AutoCarver/discretizers/utils')
import AutoCarver


## Titanic Data

In this example notebook, we will use the Titanic dataset.

The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.

The dataset includes various features such as passengers' names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).

In [3]:
import pandas as pd

# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)

# Display the first few rows of the dataset
titanic_data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


## Target type and Carver selection

In [4]:
target = "Survived"

titanic_data[target].value_counts(dropna=False)

0    545
1    342
Name: Survived, dtype: int64

The target ``"Survived"`` is a binary target of type ``int64`` used for a classification task. Hence we will use ``AutoCarver.BinaryCarver`` and ``AutoCarver.selectors.ClassificationSelector`` in following code blocks.

## Data Sampling

In [5]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])

In [6]:
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

(0.38552188552188554, 0.3856655290102389)

# Picking up columns to Carve

In [7]:
train_set.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
617,0,3,Mr. Antoni Yasbeck,male,27.0,1,0,14.4542
489,0,1,Mr. Harry Markland Molson,male,55.0,0,0,30.5
871,1,3,Miss. Adele Kiamie Najib,female,15.0,0,0,7.225
654,0,3,Mrs. John (Catherine) Bourke,female,32.0,1,1,15.5
653,0,3,Mr. Alexander Radeff,male,27.0,0,0,7.8958


In [8]:
# column data types
train_set.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

In [9]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()

0    438
1     87
2     60
3      3
5      3
4      2
6      1
Name: Parents/Children Aboard, dtype: int64

In [10]:
# values taken by Pclass
train_set["Pclass"].value_counts()

3    326
1    142
2    126
Name: Pclass, dtype: int64

The feature ``"Pclass"`` is of type ``"int64"``, but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (ranking of named passenger classes). Thus we will add it to the list of ``ordinal_features`` and set the ordering of its values in ``values_orders`` (string values). 

``"Sex"`` is the only quantitative categorical feature, it's added to the list of ``qualitative_features``.

``"Age"`` and ``"Fare"`` are quantitative continuous features, whilst ``"Siblings/Spouses Aboard"``, ``"Parents/Children Aboard"`` can be considered as quantitative discrete features. Those four features will be added to the list of ``quantitative_features``.

In [11]:
# lists of features per data type
quantitative_features = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
qualitative_features = ["Sex"]
ordinal_features = ["Pclass"]

# user-specified ordering for ordinal features
values_orders = {
    "Pclass": ["1", "2", "3"]
}

# Using AutoCarver

## AutoCarver settings

### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:

- For quantitative features, it defines the number of quantiles to initialy discretize the features with.

- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

In [12]:
min_freq = 0.02

**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)

### Desired number of modalities

The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities.

In [13]:
max_n_mod = 5

**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)

### Association metric

The attribute ``sort_by`` allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by **Carvers**.

In [14]:
# For BinaryCarver, to be choosen amongst ["tschuprowt", "cramerv"]
sort_by = "tschuprowt"  # "cramerv"

**Tip:** use ``"tschuprowt"`` for more robust, or less output modalities, use ``"cramerv"`` for more output modalities.

### Grouping NaNs

The attribute ``dropna`` allows one to choose whether or not ``numpy.nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-NaN values, and then test out all possible combinations with ``numpy.nan``.

In [15]:
dropna = False  # anyway, there are no numpy.nan in this dataset

### Optional attributes

#### Minimal frequency per carved modality

The attribute ``min_freq_mod`` allows one to choose the minimum frequency per output modality. It is used by **Carvers** in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to ``min_freq/2``.

In [16]:
min_freq_mod = None  # for 0.05,  at least 5 % of observations per output modality in train and dev sets 

#### Type of output carved features

The attribute ``output_dtype`` allows one to choose the output type:

* Use ``"float"`` for integer output (default)
* Use ``"str"`` for strin output

In [17]:
output_dtype = "float"  # "str"


## Fitting AutoCarver

In [18]:
from AutoCarver import BinaryCarver

# intiating AutoCarver
auto_carver = BinaryCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    min_freq_mod=min_freq_mod,
    max_n_mod=max_n_mod,
    dropna=dropna,
    sort_by=sort_by,
    output_dtype=output_dtype,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

------
[Discretizer] Fit Qualitative Features
---
 - [StringDiscretizer] Fit ['Pclass']
 - [OrdinalDiscretizer] Fit ['Pclass']
 - [CategoricalDiscretizer] Fit ['Sex']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['Fare', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
 - [OrdinalDiscretizer] Fit ['Fare', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']


  warn(


------


------
[AutoCarver] Fit Age (1/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 2.000e+00,0.75,0.027
2.000e+00 < x <= 7.000e+00,0.625,0.04
7.000e+00 < x <= 1.100e+01,0.167,0.02
1.100e+01 < x <= 1.600e+01,0.44,0.042
1.600e+01 < x <= 1.800e+01,0.323,0.052
1.800e+01 < x <= 1.900e+01,0.391,0.039
1.900e+01 < x <= 2.050e+01,0.111,0.03
2.050e+01 < x <= 2.100e+01,0.19,0.035
2.100e+01 < x <= 2.350e+01,0.419,0.072
2.350e+01 < x <= 2.400e+01,0.542,0.04

Unnamed: 0,target_rate,frequency
x <= 2.000e+00,0.444,0.031
2.000e+00 < x <= 7.000e+00,0.75,0.027
7.000e+00 < x <= 1.100e+01,0.375,0.027
1.100e+01 < x <= 1.600e+01,0.5,0.041
1.600e+01 < x <= 1.800e+01,0.429,0.072
1.800e+01 < x <= 1.900e+01,0.2,0.034
1.900e+01 < x <= 2.050e+01,0.333,0.02
2.050e+01 < x <= 2.100e+01,0.154,0.044
2.100e+01 < x <= 2.350e+01,0.182,0.075
2.350e+01 < x <= 2.400e+01,0.5,0.034


Grouping modalities   : 100%|██████████| 31930/31930 [00:05<00:00, 6285.88it/s]
Computing associations: 100%|██████████| 31930/31930 [00:07<00:00, 4514.91it/s]
Testing robustness    :   0%|          | 1/31930 [00:04<43:25:44,  4.90s/it]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 7.000e+00,0.675,0.067
7.000e+00 < x,0.365,0.933

Unnamed: 0,target_rate,frequency
x <= 7.000e+00,0.588,0.058
7.000e+00 < x,0.373,0.942


------


------
[AutoCarver] Fit Siblings/Spouses Aboard (2/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.361,0.68
0.000e+00 < x <= 1.000e+00,0.5,0.232
1.000e+00 < x <= 2.000e+00,0.55,0.034
2.000e+00 < x,0.094,0.054

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.32,0.683
0.000e+00 < x <= 1.000e+00,0.606,0.242
1.000e+00 < x <= 2.000e+00,0.25,0.027
2.000e+00 < x,0.286,0.048


Grouping modalities   : 100%|██████████| 7/7 [00:00<00:00, 3508.62it/s]
Computing associations: 100%|██████████| 7/7 [00:00<00:00, 3508.62it/s]
Testing robustness    :   0%|          | 0/7 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.361,0.68
0.000e+00 < x <= 2.000e+00,0.506,0.266
2.000e+00 < x,0.094,0.054

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.32,0.683
0.000e+00 < x <= 2.000e+00,0.57,0.27
2.000e+00 < x,0.286,0.048


------


------
[AutoCarver] Fit Pclass (3/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
"1, 1",0.62,0.239
"2, 2",0.468,0.212
"3, 3",0.252,0.549

Unnamed: 0,target_rate,frequency
"1, 1",0.649,0.253
"2, 2",0.483,0.198
"3, 3",0.23,0.549


Grouping modalities   : 100%|██████████| 3/3 [00:00<00:00, 2976.09it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3007.39it/s]
Testing robustness    :   0%|          | 0/3 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
1 to 2,0.549,0.451
"3, 3",0.252,0.549

Unnamed: 0,target_rate,frequency
1 to 2,0.576,0.451
"3, 3",0.23,0.549


------


------
[AutoCarver] Fit Sex (4/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
male,0.188,0.636
female,0.731,0.364

Unnamed: 0,target_rate,frequency
male,0.195,0.666
female,0.765,0.334


Grouping modalities   : 100%|██████████| 1/1 [00:00<00:00, 1003.18it/s]
Computing associations: 100%|██████████| 1/1 [00:00<?, ?it/s]
Testing robustness    :   0%|          | 0/1 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
male,0.188,0.636
female,0.731,0.364

Unnamed: 0,target_rate,frequency
male,0.195,0.666
female,0.765,0.334


------


------
[AutoCarver] Fit Fare (5/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 6.496e+00,0.0,0.022
6.496e+00 < x <= 7.054e+00,0.077,0.022
7.054e+00 < x <= 7.225e+00,0.231,0.022
7.225e+00 < x <= 7.250e+00,0.182,0.037
7.250e+00 < x <= 7.750e+00,0.35,0.067
7.750e+00 < x <= 7.854e+00,0.333,0.04
7.854e+00 < x <= 7.896e+00,0.143,0.047
7.896e+00 < x <= 8.029e+00,0.5,0.027
8.029e+00 < x <= 8.050e+00,0.097,0.052
8.050e+00 < x <= 8.662e+00,0.083,0.02

Unnamed: 0,target_rate,frequency
x <= 6.496e+00,0.111,0.031
6.496e+00 < x <= 7.054e+00,0.0,0.01
7.054e+00 < x <= 7.225e+00,0.25,0.014
7.225e+00 < x <= 7.250e+00,0.167,0.02
7.250e+00 < x <= 7.750e+00,0.25,0.055
7.750e+00 < x <= 7.854e+00,0.133,0.051
7.854e+00 < x <= 7.896e+00,0.071,0.048
7.896e+00 < x <= 8.029e+00,0.333,0.01
8.029e+00 < x <= 8.050e+00,0.167,0.041
8.050e+00 < x <= 8.662e+00,0.182,0.038


Grouping modalities   : 100%|██████████| 27840/27840 [00:04<00:00, 6843.63it/s]
Computing associations: 100%|██████████| 27840/27840 [00:06<00:00, 4015.39it/s]
Testing robustness    :   0%|          | 0/27840 [00:04<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 9.500e+00,0.222,0.379
9.500e+00 < x <= 7.729e+01,0.426,0.522
7.729e+01 < x,0.797,0.099

Unnamed: 0,target_rate,frequency
x <= 9.500e+00,0.149,0.345
9.500e+00 < x <= 7.729e+01,0.472,0.549
7.729e+01 < x,0.71,0.106


------


------
[AutoCarver] Fit Parents/Children Aboard (6/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.345,0.737
0.000e+00 < x <= 1.000e+00,0.506,0.146
1.000e+00 < x,0.493,0.116

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.347,0.805
0.000e+00 < x <= 1.000e+00,0.677,0.106
1.000e+00 < x,0.385,0.089


Grouping modalities   : 100%|██████████| 3/3 [00:00<?, ?it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 1497.61it/s]
Testing robustness    :   0%|          | 0/3 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.345,0.737
0.000e+00 < x,0.5,0.263

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.347,0.805
0.000e+00 < x,0.544,0.195


------



## AutoCarver analysis

### Carving Summary

In [19]:
auto_carver.summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,content
feature,dtype,Unnamed: 2_level_1,Unnamed: 3_level_1
Age,float,0,[x <= 7.000e+00]
Age,float,1,[7.000e+00 < x]
Fare,float,0,[x <= 9.500e+00]
Fare,float,1,[9.500e+00 < x <= 7.729e+01]
Fare,float,2,[7.729e+01 < x]
Parents/Children Aboard,float,0,[x <= 0.000e+00]
Parents/Children Aboard,float,1,[0.000e+00 < x]
Siblings/Spouses Aboard,float,0,[x <= 0.000e+00]
Siblings/Spouses Aboard,float,1,[0.000e+00 < x <= 2.000e+00]
Siblings/Spouses Aboard,float,2,[2.000e+00 < x]


* As requested with ``output_dtype="float"``, output labels are integers of ranks of modalities

* For quantitative feature ``Age``, the selected combination of modalities groups ages as follows:
    * modality ``0``: lower or equal to 7 years old (``content==["x <= 7.000e+00"]``)
    * modality ``1``: ages higher than 7 years old (``content==["7.000e+00 < x "]``)

* For qualitative categorical feature ``Sex``, the selected combination of modalities has left modalities ``content=["male"]`` in modality ``0`` and ``content=["female"]`` in modality ``1`` (no combination possible)

* For qualitative ordinal feature ``Pclass``, the selected combination of modalities groups classes 1 and 2 in modality ``0`` (``content==[1, 2]``) and class 3 in modality ``1`` (``content==[3]``). The user-provided ordering of modalities has been preserved.

### Detailed overview of tested combinations

In [20]:
auto_carver.history(feature="Pclass")

Unnamed: 0,combination,tschuprowt,viability,viability_message,grouping_nan
0,"[[1, 1], [2, 2], [3, 3]]",0.269965,,[Raw X distribution],False
1,"[[1, 1, 2, 2], [3, 3]]",0.300144,True,[Combination robust between X and X_dev],False
2,"[[1, 1], [2, 2], [3, 3]]",0.269965,,[Not checked],False
3,"[[1, 1], [2, 2, 3, 3]]",0.265643,,[Not checked],False


* The most associated combination (the first tested out, where ``viability_message!=["Raw X distribution"]``) groups ``Pclass==1`` with ``Pclass==2`` and leaves ``Pclass==3`` as its own modality

* Tschuprow's T for this combination is ``0.300144`` (greater than the raw distribution, with ``0.269965``)

* This combination has been tested as viable: ``viability_message==["Combination robust between X and X_dev"]``

* Following combinations (less associated with the target) where not tested: ``viability_message==["Not checked"]``

* For all combinations ``grouping_nan==False`` means that it is not a combination in which NaNs are being groupedwith other modalities (as requested with ``dropna=False``)

## Saving and Loading AutoCarver

### Saving

All **Carvers** can safely be stored as a .json file.

In [21]:
import json

# storing as json file
with open('binay_carver.json', 'w') as my_carver_json:
    json.dump(auto_carver.to_json(), my_carver_json)

### Loading

**Carvers** can safely be loaded from a .json file.

In [22]:
import json

from AutoCarver import load_carver

# loading json file
with open('binay_carver.json', 'r') as my_carver_json:
    auto_carver = load_carver(json.load(my_carver_json))

## Applying AutoCarver

In [23]:
dev_set_processed = auto_carver.transform(dev_set)

In [24]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

Unnamed: 0,Age,Siblings/Spouses Aboard,Pclass,Sex,Fare,Parents/Children Aboard
0.0,0.05802,0.682594,0.450512,0.665529,0.34471,0.805461
1.0,0.94198,0.269625,0.549488,0.334471,0.549488,0.194539
2.0,,0.047782,,,0.105802,


# Feature Selection
## Selectors settings

### Features to select from

Here all features have been carved using ``BinaryCarver``, hence all features are qualitative.

In [25]:
features = qualitative_features + quantitative_features + ordinal_features


### Number of features to select

The attribute ``n_best`` allows one to choose the number of features to be selected per data type (quantitative and qualitative).

In [26]:
n_best = 6  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

## Using Selectors

In [27]:
from AutoCarver.selectors import ClassificationSelector

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    qualitative_features=features,
    n_best=n_best,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])

------
[Selector] Selecting from qualitative features: ['Age', 'Siblings/Spouses Aboard', 'Pclass', 'Sex', 'Fare', 'Parents/Children Aboard']
---

 - [Selector] Association between X and y


Unnamed: 0,dtype,pct_nan,pct_mode,mode,chi2_statistic,tschuprowt_measure
Sex,int64,0.0,0.636364,0.0,169.204709,0.533719
Pclass,int64,0.0,0.548822,1.0,53.511433,0.300144
Fare,float64,0.0,0.521886,1.0,69.540284,0.287718
Siblings/Spouses Aboard,int64,0.0,0.680135,0.0,22.226931,0.162663
Age,float64,0.0,0.93266,1.0,13.889037,0.152912
Parents/Children Aboard,int64,0.0,0.737374,0.0,11.057602,0.136439



 - [Selector] Association between X and y, filtered for inter-feature assocation


Unnamed: 0,dtype,pct_nan,pct_mode,mode,chi2_statistic,tschuprowt_measure
Sex,int64,0.0,0.636364,0.0,169.204709,0.533719
Pclass,int64,0.0,0.548822,1.0,53.511433,0.300144
Fare,float64,0.0,0.521886,1.0,69.540284,0.287718
Siblings/Spouses Aboard,int64,0.0,0.680135,0.0,22.226931,0.162663
Age,float64,0.0,0.93266,1.0,13.889037,0.152912
Parents/Children Aboard,int64,0.0,0.737374,0.0,11.057602,0.136439



 - [Selector] Selected qualitative features: ['Sex', 'Pclass', 'Fare', 'Siblings/Spouses Aboard', 'Age', 'Parents/Children Aboard']
------



* Feature ``Sex`` is the most associated with the target ``Survived``. Tschuprow's T value is ``tschuprowt_measure=0.533719``

* This feature has 0 % of NaNs (``pct_nan=0.0``) and its mode, ``0`` represents 64 % of observed data (``pct_nan=0.636364``)

* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)