# Setting things up

## Installation

In [1]:
# %pip install AutoCarver[jupyter]

In [2]:
import sys
import os

print(os.listdir('../../../../../AutoCarver'))

sys.path.append('../../../../../AutoCarver')
sys.path.append('../../../../../AutoCarver/AutoCarver')
sys.path.append('../../../../../AutoCarver/AutoCarver/discretizers')
sys.path.append('../../../../../AutoCarver/AutoCarver/discretizers/utils')
import AutoCarver


['.coverage', '.git', '.github', '.gitignore', '.ipynb_checkpoints', '.pytest_cache', '.readthedocs.yaml', 'AutoCarver', 'AutoCarver.egg-info', 'dist', 'docs', 'LICENSE', 'pyproject.toml', 'README.md', 'requirements.txt', 'setup.cfg', 'setup.py', 'tests', 'test_package.ipynb']


## Iris Data

In this example notebook, we will use the Iris dataset.

The Iris dataset is a classic and widely used dataset in the field of machine learning and pattern recognition. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936 and has since become a benchmark dataset for various classification and clustering tasks.

The dataset consists of measurements from 150 iris flowers, belonging to three different species: setosa, versicolor, and virginica. Four features are included for each flower: sepal length, sepal width, petal length, and petal width, all measured in centimeters.

The primary objective of the Iris dataset is typically to classify iris flowers into one of the three species based on these four features (multiclass classification).

In [3]:
import pandas as pd

from sklearn import datasets

# Load dataset directly from sklearn
iris = datasets.load_iris(as_frame=True)

# conversion to pandas
iris_data = iris["data"]
iris_data["target"] = list(map(lambda u: iris["target_names"][u], iris["target"]))

# Display the first few rows of the dataset
iris_data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Target type and Carver selection

In [4]:
target = "target"

iris_data[target].value_counts(dropna=False)

target
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

The target ``"target"`` is a multiclass target of type ``str`` used in a classification task. Hence we will use ``AutoCarver.MulticlassCarver`` and ``AutoCarver.selectors.ClassificationSelector`` in following code blocks.

## Data Sampling

In [5]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(iris_data, test_size=0.33, random_state=42, stratify=iris_data[target])

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [6]:
# checking target rate per dataset
train_set[target].value_counts(dropna=False, normalize=True), dev_set[target].value_counts(dropna=False, normalize=True)

(target
 setosa        0.34
 virginica     0.33
 versicolor    0.33
 Name: proportion, dtype: float64,
 target
 virginica     0.34
 versicolor    0.34
 setosa        0.32
 Name: proportion, dtype: float64)

# Picking up columns to Carve

In [7]:
train_set.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
136,6.3,3.4,5.6,2.4,virginica
17,5.1,3.5,1.4,0.3,setosa
142,5.8,2.7,5.1,1.9,virginica
59,5.2,2.7,3.9,1.4,versicolor
6,4.6,3.4,1.4,0.3,setosa


In [8]:
# column data types
train_set.dtypes

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
target                object
dtype: object

In [9]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


All features are quantitative continuous features. Those features will be added to the list of ``quantitative_features``.

In [10]:
# lists of features per data type
quantitative_features = ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]
qualitative_features = []
ordinal_features = []

# user-specified ordering for ordinal features
values_orders = {}

# Using AutoCarver

## AutoCarver settings

### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:

- For quantitative features, it defines the number of quantiles to initialy discretize the features with.

- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

In [11]:
min_freq = 0.1

**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)

### Desired number of modalities

The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities.

In [12]:
max_n_mod = 5

**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)

### Association metric

The attribute ``sort_by`` allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by **Carvers**.

In [13]:
# For MulticlassCarver, to be choosen amongst ["tschuprowt", "cramerv"]
sort_by = "cramerv"  # "tschuprowt"

**Tip:** use ``"tschuprowt"`` for more robust, or less output modalities, use ``"cramerv"`` for more output modalities.

### Grouping NaNs

The attribute ``dropna`` allows one to choose whether or not ``numpy.nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-NaN values, and then test out all possible combinations with ``numpy.nan``.

In [14]:
dropna = False  # anyway, there are no numpy.nan in this dataset

### Optional attributes

#### Minimal frequency per carved modality

The attribute ``min_freq_mod`` allows one to choose the minimum frequency per output modality. It is used by **Carvers** in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to ``min_freq/2``.

In [15]:
min_freq_mod = None  # if set to 0.05,  at least 5 % of observations per output modality in train and dev sets 

#### Type of output carved features

The attribute ``output_dtype`` allows one to choose the output type:

* Use ``"float"`` for integer output (default)
* Use ``"str"`` for strin output

In [16]:
output_dtype = "float"  # "str"

## Fitting AutoCarver

* First, all quantitative features are discretized:
    1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq=0.1``)
    2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2=0.05``) to be grouped with its closest modality

* Second, all features are carved following this recipe, for all classes of ``train_set[target]`` (except one):
    1. The raw distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the discretization step
    2. Grouping modalities: all consecutive combinations of modalities are applied to ``train_set``
    3. Computing associations: the association metric (``sort_by="cramerv"``) is computed with the provided target ``train_set[target]``
    4. Combinations are sorted in descending order by association value
    5. Testing robustness: finds the first combination that checks the following:
        - Representativness of modalities on ``train_set`` and ``dev_set`` (all should be more frequent than ``min_freq_mod``)
        - Distinct target rates per consecutive modalities on ``train_set`` and ``dev_set`` 
        - No inversion of target rates between ``train_set`` and ``dev_set`` (same ordering of modalities by target rate)
    6. (Optional) If requested via ``dropna=True``, and if any, all combinations of modalities with ``numpy.nan`` are applied to ``train_set`` and steps 3. and 4. are run
    7. The carved distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the carving step

In [17]:
from AutoCarver import MulticlassCarver

# intiating AutoCarver
auto_carver = MulticlassCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    min_freq_mod=min_freq_mod,
    max_n_mod=max_n_mod,
    dropna=dropna,
    sort_by=sort_by,
    output_dtype=output_dtype,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
# sepal width (cm)versicolor 


---------
[MulticlassCarver] Fit y=versicolor (1/2)
------
------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['sepal length (cm)', 'petal width (cm)', 'petal length (cm)', 'sepal width (cm)']
 - [OrdinalDiscretizer] Fit ['sepal length (cm)', 'petal width (cm)', 'petal length (cm)', 'sepal width (cm)']
------


------
[AutoCarver] Fit sepal length (cm) (1/4)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 4.800e+00,0.0,0.11
4.800e+00 < x <= 5.000e+00,0.1,0.1
5.000e+00 < x <= 5.200e+00,0.2222,0.09
5.200e+00 < x <= 5.500e+00,0.4,0.1
5.500e+00 < x <= 5.800e+00,0.5,0.12
5.800e+00 < x <= 6.100e+00,0.7778,0.09
6.100e+00 < x <= 6.300e+00,0.4,0.1
6.300e+00 < x <= 6.700e+00,0.5,0.12
6.700e+00 < x <= 7.000e+00,0.4286,0.07
7.000e+00 < x,0.0,0.1

target_rate,frequency
0.0,0.1
0.3333,0.12
0.0,0.08
0.5,0.08
0.7778,0.18
0.5,0.12
0.3333,0.06
0.2,0.2
0.0,0.02
0.0,0.04


Grouping modalities   : 100%|██████████| 255/255 [00:00<00:00, 7056.09it/s]
Computing associations: 100%|██████████| 255/255 [00:00<00:00, 3215.65it/s]


Testing robustness    :  51%|█████     | 129/255 [00:00<00:00, 244.23it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 5.200e+00,0.1,0.3
5.200e+00 < x <= 5.500e+00,0.4,0.1
5.500e+00 < x <= 6.100e+00,0.619,0.21
6.100e+00 < x,0.3333,0.39

target_rate,frequency
0.1333,0.3
0.5,0.08
0.6667,0.3
0.1875,0.32


------


------
[AutoCarver] Fit petal width (cm) (2/4)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.000e-01,0.0,0.05
1.000e-01 < x <= 2.000e-01,0.0,0.17
2.000e-01 < x <= 4.000e-01,0.0,0.11
4.000e-01 < x <= 1.200e+00,0.9167,0.12
1.200e+00 < x <= 1.300e+00,1.0,0.08
1.300e+00 < x <= 1.500e+00,0.9231,0.13
1.500e+00 < x <= 1.800e+00,0.2222,0.09
1.800e+00 < x <= 2.000e+00,0.0,0.08
2.000e+00 < x <= 2.200e+00,0.0,0.07
2.200e+00 < x,0.0,0.1

target_rate,frequency
,0.0
0.0,0.24
0.0,0.06
0.8,0.1
1.0,0.1
0.7143,0.14
0.3333,0.18
0.0,0.06
0.0,0.04
0.0,0.08


Grouping modalities   : 100%|██████████| 255/255 [00:00<00:00, 8193.51it/s]
Computing associations: 100%|██████████| 255/255 [00:00<00:00, 2954.44it/s]
Testing robustness    : 100%|██████████| 255/255 [00:00<00:00, 278.19it/s]

------


------
[AutoCarver] Fit petal length (cm) (3/4)
---

 - [AutoCarver] Raw distribution



  warn(


Unnamed: 0,target_rate,frequency
x <= 1.400e+00,0.0,0.15
1.400e+00 < x <= 1.600e+00,0.0,0.16
1.600e+00 < x <= 3.500e+00,0.5,0.06
3.500e+00 < x <= 4.200e+00,1.0,0.12
4.200e+00 < x <= 4.600e+00,1.0,0.1
4.600e+00 < x <= 4.900e+00,0.7,0.1
4.900e+00 < x <= 5.300e+00,0.1,0.1
5.300e+00 < x <= 5.800e+00,0.0,0.1
5.800e+00 < x,0.0,0.11

target_rate,frequency
0.0,0.18
0.0,0.08
0.4,0.1
1.0,0.12
0.8571,0.14
0.5,0.08
0.1667,0.12
0.0,0.14
0.0,0.04


Grouping modalities   : 100%|██████████| 162/162 [00:00<00:00, 6366.32it/s]
Computing associations: 100%|██████████| 162/162 [00:00<00:00, 3909.76it/s]
Testing robustness    :   0%|          | 0/162 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 1.600e+00,0.0,0.31
1.600e+00 < x <= 3.500e+00,0.5,0.06
3.500e+00 < x <= 4.600e+00,1.0,0.22
4.600e+00 < x <= 4.900e+00,0.7,0.1
4.900e+00 < x,0.0323,0.31

target_rate,frequency
0.0,0.26
0.4,0.1
0.9231,0.26
0.5,0.08
0.0667,0.3


------


------
[AutoCarver] Fit sepal width (cm) (4/4)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 2.500e+00,0.75,0.12
2.500e+00 < x <= 2.700e+00,0.6364,0.11
2.700e+00 < x <= 2.800e+00,0.4444,0.09
2.800e+00 < x <= 3.000e+00,0.4,0.2
3.000e+00 < x <= 3.200e+00,0.2778,0.18
3.200e+00 < x <= 3.500e+00,0.0,0.16
3.500e+00 < x,0.0,0.14

target_rate,frequency
0.5714,0.14
0.3333,0.06
0.4,0.1
0.4375,0.32
0.1667,0.12
0.25,0.16
0.0,0.1


Grouping modalities   : 100%|██████████| 56/56 [00:00<00:00, 8616.33it/s]
Computing associations: 100%|██████████| 56/56 [00:00<00:00, 3120.02it/s]
Testing robustness    :   4%|▎         | 2/56 [00:00<00:00, 98.06it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 2.700e+00,0.6957,0.23
2.700e+00 < x <= 3.000e+00,0.4138,0.29
3.000e+00 < x <= 3.200e+00,0.2778,0.18
3.200e+00 < x,0.0,0.3

target_rate,frequency
0.5,0.2
0.4286,0.42
0.1667,0.12
0.1538,0.26


------

---------


---------
[MulticlassCarver] Fit y=virginica (2/2)
------
------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['sepal length (cm)', 'petal width (cm)', 'petal length (cm)', 'sepal width (cm)']
 - [OrdinalDiscretizer] Fit ['sepal length (cm)', 'petal width (cm)', 'petal length (cm)', 'sepal width (cm)']
------


------
[AutoCarver] Fit sepal length (cm) (1/4)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 4.800e+00,0.0,0.11
4.800e+00 < x <= 5.000e+00,0.0,0.1
5.000e+00 < x <= 5.200e+00,0.0,0.09
5.200e+00 < x <= 5.500e+00,0.0,0.1
5.500e+00 < x <= 5.800e+00,0.4167,0.12
5.800e+00 < x <= 6.100e+00,0.2222,0.09
6.100e+00 < x <= 6.300e+00,0.6,0.1
6.300e+00 < x <= 6.700e+00,0.5,0.12
6.700e+00 < x <= 7.000e+00,0.5714,0.07
7.000e+00 < x,1.0,0.1

target_rate,frequency
0.0,0.1
0.1667,0.12
0.0,0.08
0.0,0.08
0.0,0.18
0.5,0.12
0.6667,0.06
0.8,0.2
1.0,0.02
1.0,0.04


Grouping modalities   : 100%|██████████| 255/255 [00:00<00:00, 8650.71it/s]
Computing associations: 100%|██████████| 255/255 [00:00<00:00, 4039.96it/s]
Testing robustness    :  15%|█▍        | 38/255 [00:00<00:00, 235.55it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 5.000e+00,0.0,0.21
5.000e+00 < x <= 5.500e+00,0.0,0.19
5.500e+00 < x <= 6.100e+00,0.3333,0.21
6.100e+00 < x <= 6.700e+00,0.5455,0.22
6.700e+00 < x,0.8235,0.17

target_rate,frequency
0.0909,0.22
0.0,0.16
0.2,0.3
0.7692,0.26
1.0,0.06


------


------
[AutoCarver] Fit petal width (cm) (2/4)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.000e-01,0.0,0.05
1.000e-01 < x <= 2.000e-01,0.0,0.17
2.000e-01 < x <= 4.000e-01,0.0,0.11
4.000e-01 < x <= 1.200e+00,0.0,0.12
1.200e+00 < x <= 1.300e+00,0.0,0.08
1.300e+00 < x <= 1.500e+00,0.0769,0.13
1.500e+00 < x <= 1.800e+00,0.7778,0.09
1.800e+00 < x <= 2.000e+00,1.0,0.08
2.000e+00 < x <= 2.200e+00,1.0,0.07
2.200e+00 < x,1.0,0.1

target_rate,frequency
,0.0
0.0,0.24
0.0,0.06
0.0,0.1
0.0,0.1
0.2857,0.14
0.6667,0.18
1.0,0.06
1.0,0.04
1.0,0.08


Grouping modalities   : 100%|██████████| 255/255 [00:00<00:00, 3629.98it/s]
Computing associations: 100%|██████████| 255/255 [00:00<00:00, 4093.48it/s]
Testing robustness    : 100%|██████████| 255/255 [00:00<00:00, 359.30it/s]

------


------
[AutoCarver] Fit petal length (cm) (3/4)
---

 - [AutoCarver] Raw distribution



  warn(


Unnamed: 0,target_rate,frequency
x <= 1.400e+00,0.0,0.15
1.400e+00 < x <= 1.600e+00,0.0,0.16
1.600e+00 < x <= 3.500e+00,0.0,0.06
3.500e+00 < x <= 4.200e+00,0.0,0.12
4.200e+00 < x <= 4.600e+00,0.0,0.1
4.600e+00 < x <= 4.900e+00,0.3,0.1
4.900e+00 < x <= 5.300e+00,0.9,0.1
5.300e+00 < x <= 5.800e+00,1.0,0.1
5.800e+00 < x,1.0,0.11

target_rate,frequency
0.0,0.18
0.0,0.08
0.0,0.1
0.0,0.12
0.1429,0.14
0.5,0.08
0.8333,0.12
1.0,0.14
1.0,0.04


Grouping modalities   : 100%|██████████| 162/162 [00:00<00:00, 10466.22it/s]
Computing associations: 100%|██████████| 162/162 [00:00<00:00, 4161.88it/s]
Testing robustness    :   1%|          | 1/162 [00:00<00:04, 35.16it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 4.600e+00,0.0,0.59
4.600e+00 < x <= 4.900e+00,0.3,0.1
4.900e+00 < x <= 5.300e+00,0.9,0.1
5.300e+00 < x,1.0,0.21

target_rate,frequency
0.0323,0.62
0.5,0.08
0.8333,0.12
1.0,0.18


------


------
[AutoCarver] Fit sepal width (cm) (4/4)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 2.500e+00,0.25,0.12
2.500e+00 < x <= 2.700e+00,0.3636,0.11
2.700e+00 < x <= 2.800e+00,0.5556,0.09
2.800e+00 < x <= 3.000e+00,0.35,0.2
3.000e+00 < x <= 3.300e+00,0.4545,0.22
3.300e+00 < x <= 3.500e+00,0.0833,0.12
3.500e+00 < x,0.2143,0.14

target_rate,frequency
0.2857,0.14
0.6667,0.06
0.6,0.1
0.4375,0.32
0.25,0.16
0.1667,0.12
0.0,0.1


Grouping modalities   : 100%|██████████| 56/56 [00:00<00:00, 7242.04it/s]
Computing associations: 100%|██████████| 56/56 [00:00<00:00, 3424.72it/s]
Testing robustness    :  25%|██▌       | 14/56 [00:00<00:00, 234.64it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 2.500e+00,0.25,0.12
2.500e+00 < x <= 2.800e+00,0.45,0.2
2.800e+00 < x <= 3.300e+00,0.4048,0.42
3.300e+00 < x,0.1538,0.26

target_rate,frequency
0.2857,0.14
0.625,0.16
0.375,0.48
0.0909,0.22


------

---------



## AutoCarver analysis

### Carving Summary

In [18]:
auto_carver.summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,content
feature,dtype,Unnamed: 2_level_1,Unnamed: 3_level_1
petal length (cm)_versicolor,float,0,[x <= 1.600e+00]
petal length (cm)_versicolor,float,1,[1.600e+00 < x <= 3.500e+00]
petal length (cm)_versicolor,float,2,[3.500e+00 < x <= 4.600e+00]
petal length (cm)_versicolor,float,3,[4.600e+00 < x <= 4.900e+00]
petal length (cm)_versicolor,float,4,[4.900e+00 < x]
petal length (cm)_virginica,float,0,[x <= 4.600e+00]
petal length (cm)_virginica,float,1,[4.600e+00 < x <= 4.900e+00]
petal length (cm)_virginica,float,2,[4.900e+00 < x <= 5.300e+00]
petal length (cm)_virginica,float,3,[5.300e+00 < x]
sepal length (cm)_versicolor,float,0,[x <= 5.200e+00]


* As requested with ``output_dtype="float"``, output labels are integers of ranks of modalities

* For ``y==versicolor``, for quantitative feature ``petal length (cm)``, the selected combination of modalities groups petal lengths as follows:
    * modality ``0``: lower or equal to 1.6cm (``content==["x <= 1.600e+00"]``)
    * modality ``1``: greater than 1.6cm and lower or equal to 4.9cm  (``content==["1.600e+00 < x <= 4.200e+00"]``)
    * modality ``2``: greater than 1.6cm and lower or equal to 4.9cm (``content==["4.200e+00 < x <= 4.600e+00"]``)
    * modality ``3``: greater than 1.6cm and lower or equal to 4.9cm (``content==["4.600e+00 < x <= 4.900e+00"]``)
    * modality ``4``: greater than 4.9cm (``content==["4.900e+00 < x "]``)

* For ``y==virginica``, for quantitative feature ``petal length (cm)``, the selected combination of modalities groups petal lengths as follows:
    * modality ``0``: lower or equal to 4.2cm (``content==["x <= 4.600e+00"]``)
    * modality ``1``: greater than 4.2cm and lower or equal to 4.9cm (``content==["4.600e+00 < x <= 4.900e+00"]``)
    * modality ``2``: greater than 4.9cm and lower or equal to 5.3cm (``content==["4.900e+00 < x <= 5.300e+00"]``)
    * modality ``3``: greater than 5.3cm (``content==["5.300e+00 < x "]``)

### Detailed overview of tested combinations

In [19]:
auto_carver.history("petal width (cm)_virginica").head()

Unnamed: 0,combination,cramerv,viability,viability_message,grouping_nan,removed
0,"[[x <= 1.000e-01], [1.000e-01 < x <= 2.000e-01...",0.942282,,[Raw X distribution],False,
1,"[[x <= 1.000e-01], [1.000e-01 < x <= 2.000e-01...",0.942282,False,[X_dev: inversion of target rates per modality...,False,
2,"[[x <= 1.000e-01, 1.000e-01 < x <= 2.000e-01],...",0.942282,False,[X_dev: inversion of target rates per modality...,False,
3,"[[x <= 1.000e-01, 1.000e-01 < x <= 2.000e-01, ...",0.942282,False,[X_dev: inversion of target rates per modality...,False,
4,"[[x <= 1.000e-01, 1.000e-01 < x <= 2.000e-01, ...",0.942282,False,[X_dev: inversion of target rates per modality...,False,


In [20]:
auto_carver.history("petal width (cm)_virginica")["viability_message"][2]

['X_dev: inversion of target rates per modality',
 'X_dev: non-representative modality (min_freq_mod=5.00%)',
 'X: non-distinct target rates per consecutive modalities']

* The most associated combination (the first tested out, where ``viability_message!=["Raw X distribution"]``) did not pass the viability tests. When looking in detail of ``viability_message``:
    * 

* Tschuprow's T for this combination is ``0.300144`` (greater than the raw distribution, with ``0.269965``)

* This combination has been tested as viable: ``viability_message==["Combination robust between X and X_dev"]``

* Following combinations (less associated with the target) where not tested: ``viability_message==["Not checked"]``

* For all combinations ``grouping_nan==False`` means that it is not a combination in which NaNs are being groupedwith other modalities (as requested with ``dropna=False``)

## Saving and Loading AutoCarver

### Saving

All **Carvers** can safely be stored as a .json file.

In [21]:
import json

# storing as json file
with open('multiclass_carver.json', 'w') as my_carver_json:
    json.dump(auto_carver.to_json(), my_carver_json)

### Loading

**Carvers** can safely be loaded from a .json file.

In [22]:
import json

from AutoCarver import load_carver

# loading json file
with open('multiclass_carver.json', 'r') as my_carver_json:
    auto_carver = load_carver(json.load(my_carver_json))

## Applying AutoCarver

In [23]:
dev_set_processed = auto_carver.transform(dev_set)

In [24]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

Unnamed: 0,sepal length (cm)_versicolor,petal length (cm)_virginica,petal length (cm)_versicolor,sepal length (cm)_virginica,sepal width (cm)_virginica,sepal width (cm)_versicolor
0.0,0.3,0.62,0.26,0.22,0.14,0.2
1.0,0.08,0.08,0.1,0.16,0.16,0.42
2.0,0.3,0.12,0.26,0.3,0.48,0.12
3.0,0.32,0.18,0.08,0.26,0.22,0.26
4.0,,,0.3,0.06,,


# Feature Selection
## Selectors settings

### Features to select from

Here all features have been carved using ``MulticlassCarver``, hence all features are qualitative.

In [25]:
features = auto_carver.features[:]


### Number of features to select

The attribute ``n_best`` allows one to choose the number of features to be selected per data type (quantitative and qualitative).

In [26]:
n_best = 4  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

## Using Selectors

In [27]:
from AutoCarver.selectors import ClassificationSelector

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    qualitative_features=features,
    n_best=n_best,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])

------
[Selector] Selecting from qualitative features: ['sepal length (cm)_versicolor', 'petal length (cm)_virginica', 'petal length (cm)_versicolor', 'sepal length (cm)_virginica', 'sepal width (cm)_virginica', 'sepal width (cm)_versicolor']
---

 - [Selector] Association between X and y


Unnamed: 0,dtype,pct_nan,pct_mode,mode,chi2_statistic,tschuprowt_measure
petal length (cm)_versicolor,float64,0.0,0.31,0.0,172.450405,0.780836
petal length (cm)_virginica,float64,0.0,0.59,0.0,95.788392,0.625343
sepal length (cm)_versicolor,float64,0.0,0.39,3.0,85.070452,0.589321
sepal length (cm)_virginica,float64,0.0,0.22,3.0,89.817519,0.563518
sepal width (cm)_virginica,float64,0.0,0.42,2.0,53.078033,0.4655
sepal width (cm)_versicolor,float64,0.0,0.3,3.0,47.9665,0.442518



 - [Selector] Association between X and y, filtered for inter-feature assocation


Unnamed: 0,dtype,pct_nan,pct_mode,mode,chi2_statistic,tschuprowt_measure
petal length (cm)_versicolor,float64,0.0,0.31,0.0,172.450405,0.780836
petal length (cm)_virginica,float64,0.0,0.59,0.0,95.788392,0.625343
sepal length (cm)_versicolor,float64,0.0,0.39,3.0,85.070452,0.589321
sepal length (cm)_virginica,float64,0.0,0.22,3.0,89.817519,0.563518



 - [Selector] Selected qualitative features: ['petal length (cm)_versicolor', 'petal length (cm)_virginica', 'sepal length (cm)_versicolor', 'sepal length (cm)_virginica']
------



* Feature ``petal width (cm)_versicolor`` is the most associated with the target. Tschuprow's T value is ``tschuprowt_measure=0.793431``

* This feature has 0 % of NaNs (``pct_nan=0.0``) and its mode, ``0``, represents 31 % of observed data (``pct_nan=0.310000``)

* The best, most associated, four features were selected (``n_best=4``)

* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)