# Setting things up

## About this notebook

In this notebook, we embark on a journey to refine the Iris Dataset for optimal performance in multiclass classification tasks, leveraging the capabilities of the ``MulticlassCarver`` pipeline. Recognized for its association-maximizing discretization, ``MulticlassCarver`` is a versatile Python tool that gracefully handles diverse data types—be they quantitative or qualitative. Our specific objective is to prepare the dataset for multiclass classification, illuminating the distinctive characteristics of Iris flower species.

The Iris Dataset, a classic in the realm of machine learning, presents features such as sepal and petal dimensions for three different Iris species. By employing ``MulticlassCarver``, our goal is to discretize both quantitative and qualitative features seamlessly, tailoring them for effective representation in our multiclass classification models.

Throughout this notebook, we'll unravel the intricacies of ``MulticlassCarver``'s discretization pipeline, showcasing its adaptability to various data types. Whether it involves transforming petal lengths or encoding species information, ``MulticlassCarver`` ensures that each feature is finely tuned for our multiclass classification tasks.

Join us in this exploration as we harness the power of ``MulticlassCarver`` to preprocess the Iris Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that not only distinguishes between Iris species but also sets the stage for the development of accurate and impactful multiclass classification models.

Let's dive in and uncover the potential of ``MulticlassCarver`` in transforming the Iris Dataset for optimal predictive modeling.


## Installation

In [1]:
# %pip install AutoCarver[jupyter]

## Iris Data

In this example notebook, we will use the Iris dataset.

The Iris dataset is a classic and widely used dataset in the field of machine learning and pattern recognition. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936 and has since become a benchmark dataset for various classification and clustering tasks.

The dataset consists of measurements from 150 iris flowers, belonging to three different species: setosa, versicolor, and virginica. Four features are included for each flower: sepal length, sepal width, petal length, and petal width, all measured in centimeters.

The primary objective of the Iris dataset is typically to classify iris flowers into one of the three species based on these four features (multiclass classification).

In [1]:
from sklearn import datasets

# Load dataset directly from sklearn
iris = datasets.load_iris(as_frame=True)

# conversion to pandas
iris_data = iris["data"]
iris_data["iris_type"] = list(map(lambda u: iris["target_names"][u], iris["target"]))

# Display the first few rows of the dataset
iris_data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),iris_type
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Target type and Carver selection

In [2]:
target = "iris_type"

iris_data[target].value_counts(dropna=False)

iris_type
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

The target ``"iris_type"`` is a multiclass target of type ``str`` used in a classification task. Hence we will use ``AutoCarver.MulticlassCarver`` and ``AutoCarver.selectors.ClassificationSelector`` in following code blocks.

## Data Sampling

In [3]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(iris_data, test_size=0.33, random_state=42, stratify=iris_data[target])

# checking target rate per dataset
train_set[target].value_counts(dropna=False, normalize=True), dev_set[target].value_counts(dropna=False, normalize=True)

(iris_type
 setosa        0.34
 virginica     0.33
 versicolor    0.33
 Name: proportion, dtype: float64,
 iris_type
 virginica     0.34
 versicolor    0.34
 setosa        0.32
 Name: proportion, dtype: float64)

## Picking up columns to Carve

In [4]:
train_set.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),iris_type
136,6.3,3.4,5.6,2.4,virginica
17,5.1,3.5,1.4,0.3,setosa
142,5.8,2.7,5.1,1.9,virginica
59,5.2,2.7,3.9,1.4,versicolor
6,4.6,3.4,1.4,0.3,setosa


In [5]:
# column data types
train_set.dtypes

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
iris_type             object
dtype: object

In [6]:
print(iris["feature_names"])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


All features are quantitative continuous features. Those features will be added to the list of ``quantitative_features``.

In [13]:
from AutoCarver import Features

# lists of features per data type
features = Features(quantitatives=["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"])

# Using AutoCarver

## AutoCarver settings

### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:

- For quantitative features, it defines the number of quantiles to initialy discretize the features with.

- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

In [8]:
min_freq = 0.05

**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)

### Desired number of modalities

The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities.

In [9]:
max_n_mod = 4

**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)

### Association metric

The attribute ``sort_by`` allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by **Carvers**.

In [13]:
# For MulticlassCarver, to be choosen amongst ["tschuprowt", "cramerv"]
# sort_by = "cramerv"  # "tschuprowt"

**Tip:** use ``"tschuprowt"`` for more robust, or less output modalities, use ``"cramerv"`` for more output modalities.

### Grouping NaNs

The attribute ``dropna`` allows one to choose whether or not ``numpy.nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-NaN values, and then test out all possible combinations with ``numpy.nan``.

In [10]:
dropna = False  # anyway, there are no numpy.nan in this dataset

### Optional attributes

#### Minimal frequency per carved modality

The attribute ``min_freq_mod`` allows one to choose the minimum frequency per output modality. It is used by **Carvers** in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to ``min_freq/2``.

In [15]:
# min_freq_mod = None  # if set to 0.05,  at least 5 % of observations per output modality in train and dev sets 

#### Type of output carved features

The attribute ``output_dtype`` allows one to choose the output type:

* Use ``"float"`` for integer output (default)
* Use ``"str"`` for string output

In [11]:
ordinal_encoding = True

## Fitting AutoCarver

* First, all quantitative features are discretized:
    1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq=0.1``)
    2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2=0.05``) to be grouped with its closest modality

* Second, all features are carved following this recipe, for all classes of ``train_set[target]`` (except one):
    1. The raw distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the discretization step
    2. Grouping modalities: all consecutive combinations of modalities are applied to ``train_set``
    3. Computing associations: the association metric (``sort_by="cramerv"``) is computed with the provided target ``train_set[target]``
    4. Combinations are sorted in descending order by association value
    5. Testing robustness: finds the first combination that checks the following:
        - Representativness of modalities on ``train_set`` and ``dev_set`` (all should be more frequent than ``min_freq_mod``)
        - Distinct target rates per consecutive modalities on ``train_set`` and ``dev_set`` 
        - No inversion of target rates between ``train_set`` and ``dev_set`` (same ordering of modalities by target rate)
    6. (Optional) If requested via ``dropna=True``, and if any, all combinations of modalities with ``numpy.nan`` are applied to ``train_set`` and steps 3. and 4. are run
    7. The carved distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the carving step

In [14]:
from AutoCarver import MulticlassCarver

# intiating AutoCarver
auto_carver = MulticlassCarver(
    features=features,
    ordinal_encoding=ordinal_encoding,
    min_freq=min_freq,
    max_n_mod=max_n_mod,
    dropna=dropna,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])


---------
[MulticlassCarver] Fit y=versicolor (1/2)
------
------
--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
 - [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
 - [OrdinalDiscretizer] Fit Features(['sepal width (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
------

---------
------ [BinaryCarver] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=versicolor') (1/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 4.70e+00,0.0,0.06
4.70e+00 < x <= 4.80e+00,0.0,0.05
4.80e+00 < x <= 4.90e+00,0.0,0.03
4.90e+00 < x <= 5.00e+00,0.1429,0.07
5.00e+00 < x <= 5.10e+00,0.1667,0.06
5.10e+00 < x <= 5.40e+00,0.1429,0.07
5.40e+00 < x <= 5.50e+00,0.6667,0.06
5.50e+00 < x <= 5.70e+00,0.5714,0.07
5.70e+00 < x <= 5.80e+00,0.4,0.05
5.80e+00 < x <= 6.00e+00,0.6667,0.06

target_rate,frequency
0.0,0.1
,0.0
0.3333,0.06
0.3333,0.06
0.0,0.06
0.25,0.08
1.0,0.02
0.8571,0.14
0.5,0.04
0.6667,0.06


Grouping modalities   : 100%|█████████▉| 695/696 [00:00<00:00, 8883.50it/s]
Computing associations: 100%|██████████| 696/696 [00:00<00:00, 4223.39it/s]
Testing robustness    :   8%|▊         | 55/696 [00:00<00:01, 352.10it/s]



 [BinaryCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 5.4e+00,0.0882,0.34
5.4e+00 < x,0.4545,0.66

target_rate,frequency
0.1667,0.36
0.4375,0.64


--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=versicolor') (2/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 2.40e+00,0.8571,0.07
2.40e+00 < x <= 2.50e+00,0.6,0.05
2.50e+00 < x <= 2.60e+00,0.75,0.04
2.60e+00 < x <= 2.70e+00,0.5714,0.07
2.70e+00 < x <= 2.80e+00,0.4444,0.09
2.80e+00 < x <= 2.90e+00,0.6667,0.06
2.90e+00 < x <= 3.00e+00,0.2857,0.14
3.00e+00 < x <= 3.10e+00,0.3333,0.09
3.10e+00 < x <= 3.20e+00,0.2222,0.09
3.20e+00 < x <= 3.30e+00,0.0,0.04

target_rate,frequency
0.75,0.08
0.3333,0.06
0.0,0.02
0.5,0.04
0.4,0.1
0.75,0.08
0.3333,0.24
0.0,0.04
0.25,0.08
0.5,0.04


Grouping modalities   : 100%|█████████▉| 468/469 [00:00<00:00, 8871.18it/s]
Computing associations: 100%|██████████| 469/469 [00:00<00:00, 4179.52it/s]
Testing robustness    :   0%|          | 0/469 [00:00<?, ?it/s]



 [BinaryCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 2.9e+00,0.6316,0.38
2.9e+00 < x,0.1452,0.62

target_rate,frequency
0.5263,0.38
0.2258,0.62


--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=versicolor') (3/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.30e+00,0.0,0.04
1.30e+00 < x <= 1.40e+00,0.0,0.11
1.40e+00 < x <= 1.50e+00,0.0,0.09
1.50e+00 < x <= 1.60e+00,0.0,0.07
1.60e+00 < x <= 3.50e+00,0.5,0.06
3.50e+00 < x <= 3.90e+00,1.0,0.05
3.90e+00 < x <= 4.20e+00,1.0,0.07
4.20e+00 < x <= 4.40e+00,1.0,0.06
4.40e+00 < x <= 4.60e+00,1.0,0.04
4.60e+00 < x <= 4.70e+00,1.0,0.03

target_rate,frequency
0.0,0.14
0.0,0.04
0.0,0.08
,0.0
0.4,0.1
1.0,0.02
1.0,0.1
,0.0
0.8571,0.14
1.0,0.04


Grouping modalities   : 100%|█████████▉| 574/575 [00:00<00:00, 5419.80it/s]
Computing associations: 100%|██████████| 575/575 [00:00<00:00, 3752.47it/s]
Testing robustness    :   0%|          | 0/575 [00:00<?, ?it/s]




 [BinaryCarver] Carved distribution


Unnamed: 0,target_rate,frequency
x <= 1.6e+00,0.0,0.31
1.6e+00 < x <= 4.9e+00,0.8421,0.38
4.9e+00 < x,0.0323,0.31

target_rate,frequency
0.0,0.26
0.7273,0.44
0.0667,0.3


--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=versicolor') (4/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.00e-01,0.0,0.05
1.00e-01 < x <= 2.00e-01,0.0,0.17
2.00e-01 < x <= 3.00e-01,0.0,0.05
3.00e-01 < x <= 4.00e-01,0.0,0.06
4.00e-01 < x <= 1.10e+00,0.8571,0.07
1.10e+00 < x <= 1.20e+00,1.0,0.05
1.20e+00 < x <= 1.30e+00,1.0,0.08
1.30e+00 < x <= 1.40e+00,1.0,0.06
1.40e+00 < x <= 1.60e+00,0.7778,0.09
1.60e+00 < x <= 1.80e+00,0.1429,0.07

target_rate,frequency
,0.0
0.0,0.24
0.0,0.04
0.0,0.02
0.8,0.1
,0.0
1.0,0.1
0.5,0.04
0.8571,0.14
0.1429,0.14


Grouping modalities   : 100%|█████████▉| 376/377 [00:00<00:00, 10249.22it/s]
Computing associations: 100%|██████████| 377/377 [00:00<00:00, 4246.36it/s]
Testing robustness    :   0%|          | 0/377 [00:00<?, ?it/s]



 [BinaryCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 4.0e-01,0.0,0.33
4.0e-01 < x <= 1.6e+00,0.9143,0.35
1.6e+00 < x,0.0312,0.32

target_rate,frequency
0.0,0.3
0.8421,0.38
0.0625,0.32


---------


---------
[MulticlassCarver] Fit y=virginica (2/2)
------
------
--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
 - [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
 - [OrdinalDiscretizer] Fit Features(['sepal width (cm)__y=virginica', 'petal width (cm)__y=virginica'])
------

---------
------ [BinaryCarver] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=virginica') (1/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 4.70e+00,0.0,0.06
4.70e+00 < x <= 4.80e+00,0.0,0.05
4.80e+00 < x <= 4.90e+00,0.0,0.03
4.90e+00 < x <= 5.00e+00,0.0,0.07
5.00e+00 < x <= 5.10e+00,0.0,0.06
5.10e+00 < x <= 5.40e+00,0.0,0.07
5.40e+00 < x <= 5.50e+00,0.0,0.06
5.50e+00 < x <= 5.70e+00,0.2857,0.07
5.70e+00 < x <= 5.80e+00,0.6,0.05
5.80e+00 < x <= 6.00e+00,0.3333,0.06

target_rate,frequency
0.0,0.1
,0.0
0.3333,0.06
0.0,0.06
0.0,0.06
0.0,0.08
0.0,0.02
0.0,0.14
0.0,0.04
0.3333,0.06


Grouping modalities   : 100%|█████████▉| 695/696 [00:00<00:00, 9688.13it/s]
Computing associations: 100%|██████████| 696/696 [00:00<00:00, 4330.79it/s]
Testing robustness    :   0%|          | 1/696 [00:00<00:06, 104.32it/s]




 [BinaryCarver] Carved distribution


Unnamed: 0,target_rate,frequency
x <= 5.7e+00,0.0426,0.47
5.7e+00 < x,0.5849,0.53

target_rate,frequency
0.0385,0.52
0.6667,0.48


--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=virginica') (2/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 2.40e+00,0.1429,0.07
2.40e+00 < x <= 2.50e+00,0.4,0.05
2.50e+00 < x <= 2.60e+00,0.25,0.04
2.60e+00 < x <= 2.70e+00,0.4286,0.07
2.70e+00 < x <= 2.80e+00,0.5556,0.09
2.80e+00 < x <= 2.90e+00,0.1667,0.06
2.90e+00 < x <= 3.00e+00,0.4286,0.14
3.00e+00 < x <= 3.10e+00,0.2222,0.09
3.10e+00 < x <= 3.20e+00,0.5556,0.09
3.20e+00 < x <= 3.30e+00,0.75,0.04

target_rate,frequency
0.0,0.08
0.6667,0.06
1.0,0.02
0.5,0.04
0.6,0.1
0.25,0.08
0.5,0.24
1.0,0.04
0.0,0.08
0.0,0.04


Grouping modalities   : 100%|█████████▉| 468/469 [00:00<00:00, 9538.30it/s]
Computing associations: 100%|██████████| 469/469 [00:00<00:00, 4816.13it/s]
Testing robustness    :   4%|▍         | 18/469 [00:00<00:01, 364.59it/s]



 [BinaryCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 2.6e+00,0.25,0.16
2.6e+00 < x <= 3.3e+00,0.431,0.58
3.3e+00 < x,0.1538,0.26

target_rate,frequency
0.375,0.16
0.4194,0.62
0.0909,0.22


--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=virginica') (3/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.30e+00,0.0,0.04
1.30e+00 < x <= 1.40e+00,0.0,0.11
1.40e+00 < x <= 1.50e+00,0.0,0.09
1.50e+00 < x <= 1.60e+00,0.0,0.07
1.60e+00 < x <= 3.50e+00,0.0,0.06
3.50e+00 < x <= 3.90e+00,0.0,0.05
3.90e+00 < x <= 4.20e+00,0.0,0.07
4.20e+00 < x <= 4.40e+00,0.0,0.06
4.40e+00 < x <= 4.60e+00,0.0,0.04
4.60e+00 < x <= 4.70e+00,0.0,0.03

target_rate,frequency
0.0,0.14
0.0,0.04
0.0,0.08
,0.0
0.0,0.1
0.0,0.02
0.0,0.1
,0.0
0.1429,0.14
0.0,0.04


Grouping modalities   : 100%|█████████▉| 574/575 [00:00<00:00, 8610.53it/s]
Computing associations: 100%|██████████| 575/575 [00:00<00:00, 4029.74it/s]
Testing robustness    :   0%|          | 0/575 [00:00<?, ?it/s]



 [BinaryCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 4.9e+00,0.0435,0.69
4.9e+00 < x,0.9677,0.31

target_rate,frequency
0.0857,0.7
0.9333,0.3


--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=virginica') (4/4)
 [BinaryCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.00e-01,0.0,0.05
1.00e-01 < x <= 2.00e-01,0.0,0.17
2.00e-01 < x <= 3.00e-01,0.0,0.05
3.00e-01 < x <= 4.00e-01,0.0,0.06
4.00e-01 < x <= 1.10e+00,0.0,0.07
1.10e+00 < x <= 1.20e+00,0.0,0.05
1.20e+00 < x <= 1.30e+00,0.0,0.08
1.30e+00 < x <= 1.40e+00,0.0,0.06
1.40e+00 < x <= 1.50e+00,0.1429,0.07
1.50e+00 < x <= 1.80e+00,0.7778,0.09

target_rate,frequency
,0.0
0.0,0.24
0.0,0.04
0.0,0.02
0.0,0.1
,0.0
0.0,0.1
0.5,0.04
0.2,0.1
0.6667,0.18


Grouping modalities   : 100%|█████████▉| 376/377 [00:00<00:00, 8145.04it/s]
Computing associations: 100%|██████████| 377/377 [00:00<00:00, 4131.34it/s]
Testing robustness    :   0%|          | 0/377 [00:00<?, ?it/s]








 [BinaryCarver] Carved distribution


Unnamed: 0,target_rate,frequency
x <= 1.5e+00,0.0152,0.66
1.5e+00 < x,0.9412,0.34

target_rate,frequency
0.0625,0.64
0.8333,0.36


---------



## AutoCarver analysis

### Carving Summary

In [15]:
auto_carver.summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,content,target_rate,frequency
feature,cramerv,tschuprowt,n_mod,label,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Quantitative('sepal length (cm)__y=versicolor'),0.346586,0.346586,2,0,x <= 5.4e+00,0.088235,0.34
Quantitative('sepal length (cm)__y=versicolor'),0.346586,0.346586,2,1,5.4e+00 < x,0.454545,0.66
Quantitative('sepal width (cm)__y=versicolor'),0.480207,0.480207,2,0,x <= 2.9e+00,0.631579,0.38
Quantitative('sepal width (cm)__y=versicolor'),0.480207,0.480207,2,1,2.9e+00 < x,0.145161,0.62
Quantitative('petal length (cm)__y=versicolor'),0.853058,0.717333,3,0,x <= 1.6e+00,0.0,0.31
Quantitative('petal length (cm)__y=versicolor'),0.853058,0.717333,3,1,1.6e+00 < x <= 4.9e+00,0.842105,0.38
Quantitative('petal length (cm)__y=versicolor'),0.853058,0.717333,3,2,4.9e+00 < x,0.032258,0.31
Quantitative('petal width (cm)__y=versicolor'),0.912212,0.767075,3,0,x <= 4.0e-01,0.0,0.33
Quantitative('petal width (cm)__y=versicolor'),0.912212,0.767075,3,1,4.0e-01 < x <= 1.6e+00,0.914286,0.35
Quantitative('petal width (cm)__y=versicolor'),0.912212,0.767075,3,2,1.6e+00 < x,0.03125,0.32


### Detailed overview of tested combinations

In [19]:
features["sepal width (cm)__y=virginica"].history.head(20)

Unnamed: 0,info,cramerv,tschuprowt,combination,n_mod,dropna,train,viable,dev
0,Raw distribution (n_mod=15>max_n_mod=4),0.414206,0.214133,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",15,False,,,
1,Not viable,0.291441,0.245071,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",3,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Inversion of target..."
2,Not viable,0.315349,0.239613,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': False, 'info': 'Non-representative ...",False,{'viable': None}
3,Not viable,0.306029,0.232532,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Inversion of target..."
4,Not viable,0.303669,0.230739,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': False, 'info': 'Non-representative ...",False,{'viable': None}
5,Not viable,0.303397,0.230532,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': False, 'info': 'Non-representative ...",False,{'viable': None}
6,Not viable,0.302596,0.229923,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': False, 'info': 'Non-representative ...",False,{'viable': None}
7,Not viable,0.301564,0.229139,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Inversion of target..."
8,Not viable,0.271729,0.228496,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",3,False,"{'viable': False, 'info': 'Non-representative ...",False,{'viable': None}
9,Not viable,0.299915,0.227886,"{'x <= 2.40e+00': 'x <= 2.40e+00', '2.40e+00 <...",4,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Inversion of target..."


In [21]:
features["sepal width (cm)__y=virginica"].history.dev[1]

{'viable': False, 'info': 'Inversion of target rates per modality'}

* The most associated combination of feature ``sepal width (cm)_virginica`` (the first tested out, where ``viability_message!=["Raw X distribution"]``) did not pass the viability tests. When looking in ``viability_message``:
    * ``"X_dev: inversion of target rates per modality"``: target rates (mean values of ``iris_type=="virginica"`` per grouped modality) are not ranked the same between ``train_set`` and ``dev_set``

* For feature ``sepal width (cm)_virginica``, the 15th combination is the first to pass the tests:
    - ``viability_message!=["Combination robust between X and X_dev"]``
    - Cramér's V with ``ìris_type`` is ``0.252203`` for this combination
    - Following combinations (less associated with the target) where not tested: ``viability_message==["Not checked"]``

* For all combinations ``grouping_nan==False`` means that it is not a combination in which NaNs are being groupedwith other modalities (as requested with ``dropna=False``)

## Saving and Loading AutoCarver

### Saving

All **Carvers** can safely be stored as a .json file.

In [22]:
auto_carver.save("multiclass_carver.json")

### Loading

**Carvers** can safely be loaded from a .json file.

In [24]:
from AutoCarver import MulticlassCarver

auto_carver = MulticlassCarver.load("multiclass_carver.json")

## Applying AutoCarver

In [25]:
dev_set_processed = auto_carver.transform(dev_set)

In [26]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

Unnamed: 0,sepal length (cm)__y=versicolor,sepal width (cm)__y=versicolor,petal length (cm)__y=versicolor,petal width (cm)__y=versicolor,sepal length (cm)__y=virginica,sepal width (cm)__y=virginica,petal length (cm)__y=virginica,petal width (cm)__y=virginica
0.0,0.36,0.38,0.26,0.3,0.52,0.16,0.7,0.64
1.0,0.64,0.62,0.44,0.38,0.48,0.62,0.3,0.36
2.0,,,0.3,0.32,,0.22,,


# Feature Selection
## Selectors settings

### Features to select from

Here all features have been carved using ``MulticlassCarver``, hence all features are qualitative.


### Number of features to select

The attribute ``n_best`` allows one to choose the number of features to be selected per data type (quantitative and qualitative).

In [27]:
n_best_per_type = 4  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

## Using Selectors

In [29]:
from AutoCarver.selectors import ClassificationSelector

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
best_features

[Quantitative('sepal width (cm)__y=versicolor'),
 Quantitative('sepal width (cm)__y=virginica'),
 Quantitative('sepal length (cm)__y=virginica'),
 Quantitative('sepal length (cm)__y=versicolor')]

* Feature ``petal width (cm)_versicolor`` is the most associated with the target ``iris_type``:
    - Tschuprow's T value is ``tschuprowt_measure=0.7808``
    - Its has 0 % of NaNs (``pct_nan=0.0``) 
    - Its mode, ``0``, represents 31 % of observed data (``pct_nan=0.3100``)

* The best, most associated, four features were selected (``n_best=4``)

* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

## What's next?

* Thanks to **Carvers** all of your features are now optimally processed for your classification task!
* As a final step towards your model, **Selectors** can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out [Selectors Examples](https://autocarver.readthedocs.io/en/latest/selectors_examples.html)!

## Well done!

Your commitment to achieving optimal results in multiclass classification tasks shines through in your meticulous use of **AutoCarver**'s ``MulticlassCarver`` for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The ``MulticlassCarver`` has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing **AutoCarver** as your companion in the data preprocessing journey. Your use of **AutoCarver** demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in multiclass classification tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We're excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting **AutoCarver**, and we wish you continued success in your data-driven ventures.