# Setting things up

## About this notebook

In this notebook, we embark on a journey to elevate the predictive capabilities of the California Housing Prices Dataset through advanced preprocessing using the ``ContinuousCarver`` pipeline. Renowned for its association-maximizing discretization, ``ContinuousCarver`` is a powerful Python tool designed to handle diverse data types—whether they be quantitative or qualitative. Our specific goal is to prepare the dataset for continuous regression tasks, such as predicting housing prices.

The California Housing Prices Dataset is a treasure trove of features, encompassing information on factors like square footage, bedrooms, location, and more. By employing ``ContinuousCarver``, we aim to seamlessly discretize both quantitative and qualitative features, tailoring them for optimal representation in our continuous regression models.

Throughout this notebook, we'll explore the intricacies of ``ContinuousCarver``'s discretization pipeline, witnessing its adaptability to a variety of data types. Whether it involves transforming square footage or encoding location information, ``ContinuousCarver`` ensures that each feature is finely tuned for our regression tasks.

Join us in this exploration as we leverage the power of ``ContinuousCarver`` to preprocess the California Housing Prices Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that captures the nuanced relationships within the housing market, setting the stage for the development of accurate and impactful continuous regression models.

Let's dive in and uncover the potential of ``ContinuousCarver`` in transforming the California Housing Prices Dataset for optimal predictive modeling.


## Installation

In [1]:
# %pip install AutoCarver[jupyter]

## Califorinia Housing Prices Data

In this example notebook, we will use the California Housing Prices dataset.

The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.

Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression).

In [2]:
from sklearn import datasets

# Load dataset directly from sklearn
housing = datasets.fetch_california_housing(as_frame=True)

# conversion to pandas
housing_data = housing["data"]
housing_data[housing["target_names"][0]] = housing["target"]

# Display the first few rows of the dataset
housing_data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Target type and Carver selection

In [3]:
target = "MedHouseVal"

housing_data[target].describe()

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target ``"MedHouseVal"`` is a continuous target of type ``float64`` used in a regression task. Hence we will use ``AutoCarver.ContinuousCarver`` and ``AutoCarver.selectors.RegressionSelector`` in following code blocks.

## Data Sampling

In [4]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

(np.float64(2.0666362048018514), np.float64(2.072459655020552))

## Picking up columns to Carve

In [5]:
train_set.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
5088,0.9809,19.0,3.187726,1.129964,726.0,2.620939,33.98,-118.28,1.214
17096,4.2232,33.0,6.189696,1.086651,1015.0,2.377049,37.46,-122.23,3.637
5617,3.5488,42.0,4.821577,1.095436,1044.0,4.33195,33.79,-118.26,2.056
20060,1.6469,24.0,4.274194,1.048387,1686.0,4.532258,35.87,-119.26,0.476
895,3.9909,14.0,4.608303,1.08935,2738.0,2.471119,37.54,-121.96,2.36


In [6]:
# column data types
train_set.dtypes

MedInc         float64
HouseAge       float64
AveRooms       float64
AveBedrms      float64
Population     float64
AveOccup       float64
Latitude       float64
Longitude      float64
MedHouseVal    float64
dtype: object

All features are quantitative continuous features at the exception of ``Latitude`` and ``Longitude`` which are geographical featues (not supported by ``AutoCarver`` as is). All other features will be added to the list of ``quantitative_features``.

In [7]:
from AutoCarver import Features

# lists of features per data type
features = Features(quantitatives=["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"])

  from tqdm.autonotebook import tqdm


# Using AutoCarver

## AutoCarver settings

### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used:

- For quantitative features, to define the number of quantiles to initialy discretize the features with.

- For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality.

In [8]:
min_freq = 0.1

**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)

### Desired number of modalities

The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities.

In [9]:
max_n_mod = 4

**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)

### Grouping NaNs

The attribute ``dropna`` allows one to choose whether or not ``nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-``nan`` values, and then test out all possible combinations with ``nan``.

In [10]:
dropna = False  # anyway, there are no nan in this dataset

#### Type of output carved features

The attribute ``ordinal_encoding`` allows one to choose the output type:

* Use ``True`` for integer output of ranked modalities (default)
* Use ``False`` for string output of modalities

In [11]:
ordinal_encoding = True

## Fitting AutoCarver

* First, all quantitative features are discretized:
    1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq``)
    2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2``) to be grouped with its closest modality

* Second, all features are carved following this recipe, for all classes of ``train_set[target]`` (except one):
    1. The raw distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the discretization step
    2. Grouping modalities: all consecutive combinations of modalities are applied to ``train_set``
    3. Computing associations: the association metric (Krsuskal-Wallis' statistic, by default) is computed with the provided ``train_set[target]``
    4. Combinations are sorted in descending order by association value
    5. Testing robustness: finds the first combination that checks the following:
        - Representativness of modalities on ``train_set`` and ``dev_set`` (all should be more frequent than ``min_freq/2``)
        - Distinct target rates per consecutive modalities on ``train_set`` and ``dev_set`` 
        - No inversion of target rates between ``train_set`` and ``dev_set`` (same ordering of modalities by target rate)
    6. (Optional) If requested via ``dropna=True``, and if any, all combinations of modalities with ``nan`` are applied to ``train_set`` and steps 3. and 4. are run
    7. The carved distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the carving step

In [12]:
from AutoCarver import ContinuousCarver

# intiating AutoCarver
auto_carver = ContinuousCarver(
    features=features,
    min_freq=min_freq,
    max_n_mod=max_n_mod,
    dropna=dropna,
    ordinal_encoding=ordinal_encoding,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

------
--- [QuantitativeDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
 - [ContinuousDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
 - [OrdinalDiscretizer] Fit Features(['HouseAge'])
------

---------
------ [ContinuousCarver] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
--- [ContinuousCarver] Fit Quantitative('MedInc') (1/6)
 [ContinuousCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.6e+00,1.1102,0.05
1.6e+00 < x <= 1.9e+00,1.1285,0.05
1.9e+00 < x <= 2.2e+00,1.2198,0.05
2.2e+00 < x <= 2.4e+00,1.3171,0.05
2.4e+00 < x <= 2.6e+00,1.3817,0.05
2.6e+00 < x <= 2.7e+00,1.5409,0.05
2.7e+00 < x <= 3.0e+00,1.6159,0.05
3.0e+00 < x <= 3.1e+00,1.6906,0.0499
3.1e+00 < x <= 3.3e+00,1.8232,0.05
3.3e+00 < x <= 3.5e+00,1.9059,0.05

target_rate,frequency
1.1017,0.0509
1.041,0.0502
1.2407,0.0501
1.2919,0.0506
1.4676,0.0536
1.5605,0.0417
1.628,0.0584
1.7519,0.0471
1.8443,0.0504
1.85,0.0498


Grouping modalities   : 100%|█████████▉| 1158/1159 [00:00<00:00, 1464.55it/s]
Computing associations: 100%|██████████| 1159/1159 [00:02<00:00, 476.19it/s]
Testing robustness    :   0%|          | 0/1159 [00:00<?, ?it/s]




 [ContinuousCarver] Carved distribution


Unnamed: 0,target_rate,frequency
x <= 2.6e+00,1.2314,0.25
2.6e+00 < x <= 4.0e+00,1.8016,0.35
4.0e+00 < x <= 5.5e+00,2.3587,0.2499
5.5e+00 < x,3.59,0.1501

target_rate,frequency
1.2315,0.2554
1.8222,0.3509
2.3953,0.2446
3.5721,0.1491


--- [ContinuousCarver] Fit Quantitative('HouseAge') (2/6)
 [ContinuousCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 8.0e+00,2.1158,0.0537
8.0e+00 < x <= 1.2e+01,1.822,0.0477
1.2e+01 < x <= 1.5e+01,1.859,0.0613
1.5e+01 < x <= 1.6e+01,2.0358,0.0393
1.6e+01 < x <= 1.8e+01,1.9013,0.0596
1.8e+01 < x <= 2.0e+01,1.9399,0.0468
2.0e+01 < x <= 2.2e+01,2.0134,0.0404
2.2e+01 < x <= 2.5e+01,2.1055,0.0705
2.5e+01 < x <= 2.6e+01,2.0977,0.03
2.6e+01 < x <= 2.8e+01,2.0218,0.0475

target_rate,frequency
2.0205,0.0526
1.7827,0.0443
1.878,0.0556
1.9208,0.0335
1.9484,0.0652
1.9517,0.047
2.1141,0.0421
2.1179,0.0759
2.0888,0.0299
2.2138,0.0443


Grouping modalities   : 100%|█████████▉| 986/987 [00:00<00:00, 1431.77it/s]
Computing associations: 100%|██████████| 987/987 [00:02<00:00, 480.34it/s]
Testing robustness    :   1%|          | 6/987 [00:00<00:02, 430.04it/s]



 [ContinuousCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 2.2e+01,1.9494,0.3486
2.2e+01 < x <= 2.6e+01,2.1032,0.1005
2.6e+01 < x <= 4.5e+01,2.0509,0.4437
4.5e+01 < x,2.4785,0.1072

target_rate,frequency
1.9447,0.3403
2.1097,0.1058
2.067,0.447
2.4651,0.1069


--- [ContinuousCarver] Fit Quantitative('AveRooms') (3/6)
 [ContinuousCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 3.4e+00,1.9126,0.05
3.4e+00 < x <= 3.8e+00,1.8286,0.05
3.8e+00 < x <= 4.1e+00,1.8169,0.05
4.1e+00 < x <= 4.3e+00,1.8418,0.05
4.3e+00 < x <= 4.5e+00,1.7529,0.05
4.5e+00 < x <= 4.6e+00,1.7915,0.05
4.6e+00 < x <= 4.8e+00,1.8214,0.05
4.8e+00 < x <= 4.9e+00,1.7685,0.05
4.9e+00 < x <= 5.1e+00,1.7466,0.05
5.1e+00 < x <= 5.2e+00,1.7717,0.05

target_rate,frequency
1.8659,0.0518
1.8728,0.0505
1.7627,0.0524
1.802,0.0543
1.7223,0.0552
1.6802,0.0452
1.7707,0.053
1.803,0.0443
1.8209,0.0523
1.8326,0.0437


Grouping modalities   : 100%|█████████▉| 1158/1159 [00:00<00:00, 1524.83it/s]
Computing associations: 100%|██████████| 1159/1159 [00:02<00:00, 468.35it/s]
Testing robustness    :   1%|          | 7/1159 [00:00<00:03, 350.70it/s]



 [ContinuousCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 5.2e+00,1.8053,0.5
5.2e+00 < x <= 5.9e+00,1.9061,0.2
5.9e+00 < x <= 6.5e+00,2.2275,0.15
6.5e+00 < x,2.9907,0.1501

target_rate,frequency
1.7933,0.5028
1.9208,0.2033
2.2521,0.144
3.042,0.1499


--- [ContinuousCarver] Fit Quantitative('AveBedrms') (4/6)
 [ContinuousCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 9.400e-01,2.0684,0.05
9.400e-01 < x <= 9.672e-01,2.0735,0.05
9.672e-01 < x <= 9.832e-01,2.2167,0.0501
9.832e-01 < x <= 9.958e-01,2.1706,0.0499
9.958e-01 < x <= 1.007e+00,2.131,0.05
1.007e+00 < x <= 1.015e+00,2.2358,0.05
1.015e+00 < x <= 1.025e+00,2.1668,0.05
1.025e+00 < x <= 1.033e+00,2.2102,0.05
1.033e+00 < x <= 1.041e+00,2.1295,0.05
1.041e+00 < x <= 1.050e+00,2.1548,0.05

target_rate,frequency
2.0416,0.0539
2.2043,0.0527
2.0997,0.0482
2.1835,0.0487
2.2628,0.0552
2.1619,0.048
2.2295,0.0567
2.169,0.0493
2.1581,0.0528
2.1202,0.0476


Grouping modalities   : 100%|█████████▉| 1158/1159 [00:00<00:00, 1399.89it/s]
Computing associations: 100%|██████████| 1159/1159 [00:02<00:00, 466.61it/s]
Testing robustness    :   3%|▎         | 35/1159 [00:00<00:01, 562.72it/s]




 [ContinuousCarver] Carved distribution


Unnamed: 0,target_rate,frequency
x <= 9.67e-01,2.0709,0.1
9.67e-01 < x <= 1.06e+00,2.171,0.45
1.06e+00 < x <= 1.14e+00,2.0475,0.2999
1.14e+00 < x,1.7888,0.1501

target_rate,frequency
2.1221,0.1066
2.1685,0.4517
2.039,0.2955
1.8072,0.1462


--- [ContinuousCarver] Fit Quantitative('Population') (5/6)
 [ContinuousCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 3.5e+02,1.9859,0.0501
3.5e+02 < x <= 5.1e+02,2.1616,0.0501
5.1e+02 < x <= 6.3e+02,2.1117,0.0501
6.3e+02 < x <= 7.2e+02,2.2819,0.0497
7.2e+02 < x <= 7.9e+02,2.0335,0.0509
7.9e+02 < x <= 8.6e+02,2.2113,0.0492
8.6e+02 < x <= 9.4e+02,2.0772,0.0498
9.4e+02 < x <= 1.0e+03,2.1386,0.05
1.0e+03 < x <= 1.1e+03,2.043,0.0503
1.1e+03 < x <= 1.2e+03,2.0506,0.0496

target_rate,frequency
1.9012,0.053
2.1915,0.052
2.1706,0.0523
2.1062,0.0514
2.2019,0.0531
2.1765,0.049
2.2025,0.0506
2.1329,0.0553
2.1744,0.0437
2.1319,0.048


Grouping modalities   : 100%|█████████▉| 1158/1159 [00:00<00:00, 1521.46it/s]
Computing associations: 100%|██████████| 1159/1159 [00:02<00:00, 482.31it/s]
Testing robustness    :  16%|█▋        | 191/1159 [00:00<00:01, 721.78it/s]




 [ContinuousCarver] Carved distribution


Unnamed: 0,target_rate,frequency
x <= 6.3e+02,2.0864,0.1503
6.3e+02 < x <= 8.6e+02,2.1743,0.1498
8.6e+02 < x <= 2.2e+03,2.0433,0.5498
2.2e+03 < x,2.025,0.1501

target_rate,frequency
2.0867,0.1572
2.1618,0.1536
2.0607,0.539
2.0084,0.1502


--- [ContinuousCarver] Fit Quantitative('AveOccup') (6/6)
 [ContinuousCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 1.87e+00,2.7122,0.05
1.87e+00 < x <= 2.07e+00,2.6633,0.05
2.07e+00 < x <= 2.22e+00,2.3373,0.05
2.22e+00 < x <= 2.34e+00,2.308,0.05
2.34e+00 < x <= 2.43e+00,2.1976,0.05
2.43e+00 < x <= 2.51e+00,2.2064,0.05
2.51e+00 < x <= 2.60e+00,2.1736,0.05
2.60e+00 < x <= 2.67e+00,2.1862,0.05
2.67e+00 < x <= 2.74e+00,2.1378,0.05
2.74e+00 < x <= 2.82e+00,2.1902,0.05

target_rate,frequency
2.7684,0.0484
2.5334,0.0435
2.3989,0.0542
2.3641,0.0533
2.2272,0.0546
2.2969,0.0489
2.3179,0.0508
2.0793,0.0467
2.1847,0.0521
2.1752,0.0504


Grouping modalities   : 100%|█████████▉| 1158/1159 [00:00<00:00, 1394.69it/s]
Computing associations: 100%|██████████| 1159/1159 [00:02<00:00, 405.66it/s]
Testing robustness    :   0%|          | 3/1159 [00:00<00:07, 161.16it/s]



 [ContinuousCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 2.2e+00,2.5709,0.1501
2.2e+00 < x <= 3.1e+00,2.1681,0.5001
3.1e+00 < x <= 3.6e+00,1.8729,0.1998
3.6e+00 < x,1.4822,0.1501

target_rate,frequency
2.5615,0.1461
2.1836,0.5129
1.8527,0.1869
1.5056,0.1541


## AutoCarver analysis

### Carving Summary

In [13]:
auto_carver.summary()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,content,target_rate,frequency
feature,kruskal,n_mod,label,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Quantitative('MedInc'),6037.182135,4,0,x <= 2.6e+00,1.231421,0.25
Quantitative('MedInc'),6037.182135,4,1,2.6e+00 < x <= 4.0e+00,1.801562,0.350014
Quantitative('MedInc'),6037.182135,4,2,4.0e+00 < x <= 5.5e+00,2.35866,0.249928
Quantitative('MedInc'),6037.182135,4,3,5.5e+00 < x,3.59004,0.150058
Quantitative('HouseAge'),163.527841,4,0,x <= 2.2e+01,1.949361,0.348568
Quantitative('HouseAge'),163.527841,4,1,2.2e+01 < x <= 2.6e+01,2.103173,0.100521
Quantitative('HouseAge'),163.527841,4,2,2.6e+01 < x <= 4.5e+01,2.050927,0.443665
Quantitative('HouseAge'),163.527841,4,3,4.5e+01 < x,2.478542,0.107246
Quantitative('AveRooms'),1391.586489,4,0,x <= 5.2e+00,1.805255,0.5
Quantitative('AveRooms'),1391.586489,4,1,5.2e+00 < x <= 5.9e+00,1.906098,0.199957


* As requested with ``ordinal_encoding=True``, output labels are integers of modalities

* For quantitative feature ``Population``, the selected combination of modalities groups populations as follows:
    * label ``0``: lower or equal to 630 people (``content="x <= 6.3e+02"``)
    * label ``1``: greater than 630 people and lower or equal to 860 people  (``content="6.3e+02 < x <= 8.6e+02"``)
    * label ``2``: greater than 860 people and lower or equal to 2200 people (``content="8.6e+02 < x <= 2.2e+03"``)
    * label ``3``: higher than 2200 people (``content="2.2e+03 < x"``)

### Detailed overview of tested combinations

In [14]:
features["AveOccup"].history.head(7)

Unnamed: 0,info,kruskal,combination,n_mod,dropna,train,viable,dev
0,Raw distribution (n_mod=20>max_n_mod=4),1062.072498,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",20,False,,,
1,Not viable,994.51441,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",4,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Non-representative ..."
2,Not viable,994.504665,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",4,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Non-representative ..."
3,Not viable,991.504255,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",4,False,"{'viable': True, 'info': ''}",False,"{'viable': False, 'info': 'Non-representative ..."
4,Best for kruskal and max_n_mod=4,991.408301,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",4,False,"{'viable': True, 'info': ''}",True,"{'viable': True, 'info': ''}"
5,Not checked,991.308986,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",4,False,,,
6,Not checked,988.666983,"{'x <= 1.87e+00': 'x <= 1.87e+00', '1.87e+00 <...",4,False,,,


In [15]:
features["AveOccup"].history.dev[1]

{'viable': False, 'info': 'Non-representative modality for min_freq=10.00%'}

* The most associated combination of feature ``AveOccup`` (the first tested out, where ``info!="Raw distribution"``) did not pass the viability tests. When looking in ``history.dev``:
    * ``"Non-representative modality for min_freq=10.00%"``: tells us that a modality is unstable between ``train_set`` and ``dev_set``

* For feature feature ``AveOccup``, the 4th combination is the first to pass tests:
    - ``viabe=True``
    - ``info="Best for kruskal and max_n_mod=4"``
    - Kruskal-Wallis' H with ``MedHouseVal`` is ``991.408301`` for this combination
    - Following combinations (less associated with the target) where not tested: ``info="Not checked"``

* For all combinations ``dropna=False`` means that it is not a combination in which ``nan``s are being grouped with other modalities (as requested with ``dropna=False``)

## Saving and Loading AutoCarver

### Saving

All **Carvers** can safely be stored as a .json file.

In [16]:
auto_carver.save("continuous_carver.json")

### Loading

**Carvers** can safely be loaded from a .json file.

In [17]:
from AutoCarver import ContinuousCarver

# loading json file
auto_carver = ContinuousCarver.load('continuous_carver.json')

## Applying AutoCarver

In [18]:
dev_set_processed = auto_carver.transform(dev_set)

In [19]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0.0,0.255432,0.340282,0.502789,0.106577,0.157223,0.146066
1.0,0.350851,0.105843,0.203318,0.451703,0.153553,0.512918
2.0,0.244568,0.447005,0.144011,0.295508,0.539049,0.186876
3.0,0.149149,0.10687,0.149883,0.146213,0.150176,0.15414


# Feature Selection
## Selectors settings
### Features to select from

Here all features have been carved using ``BinaryCarver``, hence all features are qualitative.

### Number of features to select

The attribute ``n_best_per_type`` allows one to choose the number of features to be selected per data type (quantitative and qualitative).

In [20]:
n_best_per_type = 6 

## Using Selectors

In [21]:
from AutoCarver.selectors import RegressionSelector

# select the most target associated qualitative features
feature_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
best_features

 [RegressionSelector] Selected Features


Unnamed: 0,feature,NanMeasure,ModeMeasure,KruskalMeasure,KruskalRank,TschuprowtFilter,TschuprowtWith
0,Quantitative('MedInc'),0.0,0.35,6037.1821,0,0.0,itself
2,Quantitative('AveRooms'),0.0,0.5,1391.5865,1,0.4015,MedInc
5,Quantitative('AveOccup'),0.0,0.5001,991.4083,2,0.1864,AveRooms
3,Quantitative('AveBedrms'),0.0,0.45,315.7944,3,0.1392,MedInc
1,Quantitative('HouseAge'),0.0,0.4437,163.5278,4,0.1362,AveRooms
4,Quantitative('Population'),0.0,0.5498,16.1097,5,0.1517,AveBedrms


Features(['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population'])

In [22]:
train_set_processed[best_features].head()

Unnamed: 0,MedInc,AveRooms,AveOccup,AveBedrms,HouseAge,Population
5088,0.0,0.0,1.0,2.0,0.0,1.0
17096,2.0,2.0,1.0,2.0,2.0,2.0
5617,1.0,0.0,3.0,2.0,2.0,2.0
20060,0.0,0.0,3.0,1.0,1.0,2.0
895,2.0,0.0,1.0,2.0,0.0,3.0


* Feature ``MedInc`` is the most associated with the target ``MedHouseVal``:
    - Kruskal-Wallis' H value is ``KruskalMeasure=6037.1821``
    - It has 0 % of NaNs (``NanMeasure=0.0000``) 
    - Its mode represents 35 % of observed data (``ModeMeasure=0.3500``)

* Feature ``AveRooms`` is strongly associated to feature ``MedInc``:
    - Tschuprow's T value is ``TschuprowtFilter=0.4015`` for ``TschuprowtWith=MedInc``

* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

# Modeling
Fitting model on train data

In [23]:
from xgboost import XGBRegressor

model = XGBRegressor()
model.fit(train_set_processed[best_features], train_set_processed[target])

AttributeError: 'super' object has no attribute '__sklearn_tags__'

AttributeError: 'super' object has no attribute '__sklearn_tags__'

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)

Saving model

In [24]:
model.save_model("regression_xgboost.json")

Prediction on dev dataset and performance

In [25]:
from sklearn.metrics import root_mean_squared_error

dev_pred = model.predict(dev_set_processed[best_features])
root_mean_squared_error(dev_set_processed[target], dev_pred)

0.7773564029114313

## What's next?

* Thanks to **Carvers** all of your features are now optimally processed for your regression task!
* As a final step towards your model, **Selectors** can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out [Selectors Examples](https://autocarver.readthedocs.io/en/latest/selectors_examples.html)!

## Well done!

Your commitment to achieving optimal results in continuous regression tasks shines through in your meticulous use of **AutoCarver**'s ``ContinuousCarver`` for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The ``ContinuousCarver`` has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing **AutoCarver** as your companion in the data preprocessing journey. Your use of **AutoCarver** demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in continuous regression tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We're excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting **AutoCarver**, and we wish you continued success in your data-driven ventures.