# Setting things up

## Installation

In [1]:
!pip install AutoCarver[jupyter]



## Titanic Data

In this example notebook, we will use the Titanic dataset.

The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.

The dataset includes various features such as passengers' names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).

In [2]:
import pandas as pd

# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)

# Display the first few rows of the dataset
titanic_data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


## Target type and Carver selection

In [3]:
target = "Survived"

titanic_data[target].value_counts(dropna=False)

0    545
1    342
Name: Survived, dtype: int64

The target ``"Survived"`` is a binary target of type ``int64`` used for a classification task. Hence we will use ``AutoCarver.BinaryCarver`` and ``AutoCarver.selectors.ClassificationSelector`` in following code blocks.

## Data Sampling

In [4]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])

In [5]:
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

(0.38552188552188554, 0.3856655290102389)

# Picking up columns to Carve

In [6]:
train_set.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
617,0,3,Mr. Antoni Yasbeck,male,27.0,1,0,14.4542
489,0,1,Mr. Harry Markland Molson,male,55.0,0,0,30.5
871,1,3,Miss. Adele Kiamie Najib,female,15.0,0,0,7.225
654,0,3,Mrs. John (Catherine) Bourke,female,32.0,1,1,15.5
653,0,3,Mr. Alexander Radeff,male,27.0,0,0,7.8958


In [7]:
# column data types
train_set.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

In [8]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()

0    438
1     87
2     60
3      3
5      3
4      2
6      1
Name: Parents/Children Aboard, dtype: int64

In [9]:
# values taken by Pclass
train_set["Pclass"].value_counts()

3    326
1    142
2    126
Name: Pclass, dtype: int64

The feature ``"Pclass"`` is of type ``"int64"``, but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (ranking of named passenger classes). Thus we will add it to the list of ``ordinal_features`` and set the ordering of its values in ``values_orders`` (string values). 

``"Sex"`` is the only quantitative categorical feature, it's added to the list of ``qualitative_features``.

``"Age"`` and ``"Fare"`` are quantitative continuous features, whilst ``"Siblings/Spouses Aboard"``, ``"Parents/Children Aboard"`` can be considered as quantitative discrete features. Thoses four features will be added to the list of ``quantitative_features``.

In [10]:
quantitative_features = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
qualitative_features = ["Sex"]
ordinal_features = ["Pclass"]

values_orders = {
    "Pclass": ["1", "2", "3"]
}

# Using AutoCarver

## AutoCarver settings

### Representativness of modalities

The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:

- For quantitative features, it defines the number of quantiles to initialy discretize the features with.

- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

In [11]:
min_freq = 0.02

**Tip:** should be set between ``0.02`` (slower, preciser, less robust) and ``0.05`` (faster, more robust)

### Desired number of modalities

The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities.

In [12]:
max_n_mod = 5

**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)

### Association metric

The attribute ``sort_by`` allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by **Carvers**.

In [13]:
# For BinaryCarver, to be choosen amongst ["tschuprowt", "cramerv"]
sort_by = "tschuprowt"  # "cramerv"

**Tip:** use ``"tschuprowt"`` for more robust, or less output modalities, use ``"cramerv"`` for more output modalities.


## Fitting AutoCarver

In [14]:
from AutoCarver import BinaryCarver

# intiating AutoCarver
auto_carver = BinaryCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    max_n_mod=max_n_mod,
    sort_by=sort_by,
    verbose=True,  # showing statistics
    copy=True,
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

------
[Discretizer] Fit Qualitative Features
---
 - [StringDiscretizer] Fit ['Pclass']
 - [OrdinalDiscretizer] Fit ['Pclass']
 - [CategoricalDiscretizer] Fit ['Sex']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Age']
 - [OrdinalDiscretizer] Fit ['Fare', 'Siblings/Spouses Aboard', 'Age', 'Parents/Children Aboard']


  warn(


------


------
[AutoCarver] Fit Fare (1/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 6.496e+00,0.0,0.022
6.496e+00 < x <= 7.054e+00,0.077,0.022
7.054e+00 < x <= 7.225e+00,0.231,0.022
7.225e+00 < x <= 7.250e+00,0.182,0.037
7.250e+00 < x <= 7.750e+00,0.35,0.067
7.750e+00 < x <= 7.854e+00,0.333,0.04
7.854e+00 < x <= 7.896e+00,0.143,0.047
7.896e+00 < x <= 8.029e+00,0.5,0.027
8.029e+00 < x <= 8.050e+00,0.097,0.052
8.050e+00 < x <= 8.662e+00,0.083,0.02

Unnamed: 0,target_rate,frequency
x <= 6.496e+00,0.111,0.031
6.496e+00 < x <= 7.054e+00,0.0,0.01
7.054e+00 < x <= 7.225e+00,0.25,0.014
7.225e+00 < x <= 7.250e+00,0.167,0.02
7.250e+00 < x <= 7.750e+00,0.25,0.055
7.750e+00 < x <= 7.854e+00,0.133,0.051
7.854e+00 < x <= 7.896e+00,0.071,0.048
7.896e+00 < x <= 8.029e+00,0.333,0.01
8.029e+00 < x <= 8.050e+00,0.167,0.041
8.050e+00 < x <= 8.662e+00,0.182,0.038


Grouping modalities   : 100%|██████████| 27840/27840 [00:04<00:00, 6414.49it/s]
Computing associations: 100%|██████████| 27840/27840 [00:06<00:00, 4380.26it/s]
Testing robustness    :   0%|          | 0/27840 [00:04<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 9.500e+00,0.222,0.379
9.500e+00 < x <= 7.729e+01,0.426,0.522
7.729e+01 < x,0.797,0.099

Unnamed: 0,target_rate,frequency
x <= 9.500e+00,0.149,0.345
9.500e+00 < x <= 7.729e+01,0.472,0.549
7.729e+01 < x,0.71,0.106


------


------
[AutoCarver] Fit Siblings/Spouses Aboard (2/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.361,0.68
0.000e+00 < x <= 1.000e+00,0.5,0.232
1.000e+00 < x <= 2.000e+00,0.55,0.034
2.000e+00 < x,0.094,0.054

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.32,0.683
0.000e+00 < x <= 1.000e+00,0.606,0.242
1.000e+00 < x <= 2.000e+00,0.25,0.027
2.000e+00 < x,0.286,0.048


Grouping modalities   : 100%|██████████| 7/7 [00:00<00:00, 3511.98it/s]
Computing associations: 100%|██████████| 7/7 [00:00<00:00, 3449.26it/s]
Testing robustness    :   0%|          | 0/7 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.361,0.68
0.000e+00 < x <= 2.000e+00,0.506,0.266
2.000e+00 < x,0.094,0.054

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.32,0.683
0.000e+00 < x <= 2.000e+00,0.57,0.27
2.000e+00 < x,0.286,0.048


------


------
[AutoCarver] Fit Age (3/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 2.000e+00,0.75,0.027
2.000e+00 < x <= 7.000e+00,0.625,0.04
7.000e+00 < x <= 1.100e+01,0.167,0.02
1.100e+01 < x <= 1.600e+01,0.44,0.042
1.600e+01 < x <= 1.800e+01,0.323,0.052
1.800e+01 < x <= 1.900e+01,0.391,0.039
1.900e+01 < x <= 2.050e+01,0.111,0.03
2.050e+01 < x <= 2.100e+01,0.19,0.035
2.100e+01 < x <= 2.350e+01,0.419,0.072
2.350e+01 < x <= 2.400e+01,0.542,0.04

Unnamed: 0,target_rate,frequency
x <= 2.000e+00,0.444,0.031
2.000e+00 < x <= 7.000e+00,0.75,0.027
7.000e+00 < x <= 1.100e+01,0.375,0.027
1.100e+01 < x <= 1.600e+01,0.5,0.041
1.600e+01 < x <= 1.800e+01,0.429,0.072
1.800e+01 < x <= 1.900e+01,0.2,0.034
1.900e+01 < x <= 2.050e+01,0.333,0.02
2.050e+01 < x <= 2.100e+01,0.154,0.044
2.100e+01 < x <= 2.350e+01,0.182,0.075
2.350e+01 < x <= 2.400e+01,0.5,0.034


Grouping modalities   : 100%|██████████| 31930/31930 [00:05<00:00, 6361.64it/s]
Computing associations: 100%|██████████| 31930/31930 [00:07<00:00, 4450.87it/s]
Testing robustness    :   0%|          | 1/31930 [00:05<45:05:40,  5.08s/it]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 7.000e+00,0.675,0.067
7.000e+00 < x,0.365,0.933

Unnamed: 0,target_rate,frequency
x <= 7.000e+00,0.588,0.058
7.000e+00 < x,0.373,0.942


------


------
[AutoCarver] Fit Pclass (4/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
"1, 1",0.62,0.239
"2, 2",0.468,0.212
"3, 3",0.252,0.549

Unnamed: 0,target_rate,frequency
"1, 1",0.649,0.253
"2, 2",0.483,0.198
"3, 3",0.23,0.549


Grouping modalities   : 100%|██████████| 3/3 [00:00<00:00, 3006.67it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 2804.30it/s]
Testing robustness    :   0%|          | 0/3 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
1 to 2,0.549,0.451
"3, 3",0.252,0.549

Unnamed: 0,target_rate,frequency
1 to 2,0.576,0.451
"3, 3",0.23,0.549


------


------
[AutoCarver] Fit Parents/Children Aboard (5/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.345,0.737
0.000e+00 < x <= 1.000e+00,0.506,0.146
1.000e+00 < x,0.493,0.116

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.347,0.805
0.000e+00 < x <= 1.000e+00,0.677,0.106
1.000e+00 < x,0.385,0.089


Grouping modalities   : 100%|██████████| 3/3 [00:00<00:00, 3006.67it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3008.83it/s]
Testing robustness    :   0%|          | 0/3 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.345,0.737
0.000e+00 < x,0.5,0.263

Unnamed: 0,target_rate,frequency
x <= 0.000e+00,0.347,0.805
0.000e+00 < x,0.544,0.195


------


------
[AutoCarver] Fit Sex (6/6)
---

 - [AutoCarver] Raw distribution


Unnamed: 0,target_rate,frequency
male,0.188,0.636
female,0.731,0.364

Unnamed: 0,target_rate,frequency
male,0.195,0.666
female,0.765,0.334


Grouping modalities   : 100%|██████████| 1/1 [00:00<00:00, 1003.42it/s]
Computing associations: 100%|██████████| 1/1 [00:00<00:00, 1001.27it/s]
Testing robustness    :   0%|          | 0/1 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution





Unnamed: 0,target_rate,frequency
male,0.188,0.636
female,0.731,0.364

Unnamed: 0,target_rate,frequency
male,0.195,0.666
female,0.765,0.334


------



In [3]:
# from ucimlrepo import fetch_ucirepo 
  
# # fetch dataset 
# breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# # data (as pandas dataframes) 
# X = breast_cancer_wisconsin_diagnostic.data.features 
# y = breast_cancer_wisconsin_diagnostic.data.targets 
  
# # metadata 
# print(breast_cancer_wisconsin_diagnostic.metadata) 
  
# # variable information 
# print(breast_cancer_wisconsin_diagnostic.variables) 

{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'published_in': 'Electronic imaging', 'year': 1993, 'url': 'https://www.semanticscholar.org/paper/53