# Auto-ablation with Howso Engine

> Note: This feature is experimental and is not universally recommended for production deployment.

## Overview

This notebook provides an overview of using auto-ablation during train to reduce the size of your data _as they are trained_.

In [1]:
import pandas as pd
from pmlb import fetch_data

from howso.engine import Trainee
from howso.utilities import infer_feature_attributes

## Step 1: Load Data & Feature Mapping

Our example dataset for this recipe is the well-known `Adult` dataset. This dataset works well because it has over 48,000 cases. In general, datasets containing over 25,000 rows work well. The default minimum size is 1,000 cases, below which no cases will be ablated.

This part of the process is identical to other recipes.

In [2]:
df = fetch_data("adult", local_cache_dir="../../../data")
features = infer_feature_attributes(df)

display(df)
display(features.to_dataframe())

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39.0,7,77516.0,9,13.0,4,1,1,4,1,2174.0,0.0,40.0,39,1
1,50.0,6,83311.0,9,13.0,2,4,0,4,1,0.0,0.0,13.0,39,1
2,38.0,4,215646.0,11,9.0,0,6,1,4,1,0.0,0.0,40.0,39,1
3,53.0,4,234721.0,1,7.0,2,6,0,2,1,0.0,0.0,40.0,39,1
4,28.0,4,338409.0,9,13.0,2,10,5,2,0,0.0,0.0,40.0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39.0,4,215419.0,9,13.0,0,10,1,4,0,0.0,0.0,36.0,39,1
48838,64.0,0,321403.0,11,9.0,6,0,2,2,1,0.0,0.0,40.0,39,1
48839,38.0,4,374983.0,9,13.0,2,10,0,4,1,0.0,0.0,50.0,39,1
48840,44.0,4,83891.0,9,13.0,0,1,3,1,1,5455.0,0.0,40.0,39,1


Unnamed: 0_level_0,type,decimal_places,bounds,bounds,bounds,bounds,bounds,data_type,original_type,original_type
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,min,max,allow_null,observed_min,observed_max,Unnamed: 8_level_1,data_type,size
age,continuous,0,0.0,137.0,True,17.0,90.0,number,numeric,8
workclass,nominal,0,,,False,,,number,integer,8
fnlwgt,continuous,0,0.0,2449285.0,True,12285.0,1490400.0,number,numeric,8
education,nominal,0,,,False,,,number,integer,8
education-num,continuous,0,0.0,26.0,True,1.0,16.0,number,numeric,8
marital-status,nominal,0,,,False,,,number,integer,8
occupation,nominal,0,,,False,,,number,integer,8
relationship,nominal,0,,,False,,,number,integer,8
race,nominal,0,,,False,,,number,integer,8
sex,nominal,0,,,False,,,number,integer,8


## Step 3: Create Trainee and Set Parameters

The process of creating the `Trainee` is the same as the other recipes. However, after the `Trainee` is created, we set auto-analyze and auto-ablation parameters to ensure that the `Trainee` remains informed as it removes cases during train.

In [3]:
t = Trainee(features=features)

# Set auto-analyze parameters. Note that use_case_weights=True since case-weights are how
# information from ablated cases is retained in the Trainee.
t.set_auto_analyze_params(auto_analyze_enabled=True, use_case_weights=True)

# Set auto-ablation parameters.
t.set_auto_ablation_params(
    auto_ablation_enabled=True,
    influence_weight_entropy_threshold=0.6,
    min_num_cases=2_500,
    max_num_cases=10_000,
    delta_threshold_map={
        "accuracy": {"target": 0.1}
    }
)

This recipe uses the following parameters for auto-ablation:

- `auto_ablation_enabled` — When this parameter is `False`, auto-ablation will not be performed.
- `influence_weight_entropy_threshold` — This parameter is used to determine which cases will be ablated.
  If the entropy of a case's influence weights would be above this percentile of existing cases' influence
  weight entropies, then the case is ablated. This is set to $0.6$, the default value.
- `min_num_cases` — This parameter sets the minimum number of cases that are allowed to be in the Trainee.
- `max_num_cases` — This parameter sets the maximum number of cases that are allowed to be in the Trainee.
- `delta_threshold_map` — This parameter defines thresholds that are used to determine when ablation should stop.
  In this case, the threshold map instructs auto-ablation to stop if the delta between the current accuracy
  of the `"target"` feature and its previous accuracy is greater than $0.1$. I.e., ablation will stop if accuracy
  for that feature drops from $0.8$ to $0.7$. More formally, ablation will stop if
  $\text{acc}_\text{old} - \text{acc}_\text{new} \geq 0.1$.

## Step 4: Training

Now that the parameters are set, the `Trainee` can be trained as if this were a normal workflow.

In [4]:
t.train(df)

Notably, only a subset of the cases have been trained (the rest have been ablated).

In [5]:
num_trained_cases = t.get_num_training_cases()
num_total_cases = len(df)
num_trained_percent = round((num_trained_cases / num_total_cases) * 100, 3)

print(f"Trained {num_trained_cases:,} cases out of {num_total_cases:,} ({num_trained_percent}%).")

Trained 2,946 cases out of 48,842 (6.032%).


## Step 5: Results

An ablated `Trainee` can be used in much the same way as other `Trainee`s, as long as `use_case_weights=True`. Here, we will investigate the prediction stats for all features.

In [6]:
stats = t.get_prediction_stats(
    action_feature="target",
    details={
        "prediction_stats": True,
        "selected_prediction_stats": ["all"],
    },
)

In [7]:
stats.loc[stats.index != "confusion_matrix", :]

Unnamed: 0,fnlwgt,age,education-num,capital-loss,capital-gain,hours-per-week,occupation,relationship,target,education,native-country,sex,race,marital-status,workclass
adjusted_smape,56.690903,25.178745,4.129842,106.63314,104.061808,29.858209,,,,,,,,,
recall,,,,,,,0.227443,0.464548,0.80129,0.930873,0.078264,0.80912,0.272753,0.372112,0.347871
missing_value_accuracy,,,,,,,,,,,,,,,
accuracy,,,,,,,0.302,0.69,0.84,0.966,0.865927,0.831,0.854,0.752,0.697
spearman_coeff,0.063918,0.488301,0.941597,0.383378,0.420499,0.318865,,,,,,,,,
smape,56.691077,25.502498,4.508189,108.042764,105.318053,30.337268,,,,,,,,,
mae,112416.305047,10.582308,0.293207,333.896078,3175.826992,11.134636,0.715739,0.334505,0.173946,0.041623,0.157321,0.181778,0.169707,0.261366,0.345619
mcc,,,,,,,0.220056,0.571071,0.592682,0.961087,0.226766,0.627179,0.260268,0.632135,0.395565
precision,,,,,,,0.231938,0.694472,0.791473,0.968499,0.105683,0.818124,0.411596,0.401198,0.436838
r2,-0.200978,0.149978,0.868959,-0.202335,-0.010409,0.034246,,,,,,,,,


In [8]:
# The confusion matrix can be retrieved
print("Howso Prediction Results - Confusion Matrix for 'target'")
matrix = pd.DataFrame(stats["target"]["confusion_matrix"]["matrix"])
matrix.index.name = "Predicted"
matrix.columns.name = "Actual"
display(matrix)

Howso Prediction Results - Confusion Matrix for 'target'


Actual,�0,�1
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
�0,188,87
�1,73,652


## Conclusion

Howso Engine with auto-ablation can compress data to a fraction of its original size while ensuring accuracy (and other measures) remain in an acceptable range without compromising the other capabilities of the `Trainee`.