# Auditing and Editing
## Overview 

Howso Engine's instance-based machine learning approach enables unique capabilities in addition to interpretability, which we learned about in the other recipes. We were able to detect possible anomalies and investigate Influential Cases and features that may be of concern. Using that information combined with Howso Engine giving us dynamic control over our Trainee, we can take meaningful action without having to dramatically incur additional expenses such as recreating the model. This is in contrast to most machine learning models which, once trained, are difficult to update without retraining the entire model. 

In this notebook we demonstrate the editability of a Howso Engine Trainee to take advantage of the Trainee and data diagnostic results shown from the `engine-insights.ipynb` recipe. 

This can be done on a small scale where we show how a case can be edited or removed to modify the behavior of the Trainee. A Howso Engine session allows us to toggle entire batches of training data and add/remove large chunks of training data. This can be very useful if we are continously adding data to our Trainee and we discover that certain batches are undesirable.

### Sessions

A Trainee Session is associated with each modification to a Trainee, which is useful for auditability. A session consists of the following information:  

- Unique identifier  

- The user for which the Session was created  

- Date the Session was created 

- Name, given by the user (Optional) 

- Metadata for the user to store information (Optional) 

When working with Trainees, a default session will be automatically started for you unless you explicitly start (or create) your own. This session will be used for all interactions with the Trainee, unless a new session is explicitly started, for as long as your client is running. Additionally, each instance of a Howso Client will use its own unique active session. Starting a new session explicitly is useful if you want to give it a name and/or metadata for your own reference later, or if you wish to use separate sessions for different modifications of the Trainee. For example, using a unique session each time you train would allow you to later reference the specific cases that were trained by a certain session. 


## Recipe Goals:

This notebook will show how to edit cases in a Howso Engine Trainee, either individually or in batches through the use of Sessions. This will allow the user to take actions on cases they deem necessary through use of the interpretability and auditing tools shown in other recipes.

In [1]:
import pandas as pd
from pmlb import fetch_data

from howso import engine
from howso.utilities import infer_feature_attributes

# Section 1: Train, Analyze, and Evaluate

For questions about the specific steps of this section, please see the [basic workflow guide](https://docs.howso.com/user_guide/basics/basic_workflow.html).

### Step 1: Load Data

Our example dataset for this recipe continues to be the well known `Adult` dataset. This dataset consists of 14 Context Features and 1 Action Feature. The Action Feature in this version of the `Adult` dataset has been renamed to `target` and it takes the form of a binary indicator for whether a person in the data makes over $50,000/year (*target*=1) or less (*target*=0).

In [2]:
df = fetch_data('adult', local_cache_dir="../../data/adult")

# subsample the data to ensure the example runs quickly
df = df.sample(1001, random_state=0)

df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
38113,41.0,4,151856.0,11,9.0,2,11,0,4,1,0.0,0.0,40.0,39,1
39214,57.0,6,87584.0,10,16.0,0,10,1,4,0,0.0,0.0,25.0,39,1
44248,31.0,2,220669.0,9,13.0,4,10,3,4,0,6849.0,0.0,40.0,39,1
10283,55.0,4,171355.0,8,11.0,2,7,0,4,1,0.0,0.0,20.0,39,1
26724,59.0,6,148626.0,0,6.0,2,5,0,4,1,0.0,0.0,40.0,39,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4721,60.0,0,204486.0,9,13.0,2,0,0,4,1,0.0,0.0,8.0,39,0
40113,48.0,2,93449.0,14,15.0,2,10,0,1,1,99999.0,0.0,40.0,30,0
17827,25.0,4,114838.0,14,15.0,4,10,1,4,0,0.0,0.0,8.0,22,1
35120,22.0,4,202125.0,11,9.0,2,12,0,4,1,0.0,0.0,50.0,39,1


### Step 2: Train Trainee utilizing Sessions

In this section we will perform all of the steps needed to train Howso Engine's Trainee.

In [3]:
# Infer features attributes
features = infer_feature_attributes(df)

# Specify Context and Action Features
action_features = ['target']
context_features = features.get_names(without=action_features)

# We extract one row for demonstrative purposes later
test_case = df.iloc[0]
df = df.iloc[1:]

# Split the data into Context Features (X) and Action Feature (y)
dfX = df[context_features]
dfy = df[action_features]

To demonstrate how to edit cases, we will break the training into two sessions. 

1. The first session is for half of the original dataset 
2. The second session is a modified version of the remaining half of the original dataset containing a target feature that is flipped from the true value

In [4]:
ind_session_1 = dfX.index[ :(dfX.shape[0] //2 )]
ind_session_2 = dfX.index[ (dfX.shape[0] //2 ):]

X_train_1 = dfX.loc[ind_session_1]
y_train_1 = dfy.loc[ind_session_1]

X_train_2 = dfX.loc[ind_session_2]

# Flip the target value for the second set of target feature values
y_train = dfy['target']
y_train_2 = pd.Series([int(not x) for x in y_train.loc[ind_session_2]], name=action_features[0], index=ind_session_2)

In [5]:
# Create the Trainee
t = engine.Trainee(
    features=features,
    overwrite_existing=True
)

session = engine.Session('train_session_1', metadata={'data': 'original data'})
t.train(X_train_1.join(y_train_1))

session = engine.Session('train_session_2', metadata={'data': 'modified data (flipped target values)'})
t.train(X_train_2.join(y_train_2))

# Analyze the Trainee
t.analyze()

### Step 3: Inspect Results

In [6]:
accuracy = t.react_aggregate(
    details = {
    "prediction_stats": True,
    "selected_prediction_stats": ['accuracy'],
    }
)['target'].iloc[0]

print("Test set prediction accuracy: {acc}".format(acc=accuracy))

Test set prediction accuracy: 0.491


As expected, flipping the target feature's values for half of the data greatly reduces the accuracy compared to the expected accuracy shown in recipe `engine-intro.ipynb`.

While it is unrealistic to know this ahead of time in a real world setting, we will use this stark result to clearly demonstrate the effect of Trainee editing.


# Section 2: Audit & Edit a Trainee

There are many reasons to audit and edit a Trainee. In other recipes, we highlighted several training cases that may be anomalous that may be candidates for removal. In this recipe, we have a entire chunk of training data that is incorrect. Howso Engine has the ability to edit data at different scales.

What sets Howso Engine apart from other machine learning models is there is no need for retraining. For example, if we use `Scikit-Learn`'s Logistic regression and discover that our training data consists of cases we would like to remove, then we would have to go back to the beginning of the workflow, remove the problematic cases from the training data, and completely retrain the model.

In Howso Engine, this is unnecessary unless a very large portion of the training data is altered. If this is the case, re-analyzing the Trainee may be appropriate, although it is not strictly necessary.

### Step 1: Editing Individual Cases to Tune the Trainee

Here we demonstrate how to edit a Trainee one case at a time. Editing a case allows the user to modify or "fix" the behavior of the Trainee by the targeted editing of one or more cases. The user has complete control over all data in the Trainee, making it dynamic and quickly adjustable. Users do not need to worry about minor mistakes as the Trainee can be fine-tuned with this method after training.

In our use case, the anomalous cases identified for the `Adult` dataset in `anomaly_detection.ipynb`recipe represents possible cases we want to edit. We noticed certain cases with unusual values for `capital-gains` like 99999 that look like they are nominal values representing other values, such as blanks. Editing cases allows us to easily correct these minor issues for an otherwise valid case post-training. 

If we believe that a case is entirely invalid and warrants removal, Howso Engine can also remove it entirely.

To demonstrate this ability, we `react` to a single case to compare the predictions before and after an edit.

In [7]:
test_case_X = test_case[context_features]
test_case_y = test_case[action_features]

details = {
    'influential_cases':True,
}

new_result = t.react(
    [test_case_X.values.tolist()],
    context_features=context_features,
    action_features=action_features,
    details=details
)

In [8]:
print('prediction: {}'.format(int(new_result['action']['target'].iloc[0])))
print('actual: {}'.format(int(test_case_y.iloc[0])))

prediction: 0
actual: 1


#### Results 

We can see that the the predicted value is incorrect. If we want to artifically correct this prediction using our Trainee, we can edit its influential cases. This is for demonstrative purposes only and we do not recommend editing influential cases without fully investigating the cases.

### Step 2: Identify Influential Cases

To determine which cases we want to edit, we identify the Influential Cases.

In [9]:
influence_df = pd.DataFrame(new_result['details']['influential_cases'][0])
influence_df

Unnamed: 0,age,education,marital-status,race,capital-gain,capital-loss,hours-per-week,target,sex,occupation,.session_training_index,workclass,native-country,.session,fnlwgt,education-num,relationship,.influence_weight
0,42,11,2,4,0,0,40,0,1,6,410,4,39,0e5bfb65-e733-43b9-985d-6e174cd77f80,171351,9,0,0.126047
1,40,11,2,4,0,0,40,0,1,7,341,4,39,0e5bfb65-e733-43b9-985d-6e174cd77f80,114157,9,0,0.125624
2,35,11,2,4,0,0,40,0,1,1,185,4,39,0e5bfb65-e733-43b9-985d-6e174cd77f80,138441,9,0,0.125217
3,36,11,2,4,0,0,40,0,1,3,347,4,39,0e5bfb65-e733-43b9-985d-6e174cd77f80,183279,9,0,0.12517
4,35,11,2,4,0,0,40,0,1,7,203,4,39,0e5bfb65-e733-43b9-985d-6e174cd77f80,113152,9,0,0.124688
5,46,11,2,4,0,0,45,0,1,12,113,4,39,ad4070de-d22b-4e92-bf16-96398c47c5a4,132912,9,0,0.124567
6,36,11,2,4,0,0,40,1,1,4,473,4,39,ad4070de-d22b-4e92-bf16-96398c47c5a4,98360,9,0,0.124434
7,36,11,2,4,0,0,40,1,1,7,403,4,39,ad4070de-d22b-4e92-bf16-96398c47c5a4,209629,9,0,0.124253


We can see that many of the Influential Cases have the incorrect target value.

### Step 3: Edit Cases

We will modify those two Influential Cases which have a different target value than what we want to predict by flipping their target values. Having more Influential Cases with the correct target value will increase the chance of that case being predicted to the correct target value.

In [10]:
# Modify case 1
session_id = influence_df.iloc[0]['.session']
session_training_index = influence_df.iloc[0]['.session_training_index']

# Flip the target in the original case
cases = t.get_cases(session=session_id, features=['.session_training_index', 'target'])
orig_target = cases.set_index('.session_training_index').loc[session_training_index].iloc[0]

# Flip the target
if str(orig_target) == '0':
    flipped = 1
else:
    flipped = 0

t.edit_cases(feature_values=[flipped],
             case_indices=[(session_id, session_training_index.item())],
             features=['target'])

print("Modifying training index {ind} of Session {session_id} target value to {tar}".format(ind=session_training_index, session_id=session_id, tar=flipped))

Modifying training index 410 of Session 0e5bfb65-e733-43b9-985d-6e174cd77f80 target value to 1


In [11]:
# Modify case 2
session_id = influence_df.iloc[0]['.session']
session_training_index = influence_df.iloc[1]['.session_training_index']

# Flip the target in the original case
cases = t.get_cases(session=session_id, features=['.session_training_index', 'target'])
orig_target = cases.set_index('.session_training_index').loc[session_training_index].iloc[0]

# Flip the target
if str(orig_target) == '0':
    flipped = 1
else:
    flipped = 0

t.edit_cases(feature_values=[flipped],
             case_indices=[(session_id, session_training_index.item())],
             features=['target'])

print("Modifying training index {ind} of Session {session_id} target value to {tar}".format(ind=session_training_index, session_id=session_id, tar=flipped))

Modifying training index 341 of Session 0e5bfb65-e733-43b9-985d-6e174cd77f80 target value to 1


### Step 4: Verify the Edit and Check the Case Audit

We can audit one of the updated cases to make sure the case has been edited and demonstrate how to retrieve the case history. Editing case history provides another layer of auditability and accountability to the Trainee.

In [12]:
updated_case = t.get_cases(
    case_indices=[(session_id, session_training_index.item())],
    features=df.columns.tolist() + ['.case_edit_history']
)

# audit edit history
updated_case.loc[ 0, '.case_edit_history']

{'0e5bfb65-e733-43b9-985d-6e174cd77f80': [{'value': 1,
   'feature': 'target',
   'type': 'edit',
   'previous_value': 0}]}

### Step 5: Predict Again

We will re-run the prediction to see if the target value is correct now.

In [13]:
new_result = t.react(
    [test_case_X.values.tolist()],
    context_features=context_features,
    action_features=action_features,
    details=details
)

print('prediction: {}'.format(new_result['action']['target'].iloc[0]))
print('actual: {}'.format(test_case_y.iloc[0]))

prediction: 1
actual: 1.0


We can see that by editing those two cases, we flipped the prediction for our original test case without re-training or re-analyzing our Trainee. If done correctly, this provides a user with a surgical tool for Trainee corrections.

### Step 6: Delete a Case

In addition to editing a case, Howso Engine can also delete a case, removing it from the model and any further predictions. This workflow is the same as the edit example in the section above, except we use `remove_cases` instead of `edit_cases`. 

In [14]:
# remove cases using ".session_training_index"
t.remove_cases(num_cases=1, case_indices=[(session_id, session_training_index.item())])

1

The dynamic edting and deleting of individual Cases allows the user to perform targeted modification of the Trainee and provides the user with unparalleled control over their data and Trainee. This should not be done lightly and we recommend that all Cases be investigated before performing this action.

## Section 3: Editing Sessions

In the beginning of this notebook, we trained the data in two sessions. The first session used a normal sample of training data, however the second session artifically flipped the target variable. This reduced the performance of our Trainee by introducing a large portion of incorrect data.

Howso Engine has the capability to add or remove entire sessions. In this situation, if we discovered that one of our sessions had very poor quality data, like our example, we can easily remove that entire session's data without having to individually alter cases.

### Step 1: View Sessions

Let's first see how many sessions are in this Trainee along with some details of each session.

In [15]:
sessions = t.get_sessions()
sessions

[{'id': 'ad4070de-d22b-4e92-bf16-96398c47c5a4', 'name': 'train_session_1'},
 {'id': '0e5bfb65-e733-43b9-985d-6e174cd77f80', 'name': 'train_session_2'}]

In [16]:
display(engine.get_session(sessions[0]['id']))
display(engine.get_session(sessions[1]['id']))

{'id': 'ad4070de-d22b-4e92-bf16-96398c47c5a4',
 'name': 'train_session_1',
 'metadata': {'data': 'original data',
              'trainee_id': '8b6ba317-0c5d-40d7-9653-367bd8a52ec5'},
 'created_date': datetime.datetime(2024, 9, 4, 22, 26, 33, 289797, tzinfo=tzlocal()),
 'modified_date': datetime.datetime(2024, 9, 4, 22, 26, 33, 289799, tzinfo=tzlocal())}

{'id': '0e5bfb65-e733-43b9-985d-6e174cd77f80',
 'name': 'train_session_2',
 'metadata': {'data': 'modified data (flipped target values)'},
 'created_date': datetime.datetime(2024, 9, 4, 22, 26, 33, 315927, tzinfo=datetime.timezone.utc),
 'modified_date': datetime.datetime(2024, 9, 4, 22, 26, 33, 315929, tzinfo=datetime.timezone.utc)}

We can see the two different sessions we used when trained earlier.

### Step 2: Delete a Session

Deleting an entire session is performed in one easy step once we retrieve the session ID of the session we want to delete.

In [17]:
### Delete a session
session_id = sessions[1]['id']
t.delete_session(session_id)

# Re-analyze the Trainee
t.analyze()

### Step 3: Recompute Accuracy and Inspect

We then use `react_aggregate` to the compute accuracy metrics.

In [18]:
accuracy_new = t.react_aggregate(
    details = {
    "prediction_stats": True,
    "selected_prediction_stats": ['accuracy'],
    }
)['target'].iloc[0]

print("Original accuracy: {acc}".format(acc=accuracy))
print("New accuracy: {acc}".format(acc=accuracy_new))

Original accuracy: 0.491
New accuracy: 0.812


In [19]:
# Check to make sure there is only 1 session
t.get_sessions()

[{'id': 'ad4070de-d22b-4e92-bf16-96398c47c5a4', 'name': 'train_session_1'}]

We can clearly see the difference in accuracy results once the faulty session data is removed.

Additionally, we see only one session remains in the Trainee.

# Conclusion:

We can see that by getting rid of the session with the faulty data, our Trainee performance improved dramatically, as expected. This capability provides the user with a very efficient way to maintain control over a continously evolving Trainee if the user is constantly adding training data.

The tools shown in this and other recipes allows the user to find, diagnose, and act at a level of ease and precision that other machine learning models cannot match. This opens the door to possibilities for the user and provides a flexible platform that can adjust to any type of machine learning needs.