# Auditing and Editing
## Overview 

Howso Engine's instance-based machine learning approach enables unique capabilities in addition to interpretability, which we learned about in the other recipes. We were able to detect possible anomalies and investigate Influential Cases and features that may be of concern. Using that information combined with Howso Engine giving us dynamic control over our Trainee, we can take meaningful action without having to dramatically incur additional expenses such as recreating the model. This is in contrast to most machine learning models which, once trained, are difficult to update without retraining the entire model. 

In this notebook we demonstrate the editability of a Howso Engine Trainee to take advantage of the Trainee and data diagnostic results shown from the `engine-insights.ipynb` recipe. 

This can be done on a small scale where we show how a case can be edited or removed to modify the behavior of the Trainee. A Howso Engine session allows us to toggle entire batches of training data and add/remove large chunks of training data. This can be very useful if we are continously adding data to our Trainee and we discover that certain batches are undesirable.

### Sessions

A Trainee Session is associated with each modification to a Trainee, which is useful for auditability. A session consists of the following information:  

- Unique identifier  

- The user for which the Session was created  

- Date the Session was created 

- Name, given by the user (Optional) 

- Metadata for the user to store information (Optional) 

When working with Trainees, a default session will be automatically started for you unless you explicitly start (or create) your own. This session will be used for all interactions with the Trainee, unless a new session is explicitly started, for as long as your client is running. Additionally, each instance of a Howso Client will use its own unique active session. Starting a new session explicitly is useful if you want to give it a name and/or metadata for your own reference later, or if you wish to use separate sessions for different modifications of the Trainee. For example, using a unique session each time you train would allow you to later reference the specific cases that were trained by a certain session. 


## Recipe Goals:

This notebook will show how to edit cases in a Howso Engine Trainee, either individually or in batches through the use of Sessions. This will allow the user to take actions on cases they deem necessary through use of the interpretability and auditing tools shown in other recipes.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pmlb import fetch_data

from howso import engine
from howso.utilities import infer_feature_attributes

# Section 1: Train, Analyze, and Evaluate

For questions about the specific steps of this section, please see the [basic workflow guide](https://docs.howso.com/user_guide/basics/basic_workflow.html).

### Step 1: Load Data

Our example dataset for this recipe continues to be the well known `Adult` dataset. This dataset consists of 14 Context Features and 1 Action Feature. The Action Feature in this version of the `Adult` dataset has been renamed to `target` and it takes the form of a binary indicator for whether a person in the data makes over $50,000/year (*target*=1) or less (*target*=0).

In [2]:
df = fetch_data('iris', local_cache_dir="../../data/iris")

# We remove petal-width so this is a 3d dataset, easier for visualization
df = df.drop(columns=['petal-width'])
df

Unnamed: 0,sepal-length,sepal-width,petal-length,target
0,6.7,3.0,5.2,2
1,6.0,2.2,5.0,2
2,6.2,2.8,4.8,2
3,7.7,3.8,6.7,2
4,7.2,3.0,5.8,2
...,...,...,...,...
145,5.0,3.5,1.6,0
146,5.4,3.9,1.7,0
147,5.1,3.4,1.5,0
148,5.0,3.6,1.4,0


### Step 2: Train Trainee utilizing Sessions

In this section we will perform all of the steps needed to train Howso Engine's Trainee.

In [3]:
# Infer features attributes
features = infer_feature_attributes(df)

# Specify Context and Action Features
action_features = ['target']
context_features = features.get_names(without=action_features)

# We extract one row for demonstrative purposes later
test_case = df.iloc[0]
# change the test case target value for demonstrative purposes
test_case['target'] = 1.0
df = df.iloc[1:]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_case['target'] = 1.0


In [4]:
# Create the Trainee
t = engine.Trainee(
    features=features,
    overwrite_existing=True
)

t.train(df)

# Analyze the Trainee
t.analyze()

The following parameters from configuration file will override the Amalgam parameters set in the code: {'trace'}


### Step 3: Inspect Results

# Section 2: Audit & Edit a Trainee

There are many reasons to audit and edit a Trainee. In other recipes, we highlighted several training cases that may be anomalous that may be candidates for removal. In this recipe, we have a entire chunk of training data that is incorrect. Howso Engine has the ability to edit data at different scales.

What sets Howso Engine apart from other machine learning models is there is no need for retraining. For example, if we use `Scikit-Learn`'s Logistic regression and discover that our training data consists of cases we would like to remove, then we would have to go back to the beginning of the workflow, remove the problematic cases from the training data, and completely retrain the model.

In Howso Engine, this is unnecessary unless a very large portion of the training data is altered. If this is the case, re-analyzing the Trainee may be appropriate, although it is not strictly necessary.

### Step 1: Editing Individual Cases to Tune the Trainee

Here we demonstrate how to edit a Trainee one case at a time. Editing a case allows the user to modify or "fix" the behavior of the Trainee by the targeted editing of one or more cases. The user has complete control over all data in the Trainee, making it dynamic and quickly adjustable. Users do not need to worry about minor mistakes as the Trainee can be fine-tuned with this method after training.

In our use case, the anomalous cases identified for the `Adult` dataset in `anomaly_detection.ipynb`recipe represents possible cases we want to edit. We noticed certain cases with unusual values for `capital-gains` like 99999 that look like they are nominal values representing other values, such as blanks. Editing cases allows us to easily correct these minor issues for an otherwise valid case post-training. 

If we believe that a case is entirely invalid and warrants removal, Howso Engine can also remove it entirely.

To demonstrate this ability, we `react` to a single case to compare the predictions before and after an edit.

In [5]:
test_case_X = test_case[context_features]
test_case_y = test_case[action_features]

details = {
    'influential_cases':True,
}

new_result = t.react(
    [test_case_X.values.tolist()],
    context_features=context_features,
    action_features=action_features,
    details=details
)

# Note, we purposely changed the test case's true value to another value for demonstrative purposes.

In [6]:
print('prediction: {}'.format(int(new_result['action']['target'].iloc[0])))
print('actual: {}'.format(int(test_case_y.iloc[0])))

prediction: 2
actual: 1


#### Results 

We can see that the the predicted value is incorrect. If we want to artifically correct this prediction using our Trainee, we can edit its influential cases. This is for demonstrative purposes only and we do not recommend editing influential cases without fully investigating the cases.

### Step 2: Identify Influential Cases

To determine which cases we want to edit, we identify the Influential Cases.

In [7]:
# influence_df = pd.DataFrame(new_result['details']['influential_cases'][0]).drop(columns=['.session', '.session_training_index', '.influence_weight'])
influence_df = pd.DataFrame(new_result['details']['influential_cases'][0])

session_id = influence_df.iloc[0]['.session']


We can see that many of the Influential Cases have the incorrect target value.

### Step 3: Edit Cases

We will modify those Influential Cases target value to the "correct" values. Having more Influential Cases with the correct target value will increase the chance of that case being predicted to the correct target value. In a real world situation, this would generally only be performed in siutations where the data was incorrectly labeled, and we do not recommend editing cases solely for accuracy purposes.

In [8]:
edited_indices = []
for index, row in influence_df.iterrows():
    t.edit_cases(feature_values=[1],
                case_indices=[(session_id, row[".session_training_index"])],
                features=['target'])
    edited_indices.append(row[".session_training_index"])

### Step 4: Verify the Edit and Check the Case Audit

We can audit one of the updated cases to make sure the case has been edited and demonstrate how to retrieve the case history. Editing case history provides another layer of auditability and accountability to the Trainee.

In [9]:
updated_case = t.get_cases(
    case_indices=[(session_id, edited_indices[0])],
    features=df.columns.tolist() + ['.case_edit_history']
)

# audit edit history
updated_case.loc[ 0, '.case_edit_history']

{'109d8464-533d-471b-882e-492edc5b869c': [{'previous_value': 1,
   'value': 1,
   'feature': 'target',
   'type': 'edit'}]}

### Step 5: Predict Again

We will re-run the prediction to see if the target value is correct now.

In [10]:
new_result = t.react(
    [test_case_X.values.tolist()],
    context_features=context_features,
    action_features=action_features,
    details=details
)

print('prediction: {}'.format(new_result['action']['target'].iloc[0]))
print('actual: {}'.format(test_case_y.iloc[0]))

prediction: 1
actual: 1.0


We can see that by editing those two cases, we flipped the prediction for our original test case without re-training or re-analyzing our Trainee. If done correctly, this provides a user with a surgical tool for Trainee corrections.

In [11]:


influence_df_orig = influence_df.drop(columns=['.session', '.session_training_index', '.influence_weight'])
# Index 5 is the test case
influence_df_orig.loc[len(influence_df_orig)] = test_case
influence_df_edit = influence_df_orig.copy()
influence_df_edit['target'] = 1
influence_df_edit

Unnamed: 0,sepal-width,petal-length,target,sepal-length
0,3.0,5.0,1,6.7
1,3.0,5.2,1,6.5
2,3.0,5.5,1,6.8
3,3.1,5.1,1,6.9
4,3.1,5.4,1,6.9
5,3.0,5.5,1,6.5
6,3.1,5.6,1,6.7
7,3.1,4.9,1,6.9
8,3.0,5.2,1,6.7


In [12]:

color_list_orig = influence_df_orig['target'].astype(str).tolist()
color_list_orig[-1] = 'test case'  # Label the last row as "test case"

# Create a custom color map
color_discrete_map_orig = {'1.0': 'green', '2.0': 'red', 'test case': 'blue'}

# Create a 3D scatter plot
fig1 = px.scatter_3d(
    influence_df_orig,
    x='sepal-width',
    y='petal-length',
    z='sepal-length',
    color=color_list_orig,
    color_discrete_map=color_discrete_map_orig,
)


color_list_edit = influence_df_edit['target'].astype(str).tolist()
color_list_edit[-1] = 'test case'  # Label the last row as "test case"

# Create a custom color map
color_discrete_map_edit = {'1': 'green', '2': 'red', 'test case': 'blue'}

# Create a 3D scatter plot
fig2 = px.scatter_3d(
    influence_df_edit,
    x='sepal-width',
    y='petal-length',
    z='sepal-length',
    color=color_list_edit,
    color_discrete_map=color_discrete_map_edit,
)


# Create subplots
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}]],
    subplot_titles=("Original Dataset", "Edited Dataset")
    )

# Add the scatter plots to the subplots
for trace in fig1.data:
    fig.add_trace(trace, row=1, col=1)

for trace in fig2.data:
    trace.showlegend = False
    fig.add_trace(trace, row=1, col=2)

# Show the plot
fig.show()

The dynamic edting and deleting of individual Cases allows the user to perform targeted modification of the Trainee and provides the user with unparalleled control over their data and Trainee. This should not be done lightly and we recommend that all Cases be investigated before performing this action.

# Conclusion:

We can see that by getting rid of the session with the faulty data, our Trainee performance improved dramatically, as expected. This capability provides the user with a very efficient way to maintain control over a continously evolving Trainee if the user is constantly adding training data.

The tools shown in this and other recipes allows the user to find, diagnose, and act at a level of ease and precision that other machine learning models cannot match. This opens the door to possibilities for the user and provides a flexible platform that can adjust to any type of machine learning needs.