## Lab 6- MLInspect

### Overview of the example from the paper


![paper_example_image](paper_example_image.png)

Example of an ML pipeline that predicts which patients are at a higher risk of serious complications, under the requirement to achieve comparable false negative rates across intersectional groups by age and race. The pipeline is implemented using native constructs from the popular pandas and scikit-learn libraries. On the left, we highlight potential issues identified by mlinspect. On the right, we show the corresponding dataflow graph extracted by mlinspect to instrument the code and pinpoint issues.

## Task
Operators like joins, selections and missing value imputaters can cause data distribution issues, which can heavily impact the performance of our model for specific demographic groups. Mlinspect helps with identifying such issues by offering a check that calculates histograms for sensitive groups in the data and verifying whether the histogram change is significant enough to alert the user. Thanks to our annotation propagation, we can deal with complex code involving things like nested sklearn pipelines and group memberships that are removed from the training data using projections.

We want to find out if preprocessing operations in pipelines introduce bias and if so, which groups are effected. The pipeline we want to analyse in this task can be found using the path os.path.join(str(get_project_root()), "experiments", "user_interviews", "adult_simple_modified.py"). The senstive attributes we want to take a look at are race and sex.

It is using a benchmark dataset frequently used in the algorithmic fairness literature. Adult income contains information about 33,000 individuals from the 1994 U.S. census, with sensitive attributes gender and race. The corresponding task is to predict whether the annual income of an individual exceeds $50,000. We took this existing data set and only modified it slightly by introducing an artificial issue which we will now try to find using mlinspect.

The code of the pipeline:

> ```python
> """
> Adult income pipeline
> """
> import os
> import pandas as pd
> from sklearn import compose, preprocessing, tree, pipeline
> 
> from mlinspect.utils import get_project_root
> 
> print('pipeline start')
> 
> train_file_a = os.path.join(str(get_project_root()), "experiments", "user_interviews", "adult_simple_train_a.csv")
> raw_data_a = pd.read_csv(train_file_a, na_values='?', index_col=0)
> 
> train_file_b = os.path.join(str(get_project_root()), "experiments", "user_interviews", "adult_simple_train_b.csv")
> raw_data_b = pd.read_csv(train_file_b, na_values='?', index_col=0)
> 
> merged_raw_data = raw_data_a.merge(raw_data_b, on="id")
> 
> data = merged_raw_data.dropna()
> 
> labels = preprocessing.label_binarize(data['income-per-year'], classes=['>50K', '<=50K'])
> 
> column_transformer = compose.ColumnTransformer(transformers=[
>     ('categorical', preprocessing.OneHotEncoder(handle_unknown='ignore'), ['education', 'workclass']),
>     ('numeric', preprocessing.StandardScaler(), ['age', 'hours-per-week'])
> ])
> adult_income_pipeline = pipeline.Pipeline([
>     ('features', column_transformer),
>     ('classifier', tree.DecisionTreeClassifier())])
> 
> adult_income_pipeline.fit(data, labels)
> print('pipeline finished')
> ```

# Step 1/4: Add check and execute the pipeline

The central entry point of mlinspect is the `PipelineInspector`. To use mlinspect, we use it and pass it the path to the runnable version of the example pipeline. Here, we have the example pipeline in a `.py` file. 

First, we define the check we want mlinspect to run. In this example, we only use `NoBiasIntroducedFor(["col1", "col2", ...])` to automatically check for significant changes in the distribution of sensitive demograhpic groups and compute the histograms.

Then, we execute the pipeline. Mlinspect returns a `InspectorResult`, which, among other information, contains the output of our check. 

In [2]:
import os
from mlinspect.utils import get_project_root

from mlinspect import PipelineInspector
from mlinspect.checks import NoBiasIntroducedFor, NoIllegalFeatures
from mlinspect.inspections import MaterializeFirstOutputRows

ADULT_MOD_FILE_PY = os.path.join(str(get_project_root()), "experiments", 
                                 "user_interviews", "adult_simple_modified.py")

inspector_result = PipelineInspector\
    .on_pipeline_from_py_file(ADULT_MOD_FILE_PY) \
    .add_check(NoBiasIntroducedFor(["race", "sex"])) \
    .add_check(NoIllegalFeatures()) \
    .add_required_inspection(MaterializeFirstOutputRows(5)) \
    .execute()

extracted_dag = inspector_result.dag
check_results = inspector_result.check_to_check_results
inspection_results = inspector_result.inspection_to_annotations

ModuleNotFoundError: No module named 'mlinspect'

# Step 2/4: Overview of the check results
## Did our check find issues?

Let us look at the `check_result` to see our some check failed. We do this using the mlinspect utlity function `check_results_as_data_frame(...)`. We see that an issue was found, so we have to investigate it.

In [3]:
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)

check_result_df = PipelineInspector.check_results_as_data_frame(check_results)
display(check_result_df)

NameError: name 'PipelineInspector' is not defined

A negative min_relative_ratio_change means that the ratio after the join is less than the original ratio

https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/mlinspect/checks/_no_bias_introduced_for.py

# Step 3/4: List of operations that could change the distribution 

As stated before, only some operations like selections, joins and missing value imputation can change the distribution. Our check already filtered all operators that can cause data distribution issues. We can use the mlinspect utility function `get_distribution_changes_overview_as_df(...)` to get an overview. The overview already tells us that mlinspect detected a potential issue caused by a JOIN involving the gender attribute. Note that the automatic issue detection from mlinspect is only as good as its configuration and should not be completely relied upon.

In [4]:
no_bias_check_result = check_results[NoBiasIntroducedFor(["race", "sex"])]

distribution_changes_overview_df = NoBiasIntroducedFor.get_distribution_changes_overview_as_df(no_bias_check_result)
display(distribution_changes_overview_df)

dag_node_distribution_changes_list = list(no_bias_check_result.bias_distribution_change.items())

NameError: name 'check_results' is not defined

# Step 4/4 Detailed Investigation

### NoBiasIntroduced
Now that we know of the potential issue, we will take a look at the histograms before and after the JOIN. We can use `distribution_change.before_and_after_df` to look at the data in form of a `pandas.DataFrame` or use the mlinspect utility function `plot_distribution_change_histograms(...)` to plot the histograms. 

In [5]:
# Select the DagNode we want to look at by index
dag_node, node_distribution_changes = dag_node_distribution_changes_list[0]

# Investige the changes
print("\033[1m{}: {}\033[0m".format(dag_node.operator_type, dag_node.source_code))
for column, distribution_change in node_distribution_changes.items():
    print("")
    print("\033[1m Column '{}'\033[0m, acceptable change: {}, min_relative_ratio_change: {}".format(column, distribution_change.acceptable_change, distribution_change.min_relative_ratio_change))
    display(distribution_change.before_and_after_df)
    NoBiasIntroducedFor.plot_distribution_change_histograms(distribution_change)

NameError: name 'dag_node_distribution_changes_list' is not defined

### Use of illegal features: 

https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/mlinspect/checks/_no_illegal_features.py

In [6]:
feature_check_result = check_results[NoIllegalFeatures()]
print("Used illegal features: {}".format(feature_check_result.illegal_features))

NameError: name 'check_results' is not defined

### MaterializeFirstOutputRows
For each operator, the MaterializeFirstOutputRows materialized the first 5 output rows. Especially for scikit-learn pipelines, it requires custom debugging code if a user just wants to look at some intermediate results ([see example stackoverflow post](https://stackoverflow.com/questions/34802465/sklearn-is-there-any-way-to-debug-pipelines)). Using mlinspect, this becomes easy. We can look at the input and output of arbitrary featurizers like OneHotEncoders or Word2Vec models.

Here, we use this functionality to look at the output of a OneHotEncoder and the imputer right before it. For this, we only need to look at the inspection result for the corresponding Dag nodes.

In [7]:
from IPython.display import display

first_rows_inspection_result = inspection_results[MaterializeFirstOutputRows(5)]

relevant_nodes = [node for node in extracted_dag.nodes if node.description in {
    "Categorical Encoder (OneHotEncoder), Column: 'education'", "Categorical Encoder (OneHotEncoder), Column: 'workclass'" }]

for dag_node in relevant_nodes:
    if dag_node in first_rows_inspection_result and first_rows_inspection_result[dag_node] is not None:
        print("\n\033[1m{} ({})\033[0m\n{}\n{}".format(
            dag_node.operator_type, dag_node.description, dag_node.source_code, dag_node.code_reference))
        display(first_rows_inspection_result[dag_node])

NameError: name 'inspection_results' is not defined

### Question: Did we find operators that introduce bias? How did the distribution of demographic groups change? 
**Write down your answer here:** 

#### Question: Can missing value imputation cause bias? If so, how?
**Write down your answer here:**