# Decoupled Classifiers

This notebook aims to be a tutorial on how to use the ``DecoupledClass`` estimator, provided in the `raimitigations.cohort` package.
This class is based on the work presented in the paper [Decoupled classifiers for group-fair and efficient machine learning](https://www.microsoft.com/en-us/research/publication/decoupled-classifiers-for-group-fair-and-efficient-machine-learning/). The ``DecoupledClass`` estimator will build a different estimator for each cohort, where the cohort separation rules are defined by a set of parameters of the class, which is similar to how the ``CohortManager`` class creates its cohorts. Both ``DecoupledClass`` and ``CohortManager`` inherit from the same abstract class ``CohortHandler``, which implements the core functionalities for handling cohorts. The difference between the ``CohortManager`` and the ``DecoupledClass`` is that the former aims to provide an interface for creating a variety of different pipelines that are executed over each cohort separately (be it a pipeline with an estimator, different pipelines to each cohort, and so on), while the goal of the latter is to function as an estimator, which means that it will always fit a model over each cohort separately, and it can also apply some transformations to each cohort as well, but the transformations must always be the same (it doesn't allow using different pre-processing pipelines to each cohort as it is allowed in the ``CohortManager``).

In this notebook, we'll show the different ways we can instantiate and use the ``DecoupledClass``. Let's start off by opening a dataset. Here, we'll use the UCI Breast Cancer dataset.

In [1]:
import pandas as pd
import numpy as np
import uci_dataset as database

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import DecoupledClass, fetch_cohort_results

df = database.load_breast_cancer()
label_col = "Class"
df[label_col] = df[label_col].replace({	"recurrence-events": 1, 
										"no-recurrence-events": 0})
df

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,0,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,0,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,0,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,0,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,0,40-49,premeno,0-4,0-2,no,2,right,right_low,no
...,...,...,...,...,...,...,...,...,...,...
281,1,30-39,premeno,30-34,0-2,no,2,left,left_up,no
282,1,30-39,premeno,20-24,0-2,no,3,left,left_up,yes
283,1,60-69,ge40,20-24,0-2,no,1,right,left_up,no
284,1,40-49,ge40,30-34,3-5,no,3,left,left_low,no


In [2]:
X_train, X_test, y_train, y_test = split_data(df, label="Class", test_size=0.2)

## Basic Scenario

Let's consider the following scenario: suppose that we want to train a different model for each cohort comprised of the different values in the ``irradiat`` column. To do this, we can call the ``DecoupledClass`` using the following parameters:

In [3]:
preprocessing = [dp.BasicImputer(verbose=False), dp.EncoderOrdinal(verbose=False)]

dec_class = DecoupledClass(
					cohort_col=["irradiat"], 
					transform_pipe=preprocessing
				)
dec_class.fit(X_train, y_train)

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
	Size: 175
	Query:
		(`irradiat` == "no")
	Value Counts:
		0: 134 (76.57%)
		1: 41 (23.43%)
	Invalid: False


cohort_1:
	Size: 53
	Query:
		(`irradiat` == "yes")
	Value Counts:
		1: 27 (50.94%)
		0: 26 (49.06%)
	Invalid: False




The ``cohort_col`` parameter works similarly to the same parameter in the ``CohortManager`` class: it creates a different cohort for each combination of values found in the columns specified in the ``cohort_col`` list (check this [notebook for more details](./cohort_manager.ipynb)). Therefore, since ``cohort_col`` = ["irradiat"], then we'll create one cohort for all instances where the `irradiat` column is "no", and another column where its value is "yes". We then train two models: one for each cohort. Since no estimators were provided, we'll create a copy of the baseline estimator used by the ``DecoupledClass``: a ``sklearn.tree.DecisionTreeClassifier`` for classification problems, or a ``sklearn.tree.DecisionTreeRegressor`` for regression.

Note that we also provided a pre-processing pipeline through the ``transform_pipe`` parameter. What happens here is that each cohort will have their own copy of this pipeline, and before fitting the model, each cohort's dataset (a subset of the original dataset) will go through this pipeline. In this case, before fitting the model, each cohort will impute the missing values, and then encode the categorical features. Differently from the ``CohortManager`` class, the ``DecoupledClass`` doesn't allow to use different pipelines for each cohort: all cohorts will use different copies of the same pipeline.

After creating the ``DecoupledClass`` object, we can then print some information about each cohort created. To do this, we use the ``print_cohorts()`` method.

## Merging Invalid Cohorts

When creating multiple cohorts, we might end up with a few cohorts with a skewed label distribution, or very small cohorts. In these cases, we might want to fix these cohorts before proceeding. One approach is to use data rebalancing and create new instances for only a few cohorts, and we can do this using the [dataprocessing.Synthesizer](../dataprocessing/module_tests/rebalance_sdv.ipynb) class or using the [Rebalance](../dataprocessing/module_tests/rebalance_imbl.ipynb) class together with the ``CohortManager()``. Apart from these solutions, the ``DecoupledClass`` also offers some new solutions. The first solution, which we'll explore in this section, is to greedily merge invalid cohorts until they become valid. 

When merging cohorts, we first need to define what a valid cohort is. Here, we consider that invalid cohorts are those that fall into at least one of the following conditions:

1. **Small Cohorts:** cohorts with a size below a certain threshold
2. **Skewed Cohorts:** cohorts with a label column with a skewed distribution.

After a cohort is deemed invalid, we need to decide which cohort it will be merged into. We simply choose the smallest cohort different from the invalid cohort and then merge them (this is why we mentioned that this is a greedy approach for merging cohorts).

There are a few parameters used to control these validity checks:

* ``min_cohort_size``: the minimum size a cohort is allowed to have to be considered valid
* ``min_cohort_pct``: a value between [0, 1] that determines the minimum size allowed for a cohort. The minimum size is given by the size of the full dataset (``df.shape[0]``) multiplied by ``min_cohort_pct``. The maximum value between ``min_cohort_size`` and (``df.shape[0]`` * ``min_cohort_pct``) is  used to determine the minimum size allowed for a cohort
* ``minority_min_rate``: the minimum occurrence rate for the minority class (from the label column) that a cohort is allowed to have. If the minority class of the cohort has an occurrence rate lower than min_rate, the cohort is considered invalid.

In the next cell, we'll create a set of cohorts based on the joint values of the columns ["age", "menopause"]. We'll also specify a different value for the parameters ``min_cohort_pct`` and ``minority_min_rate``. Note that the resulting cohorts are not what we expected initially, that is, one cohort for each combination of unique values found between the columns ["age", "menopause"]. Instead, we end up with only a few cohorts. But note that while cohort ``cohort_4`` is a combination of simple filters based on these two columns, the other cohorts use a complex combination of filters based on these two columns. This means that the other two cohorts are a result of merged cohorts, and when two cohorts are merged, so are their filters.

In [4]:
preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
					cohort_col=["age", "menopause"], 
					min_cohort_pct=0.2,
					minority_min_rate=0.15,
					transform_pipe=preprocessing
				)
dec_class.fit(df=df, label_col="Class")

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
	Size: 91
	Query:
		((((((((`age` == "20-29") and (`menopause` == "premeno")) or ((`age` == "30-39") and (`menopause` == "lt40"))) or ((`age` == "60-69") and (`menopause` == "lt40"))) or ((`age` == "50-59") and (`menopause` == "lt40"))) or ((`age` == "70-79") and (`menopause` == "ge40"))) or ((`age` == "40-49") and (`menopause` == "ge40"))) or ((`age` == "50-59") and (`menopause` == "premeno"))) or ((`age` == "30-39") and (`menopause` == "premeno"))
	Value Counts:
		0: 59 (64.84%)
		1: 32 (35.16%)
	Invalid: False


cohort_4:
	Size: 81
	Query:
		(`age` == "40-49") and (`menopause` == "premeno")
	Value Counts:
		0: 58 (71.60%)
		1: 23 (28.40%)
	Invalid: False


cohort_8:
	Size: 114
	Query:
		((`age` == "60-69") and (`menopause` == "ge40")) or ((`age` == "50-59") and (`menopause` == "ge40"))
	Value Counts:
		0: 84 (73.68%)
		1: 30 (26.32%)
	Invalid: False




## Specify the Cohorts

Just like the ``CohortManager`` class, the ``DecoupledClass`` also allows users to specify the exact filters they want when creating the cohorts. In the following example, we'll create three cohorts: 2 of them with specific filters, and the last one will be created to include all instances that don't belong to any other cohort. **NOTE:** when specifying the exact cohorts using the ``cohort_def`` parameter, invalid cohorts won't be merged. Instead, an error will be raised indicating that one of the cohorts is invalid. However, invalid cohorts can still be used if Transfer Learning is used. More details on that in the following sections.

In [5]:
cohorts = {
    "cohort_1": [['age', '==', '40-49'], 'and', ['menopause', '==', 'premeno']],
	"cohort_2": [
            [['age', '==', '60-69'], 'and', ['menopause', '==', 'ge40']], 'or',
            [['age', '==', '30-39'], 'and', ['menopause', '==', 'premeno']],
        ],
	"cohort_3": None
}

preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
					cohort_def=cohorts, 
					min_cohort_pct=0.2,
					minority_min_rate=0.15,
					transform_pipe=preprocessing
				)
dec_class.fit(df=df, label_col="Class")

dec_class.print_cohorts()

FINAL COHORTS
cohort_1:
	Size: 81
	Query:
		(`age` == "40-49") and (`menopause` == "premeno")
	Value Counts:
		0: 58 (71.60%)
		1: 23 (28.40%)
	Invalid: False


cohort_2:
	Size: 90
	Query:
		((`age` == "60-69") and (`menopause` == "ge40")) or ((`age` == "30-39") and (`menopause` == "premeno"))
	Value Counts:
		0: 58 (64.44%)
		1: 32 (35.56%)
	Invalid: False


cohort_3:
	Size: 115
	Query:
		Remaining instances
	Value Counts:
		0: 85 (73.91%)
		1: 30 (26.09%)
	Invalid: False




## Specifying the estimator

The ``DecoupledClass`` class has multiple parameters that allow for a wide range of customizations. For example, we can also choose the estimator used by the decoupled classifier. By default, we'll use a simple ``DecisionTreeClassifier`` (for classification problems) or ``DecisionTreeRegressor`` (for regression problems), both from `sklearn`. However, if the user wants to use a more powerful estimator or the same estimator, but tweak certain parameters of it, they can specify the estimator when creating the ``DecoupledClass`` object. To do this, they just need to instantiate the estimator (don't call their `fit()` method yet), and pass it through the ``estimator`` parameter. When doing this, the Decoupled Classifier will create a copy of this estimator for each cohort. This way, each estimator will be fitted using a different dataset (the cohort's subset).

In [6]:
import xgboost as xgb

model = xgb.XGBClassifier(
            objective="binary:logistic",
            learning_rate=0.1,
            n_estimators=30,
            max_depth=10,
            colsample_bytree=0.7,
            alpha=0.0,
            reg_lambda=10.0,
            nthreads=4,
            verbosity=0,
            use_label_encoder=False,
        )

preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
                    cohort_col=["age", "menopause"], 
                    min_cohort_pct=0.2,
                    minority_min_rate=0.15,
                    estimator=model,
                    transform_pipe=preprocessing
                )
dec_class.fit(df=df, label_col="Class")



<raimitigations.cohort.decoupled_class.decoupled_classifier.DecoupledClass at 0x7f22f465c1c0>

## Calling the predict() and predict_proba() methods

The Decoupled Classifier also implements the same interface from other `sklearn`'s estimators: the ``predict()`` and ``predict_proba()`` methods. It also follows the same standards: the ``predict()`` method will return the exact classes, while the ``predict_proba()`` returns the probabilities of each instance belonging to each class. Note that the ``predict_proba()`` will only work if the estimator being used has the ``predict_proba()`` method.

In [7]:
X = df.drop(columns=[label_col])

y_pred = dec_class.predict(X)

print(f"y_pred size = {y_pred.shape}")
print(f"{y_pred[:6]} ... {y_pred[-6:]}")

y_pred size = (286,)
[0 0 0 0 0 0] ... [1 0 1 0 1 1]


In [8]:
y_pred = dec_class.predict_proba(X)

print(f"y_pred size = {y_pred.shape}")
print(f"{y_pred[:6]} ... {y_pred[-6:]}")

y_pred size = (286, 2)
[[0.53159475 0.46840525]
 [0.78541195 0.21458806]
 [0.7623236  0.23767635]
 [0.8384228  0.1615772 ]
 [0.78541195 0.21458806]
 [0.8384228  0.1615772 ]] ... [[0.3676687  0.6323313 ]
 [0.72908753 0.27091247]
 [0.44569016 0.55430984]
 [0.81318545 0.18681458]
 [0.42785716 0.57214284]
 [0.45488244 0.54511756]]


## Using Transfer Learning for invalid cohorts

What sets the ``DecoupledClass`` apart from the ``CohortManager`` is its capability to deal with invalid cohorts. We already showed how to use the Decoupled Classifier to merge invalid cohorts using a greedy approach. In this section, we'll explore a second approach for dealing with invalid cohorts, called here the transfer learning approach. In this approach, instead of merging an invalid cohort into another cohort, we'll keep all the cohorts, but when calling the ``fit()`` method of an invalid cohort, we'll also use data from other cohorts (called out-data), but the instances of the outer-data will be weighed down compared to the instances belonging to the cohort (called here the in-data). Note that when an invalid cohort uses the data from other cohorts when calling its ``fit()`` method, the cohort that lent the data to the invalid cohort (the one from which the out-data was fetched) will still only use its own data when fitting its model (unless it is also an invalid cohort).

We are now left with questions: (i) which cohorts should be used as the out-data for an invalid cohort, and (ii) how to define the value of $\theta$.

### Selecting the out-data for an invalid cohort

When selecting the out-data for an invalid cohort, that is, when selecting which cohorts will be used to lend their data to the invalid cohort, we must be careful not to use cohorts with a very different label distribution (compared to the invalid cohort that needs extra data), otherwise, the use of external data can be more harmful than useful. To check if two cohorts have a similar label distribution (be it for a classification problem where the label column is a set of encoded classes, or for regression problems, where the label is an array of real values), we compute the Jensen-Shanon distance between the label distribution of these two cohorts. If the distance is below a predefined threshold (controlled by the ``cohort_dist_th`` parameter), then the distributions are considered similar.

Here is a summary of the transfer learning process covered so far:

* When using Transfer Learning, first check if there are any invalid cohorts. Differently from the greedy approach of merging cohorts, when using transfer learning, cohorts deemed invalid due to skewed distributions are not allowed. If a cohort is deemed invalid due to a skewed distribution, an error will be raised. For each invalid cohort `i`, we'll do the following steps:

    1. Search all other cohorts `j` $\neq$ `i` (including other invalid cohorts) and find those that have a similar label distribution
    2. Create a new dataset (visible only to cohort `i`) called *out-data* that will include the subset of all other cohorts with a similar label distribution
    3. Train the estimator of cohort `i` using its own subset (*in-data*) + *out-data*, where the instances from the out-data have a smaller weight $\theta$ (we'll discuss how to set this value in the remainder of this section). Note that the estimator used when using transfer learning must allow for setting a different weight for each instance.

* All valid cohorts will be trained using only their *in-data*, even if the subset of these cohorts is used as *out-data* for invalid cohorts.

### Setting the value of $\theta$

We are now going to focus on how to set the value of the $\theta$ parameter. We'll cover the different approaches for setting this value.

#### Using a fixed $\theta$ value

The most straightforward approach for setting the value of $\theta$ is to provide a specific value for it directly. This can be done using the ``theta`` parameter when creating the ``DecoupledClass`` object. When passing a float value between [0, 1] to this parameter, this will be the value used for $\theta$ for all transfer learning operations.

In the following cell, we'll set $\theta$ = 0.3. Note that this time we first remove any missing values from the dataset. We do this because when using the ``cohort_col`` parameter to define the cohorts, if the columns used in this list have missing values, these values will be used for creating the cohorts. In this specific case, the ``breast-quad`` column has very few missing values, so if we try to create a set of cohorts with those missing values, this will result in a cohort (the one that holds all instances where this column is NaN) with a skewed label distribution, which, as previously mentioned, is not allowed when using transfer learning. Therefore, we simply remove the missing values prior to creating the cohorts.

In [9]:
preprocessing = [dp.EncoderOrdinal(verbose=False)]

imputer = dp.BasicImputer(categorical={'missing_values':np.nan, 
                                        'strategy':'most_frequent', 
                                        'fill_value':None },
                            verbose=False)
imputer.fit(df)
df_nomiss = imputer.transform(df)

dec_class = DecoupledClass(
                    cohort_col=["breast-quad"], 
                    theta=0.3,
                    min_cohort_pct=0.2,
                    minority_min_rate=0.15,
                    transform_pipe=preprocessing
                )
dec_class.fit(df=df_nomiss, label_col="Class")

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
	Size: 21
	Query:
		(`breast-quad` == "central")
	Value Counts:
		0: 17 (80.95%)
		1: 4 (19.05%)
	Invalid: True
		Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
		Theta = 0.3


cohort_1:
	Size: 111
	Query:
		(`breast-quad` == "left_low")
	Value Counts:
		0: 75 (67.57%)
		1: 36 (32.43%)
	Invalid: False


cohort_2:
	Size: 97
	Query:
		(`breast-quad` == "left_up")
	Value Counts:
		0: 71 (73.20%)
		1: 26 (26.80%)
	Invalid: False


cohort_3:
	Size: 24
	Query:
		(`breast-quad` == "right_low")
	Value Counts:
		0: 18 (75.00%)
		1: 6 (25.00%)
	Invalid: True
		Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
		Theta = 0.3


cohort_4:
	Size: 33
	Query:
		(`breast-quad` == "right_up")
	Value Counts:
		0: 20 (60.61%)
		1: 13 (39.39%)
	Invalid: True
		Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3']
		Theta = 0.3




Note that when using transfer learning and calling the ``print_cohorts()`` method, the "Invalid" key of the invalid cohorts will be set to True, and in that case, it will also inform which cohorts were used as out-data and the $\theta$ value used.

#### Finding the best $\theta$ parameter using Cross-Validation

Instead of using a fixed $\theta$ value, we can also find the best value using Cross-Validation (CV). When a cohort uses transfer learning, CV is used with the cohort data (in-data) plus the out-data using different values of $\theta$ (obtained from a list of $\theta$ values, called here *$\theta$ list*), and the final $\theta$ is selected as being the one associated with the highest performance in the CV process. The CV here splits the in-data into K folds (the best K value is identified according to the possible values specified in the ``valid_k_folds_theta`` parameter), and then proceeds to use one of the folds as the test set, and the remaining folds plus the out-data as the train set. A model is fitted for the train set and then evaluated in the test set. The ROC AUC metric is obtained for each CV run until all folds have been used as a test set. We then compute the average ROC AUC score for the K runs and that gives the CV score for a given $\theta$ value. This is repeated for all possible $\theta$ values (the $\theta$ list), and the $\theta$ with the best score is selected for that cohort. This process is repeated for each cohort that requires transfer learning, which means that some invalid cohorts might end up using different values of $\theta$.

There are a set of parameters used for controlling the CV process. These parameters are: ``default_theta``, ``min_fold_size_theta``, and ``valid_k_folds_theta``. We recommend looking through the API documentation of these parameters to better understand this process.

In the following cells, we'll check two ways to specify the **$\theta$ list**, that is, the list of possible $\theta$ values to be tested during the CV phase.

##### Using a specific list of possible $\theta$ values

We can specify a list of possible $\theta$ values. This way, when running the CV process mentioned above, we'll do it for all the $\theta$ values contained in the list passed as a parameter. This list is passed to the same ``theta`` parameter mentioned in the previous cell.

In [10]:
dec_class = DecoupledClass(
					cohort_col=["breast-quad"], 
					theta=[0.2, 0.4, 0.6, 0.8],
					min_fold_size_theta=5,
					min_cohort_pct=0.2,
					minority_min_rate=0.15,
					transform_pipe=preprocessing
				)
dec_class.fit(df=df_nomiss, label_col="Class")

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
	Size: 21
	Query:
		(`breast-quad` == "central")
	Value Counts:
		0: 17 (80.95%)
		1: 4 (19.05%)
	Invalid: True
		Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
		Theta = 0.6


cohort_1:
	Size: 111
	Query:
		(`breast-quad` == "left_low")
	Value Counts:
		0: 75 (67.57%)
		1: 36 (32.43%)
	Invalid: False


cohort_2:
	Size: 97
	Query:
		(`breast-quad` == "left_up")
	Value Counts:
		0: 71 (73.20%)
		1: 26 (26.80%)
	Invalid: False


cohort_3:
	Size: 24
	Query:
		(`breast-quad` == "right_low")
	Value Counts:
		0: 18 (75.00%)
		1: 6 (25.00%)
	Invalid: True
		Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
		Theta = 0.8


cohort_4:
	Size: 33
	Query:
		(`breast-quad` == "right_up")
	Value Counts:
		0: 20 (60.61%)
		1: 13 (39.39%)
	Invalid: True
		Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3']
		Theta = 0.4




##### Using a default list of possible $\theta$ values

Instead of providing a list of $\theta$ values, we could also use a default $\theta$ list. To do this, we only need to set the ``theta`` parameter to ``True``. This way, the ``DecoupledClass`` understands that transfer learning must be used, and that the best $\theta$ value must be identified using a default $\theta$ list.

In [11]:
dec_class = DecoupledClass(
					cohort_col=["breast-quad"], 
					theta=True,
					min_fold_size_theta=5,
					min_cohort_pct=0.2,
					minority_min_rate=0.15,
					transform_pipe=preprocessing
				)
dec_class.fit(df=df_nomiss, label_col="Class")

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
	Size: 21
	Query:
		(`breast-quad` == "central")
	Value Counts:
		0: 17 (80.95%)
		1: 4 (19.05%)
	Invalid: True
		Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
		Theta = 0.6


cohort_1:
	Size: 111
	Query:
		(`breast-quad` == "left_low")
	Value Counts:
		0: 75 (67.57%)
		1: 36 (32.43%)
	Invalid: False


cohort_2:
	Size: 97
	Query:
		(`breast-quad` == "left_up")
	Value Counts:
		0: 71 (73.20%)
		1: 26 (26.80%)
	Invalid: False


cohort_3:
	Size: 24
	Query:
		(`breast-quad` == "right_low")
	Value Counts:
		0: 18 (75.00%)
		1: 6 (25.00%)
	Invalid: True
		Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
		Theta = 0.2


cohort_4:
	Size: 33
	Query:
		(`breast-quad` == "right_up")
	Value Counts:
		0: 20 (60.61%)
		1: 13 (39.39%)
	Invalid: True
		Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3']
		Theta = 0.8




## Getting the list of conditions from each cohort

Sometimes, it is useful to access the list of conditions of a given cohort, and not only the ``pandas`` query of that cohort. The difference between the list of conditions and the query of a cohort is that the former is defined in a format specific to ``raimitigations``, which is the format used for defining a cohort, while the latter is the query used in the ``df.query()`` function, which is obtained by translating the list of conditions of a cohort to a query. Usually, we are only interested in the query of a cohort, but if we want to create a new ``DecoupledClass`` or ``CohortManager`` object, or call the ``raimitigations.cohort.fetch_cohort_results()`` function using the same conditions, we should use the list of conditions, not the query. In the following cells we'll show how to access the list of conditions of each cohort

We'll recreate the Decoupled Classifier by specifying the cohorts through a set of conditions (we'll repeat the same conditions used in a previous experiment).

In [13]:
cohorts = {
    "cohort_1": [['age', '==', '40-49'], 'and', ['menopause', '==', 'premeno']],
	"cohort_2": [
            [['age', '==', '60-69'], 'and', ['menopause', '==', 'ge40']], 'or',
            [['age', '==', '30-39'], 'and', ['menopause', '==', 'premeno']],
        ],
	"cohort_3": None
}

preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
					cohort_def=cohorts, 
					min_cohort_pct=0.2,
					minority_min_rate=0.15,
					transform_pipe=preprocessing
				)
dec_class.fit(X_train, y_train)

<raimitigations.cohort.decoupled_class.decoupled_classifier.DecoupledClass at 0x7f22082d5a90>

Now, we'll compute the performance metrics of the Decoupled Classifier for each cohort separately using the ``raimitigations.cohort.fetch_cohort_results()`` function (check the [Cohort Case 1](./case_study/case_1.ipynb) notebook for an in-depth look at this function). To do that, we need to provide the list of conditions of each cohort. Suppose that we don't have access to the list of conditions in the ``cohorts`` variable defined above (for whatever reasons), but we still have access to the ``dec_class`` variable. We can access the list of conditions of cohort ``i`` from the ``dec_class`` object using:

```python
dec_class.cohorts[i].conditions
```

``dec_class.cohorts`` is a list of cohort objects (which inherits from the ``CohortDefinition`` class, which is presented in the [Defining a Cohort](./cohort_definition.ipynb) notebook. We then access the cohort index that we want and then access the ``.conditions`` attribute, which is the list of conditions of that cohort.

In [14]:
cohort_def = []
for i in range(len(dec_class.cohorts)):
    cohort_def.append(dec_class.cohorts[i].conditions)

pred_train = dec_class.predict_proba(X_train)
pred_test = dec_class.predict_proba(X_test)

_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_def=cohort_def, return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_test, cohort_def=cohort_def, fixed_th=th_dict)

Unnamed: 0,cohort,cht_query,roc,precision,recall,f1,accuracy,threshold,cht_size
0,all,all,0.538737,0.542324,0.542324,0.542324,0.62069,0.5,58
1,cohort_0,"(`age` == ""40-49"") and (`menopause` == ""premeno"")",0.663636,0.65,0.663636,0.65368,0.6875,0.5,16
2,cohort_1,"((`age` == ""60-69"") and (`menopause` == ""ge40""...",0.55,0.541667,0.55,0.533333,0.571429,0.5,21
3,cohort_2,"((`age` != ""40-49"") or (`menopause` != ""premen...",0.433333,0.342105,0.433333,0.382353,0.619048,1.0,21
