# Convert Processed Conditions to General Example

This notebook contains an example of how the Convert Processed Conditions to General module can be used to convert rule conditions that leverage processed features (either imputed values or OHE values) into rule conditions that leverage the unprocessed features. These converted rules can then be uploaded directly to a Simility environment. 

Applying this module is required when you need to upload rules to a Simility environment that have been generated using one of the rule generator modules. This is because the inputs to these rule generator modules require features to be imputed and (in the case of categorical variables), one hot encoded. The resulting rules cannot be uploaded to a Simility environment as is, since the rule conditions use these processed features. 

The Convert Processed Conditions to General module converts the rule conditions that leverage processed features (either imputed values or OHE values) into rule conditions that leverage the unprocessed features. These converted rules can then be uploaded directly to a Simility environment. 

## Requirements

To run, you'll need the following:

* Install the Rules package - see the readme for more information.
* A rule set (stored in the standard ARGO string format) that contains processed features.

----

## Import packages

In [2]:
from rules.convert_processed_conditions_to_general import ConvertProcessedConditionsToGeneral, ReturnMappings

import pandas as pd
import numpy as np

---

## Read in dataset

Let's first read in some datasets - *X* represents the raw pipeline output, while *y* represents the updated fraud labels:

In [3]:
X = pd.read_csv('dummy_data/X.csv', index_col='eid')
y = pd.read_csv('dummy_data/y.csv', index_col='eid').squeeze()

## Processing the data

Now we'll apply the standard data cleaning processes that need to be carried out before feeding the data into one of the rule generator modules - **namely, imputing nulls and OHE encoding the categorical columns:**

In [4]:
imputed_values = {
    'num_items': -1,
    'country': 'missing'
}
X_processed = X.fillna(imputed_values)
X_processed = pd.get_dummies(X_processed)

In [5]:
X_processed.head()

Unnamed: 0_level_0,num_items,country_FR,country_GB,country_US,country_missing
eid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1.0,0,1,0,0
1,2.0,0,0,1,0
2,-1.0,1,0,0,0
3,3.0,0,0,0,1
4,1.0,0,1,0,0


---

## Generating rules

Now let's say we ran one of the ARGO rule generators on the processed dataset and generated the following rules:

In [6]:
rule_strings = {
    'Rule1': "(X['num_items']<2)",
    'Rule2': "(X['country_missing']==True)",
    'Rule3': "(X['country_US']==True)",
    'Rule4': "(X['num_items']<0)&(X['country_missing']==True)"
}

These rule conditions all contain processed features - they have either been imputed or one hot encoded. So, if we tried to convert them to the system-ready format and then create them directly in the Simility instance, it would either:

- Create inaccurate representations of the rules if they use only imputed numeric values (since the rule conditions may include the imputed value, but this wouldn't be accounted for in the system).
- Cause the request to create the rules to fail, since the one hot encoded variables don't exist in the Simility system.

Hence, we need to convert the conditions which leverage processed features into conditions which use the original, unprocessed features.

---

## Converting rule conditions

First, let's instantiate the *ConvertProcessedConditionsToGeneral* class. To do this, we need to provide the imputed values and the mapping of OHE columns to categories. For small datasets, this is relatively straightforward; however for larger datasets where multiple imputed values have been used, or a large number of columns have been OHE'd, this can be time consuming to do manually. Instead, we can use the *ReturnMapping* class to calculate this information:

In [7]:
rm = ReturnMappings()

Let's first return the imputed values used for each field:

In [8]:
imputed_values_mapping = rm.return_imputed_values_mapping([['num_items'], -1], [['country'], 'missing'])

Now let's return the category that relates to each OHE'd column:

In [9]:
ohe_categories_mapping = rm.return_ohe_categories_mapping(pre_ohe_cols=X.columns, 
                                                          post_ohe_cols=X_processed.columns, 
                                                          pre_ohe_dtypes=X.dtypes)

Once we have these mappings, we can instantiate the *ConvertProcessedConditionsToGeneral* class:

In [10]:
c = ConvertProcessedConditionsToGeneral(imputed_values=imputed_values_mapping, ohe_categories=ohe_categories_mapping)

Now we can run the *.convert()* method to convert the conditions in the rules generated above from using the processed features to using the original, unprocessed features:

In [11]:
general_rule_strings = c.convert(rule_strings=rule_strings, X=X_processed)

In [12]:
general_rule_strings

{'Rule1': "((X['num_items']<2)|(X['num_items'].isna()))",
 'Rule2': "(X['country'].isna())",
 'Rule3': "(X['country']=='US')",
 'Rule4': "(X['num_items'].isna())&(X['country'].isna())"}

### Outputs

The *.convert()* method returns a dictionary containing the set of rules which account for imputed/OHE variables, defined using the standard ARGO string format (values) and their names (keys). 

**Note the following:**

- If a numeric rule condition initially had a threshold such that the imputed null values were included in the condition, the converted condition has an additional condition to check whether the feature is also null. 
    - E.g. *Rule1* was intially *(X['num_items']<2)*, which included the imputed value of 0. The converted rule is now *((X['num_items']<2)|(X['num_items'].isna()))*, with an additional condition to check for nulls.
- If a categorical rule condition checks whether the value is the imputed null category, the converted condition is such that it will explicitly check for null values. 
    - E.g. *Rule2* was initially *(X['country_missing']==True)*. The converted rule is now *(X['country'].isna())*, such that it explicitly checks for null values.
- For categorical rule conditions, the converted condition is such that it will explicitly check for the category. 
    - E.g. *Rule3* was initially *(X['country_US']==False)*. The converted rule is now *(X['country']!='US')*, such that it explicitly checks whether the 'country' column is not equal to the 'US' category.

A useful attribute created by running the *.convert()* method is:

* rules: Class containing the rules stored in the standard ARGO string format. Methods from this class can be used to convert the rules into the standard ARGO dictionary or lambda expression representations. See the *rules* module for more information.

In [13]:
general_rule_strings

{'Rule1': "((X['num_items']<2)|(X['num_items'].isna()))",
 'Rule2': "(X['country'].isna())",
 'Rule3': "(X['country']=='US')",
 'Rule4': "(X['num_items'].isna())&(X['country'].isna())"}

### Creating the rules in the system

The generalised rules created above can now be converted to the system-ready format using the *.as_system_dicts()* method from the Rules class. 

**Note that you need to provide the Cassandra datatypes and field names for each pipeline output field name to convert the rules into the system-ready format. In this example, they're defined manually, but in practise you can use the *ReturnPipelineOutputDatatypes* and *ReturnCassandraPipelineOutputMapping* classes from the *cassandra_requests* module in the *simility_requests* sub-package:**

In [14]:
field_datatypes = {
    'num_items': 'INT',
    'country': 'TEXT'
}
cassandra_field_names = {
    'num_items': 'num_items',
    'country': 'country'
}

In [15]:
system_conditions = c.rules.as_system_dicts(field_datatypes=field_datatypes, cassandra_field_names=cassandra_field_names)

These system-ready conditions can be used to generate system-ready rule configurations using the *system_config_generation* sub-package. Once these have been created, the *simility_requests* sub-package can be used to create the rules in the system. See the documentation of these sub-packages for more information.

----

## The End

That's it folks - if you have any queries or suggestions please put them in the *#sim-datatools-help* Slack channel or email James directly.