In [1]:
%load_ext autoreload
%autoreload 2

# BivariateScreener Example

The `BivariateScreener` class screens for the interaction effects between two predictor variables and a target variable.  

Since the screening is performed with combinations of two variables, the total number of combinations screened given `n` predictor variables will be `n * (n-1) / 2`  
For this reason, the runtime for this screener is substantially longer than `UnivariateScreener`

Items to note:  
* Unlike `UnivariateScreener` **only discretized** information measures are supported
* Similar to `UnivariateScreener`, predictor and target variables must be numeric data types (not categorical).

The same data pre-processing recommendations are also provided:  
* For any categorical predictor variables, it is recommended to utilze the `Nominal2OrdinalEncoder` which will convert the categorical variables to numeric values. 
* For a target variable with two values (binary classification), a conversion of target variable values to `0` and `1` is recommended. A discretized information measure is recommend (mutual information, uncertainty reduction, etc.) with the `nbins_target` parameter value set at `2`
* For a multi-class target variable, a [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) conversion is recommended along with a discretized information measure. The `nbins_target` parameter value shold be set to the number of classes present within the target variable. Please be aware the screener will potentially struggle to produce meaningful results with extremely high cardinal target variables also exhibiting a moderate amount of class imbalance.

Please consult the following texts for additional explanations/information on this screener:
* [Data Mining Algorithms in C++](https://www.amazon.com/gp/product/B078H79QGK/)
* [VarScreen User's manual](http://www.timothymasters.info/varscreen.html)

### Generate Synthetic data

The same synthetic data set used for demonstrating `UnivariateScreener` will also be used here.  

In this example, we should see 5 strong variable combinations out of the set of combinations. 
* 2 variables actually interact to comprise the target variable (1 combo)
* each of these variables are interactions of 2 variables (4 combos)

Also, since there are 50 predictor variables, this will lead to 1,225 combinations of 2 variables.  
In order to properly account for this, the number of Monte Carlo Permutation testing reps will need to increase to 1000 (or more)

The preferred information criterion for use with this screener is `uncertainty_reduction`.

In [2]:
from narrowgate.utils.test import make_masters_sample_regression_data

In [3]:
X, y = \
    make_masters_sample_regression_data(
        n_obs = 10000,
        n_cols = 50)

In [4]:
from narrowgate.screeners import BivariateScreener

In [5]:
bi_screen = \
    BivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='uncertainty_reduction', 
        mcpt_type='complete',
        mcpt_reps=1000)

bi_screen.screen(X, y)

PROCESSING 50 variables ... 
TOTAL COLUMNS PROCESSED: 50
- Processing time: 3.968823 s
TOTAL # OF 2-VARIABLE COMBINATIONS: 1225
Begin MCPT reps ...


  0%|          | 0/1000 [00:00<?, ?it/s]

In [6]:
bi_screen_results = bi_screen.results
bi_screen_results.head(10)

BIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 50
Target variable name: target
Number of MCPT reps: 1000
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Uncertainty Reduction


Unnamed: 0,Variable 1,Variable 2,Uncertainty Reduction,Solo p-value,Unbiased p-value
0,x_0,x_1,0.666827,0.001,0.001
1,x_1,x_2,0.334572,0.001,0.001
2,x_1,x_3,0.327528,0.001,0.001
3,x_0,x_5,0.326688,0.001,0.001
4,x_0,x_4,0.325071,0.001,0.001
5,x_4,x_5,0.181637,0.001,0.001
6,x_1,x_4,0.180486,0.001,0.001
7,x_1,x_5,0.180013,0.001,0.001
8,x_1,x_45,0.178399,0.001,0.001
9,x_1,x_26,0.178227,0.001,0.001


Sure enough, we are able to see 5 strong variable combinations produced by the screening run.