In [1]:
%load_ext autoreload
%autoreload 2

# UnivariateScreener Example

The `UnivariateScreener` class from this package assists in screening for relationships between individual predictor variables and a target variable.  

No interaction effects between predictor variables are taken into account.  

As a result, this screener produces results which are different than the 'feature importance' methods available for predictive models (namely tree-based methods) as it only accounts for a predictor variable's direct relationship with a target variable. Still, often times, this screening method is sufficient for producing an initial round of filtering from a high-dimensional data. 

This screening method, along with others present in this package, require that all data be numeric in nature (including the target variable).  
* For any categorical predictor variables, it is recommended to utilze the `Nominal2OrdinalEncoder` which will convert the categorical variables to numeric values. 
* For a target variable with two values (binary classification), a conversion of target variable values to `0` and `1` is recommended. A discretized information measure is recommend (mutual information, uncertainty reduction, etc.) with the `nbins_target` parameter value set at `2`
* For a multi-class target variable, a [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) conversion is recommended along with a discretized information measure. The `nbins_target` parameter value shold be set to the number of classes present within the target variable. Please be aware the screener will potentially struggle to produce meaningful results with extremely high cardinal target variables also exhibiting a moderate amount of class imbalance.

The following functionality will be demonstrated in this notebook :
* Discretized screening methods
* Continuous screening methods
* 'Complete' vs. 'Cyclic' permutation methods

As recommended in the documentation, for more detailed information on utilizing `UnivariateScreener`, please consult the following texts:
* [Data Mining Algorithms in C++](https://www.amazon.com/gp/product/B078H79QGK/)
* [VarScreen User's manual](http://www.timothymasters.info/varscreen.html)

DISCLAIMER: This library/repository is in no way affiliated with Dr. Masters and receives no renumeration.

### Example 1 : Discretized Screening methods

As mentioned above, this class only supports numeric data as inputs ... with both the predictor and target variables.  

While it is often advised to NOT discretize continuous variables during predictive modeling due to potential information loss, discretization of continuous variables is used extensively with this screener as a method for detecting non-linear relationships between predictor and target variables. The number of bins does not have to be large in order to detect this relationship. Often times, just 3 to 5 bins is sufficient. 

Strictly using correlation as a tool for assessing a relationship between continuous predictor and target variable may eliminate some variables from consideration due to it's inability to properly assess non-linear relationships. For an example of this inability please see [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet). On the other hand, such variables with non-linear relationships are exactly those variables which we desire to retain and use in powerful predictive models whose algorithms have the ability to take advantage of such relationships during model construction time.

A simple example given in the above-referenced texts is replicated here using the synthetic data-generating function [`generate_masters_sample_regression_data`] present within the package. This synthetic data set, generates 6 variables which are informationally-related to the target variable. 2 of those 6 variables are combinations of the other 4 variables.  

Therefore, we should see the following: 
* 2 variables strongly related to the target variable
* 4 variables moderately related to the target variable
* remaining variables with a random relationship with the target variable

####  Generate Synthetic Data

In [2]:
from narrowgate.utils.test import make_masters_sample_regression_data

In [3]:
X, y = \
    make_masters_sample_regression_data(
        n_obs = 10000,
        n_cols = 50)

In [4]:
from narrowgate.screeners import UnivariateScreener

In [5]:
uni_screen = \
    UnivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='mutual_information', 
        mcpt_reps=100, 
        cscv_subsets=6)

uni_screen.screen(X, y)

PROCESSING 50 variables ... 
TOTAL COLUMNS PROCESSED: 50
- Processing time: 3.911739 s
Begin MCPT reps ...


  0%|          | 0/100 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/20 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f51f165b1c0>

In [6]:
uni_screen_results = uni_screen.results
uni_screen_results.head(10)

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 50
Target variable name: target
Number of MCPT reps: 100
Number of CSCV subsets: 6
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
0,x_1,0.2817,0.01,0.01,0.0
1,x_0,0.2595,0.01,0.01,0.0
2,x_4,0.1242,0.01,0.01,0.0
3,x_5,0.1234,0.01,0.01,0.0
4,x_2,0.1134,0.01,0.01,0.0
5,x_3,0.1099,0.01,0.01,0.0
6,x_6,0.0013,0.06,0.98,0.35
7,x_46,0.0012,0.14,1.0,0.25
8,x_8,0.0011,0.16,1.0,0.35
9,x_45,0.0011,0.15,1.0,0.35


The results above are exactly what we would have expected. 

`x_0` and `x_1` are the two variables derived from variables `x_2` thorugh `x_5` and are more strongly related to the target.

`x_6` and greater are purely random and have no relationship with the target whatsoever.

#### Brief explanation of results

Once again, at risk of sounding like a broken record, the practitioner is strongly encouraged to consult the referenced texts by Timothy Masters for a more detailed review of the results metrics.  

However, the following information is freely available from the [VarScreen user's manual](http://www.timothymasters.info/varscreen.html) and summarized here for convenience.  

The `results` DataFrame consists of the following:

* `Variable`  
    - the name of the predictor variable  

* `Solo p-value`  
    - the probability that a predictor variable having a strictly random relationship with the target could have had an information score at least as high as one obtained by sheer luck.  

* `Unbiased p-value`  
    - the probability that the best performing predictor variable could have attained its superior level of performance by sheer luck if all candidates were truly worthless.  
    - This value is an upper bound for the true unbiased p-value of the candidate, and a very small value is an indication that the predictor variable has true predictive power.  

* `P(<=median)`  
    - an estimation of superior quality of a predictor variable relative to other predictor variables present in the data set. **measures relative power of predictor variables to one another, not absolute power**  
    - only meaningful if the set of predictor variables being evaluated are known to have a non-random relationship to the target variable.  
    - the estimated probability (via cross-validation) that the out-of-sample information value of the best predictor variable will be less than or equal to the median out-of-sample performance of all other predictor variables.   
    - This value is controlled via the `cscv_subsets` parameter and refers to the number of folds to use in Combinatorially Symmetric Cross Validation. For more information about this method, please refer to the linked texts above.

**Unbiased p-value Alert**  
Because the calculation of this value is overly conservative, large values at the bottom of the list may actually be useful predictors and may have obtained these large values due to the over-estimation of the true p-value. A more rigid and accurate stepwise method has been implemented by Masters within VarScreen but has not been implmented here as it requires the practitioner to provide a predefined p-value (alpha) at runtime.

### Example 2 : Continuous Screening methods

The same data set used in the previous example will be used again in the call to demonstrate screening using a continuous information measure.  

The screening metric being used will be Spearman's rank correlation. Because the data set is known to not have any non-linearities present between the predictor and target variables, the results should be relatively the same as in the discretized example.

Note that the parameters `nbins_predictors` and `nbins_target` are both set to `None` in order to invoke the rank correlation calculation.  
In addition, the `information_measure` parameter must be changed as well.

In [7]:
uni_screen_continuous = \
    UnivariateScreener(
        nbins_predictors=None, 
        nbins_target=None, 
        information_criterion='spearman_rho', 
        mcpt_reps=100, 
        cscv_subsets=6)

uni_screen_continuous.screen(X, y)

PROCESSING 50 variables ... 
TOTAL COLUMNS PROCESSED: 50
- Processing time: 0.000716 s
Begin MCPT reps ...


  0%|          | 0/100 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/20 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f51e59de7f0>

In [8]:
uni_screen_continuous_results = uni_screen_continuous.results
uni_screen_continuous_results.head(10)

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 50
Target variable name: target
Number of MCPT reps: 100
Number of CSCV subsets: 6
MCPT permutation method: complete
Information Criterion: Spearman Rho


Unnamed: 0,Variable,spearman_rho,Solo p-value,Unbiased p-value,P(<=median)
0,x_1,0.6955,0.01,0.01,0.0
1,x_0,0.6823,0.01,0.01,0.0
2,x_4,0.4955,0.01,0.01,0.0
3,x_5,0.4931,0.01,0.01,0.0
4,x_2,0.4779,0.01,0.01,0.0
5,x_3,0.4738,0.01,0.01,0.0
6,x_26,0.0222,0.03,0.45,0.0
7,x_44,0.0169,0.03,0.92,0.0
8,x_31,0.0158,0.06,0.97,0.25
9,x_38,0.0129,0.1,1.0,0.3


Sure enough the top variables are appearing in the same order.

### Example 3: Serially-correlated variables

In this example, a new data set of several serially correlated variables are going to be created. 

A rolling 10-period average is going to be calculated from random variables, `x_6` to `x_15`. 
These original variables have no relation with one another.  
The new variables are only associated by the length of the moving average window.

The newly derived `x_6` is then going to be removed and used as the target variable while the other serially-correlated random variables will be used as predictor variables.

If serial correlation exists in either the predictor or target variables and `complete` is chosen for `mcpt_type` parameter, the screener may detect relationships with the predictor variables even though none exist.  

Therefore, be very careful to choose `cyclic` as the `mcpt_type` if even a hint of serial correlation may be present within any variables being examined.

In [9]:
# Create a new data set with serial correlation 
# Use random variables with no relationship with one another
# Create serially correlated variables by calculating a rolling average of the series.
import pandas as pd

dep_list = []

for icol in range(6,16):
    col = f'x_{icol}'
    dep_list.append(X[col].rolling(10, min_periods=1).mean())
    
X_dep = pd.concat(dep_list, axis=1)

y_dep = X_dep['x_6']
X_dep.drop(columns='x_6', inplace=True)

Perform a univariate screening with the serially-correlated data.  

`complete` is selected as `mcpt_type`

In [10]:
uni_screen_complete = \
    UnivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='mutual_information', 
        mcpt_type = 'complete',
        mcpt_reps=100, 
        cscv_subsets=8)

uni_screen_complete.screen(X_dep, y_dep)

PROCESSING 9 variables ... 
TOTAL COLUMNS PROCESSED: 9
- Processing time: 0.007171 s
Begin MCPT reps ...


  0%|          | 0/100 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/70 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f51e5a78370>

In [11]:
uni_screen_complete_results = uni_screen_complete.results
uni_screen_complete_results

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 9
Target variable name: x_6
Number of MCPT reps: 100
Number of CSCV subsets: 8
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
0,x_11,0.002,0.02,0.04,0.371429
1,x_8,0.0017,0.02,0.06,0.442857
2,x_13,0.0015,0.02,0.17,0.371429
3,x_7,0.0014,0.03,0.25,0.4
4,x_12,0.0013,0.03,0.3,0.571429
5,x_9,0.0012,0.12,0.59,0.514286
6,x_15,0.0011,0.14,0.73,0.442857
7,x_10,0.001,0.23,0.9,0.585714
8,x_14,0.0009,0.4,0.96,0.671429


Notice the low mutual information values but alsso the low solo p-value and unbiased p-values as well.  

By not accounting for the serial correlation using `cyclic` as the `mcpt_type`, the screener has detected relationships between predictor and target variables when, in fact, there is none. The only relationship is the window length of the serial correlation.

Re-run the same screener but use `cyclic` instead.

In [12]:
uni_screen_cyclic = \
    UnivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='mutual_information', 
        mcpt_type='cyclic',
        mcpt_reps=100, 
        cscv_subsets=8)

uni_screen_cyclic.screen(X_dep, y_dep)

PROCESSING 9 variables ... 
TOTAL COLUMNS PROCESSED: 9
- Processing time: 0.006906 s
Begin MCPT reps ...


  0%|          | 0/100 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/70 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f51e5a78040>

In [13]:
uni_screen_cyclic_results = uni_screen_cyclic.results
uni_screen_cyclic_results

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 9
Target variable name: x_6
Number of MCPT reps: 100
Number of CSCV subsets: 8
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: cyclic
Information Criterion: Mutual Information


Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
0,x_11,0.002,0.21,0.9,0.371429
1,x_8,0.0017,0.38,0.97,0.442857
2,x_13,0.0015,0.38,0.99,0.371429
3,x_7,0.0014,0.48,1.0,0.4
4,x_12,0.0013,0.48,1.0,0.571429
5,x_9,0.0012,0.64,1.0,0.514286
6,x_15,0.0011,0.66,1.0,0.442857
7,x_10,0.001,0.71,1.0,0.585714
8,x_14,0.0009,0.85,1.0,0.671429


This is much better.  

While some of the variables still have somewhat low solo p-values, all of the unbiased p-values are quite high, therefore indicating no predictive power among the predictor variables.

### Conclusion

The `UnivariateScreener` is an excellent first-use tool for exploratory data analysis.  

Monte Carlo Permutation testing helps assess robust of predictor variables. 

Additional examples with real-world data sets