# Nominal2OrdinalEncoder Example

Synthetic data sets will be used to demonstrate the usage of the Nominal2OrdinalEncoder class. While scikit-learn provides utility functions such as  `make_regression` and `make_classification`, neither of these support the inclusion of nominal/categorical predictor variables.  

As a result, a synthetic data-generating function demonstrated at 


The `UnivariateScreener` class from this package will also be used to demonstrate the use of information-based measures on both the orginal numeric predictor variables as well as the Nominal2Ordinal encoded features.

In [1]:
%load_ext autoreload
%autoreload 2

## Classification Example

### Generate synthetic dataset

In [2]:
from sklearn.datasets import make_classification
import pandas as pd

n_obs = 10000
n_features = 100

X, y = make_classification(
            n_samples=n_obs, 
            n_features=n_features, 
            n_informative=10, 
            n_redundant=0, 
            n_classes=2, 
            flip_y=0.2, 
            random_state=42)

X = pd.DataFrame(X, columns=[f'x{i}' for i in range(n_features)])
y = pd.Series(y, name='target')

### Examine univariate relationship with target

In [4]:
from narrowgate.screeners import UnivariateScreener

In [5]:
uni_screen = \
    UnivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='mutual_information', 
        mcpt_reps=100, 
        cscv_subsets=6)

uni_screen.screen(X, y)

PROCESSING 100 variables ... 
TOTAL COLUMNS PROCESSED: 100
- Processing time: 3.987596 s
Begin MCPT reps ...


  0%|          | 0/100 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/20 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f762e980790>

In [6]:
uni_screen.results.head(15)

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 100
Target variable name: target
Number of MCPT reps: 100
Number of CSCV subsets: 6
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
0,x91,0.0873,0.01,0.01,0.0
1,x13,0.0421,0.01,0.01,0.0
2,x95,0.0243,0.01,0.01,0.0
3,x54,0.019,0.01,0.01,0.0
4,x57,0.0183,0.01,0.01,0.0
5,x48,0.0172,0.01,0.01,0.0
6,x87,0.0149,0.01,0.01,0.0
7,x71,0.0009,0.01,0.18,0.05
8,x96,0.0007,0.01,0.52,0.05
9,x11,0.0006,0.08,0.87,0.25


The above univariate screening results indicate that there indeed appears to be 10 variables within the synthetic data set with a non-random relationship with the binary target variable. These 'signal' variables are the first 10 listed in the results DataFrame printed above.

In order to test out the Nominal2Ordinal encoding capability, the second variable in the results list will be discretized to create a categorical variable which should also possess a non-random relationship with the target variable.

After the variable is discretized, Nominal2Ordinal encoding will be performed on the categorical variable in the modified data set and the modified data set will be screened again.

### Discretize 2nd most informative variable

In [21]:
var_to_discretize = uni_screen.results.Variable[1]

new_var = f'{var_to_discretize}_binned'

X[new_var] = pd.cut(X[var_to_discretize], bins=8)

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 100
Target variable name: target
Number of MCPT reps: 100
Number of CSCV subsets: 6
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


### Perform Nominal2Ordinal encoding

In [20]:
X.columns

Index(['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9',
       ...
       'x91', 'x92', 'x93', 'x94', 'x95', 'x96', 'x97', 'x98', 'x99',
       'x13_binned'],
      dtype='object', length=101)