# Nominal2OrdinalEncoder Example

Synthetic data sets will be used to demonstrate the usage of the Nominal2OrdinalEncoder class.  

While scikit-learn provides utility functions such as  `make_regression` and `make_classification`, neither of these support the inclusion of nominal/categorical predictor variables.  

As a result, a synthetic data-generating function available at [this blog post](https://brendanhasz.github.io/2019/03/04/target-encoding.html#data) by Brendan Hasz will be used. This incredibly useful function produces 
categorical predictor variables linked to a target variable by utilizing a beta-binomial probability distribution. This function has been updated within this package to support numpy's random number generator for reproducibility purposes.

The `UnivariateScreener` class from this package will also be used to demonstrate the use of information-based measures on both the orginal numeric predictor variables as well as the Nominal2Ordinal encoded features.

In [1]:
%load_ext autoreload
%autoreload 2

## Classification Example

### Generate synthetic dataset

In [2]:
import pandas as pd

from narrowgate.utils.test import make_regression_with_categoricals

### Declare the variables used to generate the synthetic dataset

In [3]:
# Number of rows/observations
n_obs = 10000

# Number of categorical columns (total)
n_cat_vars = 10

# Number of categorical columns that will be informative in relation to the target variable
n_cat_informative = 5

# Nubmer of categories within each categorical variable
n_cats_per_var = 20

# The imbalance between the category occurrences within each categorical variable 
imbalance = 0.15
    
# Noise to add to the target variable 
noise = 1.0
    
# Number of continuous columns in the data set
n_cont_vars = 10
    
# Weight of the effect of the continuous variables on the target variable
cont_weight = 0.5
    
# Proportion of variance associated with interaction effects between the categorical variables
# Does not include interaction effects with continuous variables
cat_interactions = 0.0

### Generate Synthetic Data

In [4]:
X, y = \
    make_regression_with_categoricals(
        n_obs = n_obs,
        n_cat_vars = n_cat_vars,
        n_cat_vars_informative = n_cat_informative,
        n_cats_per_var = n_cats_per_var,
        imbalance = imbalance,
        noise = noise,
        n_cont_vars = n_cat_vars,
        cont_weight = cont_weight,
        cat_interactions = cat_interactions,
        random_state = 32)

Let's take a real quick look at `X` at a very high level to make sure the desired data has been produced  

In [5]:
X.head()

Unnamed: 0,categorical_0,categorical_1,categorical_2,categorical_3,categorical_4,categorical_5,categorical_6,categorical_7,categorical_8,categorical_9,continuous_0,continuous_1,continuous_2,continuous_3,continuous_4,continuous_5,continuous_6,continuous_7,continuous_8,continuous_9
0,c,c,bh,h,e,bf,e,b,g,bf,0.706137,0.509071,-0.201694,-0.362294,-0.463444,0.673694,0.078033,1.187407,-0.364939,-0.050849
1,ba,f,h,g,be,bf,be,b,i,g,-0.02351,-0.291555,0.125132,-0.701162,0.639746,1.011278,0.287411,-0.223269,0.283496,0.022197
2,g,b,bd,bg,bh,ba,d,bj,e,c,-0.308413,-0.296885,0.20002,-0.16445,0.177224,0.06343,-1.177529,0.445588,-0.805963,0.021641
3,f,b,g,bb,f,h,bh,bc,bc,be,0.543808,0.237249,0.262364,0.473875,0.317089,0.319631,0.448239,0.653434,0.044292,-0.027299
4,bc,bd,j,c,bf,e,d,b,bb,be,1.134608,0.485454,0.457492,-0.402468,0.061813,0.466507,0.363072,-0.081751,-0.149163,-0.411838


It does look like we have 10 categorical and 5 continuous variables. However, we dont know the relationship of these variables to the target variable.  

Let's use the `Nominal2OrdinalEncoder` class to encode the categorical variables to numeric and then use `UnivariateScreener` to see which of the
categorical variables are informationally-related to the target variable. There should only be 5.

We are going to have the encoder `auto`-select the non-numeric columns to transfrom.

In [6]:
from narrowgate.encoders import Nominal2OrdinalEncoder

nom_ord_encoder = Nominal2OrdinalEncoder()

nom_ord_encoder.fit(X, y, columns='auto')

X_numeric = nom_ord_encoder.transform(X)

### Examine univariate relationship with target

Finally, we are going to apply a `UnivariateScreener` to the newly-transformed categorical data along with the existing continuous variables.  

If all goes well, we should only see 5 out of the 10 categorical variables show up as being informative.  We should also see the continuous variables have a slight bit of information related to the target variable but not a significant amount.

In [7]:
from narrowgate.screeners import UnivariateScreener

In [8]:
uni_screen = \
    UnivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='mutual_information', 
        mcpt_reps=100, 
        cscv_subsets=6)

uni_screen.screen(X_numeric, y)

PROCESSING 20 variables ... 
TOTAL COLUMNS PROCESSED: 20
- Processing time: 3.852264 s
Begin MCPT reps ...


  0%|          | 0/100 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/20 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f6d58161ca0>

In [9]:
uni_screen.results

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 20
Target variable name: continuous_0
Number of MCPT reps: 100
Number of CSCV subsets: 6
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
0,categorical_6,0.0762,0.01,0.01,0.0
1,continuous_7,0.0582,0.01,0.01,0.0
2,categorical_8,0.0531,0.01,0.01,0.0
3,categorical_9,0.0453,0.01,0.01,0.0
4,categorical_5,0.0412,0.01,0.01,0.0
5,continuous_4,0.0371,0.01,0.01,0.0
6,continuous_6,0.0358,0.01,0.01,0.0
7,categorical_7,0.0317,0.01,0.01,0.0
8,continuous_2,0.0069,0.01,0.01,0.25
9,continuous_3,0.0054,0.01,0.01,0.5


The above univariate screening results indicate that there indeed appears to be 5 categorical variables transformed by `Nominal2OrdinalEncoder` with a non-random relationship to the target variable. As expected from the `cont_weight` parameter, there are also several continuous variables which exhibit an informational relationship with the target variable. 

In [21]:
var_to_discretize = uni_screen.results.Variable[1]

new_var = f'{var_to_discretize}_binned'

X[new_var] = pd.cut(X[var_to_discretize], bins=8)

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 10000
Number of predictors: 100
Target variable name: target
Number of MCPT reps: 100
Number of CSCV subsets: 6
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


## Extreme Example

While the results from the above example is actually representative of quite a few data sets, it will also be worthwhile to test an extreme example.  

The following example demonstrates a synthetic data set containing 100 categorical variables where only 3 are informationally-related to the target variable. In addition, a significant weighting to the continuous variables will also be present and we are increasing the number of observations in the data set. We are literally attempting to find a needle in haystack with the categorical variables.  

How good will Nominal2Ordinal encoding perform?

In [10]:
# Number of rows/observations
n_obs = 100000

# Number of categorical columns (total)
n_cat_vars = 100

# Number of categorical columns that will be informative in relation to the target variable
n_cat_informative = 3

# Nubmer of categories within each categorical variable
n_cats_per_var = 20

# The imbalance between the category occurrences within each categorical variable 
imbalance = 0.25
    
# Noise to add to the target variable 
noise = 1.0
    
# Number of continuous columns in the data set
n_cont_vars = 50
    
# Weight of the effect of the continuous variables on the target variable
cont_weight = 0.8
    
# Proportion of variance associated with interaction effects between the categorical variables
# Does not include interaction effects with continuous variables
cat_interactions = 0.0

In [11]:
X_extreme, y_extreme = \
    make_regression_with_categoricals(
        n_obs = n_obs,
        n_cat_vars = n_cat_vars,
        n_cat_vars_informative = n_cat_informative,
        n_cats_per_var = n_cats_per_var,
        imbalance = imbalance,
        noise = noise,
        n_cont_vars = n_cat_vars,
        cont_weight = cont_weight,
        cat_interactions = cat_interactions,
        random_state = 32)

In [14]:
nom_ord_encoder_extreme = Nominal2OrdinalEncoder()
nom_ord_encoder_extreme.fit(X_extreme, y_extreme, columns='auto')
X_numeric_extreme = nom_ord_encoder_extreme.transform(X_extreme)

In [15]:
uni_screen_extreme = \
    UnivariateScreener(
        nbins_predictors=5, 
        nbins_target=5, 
        information_criterion='mutual_information', 
        mcpt_reps=1000, 
        cscv_subsets=6)

uni_screen_extreme.screen(X_numeric_extreme, y_extreme)

PROCESSING 200 variables ... 
TOTAL COLUMNS PROCESSED: 200
- Processing time: 0.519417 s
Begin MCPT reps ...


  0%|          | 0/1000 [00:00<?, ?it/s]

Begin CSCV reps ...


  0%|          | 0/20 [00:00<?, ?it/s]

<narrowgate.screeners.univariate.base.UnivariateScreener at 0x7f6d56aa4940>

In [16]:
uni_screen_extreme_results = uni_screen_extreme.results
uni_screen_extreme_results.head(50)

UNIVARIATE SCREENER RESULTS
---------------------------
Number of observations: 100000
Number of predictors: 200
Target variable name: continuous_0
Number of MCPT reps: 1000
Number of CSCV subsets: 6
Number of bins (predictors): 5
Number of bins (target): 5
MCPT permutation method: complete
Information Criterion: Mutual Information


Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
0,continuous_29,0.0325,0.001,0.001,0.0
1,continuous_30,0.0267,0.001,0.001,0.0
2,continuous_79,0.0169,0.001,0.001,0.0
3,continuous_50,0.0164,0.001,0.001,0.0
4,continuous_78,0.016,0.001,0.001,0.0
5,continuous_38,0.0145,0.001,0.001,0.0
6,continuous_52,0.0134,0.001,0.001,0.0
7,continuous_92,0.0121,0.001,0.001,0.0
8,continuous_73,0.0117,0.001,0.001,0.0
9,continuous_71,0.0109,0.001,0.001,0.0


Even though the combined group of continuous variables were weighted significantly toward to the target variable (`0.8`), there were still three categorical variables that were able to be recognized by the screener as being informationally-related to the target variable. Can you find them in the above results?

In [17]:
categorical_mask = ['categorical' in var_name for var_name in uni_screen_extreme_results.Variable]
uni_screen_extreme_results.loc[categorical_mask].head(10)

Unnamed: 0,Variable,mutual_information,Solo p-value,Unbiased p-value,P(<=median)
16,categorical_98,0.0074,0.001,0.001,0.0
22,categorical_99,0.0054,0.001,0.001,0.0
32,categorical_97,0.0043,0.001,0.001,0.0
88,categorical_37,0.0002,0.001,0.018,0.6
89,categorical_36,0.0002,0.001,0.025,0.45
90,categorical_88,0.0002,0.001,0.031,0.55
93,categorical_68,0.0002,0.001,0.047,0.5
95,categorical_3,0.0002,0.001,0.121,0.7
98,categorical_13,0.0002,0.002,0.13,0.8
99,categorical_8,0.0002,0.002,0.13,0.85


The index of the resulting DataFrame displays the importance ordering of the variables in relation to the target variable in the synthetic data set. As you can see, the top three categorical variables, while not have incredibly high mutual information values, do have much higher values than the rest of their categorical variable peers.

## Conclusion

The `Nominal2OrdinalEncoder` appears to be a viable method for numerically encoding categorical variables for use in exploratory data analysis and predictive modeling activities.