In [1]:
%load_ext autoreload
%autoreload 2

# FSCA Examples

FSCA stands for 'Forward Selection Component Analysis'.

This algorithm is based on the following: 
- Find the smallest number of variables of all variables present in a data set which 'explain' as much of the total variability observed as possible. 
- A variable is considered as doing a good job of explaining variability if it's value is able to inform us about the values of other variables in the data set.
- The best variable (or subset of variables) lets us best predict the values of the other variables.

### Methods
The `FSCA` class offers three execution methods via the `method` parameter:
- `pca` - Traditional Prinicipal Component Analysis where all components are retained
- `ordered` - Ordered FSCA where the greedy, strictly forward selection algorithm is used to select the variables
- `refined` - Refined FSCA where all variables are re-evaluated for performance at each forward selection step and optimal variables are retained.

The `refined` method is the preferred method over `ordered` as it:
- allows for continual refinement of the set of selected variables
- tests the set of currently selected variables at each step
- removes any variables and replaces them with better performing candidate variables. 

While the `refined` method sacrifices the strict ordering of the variables determined during the forward selection process, it produces a more optimal final subset of variables which is often preferred during feature selection activities.

The `pca` method is present to allow for additional operations to take place on the features such as clustering. Clustering as implemented with this class 
assists the user in determining highly collinear variables present in the data set. 

### Number of Components
The `FSCA` class also allows two methods for selecting the number of components/features:
* the user can manually select the number of components/features
* the number of components/features is automatically selected using either Horn's or Minka's method

### Scikit-learn Compatability
In addition to the base `FSCA` class, `FSCASelector` and `FSCATransformer` classes extend the use of the base class for use within the `scikit-learn` ecosystem. Both extended classes are `Mixin` classes able to be leveraged within a scikit-learn `Pipeline` for either selecting a subset of variables or transforming the data set into it's subset of principal components determined during the forward selection/refinement process.

This class truly offers a unique perspective on feature selection.  
At no time is a target variable used in the selection process.  
The driving princple behind feature selection within `FSCA` is to maximize the explained variance of the entire dataset with a subset of the fewest number of variables. Therefore, the major caveat with `FSCA` is that there is no guarantee that the selected subset of variables (or transformed components) will, in any way, be related to the target variable.

Please consult the following references for more information about the algorithm and it's potential uses:
* [Modern Data Mining Algorithms in C++ and CUDA C](https://www.amazon.com/gp/product/B089R5SYVS/)
* [Forward Selection Component Analysis: Algorithms and Applications](https://doi.org/10.1109/TPAMI.2017.2648792)
* [VarScreen user's guide](http://www.timothymasters.info/varscreen.html)

## Example 1: Ordered selection

The data set being examined will be generated by the `make_masters_sample_regression_data` function.  
The target variable generated, `y` will be added into the data set for demonstration purposes.  
A total of 9 variables will be present in the data set with variables with the following characteristics:
* ```target = x_0 + x_1```
* ```x_0 = x_2 + x_3```
* ```x_1 = x_4 + x_5```
* all other variables are random

In [2]:
import pandas as pd

from narrowgate.utils.test import make_masters_sample_regression_data

In [3]:
X, y = \
    make_masters_sample_regression_data(
        n_obs = 10000,
        n_cols = 8)

In [4]:
X = pd.concat([X,y], axis=1)

In [5]:
X.head()

Unnamed: 0,x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,target
0,0.700071,1.658846,0.377136,0.322935,0.686614,0.972232,0.96683,0.671298,2.358917
1,0.859634,1.514251,0.317078,0.542556,0.601223,0.913027,0.408457,0.550916,2.373884
2,0.273725,1.512612,0.040542,0.233183,0.607784,0.904828,0.008684,0.459207,1.786337
3,0.227864,0.476151,0.161198,0.066666,0.149306,0.326845,0.365441,0.591441,0.704014
4,1.199534,1.418188,0.663864,0.53567,0.987275,0.430913,0.298618,0.748917,2.617722


Run the `FSCA` algorithm using the `ordered` method

In [6]:
from narrowgate.screeners import FSCA

In [7]:
fsca_ordered = FSCA(n_components = len(X.columns), method='ordered', standardize=True)

In [8]:
fsca_ordered.fit(X)

Commencing stepwise construction with target
 - Added x_0 for criterion=4.98918936245303
 - Added x_4 for criterion=5.99914441359842
 - Added x_2 for criterion=7.000178841459978
 - Added x_7 for criterion=8.000191632059934
 - Added x_6 for criterion=8.999995312303337


FSCA(n_components=6)

Three are several class objects that are available for examination upon completion:
* `n_unique_eigval` - Number of unique (non-redundant) sources of variation in the data set
* `eigvals[:n_unique_eigval]` - Eigenvalues
* `eigval_cum[:n_unique_eigval]` - Cumulative eigenvalue percent
* `pc_factors` - Principal component factor structure
* `kept_column_names` - the subset of variables selected

In [9]:
# Number of unique (non-redundant) sources of variation in the data set
fsca_ordered.n_unique_eigval

6

Even though there are 9 variables, 3 of the variables are combinations of the other variables.  
Therefore, 6 unique sources of variation in the data is reasonable.

In [11]:
# Eigenvalues
fsca_ordered.eigvals[:fsca_ordered.n_unique_eigval]

array([2.99337567, 1.99597909, 1.02028378, 1.0093568 , 0.99856091,
       0.98243907])

In [12]:
# Cumulative eigenvalue percent
fsca_ordered.eigval_cum[:fsca_ordered.n_unique_eigval]

array([ 33.25974694,  55.43730394,  66.77379628,  77.98887767,
        89.08400469, 100.        ])

In [13]:
fsca_ordered.pc_factors

Unnamed: 0,PC0,PC1,PC2,PC3,PC4,PC5
x_0,0.712142,-0.701995,-0.000938,0.000648,0.003236,0.006724
x_1,0.70155,0.712562,0.004637,0.001532,0.002461,-0.007246
x_2,0.510745,-0.48386,-0.434364,-0.043885,-0.374712,0.417144
x_3,0.495778,-0.508249,0.431446,0.044638,0.377902,-0.406127
x_4,0.495195,0.503246,-0.53273,-0.154526,0.377959,-0.225814
x_5,0.492077,0.499521,0.542586,0.157649,-0.376833,0.216988
x_6,-0.003232,0.010372,0.255255,-0.682598,0.405534,0.551659
x_7,-0.008642,0.009858,-0.045613,0.700544,0.515746,0.490906
target,0.999978,0.004241,0.002604,0.00154,0.004032,-0.000338


The following interpretations can be made given the previous three cells:
- The first eigenvector/component (PC0) accounts for over 33% of the total variance in the data set (`eigval_cum[0]`)
- The first eigenvector/component (PC0) is very highly correlated with the target variable. 
- The second eigenvector/component (PC1) is just the contrast between`x_0` and `x_1` and their 'constituent' variables. These two variables comprise the target variable
- Together, the first 2 components comprise over 55% of the total variance in the data set (`eigval_cum[1]`)

The mean squared correlations of each variable with one another are also calculated during the `fit`

In [14]:
fsca_ordered.corr

x_0       0.187718
x_1       0.185849
x_2       0.094220
x_3       0.093492
x_4       0.093241
x_5       0.092046
x_6       0.000029
x_7       0.000033
target    0.249163
Name: mean_sq_corr, dtype: float64

In [15]:
fsca_ordered.kept_column_names

['target', 'x_0', 'x_4', 'x_2', 'x_7', 'x_6']

These column names correspond to the results present in the VarScreen user guide example:
* `target` is selected
* one of the two variables which comprise the target is selected (`x_0`)
* a single comprising `x_0` and `x_1` was selected (`x_2` -> `x_0` and `x_4` -> `x_1`)
* the two random variables having nothing to do with the target (`x_6` and `x_7`)

Finally, we observe the `weights` needed in order to transform the selected variables into the components.  
Notice the first component is almost entirely the variable `target`

In [16]:
fsca_ordered.weights

Unnamed: 0,FSCA_ORD_C0,FSCA_ORD_C1,FSCA_ORD_C2,FSCA_ORD_C3,FSCA_ORD_C4,FSCA_ORD_C5
target,0.99985,-5e-05,-3.568858e-05,-3.584938e-05,5.704381e-07,2.124531e-07
x_0,0.709017,0.704979,-6.080654e-07,-4.999462e-05,7.554854e-07,5.241246e-07
x_4,0.497218,-0.500909,0.708318,-8.308572e-07,-7.879571e-08,-1.818968e-07
x_2,0.505705,0.49159,0.009200891,0.7087064,8.498469e-07,9.422453e-07
x_7,-0.005725,-0.005029,0.001599923,-0.004566615,0.9999082,9.735426e-07
x_6,-0.002125,-0.005297,0.0003134868,-0.00810782,-0.009812165,0.9998518


## Example 2: Refined selection

Using the exact same data set, perform selection with backward refinement.

In [17]:
fsca_refined = FSCA(n_components = len(X.columns), method='refined', standardize=True)

In [18]:
fsca_refined.fit(X)

Commencing stepwise construction with target
 - Added x_0 for criterion=4.98918936245303
   - no refinement needed
 - Added x_4 for criterion=5.99914441359842
   - Replaced x_0 with x_5 to get criterion = 5.999144413609411
 - Added x_2 for criterion=7.00017884145514
   - Replaced target with x_3 to get criterion = 7.000178841475345
 - Added x_7 for criterion=8.00019163206343
   - Replaced x_2 with target to get criterion = 8.000191632093797
 - Added x_6 for criterion=8.999995312303328
   - Replaced target with x_2 to get criterion = 8.999995312303405


FSCA(method='refined', n_components=6)

In [19]:
fsca_refined.kept_column_names

['x_3', 'x_5', 'x_4', 'x_2', 'x_7', 'x_6']

The set of selected variables is different than the `ordered` method yet more intuitive as only the random variables have been selected by the algorithm.  
All explained variance within the data set should be represented by these variables rather than a conglomerate of derived variables.

Note that the refined eigenvalues are all roughly equal.  
This is reasonable since the final selected set of variables were all randomly generated and independent of one another.

In [20]:
fsca_refined.eigvals_refined

array([1.02177928, 1.01014834, 1.0053963 , 0.99933155, 0.98696931,
       0.97637055])

In [21]:
fsca_refined.eigval_cum_refined

array([17.02965462, 33.86546022, 50.6220653 , 67.27759118, 83.72707971,
       99.99992227])

The principal component factor structure below shows the correlation of each variable to the component in the given column. 

In [22]:
fsca_refined.component_correlations_refined

Unnamed: 0,FSCA_REF_PC0,FSCA_REF_PC1,FSCA_REF_PC2,FSCA_REF_PC3,FSCA_REF_PC4,FSCA_REF_PC5
x_3,0.431174,0.112738,0.363384,0.611956,-0.484943,0.244276
x_5,0.347606,0.057338,0.284203,-0.790015,-0.371138,0.182332
x_4,-0.651381,0.115078,-0.256392,-0.014727,-0.475428,0.520071
x_2,-0.49349,0.229833,0.604741,-0.007041,-0.211284,-0.541517
x_7,0.007102,-0.771407,-0.206122,0.018011,-0.481825,-0.360436
x_6,0.21716,0.577076,-0.564484,-0.011239,-0.333509,-0.43569


The selected variables can be transformed into the corresponding components by producing a `weights` matrix.  
This `weights` matrix is produced by dividing the above correlations for each component by the eigenvalue corressponding to the respective component (`eigvals_refined`)  
The `weights` matrix is calculated during `fit` and available as a class member.

In [23]:
fsca_refined.weights

Unnamed: 0,FSCA_REF_PC0,FSCA_REF_PC1,FSCA_REF_PC2,FSCA_REF_PC3,FSCA_REF_PC4,FSCA_REF_PC5
x_3,0.421983,0.111605,0.361434,0.612365,-0.491346,0.250188
x_5,0.340197,0.056762,0.282678,-0.790544,-0.376038,0.186745
x_4,-0.637497,0.113922,-0.255016,-0.014737,-0.481705,0.532657
x_2,-0.482972,0.227524,0.601495,-0.007046,-0.214073,-0.554622
x_7,0.006951,-0.763657,-0.205015,0.018023,-0.488186,-0.369159
x_6,0.212532,0.571279,-0.561455,-0.011247,-0.337912,-0.446235


## Example 3: FSCASelector

The `FSCA` base class can be used as a feature selector within the scikit-learn ecosystem.

In [24]:
from narrowgate.screeners.fsca import FSCASelector

In [25]:
fsca_select = FSCASelector(n_components = len(X.columns), method='refined', standardize=True)
fsca_select.fit(X)

Commencing stepwise construction with target
 - Added x_0 for criterion=4.98918936245303
   - no refinement needed
 - Added x_4 for criterion=5.99914441359842
   - Replaced x_0 with x_5 to get criterion = 5.999144413609411
 - Added x_2 for criterion=7.00017884145514
   - Replaced target with x_3 to get criterion = 7.000178841475345
 - Added x_7 for criterion=8.00019163206343
   - Replaced x_2 with target to get criterion = 8.000191632093797
 - Added x_6 for criterion=8.999995312303328
   - Replaced target with x_2 to get criterion = 8.999995312303405


FSCASelector()

In [26]:
X_selected = fsca_select.transform(X)

In [27]:
X_selected.head()

Unnamed: 0,x_3,x_5,x_4,x_2,x_7,x_6
0,0.322935,0.972232,0.686614,0.377136,0.671298,0.96683
1,0.542556,0.913027,0.601223,0.317078,0.550916,0.408457
2,0.233183,0.904828,0.607784,0.040542,0.459207,0.008684
3,0.066666,0.326845,0.149306,0.161198,0.591441,0.365441
4,0.53567,0.430913,0.987275,0.663864,0.748917,0.298618


In [28]:
X.columns

Index(['x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'target'], dtype='object')

In [29]:
X_selected.columns

Index(['x_3', 'x_5', 'x_4', 'x_2', 'x_7', 'x_6'], dtype='object')

## Example 3: FSCATransformer

The `FSCA` base class can be used as a feature transformer within the scikit-learn ecosystem similar to `PCA` or `KernelPCA`.

In [30]:
from narrowgate.screeners.fsca.transformer import FSCATransformer

In [31]:
fsca_transformer = FSCATransformer(n_components = len(X.columns), method='refined', standardize=True)
X_transform = fsca_transformer.fit_transform(X)

Commencing stepwise construction with target
 - Added x_0 for criterion=4.98918936245303
   - no refinement needed
 - Added x_4 for criterion=5.99914441359842
   - Replaced x_0 with x_5 to get criterion = 5.999144413609411
 - Added x_2 for criterion=7.00017884145514
   - Replaced target with x_3 to get criterion = 7.000178841475345
 - Added x_7 for criterion=8.00019163206343
   - Replaced x_2 with target to get criterion = 8.000191632093797
 - Added x_6 for criterion=8.999995312303328
   - Replaced target with x_2 to get criterion = 8.999995312303405


In [32]:
X_transform.head()

Unnamed: 0,FSCA_REF_PC0,FSCA_REF_PC1,FSCA_REF_PC2,FSCA_REF_PC3,FSCA_REF_PC4,FSCA_REF_PC5
0,0.434852,0.468056,-1.207275,-1.674851,-1.372146,-0.223502
1,0.555851,-0.331491,0.138235,-1.028446,-0.616981,0.91027
2,0.245196,-1.219311,0.011825,-1.642865,0.739478,1.921474
3,0.39483,-1.124441,-0.901812,-0.392992,1.803166,-0.397314
4,-1.52941,-0.751453,0.111797,0.269276,-1.082696,0.551341


## Example 4: Variable Clustering

Coming soon!