# Demo
> **The purpose of this notebook is to demostrate**
> 1. How to conduct a feature space decomposition
> 2. How to visualize a decomposition
> 3. Qualitatively validate if the algorithm performs as expected

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

from decomposition.var_clus import VarClus

### Demo 1: Instantiate the VarClus class

> **The parameters used in the constructor are**  

> **`n_split`**: Number of sub-clusters that every time a cluster is split into. Default 2  
**`max_eigenvalue`**: Eigenvalue threshold below which the decomposition will be stopped. Please note, the dataframe will be scaled during the process so each of the features will have variance == 1. Default 1   
**`max_tries`**: Number of max tries before the algorithm gives up. Default 3

> Besides the aformentioned properties, a key property is called Cluster. A Cluster object holds the information related to the below regards
1. What features are in the cluster
2. What are the parent clusters, if any
3. what are the child clusters, in any
4. Dataframe

In [2]:
demo1 = VarClus()

# Larger max_eigenvalue usually results in bigger and fewer child clusters
demo1 = VarClus(max_eigenvalue=5)

### Demo 2: Test on an arbitrary dataset

> Let's create some simple dataset to play with  
> We can leverage make_classification to make the dataset

In [3]:
n_features = 25
n_rows = 1e4

raw_df, _ = make_classification(n_samples=int(n_rows), 
                                n_features=n_features, 
                                n_informative=n_features,
                                n_redundant=0)

columns = ['feature_{}'.format(i) for i in range(n_features)]
demo2_df = pd.DataFrame(raw_df, columns=columns)

demo2 = VarClus(max_eigenvalue=1.1)
demo2.decompose(demo2_df)

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature feature_10
current EV is 3.7924900743028194, new EV is 3.793246884573982
Feature feature_10 was re-assigned
child_clusters[i] has 11 features and child_clusters[j] has 14 features
assessing feature feature_14
current EV is 3.793246884573982, new EV is 3.619450861908802
assessing feature feature_15
current EV is 3.793246884573982, new EV is 3.7180395740211196
assessing feature feature_2
current EV is 3.793246884573982, new EV is 3.7555131463259404
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-0
phase #1: NCS
phase #2: Search
assessing feature feature_14
current EV is 2.991324113673583, new EV is 2.8614019129000443
assessing feature feature_15
current EV is 2.991324113673583, new EV is 2.973809988914784
assessing feature feature_2
current EV is 2.991324113673583, new EV is 2.96300695940678
Number of max tries has been reached. Returning current result...
decomp

AttributeError: 'NoneType' object has no attribute 'explained_variance_'

> **The decomposition has a hierarchical structure, meaning the features of a child cluster are a subset of its parent cluster.
The whole algorithm can be described as below**

>> 1. Conducts PCA on current feature space. If the max eigenvalue is smaller than threshold,
    stop decomposition
2. Calculates the first N PCA components and assign features to these components based on
    absolute correlation from high to low. These components are the initial centroids of
    these child clusters.
3. After initial assignment, the algorithm conducts an iterative assignment called Nearest
    Component Sorting (NCS). Basically, the centroid vectors are re-computed as the first
    components of the child clusters and the algorithm will re-assign each of the feature
    based on the same correlation rule.
4. After NCS, the algorithm tries to increase the total variance explained by the first
    PCA component of each child cluster by re-assigning features across clusters

In [None]:
# Checkout the root_cluster
root_cluster = demo2.cluster

# Direct children of the root_cluster
child_clusters = root_cluster.children

# Direct parent of the root_cluster, if any
parent_clusters = root_cluster.parents

# root_cluster contains the original dataframe
root_cluster.dataframe.shape

In [None]:
# Print out the structure of the decomposition
demo2.print_cluster_structure()