# Demo
> **The purpose of this notebook is to demostrate**
> 1. How to conduct a feature space decomposition
> 2. How to visualize a decomposition
> 3. Qualitatively validate if the algorithm performs as expected

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

from decomposition.var_clus import VarClus

### Demo 1: Instantiate the VarClus class

> **The parameters used in the constructor are**  

> **`n_split`**: Number of sub-clusters that every time a cluster is split into. Default 2  
**`max_eigenvalue`**: Eigenvalue threshold below which the decomposition will be stopped. Please note, the dataframe will be scaled during the process so each of the features will have variance == 1. Default 1   
**`max_tries`**: Number of max tries before the algorithm gives up. Default 3

> Besides the aformentioned properties, a key property is called Cluster. A Cluster object holds the information related to the below regards
1. What features are in the cluster
2. What are the parent clusters, if any
3. what are the child clusters, in any
4. Dataframe

In [2]:
demo1 = VarClus()

# Larger max_eigenvalue usually results in bigger and fewer child clusters
demo1 = VarClus(max_eigenvalue=5)

### Demo 2: Test on an arbitrary dataset

> Let's create some simple dataset to play with  
> We can leverage make_classification to make the dataset

In [15]:
n_features = 25
n_rows = 1e4

raw_df, _ = make_classification(n_samples=int(n_rows), 
                                n_features=n_features, 
                                n_informative=n_features,
                                n_redundant=0)

columns = ['feature_{}'.format(i) for i in range(n_features)]
demo2_df = pd.DataFrame(raw_df, columns=columns)

demo2 = VarClus(max_eigenvalue=1.35)
demo2.decompose(demo2_df)

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature feature_11
current EV is 3.5880316988365246, new EV is 3.4879502809369054
assessing feature feature_13
current EV is 3.5880316988365246, new EV is 3.599534018288427
Feature feature_13 was re-assigned
child_clusters[i] has 13 features and child_clusters[j] has 12 features
assessing feature feature_14
current EV is 3.599534018288427, new EV is 3.6002317969817415
Feature feature_14 was re-assigned
child_clusters[i] has 12 features and child_clusters[j] has 13 features
assessing feature feature_15
current EV is 3.6002317969817415, new EV is 3.553214100843
assessing feature feature_17
current EV is 3.6002317969817415, new EV is 3.6243734137092245
Feature feature_17 was re-assigned
child_clusters[i] has 11 features and child_clusters[j] has 14 features
assessing feature feature_19
current EV is 3.6243734137092245, new EV is 3.6178614531136417
Number of max tries has been reached. Returning current result...
decomp

<decomposition.var_clus.Cluster at 0x22f2ef089e8>

> **The decomposition has a hierarchical structure, meaning the features of a child cluster are a subset of its parent cluster.
The whole algorithm can be described as below**

>> 1. Conducts PCA on current feature space. If the max eigenvalue is smaller than threshold,
    stop decomposition
2. Calculates the first N PCA components and assign features to these components based on
    absolute correlation from high to low. These components are the initial centroids of
    these child clusters.
3. After initial assignment, the algorithm conducts an iterative assignment called Nearest
    Component Sorting (NCS). Basically, the centroid vectors are re-computed as the first
    components of the child clusters and the algorithm will re-assign each of the feature
    based on the same correlation rule.
4. After NCS, the algorithm tries to increase the total variance explained by the first
    PCA component of each child cluster by re-assigning features across clusters

In [16]:
# Checkout the root_cluster
root_cluster = demo2.cluster

# Direct children of the root_cluster
child_clusters = root_cluster.children

# Direct parent of the root_cluster, if any
parent_clusters = root_cluster.parents

# root_cluster contains the original dataframe
root_cluster.dataframe.shape

(10000, 25)

In [17]:
# Print out the structure of the decomposition
demo2.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |-----feature_2
|     |-----feature_3
|     |-----feature_6
|     |-----feature_8
|     |-----feature_9
|     |-----feature_11
|     |-----feature_15
|     |-----feature_19
|     |-----feature_20
|     |-----feature_22
|     |-----feature_24
|
|-----cluster-0-1
      |
      |-----cluster-0-1-0
      |     |-----feature_1
      |     |-----feature_10
      |     |-----feature_14
      |     |-----feature_17
      |     |-----feature_18
      |     |-----feature_21
      |     |-----feature_23
      |     |-----feature_4
      |
      |-----cluster-0-1-1
            |-----feature_0
            |-----feature_12
            |-----feature_13
            |-----feature_16
            |-----feature_5
            |-----feature_7


### Demo 3: Test on a real dataset
> **UCI Machine Learning Repository - Wine Quality Data Set**  
https://archive.ics.uci.edu/ml/datasets/Wine+Quality


* **White wine first**

In [23]:
demo3_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')

In [24]:
# Looks like quality is the dependent variable
demo3_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [50]:
demo3 = VarClus()
demo3.decompose(demo3_df.drop('quality', axis=1))

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature chlorides
current EV is 4.829692811123503, new EV is 4.754858095309795
assessing feature citric acid
current EV is 4.829692811123503, new EV is 4.2179634366191925
assessing feature density
current EV is 4.829692811123503, new EV is 4.632351454755385
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-0
phase #1: NCS
phase #2: Search
assessing feature citric acid
current EV is 3.82295660241381, new EV is 3.7913608187467966
assessing feature density
current EV is 3.82295660241381, new EV is 3.60035764718484
assessing feature fixed acidity
current EV is 3.82295660241381, new EV is 3.251178168514337
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-1
phase #1: NCS
phase #2: Search
assessing feature free sulfur dioxide
Number of features is smaller than n_split, reducing n_split temporarily
current EV is 2.711437596005953, n

ValueError: Found array with 0 feature(s) (shape=(1599, 0)) while a minimum of 1 is required.

In [49]:
demo3.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |
|     |-----cluster-0-0-0
|     |     |-----citric acid
|     |     |-----density
|     |     |-----fixed acidity
|     |     |-----pH
|     |     |-----sulphates
|     |
|     |-----cluster-0-0-1
|           |-----chlorides
|           |-----volatile acidity
|
|-----cluster-0-1
      |-----alcohol
      |-----free sulfur dioxide
      |-----residual sugar
      |-----total sulfur dioxide


* **Red wine**

In [39]:
demo3_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

In [40]:
# Looks like quality is the dependent variable
demo3_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [41]:
demo3 = VarClus(max_eigenvalue=1.25)
demo3.decompose(demo3_df.drop('quality', axis=1))

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature chlorides
current EV is 4.8296928111235005, new EV is 4.754858095309797
assessing feature citric acid
current EV is 4.8296928111235005, new EV is 4.217963436619194
assessing feature density
current EV is 4.8296928111235005, new EV is 4.63235145475538
Number of max tries has been reached. Returning current result...


<decomposition.var_clus.Cluster at 0x22f30166cc0>

In [42]:
# Similar structure
demo3.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |-----chlorides
|     |-----citric acid
|     |-----density
|     |-----fixed acidity
|     |-----pH
|     |-----sulphates
|     |-----volatile acidity
|
|-----cluster-0-1
      |-----alcohol
      |-----free sulfur dioxide
      |-----residual sugar
      |-----total sulfur dioxide
