# Demo
> **The purpose of this notebook is to demostrate**
> 1. How to conduct a feature space decomposition
> 2. How to visualize a decomposition
> 3. Qualitatively validate if the algorithm performs as expected

**This decomoposition only works on numerical values**

### Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

from decomposition.var_clus import VarClus

### Demo 1: Instantiate the VarClus class

> **The parameters used in the constructor are**  

> **`n_split`**: Number of sub-clusters that every time a cluster is split into. Default 2  
**`max_eigenvalue`**: Eigenvalue threshold below which the decomposition will be stopped. Please note, the dataframe will be scaled during the process so each of the features will have variance == 1. Default 1   
**`max_tries`**: Number of max tries before the algorithm gives up. Default 3

> Besides the aformentioned properties, a key property is called Cluster. A Cluster object holds the information related to the below regards
1. What features are in the cluster
2. What are the parent clusters, if any
3. what are the child clusters, in any
4. Dataframe

In [2]:
demo1 = VarClus()

# Larger max_eigenvalue usually results in bigger and fewer child clusters
demo1 = VarClus(max_eigenvalue=5)

### Demo 2: Test on an arbitrary dataset

> Let's create some simple dataset to play with  
> We can leverage make_classification to make the dataset

In [3]:
n_features = 25
n_rows = 1e4

raw_df, _ = make_classification(n_samples=int(n_rows), 
                                n_features=n_features, 
                                n_informative=n_features,
                                n_redundant=0)

columns = ['feature_{}'.format(i) for i in range(n_features)]
demo2_df = pd.DataFrame(raw_df, columns=columns)

demo2 = VarClus(max_eigenvalue=1.35)
demo2.decompose(demo2_df)

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature feature_1
current EV is 3.980527640878324, new EV is 3.853590719855645
assessing feature feature_10
current EV is 3.980527640878324, new EV is 3.8500825653102986
assessing feature feature_11
current EV is 3.980527640878324, new EV is 3.988622736094064
Feature feature_11 was re-assigned
cluster-0-0 has 14 features and name_1 has 11 features
assessing feature feature_12
current EV is 3.988622736094064, new EV is 3.9279310137223034
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-0
phase #1: NCS
phase #2: Search
assessing feature feature_1
current EV is 3.132842884446023, new EV is 3.1518142315232285
Feature feature_1 was re-assigned
cluster-0-0-0 has 7 features and name_1 has 7 features
assessing feature feature_10
current EV is 3.1518142315232285, new EV is 3.087154182417183
assessing feature feature_12
current EV is 3.1518142315232285, new EV is 3.16748623418848

<decomposition.var_clus.Cluster at 0x26457a979b0>

> **The decomposition has a hierarchical structure, meaning the features of a child cluster are a subset of its parent cluster.
The whole algorithm can be described as below**

>> 1. Conducts PCA on current feature space. If the max eigenvalue is smaller than threshold,
    stop decomposition
2. Calculates the first N PCA components and assign features to these components based on
    absolute correlation from high to low. These components are the initial centroids of
    these child clusters.
3. After initial assignment, the algorithm conducts an iterative assignment called Nearest
    Component Sorting (NCS). Basically, the centroid vectors are re-computed as the first
    components of the child clusters and the algorithm will re-assign each of the feature
    based on the same correlation rule.
4. After NCS, the algorithm tries to increase the total variance explained by the first
    PCA component of each child cluster by re-assigning features across clusters

In [4]:
# Checkout the root_cluster
root_cluster = demo2.cluster

# Direct children of the root_cluster
child_clusters = root_cluster.children

# Direct parent of the root_cluster, if any
parent_clusters = root_cluster.parents

# root_cluster contains the original dataframe
root_cluster.dataframe.shape

(10000, 25)

In [5]:
# Print out the structure of the decomposition
demo2.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |
|     |-----cluster-0-0-0
|     |     |-----feature_10
|     |     |-----feature_13
|     |     |-----feature_15
|     |     |-----feature_19
|     |     |-----feature_21
|     |     |-----feature_22
|     |
|     |-----cluster-0-0-1
|           |-----feature_3
|           |-----feature_4
|           |-----feature_9
|           |-----feature_16
|           |-----feature_18
|           |-----feature_8
|           |-----feature_1
|           |-----feature_12
|
|-----cluster-0-1
      |-----feature_0
      |-----feature_2
      |-----feature_5
      |-----feature_6
      |-----feature_7
      |-----feature_14
      |-----feature_17
      |-----feature_20
      |-----feature_23
      |-----feature_24
      |-----feature_11


### Demo 3: Test on a real dataset
> **UCI Machine Learning Repository - Wine Quality Data Set**  
https://archive.ics.uci.edu/ml/datasets/Wine+Quality


* **White wine first**

In [6]:
demo3_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')

In [7]:
# Looks like quality is the dependent variable
demo3_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [8]:
demo3 = VarClus()
demo3.decompose(demo3_df.drop('quality', axis=1))

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature alcohol
current EV is 4.723934521410765, new EV is 4.245067665175402
assessing feature chlorides
current EV is 4.723934521410765, new EV is 4.643244380951357
assessing feature density
current EV is 4.723934521410765, new EV is 4.079544622431163
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-0
phase #1: NCS
phase #2: Search
assessing feature alcohol
current EV is 3.8595240255765804, new EV is 3.691734857563689
assessing feature density
current EV is 3.8595240255765804, new EV is 3.3128130916847462
assessing feature residual sugar
current EV is 3.8595240255765804, new EV is 3.5363030396950004
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-1
phase #1: NCS
phase #2: Search
assessing feature citric acid
current EV is 2.6355135121918782, new EV is 2.6032183593434493
assessing feature fixed acidity
current EV is 2.6355

<decomposition.var_clus.Cluster at 0x26457f56550>

In [9]:
demo3.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |
|     |-----cluster-0-0-0
|     |     |-----alcohol
|     |     |-----density
|     |     |-----residual sugar
|     |     |-----total sulfur dioxide
|     |
|     |-----cluster-0-0-1
|           |-----chlorides
|           |-----free sulfur dioxide
|
|-----cluster-0-1
      |
      |-----cluster-0-1-0
      |     |-----citric acid
      |     |-----fixed acidity
      |     |-----pH
      |
      |-----cluster-0-1-1
            |-----sulphates
            |-----volatile acidity


* **Red wine**

In [10]:
demo3_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

In [11]:
# Looks like quality is the dependent variable
demo3_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [12]:
demo3 = VarClus()
demo3.decompose(demo3_df.drop('quality', axis=1))

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature chlorides
current EV is 4.829692811123508, new EV is 4.754858095309794
assessing feature citric acid
current EV is 4.829692811123508, new EV is 4.217963436619193
assessing feature density
current EV is 4.829692811123508, new EV is 4.63235145475538
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-0
phase #1: NCS
phase #2: Search
assessing feature citric acid
current EV is 3.82295660241381, new EV is 3.7913608187467975
assessing feature density
current EV is 3.82295660241381, new EV is 3.6003576471848393
assessing feature fixed acidity
current EV is 3.82295660241381, new EV is 3.251178168514338
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-1
phase #1: NCS
phase #2: Search
assessing feature free sulfur dioxide
Number of features is smaller than n_split, reducing n_split temporarily
current EV is 2.711437596005953, n

<decomposition.var_clus.Cluster at 0x2645a3dc828>

In [13]:
# Similar structure
demo3.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |
|     |-----cluster-0-0-0
|     |     |-----citric acid
|     |     |-----density
|     |     |-----fixed acidity
|     |     |-----pH
|     |     |-----sulphates
|     |
|     |-----cluster-0-0-1
|           |-----chlorides
|           |-----volatile acidity
|
|-----cluster-0-1
      |
      |-----cluster-0-1-0
      |     |-----free sulfur dioxide
      |     |-----total sulfur dioxide
      |     |-----alcohol
      |
      |-----cluster-0-1-1
            |-----residual sugar


### Demo 4: Test on another real dataset
> **UCI Machine Learning Repository - Credit Card Default Data Set**  
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients


* **White wine first**

In [22]:
demo4_df = pd.read_excel(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls',
    skiprows=[0],
    index_col=0
)

In [23]:
# Looks like quality is the dependent variable
demo4_df.head()

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [24]:
demo4 = VarClus()
demo4.decompose(demo4_df.drop('default payment next month', axis=1))

decomposing cluster cluster-0
phase #1: NCS
phase #2: Search
assessing feature BILL_AMT1
current EV is 10.239832492160017, new EV is 9.498678044145846
assessing feature BILL_AMT2
current EV is 10.239832492160017, new EV is 9.451074936281742
assessing feature BILL_AMT3
current EV is 10.239832492160017, new EV is 9.42575617810797
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-0
phase #1: NCS
phase #2: Search
assessing feature BILL_AMT1
current EV is 7.391294359358845, new EV is 6.664518437010028
assessing feature BILL_AMT2
current EV is 7.391294359358845, new EV is 6.648267907919297
assessing feature BILL_AMT3
current EV is 7.391294359358845, new EV is 6.695504389761458
Number of max tries has been reached. Returning current result...
decomposing cluster cluster-0-1
phase #1: NCS
phase #2: Search
assessing feature LIMIT_BAL
current EV is 5.939294085291814, new EV is 5.828003796021549
assessing feature PAY_0
current EV is 5.939294085291814,

<decomposition.var_clus.Cluster at 0x264591e5898>

In [25]:
demo4.print_cluster_structure()

cluster-0
|
|-----cluster-0-0
|     |
|     |-----cluster-0-0-0
|     |     |-----BILL_AMT1
|     |     |-----BILL_AMT2
|     |     |-----BILL_AMT3
|     |     |-----BILL_AMT4
|     |     |-----BILL_AMT5
|     |     |-----BILL_AMT6
|     |
|     |-----cluster-0-0-1
|           |-----PAY_AMT1
|           |-----PAY_AMT2
|           |-----PAY_AMT3
|           |-----PAY_AMT4
|           |-----PAY_AMT5
|           |-----PAY_AMT6
|
|-----cluster-0-1
      |
      |-----cluster-0-1-0
      |     |-----LIMIT_BAL
      |     |-----PAY_0
      |     |-----PAY_2
      |     |-----PAY_3
      |     |-----PAY_4
      |     |-----PAY_5
      |     |-----PAY_6
      |
      |-----cluster-0-1-1
            |
            |-----cluster-0-1-1-0
            |     |-----AGE
            |     |-----EDUCATION
            |     |-----MARRIAGE
            |
            |-----cluster-0-1-1-1
                  |-----SEX
