# Iterative Feature Pruning Workflow

## Recursive Hierarchical Clustering
### Dependencies
python `numpy` and `scipy`

**In the folder**
- calculateDistance.py
- calcaulateDistanceInt.py
- mutual_info.py

**Main file: recursiveHierarchicalClusteringFast.py**


### 1. Run the .py script
Assuming that the data has already been processed to the compatible format for the input file. 

**Arguments:**
1. inputPath: specify the path to a files that contains information for the users to be clustered. Each line is represent a user.

>user_id \t A(1)G(10) \
> Where the A and G are actions and 1 and 10 are the frequencies of each action. The user_id grows from 1 to the total number of users.

2. outputPath: The directory to place all temporary files as well as the final result.

3. sizeThreshold (optional): Defines the minimum size of the cluster, that we are going to further divide. 0.05 means clusters containing less than 5% of the total instances is not going to be further splitted.

In [1]:
!python3 recursiveHierarchicalClusteringFast.py seq_input.txt output/ 0.05

[LOG]: total users 10000
[LOG]: starting in localhost for tmp_1717573444root
[LOG]: 2024-06-05 15:44:04.435067 computing matrix for output/tmp_1717573444root
[LOG]: 2024-06-05 15:44:04.719748 preprocessing takes 0.2842s
[LOG]: start new thread 1
[LOG]: start new thread 2
[LOG]: start new thread 3
[LOG]: thread 1 finished after 1
[LOG]: start new thread 4
[LOG]: thread 2 finished after 1
[LOG]: start new thread 5
[LOG]: thread 3 finished after 1
[LOG]: thread 4 finished after 0
[LOG]: thread 5 finished after 0
[LOG]: 2024-06-05 15:44:08.328126 merge started for output/tmp_1717573444root
[LOG]: 2024-06-05 15:44:08.361417 merge finished for output/tmp_1717573444root
[LOG]: 2024-06-05 15:44:08.366286 matrix computation finished for output/tmp_1717573444root
[LOG]: first matrixTime 3.980624
[LOG]: finished calculating modularityBasics
[LOG]: sweetSpot is 31, modularity 0.021828
  result = np.linalg.lstsq(A, y)
[LOG]: finished calculating modularityBasics
[LOG]: sweetSpot is 2, modularity 0.

#### Output: 
- `output/matrix.dat`: A distance matrix for the root level is stored to avoid repeated calculation. If the file is available, the scirpt will read in the matrix instead of calculating it again. The file format is a N*N distance matrix scaled to integer in the range of (0-100).

- `output/result.json`: Stores the clustering result, in the form of ['t', sub-cluster list, cluster info] or ['l', user list, cluster info].

- `Node type`: node type can be either t or l. l means leaf which means the cluster is not further split. t means tree meaning there are further splitting for the given cluster.
- `Sub-cluster list`: a list of clusters that is the resulting clusters derived from splitting the current cluster.
- `User list`: a list of user ids representing the users in the given cluster.
- `Cluster info`: a dictionary containing meta data for the cluster.
- `gini`: gini-coefficient for chi-square score value distribution, measures the skewness of feature importance distribution.
- `sweetspot`: the modularity for the best k we picked when further splitting this cluster.
- `exclusions`: a list of top features (ranked) that helps to distinguish the cluster from others.
- `exclusionsScore`: the chi-square scores correspond to the top features listed in exclusions.


### 2. Visualisation Interface

In [2]:
!python3 visulization.py output/result.json  seq_input.txt vis/vis.json

#### Open the visualisation interface on localhost

In [3]:
import os
import subprocess
import webbrowser

os.chdir('vis')
subprocess.Popen(['python3', '-m', 'http.server', '8000'])
webbrowser.open_new_tab('http://localhost:8000/multi_color.html?json=vis.json')

Serving HTTP on :: port 8000 (http://[::]:8000/) ...


True

::1 - - [05/Jun/2024 15:45:37] "GET /vis.json HTTP/1.1" 200 -
::1 - - [05/Jun/2024 15:45:37] "GET /vis.json HTTP/1.1" 304 -
::1 - - [05/Jun/2024 15:45:37] "GET /vis.json HTTP/1.1" 304 -
::1 - - [05/Jun/2024 15:45:37] "GET /vis.json HTTP/1.1" 304 -
