# Behavioral Profile Stratification via Unsupervised learning

> Behavioral data embeddings for the stratification of individuals
with neurodevelopmental conditions.

> Designed for observational measurements of cognition and behavior of individuals with 
Autism Spectrum Conditions (ASCs).

* `dataset.py`: Connects to the database and dump data
* `features.py`: Returns vocabulary and dictionary of behavioral *EHRs* for each of the 4 possible depth levels. 
It also returns a dataset with quantitative scores for level 4 features
* `pt_embedding.py`: Performs TFIDF for patient embeddings; Glove embeddings on words and average them out for 
subject embeddings; Word2vec embeddings on words, that are then averaged to output individual representations
* `clustering.py`: Performs Hierarchical Clustering/k-means on embeddings, and quantitative 4th level features
* `visualization.py`: Visualizes results (e.g. _scatterplot & dendrogram_)for sub-cluster visualization; 
_Heatmap_ for inspection of quantitative scores between sub-clusters

---
*Run the cell below to enable logging display in notebook. Otherwise the log info are written to `pipeline.log` file in `./log` folder.*

In [None]:
from importlib import reload  # Not needed in Python 2
import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', 
                    level=logging.INFO, datefmt='%I:%M:%S')

---

## Step 1: Data Loading

> The `dataset` module access the database and dumps all the available tables. Information for Data Accessibility should be provided in the `utils.py` file. Then, subject (e.g., adults) and tables (e.g., ados-2 module 4) that need to be excluded are filtered out and dictionaries of subject demographics and encounter information are provided and saved to _.csv_ file. 

In [None]:
from dataset import access_db, data_wrangling, cohort_info

In [None]:
# # it returns a dictionary of pandas dataframes storing tables from the db
tables = access_db()

In [None]:
# # reduced dictionary (it excludes tables and subjects that are not required, e.g., ados-2modulo4, eas)
rid_tables = data_wrangling(tables)

In [None]:
# # it returns dictionary of subjects info and encounters
pinfo, penc = cohort_info(rid_tables)

## Step 2: Feature Processing

> Class `DataFeatures`is initialized with the depth level desired. The depth level can range from 1 to 4, where levels 1-3 are sistematically derived from instrument item structures, and level 4 is empirically derived in accordance with clinical experts. According to the levels, _behavioral EHRs_ (bEHRs) and vocabulary of terms are created. For each subject, each item score $N$ is considered as a word of the form `instrument_name::item::N`, the sequence of "words" chronologically ordered becomes the bEHR for each individual. Moreover, all the behavioral terms obtained are collected into a vocabulary. 

> The `create_level_features` method is only available for level 4, due to noise and missingness of data. It represents each subject as a vector of quantitative scores to tests ordered according to 5 timeframes (F1-F5), clinically selected. Missing values are imputed with mean.

In [None]:
from features import DataFeatures

In [None]:
datafeatures = DataFeatures(level=4, df_dict=rid_tables)

In [None]:
behr, (bt_to_idx, idx_to_bt) = datafeatures.create_level_tokens()

In [None]:
feat_df, feat_df_scaled = datafeatures.create_level_features()

## Step 3: Embeddings

> `Pembeddings` class consits of three methods: `tfidf` that outputs patient embeddings from SVD transform of word co-occurrence counts; `word2vec_emb` that computes word embeddings for each behavioral term learned via _continuous Skip-gram model_ (Mikolov et al., 2013) and outputs patient representations averaging out the behavioral terms of their sequence; `glove_pemb` that learns word embeddings via GloVe algorithm (Pennington et al., 2014) and averages out behavioral terms returning patient encodings.

In [None]:
from pt_embedding import Pembeddings

In [None]:
model = Pembeddings(behr, bt_to_idx)

In [None]:
svd_pid_list, svd_mtx = model.tfidf()

In [None]:
glove_pid_list, glove_emb, word_emb = model.glove_pemb()

In [None]:
w2v_pid_list, w2v_emb, w2v_word_emb, _ = model.word2vec_emb()

## Step 4: Clustering

> This module performs _hierarchical clustering_ or _k-means clustering_ techniques on either subject embeddings or feature data. The best number of clusters is chosen via the Elbow Method.

In [None]:
from clustering import HclustEmbeddings, HclustFeatures, KMeansEmbeddings, KMeansFeatures, compare_clustering
import utils as ut

In [None]:
hclust_emb = HclustEmbeddings(min_cl=ut.min_cl, max_cl=ut.max_cl, 
                              affinity='euclidean', linkage='ward')

kmclust_emb = KMeansEmbeddings(min_cl=ut.min_cl, max_cl=ut.max_cl)

### `TF-IDF` Embedding

In [None]:
# TFIDF EMBEDDING
# tfidf_best_cl = hclust_emb.find_best_nclu(svd_mtx, n_iter=ut.n_iter, 
#                                           subsampl=ut.subsampl)
tfidf_best_hccl = hclust_emb.elbow_method(svd_mtx)
tfidf_hcsubc = hclust_emb.fit(svd_mtx, svd_pid_list, tfidf_best_hccl)

In [None]:
# # KMeans clustering
tfidf_best_kmcl = kmclust_emb.elbow_method(svd_mtx)
tfidf_kmsubc = kmclust_emb.fit(svd_mtx, svd_pid_list, tfidf_best_kmcl)

### `Glove` Embedding

In [None]:
# GLOVE EMBEDDING
# glv_best_cl = hclust_emb.find_best_nclu(glove_emb, n_iter=ut.n_iter, subsampl=ut.subsampl)
glv_best_hccl = hclust_emb.elbow_method(glove_emb)
glv_hcsubc = hclust_emb.fit(glove_emb, glove_pid_list, glv_best_hccl)

In [None]:
glv_best_kmcl = kmclust_emb.elbow_method(glove_emb)
glv_kmsubc = kmclust_emb.fit(glove_emb, glove_pid_list, glv_best_kmcl)

### `Word2Vec` Embedding

In [None]:
w2v_best_hccl = hclust_emb.elbow_method(w2v_emb)
w2v_hcsubc = hclust_emb.fit(w2v_emb, w2v_pid_list, w2v_best_hccl)

In [None]:
w2v_best_kmcl = kmclust_emb.elbow_method(w2v_emb)
w2v_kmsubc = kmclust_emb.fit(w2v_emb, w2v_pid_list, w2v_best_kmcl)

### Feature clustering

In [None]:
hclust_feat = HclustFeatures(min_cl=ut.min_cl, max_cl=ut.max_cl, 
                             affinity='euclidean', linkage='ward')
kmclust_feat = KMeansFeatures(min_cl=ut.min_cl, max_cl=ut.max_cl)

In [None]:
# FEATURES REPRESENTATION
# feat_best_cl = hclust_feat.find_best_nclu(feat_df_scaled, n_iter=ut.n_iter, subsampl=ut.subsampl)
feat_best_hccl = hclust_feat.elbow_method(feat_df_scaled)
feat_hcsubc = hclust_feat.fit(feat_df_scaled, feat_best_hccl)

In [None]:
feat_best_kmcl = kmclust_feat.elbow_method(feat_df_scaled)
feat_kmsubc = kmclust_feat.fit(feat_df_scaled, feat_best_kmcl)

## Step 5: Clustering II (Visualization) 

> The second clustering module (`visualization`) enables the visualization of dendrogram, and Elbow Method curve for number of clusters selection. Moreover, it allows the visualization of the identified subtypes with scatterplots (UMAP projection visualization technique) and heatmaps for phenotyping (quantitative scores of selected items are highlighted). All these plots are available for both patient embeddings and feature data.

In [None]:
from visualization import Visualization

In [None]:
viz = Visualization(pinfo, ut.col_dict, ut.c_out)

### Tf-idf 

In [None]:
# # Example of visualization for tfidf embeddings
# # Prepare data for umap and dendrogram
# umap_mtx, pid_subc_list = viz.data_scatter_dendrogram(svd_mtx, tfidf_hcsubc, svd_pid_list, random_state=42,
#                                                       n_neighbors = 100,
#                                                       min_dist=0.0)

In [None]:
# viz.scatterplot_dendrogram(svd_mtx, umap_mtx, pid_subc_list, 15, 10)

In [None]:
# # Prepare data for heatmap
# emb_scaled = viz.data_heatmap_emb(behr, bt_to_idx, tfidf_hcsubc, 
#                                   save_df='df_tfidfemb_level4.csv')

In [None]:
# viz.heatmap_emb(emb_scaled, 500, 2000, save_html='tfidf_heatmap_level-4')

### `GloVe`

In [None]:
# Visualization for GloVe embeddings
# Prepare data for umap and dendrogram
umap_mtx, pid_subc_list = viz.data_scatter_dendrogram(glove_emb, glv_hcsubc, glove_pid_list, random_state=42,
                                                      n_neighbors = 5,
                                                      min_dist=0.0)

In [None]:
viz.scatterplot_dendrogram(glove_emb, umap_mtx, pid_subc_list, 15, 10, save_fig=None)

In [None]:
# Plot UMAP projection of word embeddings via GloVe
viz.plot_word_embedding(word_emb, idx_to_bt, 800, 
                        800,
                        n_neighbors = 10,
                        min_dist=0.0)

In [None]:
# Prepare data for heatmap
emb_scaled = viz.data_heatmap_emb(behr, bt_to_idx, glv_hcsubc, 
                                  save_df=None)

In [None]:
viz.heatmap_emb(emb_scaled, 500, 1800, save_html=None)

### `Word2vec`

In [None]:
# Scatterplot and dendrogram of UMAP projections
umap_mtx, pid_subc_list = viz.data_scatter_dendrogram(w2v_emb, w2v_hcsubc, w2v_pid_list, random_state=42,
                                                      n_neighbors = 5,
                                                      min_dist=0.0)
viz.scatterplot_dendrogram(w2v_emb, umap_mtx, pid_subc_list, 15, 10)

In [None]:
# Plot UMAP projection of word embeddings via Word2Vec
viz.plot_word_embedding(w2v_word_emb.transpose(), 
                        idx_to_bt, 
                        800, 
                        800,
                        n_neighbors =10,
                        min_dist=0.0)

In [None]:
# Prepare data for heatmap
emb_scaled = viz.data_heatmap_emb(behr, bt_to_idx, w2v_hcsubc, 
                                  save_df=None)
viz.heatmap_emb(emb_scaled, 500, 1800)

### Features

In [None]:
# Feature data visualization
# Prepare data for umap and dendrogram
umap_mtx, pid_subc_list = viz.data_scatter_dendrogram(feat_df_scaled, feat_hcsubc, random_state=42,
                                                      n_neighbors = 10,
                                                      min_dist=0.0)

In [None]:
viz.scatterplot_dendrogram(feat_df_scaled, umap_mtx, pid_subc_list, 15, 10, save_fig=None)

In [None]:
# Prepare data for heatmap
emb_scaled = viz.data_heatmap_feat(feat_df, feat_df_scaled, feat_hcsubc, 
                                  save_df=None)

In [None]:
viz.heatmap_feat(emb_scaled, 1000, 2000, save_html=None)

---
