# Behavioral profile stratification via unsupervised learning

In [1]:
from dataset import access_db, demographics, data_path
from features import *

Created directory /odf-data: 2019-07-08-22-19-46



### Package `dataset`

**Access odf-lab database**

:return:
- `list` of Patient class objects
- `dictionary` of Pencounters objects per patient
- `dictionary` of pandas dataframes per table in the DB

**Compute standard demographic statistics**

In [2]:
subj_list, p_enc, df_dict = access_db()

In [3]:
demographics(subj_list, p_enc)

Period span: 01/05/2013 -- 31/10/2018

N of subjects: 205

Average number of assessments: 5.522
Median number of assessments: 5.0
Maximum number of assessments: 24
Minimum number of assessments: 1

Average number of encounters: 1.439
Median number of assessments: 1.0
Maximum number of assessments: 5
Minimum number of assessments: 1

Instrument list:
griffithsmentaldevelopmentscales
ados-2modulo1
wppsi-iiifascia40-73
ados-2modulotoddler
leiterinternationalperformancescale-revised
wisc-iv
ados-2modulo2
srs
ados-2modulo3
wppsi-iiifascia26-311
psi-sf
wppsi
wisc-iii
wais-iv
vineland-ii
emotionalavailabilityscales
N of selected instruments: 16


Mean age of the subjects: 11.054212729030509 -- Standard deviation: 5.097775697784719
N Female: 34 -- N Male: 171



### Package `features`

**Create raw behavioral ehrs**

:parameters:

`dictionary` of dataframes per table in the DB

:return:

`dictionary` list of [Pinfo, behavioral tokens ordered wrt date of assessment]

**Filter tokens from raw behavioral ehrs according to depth level**

:parameters:

`dictionary` list of [Pinfo, behavioral tokens] per subject

:return:

`dictionary` list of [Pinfo, filtered tokens wrt the level]

**Create BEHRs and vocabulary**

* `create_vocabulary`

    :parameters:

    `dictionary` list of [Pinfo, tokens] per subject
    
    `int` level

    :return:

    `dictionaries` idx to term, term to idx
    
* `create_behr`

    :parameters:
    
    `dictionary` list of [Pinfo, tokens] per subject
    
    `int` level
    
    :return:
    
    `dictionary` of list of tuples (DOA, [instrument tokens]) per subject
    
**Create feature data (quantitative scores)**

* `create_features_data`

    :parameters:
    
    `dictionary` output of behr_level4
    
    :return:
    
    mean-imputed dataframe
    
    normalized (column-wise) dataframe

In [4]:
raw_behr = create_tokens(df_dict)

Average length of behavioral sequences: 5.522



### Behavioral EHRs Level-1

In [5]:
lev1 = behr_level1(raw_behr)

In [6]:
lab1_to_idx, idx_to_lab1 = create_vocabulary(lev1, level=1)
out_behr_lev1 = create_behr(lev1, lab1_to_idx, level=1)

Vocabulary size:1349



### Behavioral EHRs Level-2

In [7]:
lev2 = behr_level2(raw_behr)

In [8]:
lab2_to_idx, idx_to_lab2 = create_vocabulary(lev2, level=2)
out_behr_lev2 = create_behr(lev2, lab2_to_idx, level=2)

Vocabulary size:1198



### Behavioral EHRs Level-3

In [9]:
lev3 = behr_level3(raw_behr)

In [10]:
lab3_to_idx, idx_to_lab3 = create_vocabulary(lev3, level=3)
out_behr_lev3 = create_behr(lev3, lab3_to_idx, level=3)

Vocabulary size:514



### Behavioral EHRs Level-4

In [11]:
lev4 = behr_level4(raw_behr)

In [12]:
lab4_to_idx, idx_to_lab4 = create_vocabulary(lev4, level=4)
out_behr_lev4 = create_behr(lev4, lab4_to_idx, level=4)

Vocabulary size:1167



### Feature data Level-4

In [13]:
feat_df, feat_scaled_df = create_features_data(lev4)

# Patient embeddings

In [14]:
from pt_embedding import Pembeddings

In [15]:
model = Pembeddings(out_behr_lev4, idx_to_lab4)

In [16]:
svd_pid_list, svd_mtx = model.tfidf()
glove_pid_list, glove_emb = model.glove_pemb()

Performing SVD on the TF-IDF matrix...
epoch 0, error 0.011
