# CS598 Deep Learning for Healthcare
### Mitch & Sathish

## Reproducability Summary
This project recreates the work done in the paper -> [SurfCon: Synonym Discovery on Privacy-Aware Clinical Data](https://arxiv.org/pdf/1906.09285.pdf). The goal of the paper is to utilize aggregated Co-frequency data of medical terms from Clinical Notes as an indicator of relationships between these terms. For this paper, the authors demonstrated the ability to identify  synonymous terms (indicated by alignment under the same UMLS Concept). Our team was able to reproduce positive results akin to to the results seen by the authors of the Surfcon Paper. We reproduced two of the expirements done by the authors, with a slight variation of datasets. The character based pre-trained embeddings utilized by the original authors were unavailable, and therefore we identified and utilized a different subword embeddings instead. 

## Description & Context


While the Authors did make [code](https://github.com/zhenwang9102/SurfCon) available for the models and training, modifications were required for the following key elements:
- Data loading
- Data pre-processing
    - Transform data structure for inputs
    - Implement PPMI algorithm to convert frequency to PPMI
    - Implement subsampling algorithm
    - Map & Create synonym graph for labels
- Update outdated packages

Additionally, as the character based (surface form) pre-trained embedding was not available, we worked with the authors to find a suitable alternative (see below), and wrote code to pre-process this data as the structure of the datasets differed.

Therefore, the referenced code includes a combination of the authors original code, modifications made by our team, and new code generated by our team.

## Data Loading

The main dataset utilized for this research is a co-frequency graph built from clinical narrative data in this paper: [Building the graph of medicine from millions of clinical narratives](https://datadryad.org/resource/doi:10.5061/dryad.jp917). The graph data is available in the paper link above.

The pretrained embeddings utilized in the paper can be downloaded here: 
- Word Embeddings -> [GloVe](http://nlp.stanford.edu/data/glove.6B.zip)
- Node Embeddings -> [Node Embeddings](https://drive.google.com/file/d/1nKXDppoSsT6uHCl0yG_zlrC4QFyCyu41/view)

The pretrained embeddings that our team used to replace the CharNGram in the paper:
- Subword Embeddings -> Fastext [Fastext pretrained subword embeddings](https://fasttext.cc/docs/en/english-vectors.html)

1. Download co-frequency graph, unzip and store in the `mappings` folder
2. Download embeddings, unzip, and store in the `embeddings` folder
3. To replicate the medical terms that the original authors used, download [these](https://drive.google.com/file/d/1RN0x45dnMAkRKQWAwIqoz2qNL_3hfsQv/view) `pkl` files and store them as follows:
    - `all_iv_terms_perBin_1.pkl` in the `sym_data` folder
    - `term_string_mapping.pkl` in the `mappings` folder


Alternatively, if unable to replicate the data folder structure as needed feel free to pull down the [data folder](https://drive.google.com/drive/folders/1WWg-rEqJl1A-5IM3nQHT93hIfpXaSG4-?usp=sharing) from our development which includes all of the datasets in the correct locations 

In [None]:
cd /content/drive/MyDrive/cs598_dlh_project/

/content/drive/MyDrive/cs598_dlh_project


In [None]:
# Install Python packages
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.24.2
  Downloading numpy-1.24.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
Collecting tqdm==4.28.1
  Downloading tqdm-4.28.1-py2.py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting jellyfish==0.7.1
  Downloading jellyfish-0.7.1.tar.gz (131 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.1/131.1 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Distance==0.1.3
  Downloading Distance-0.1.3.tar.gz (180 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.3/180.3 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Prepa

In [None]:
cd src

/content/drive/MyDrive/cs598_dlh_project/src


## Data Pre-processing

In [None]:
import pandas as pd
import numpy as np
import pickle
import networkx as nx

examplePPMIRow = None
exampleSynonym = None

# To re-write the processed datasets, set this value to True
writeDatasets = False

# ======================================
# Convert Co-frequency count graph to PPMI
# ======================================

def cofreq_to_ppmi(cofreq_graph, singleton_freqs):
    # Add ppmi column to graph after calculating PPMI from term frequency
    wordCount = len(singleton_freqs.index)    
    singleton_freqs = singleton_freqs.rename(columns={"node": "node1", "count": "count_1"}) 
    cofreqIncludeTermCounts = cofreq_graph.merge(singleton_freqs, how='inner', on=["node1"])
    
    singleton_freqs = singleton_freqs.rename(columns={"node1": "node2", "count_1": "count_2"}) 
    cofreqIncludeTermCounts = cofreqIncludeTermCounts.merge(singleton_freqs, how='inner', on=["node2"])     

    cofreqIncludeTermCounts["ppmi"] = np.maximum(0, np.log2((cofreqIncludeTermCounts["count"] / wordCount) / ((cofreqIncludeTermCounts["count_1"] / wordCount) * (cofreqIncludeTermCounts["count_2"] / wordCount))))

    cofreq_graph = cofreqIncludeTermCounts[["node1", "node2", "count", "ppmi"]]

    return cofreq_graph



def generateDatasetPkl():
    global examplePPMIRow

    print("Getting subsampled terms")
    singleton_freqs = subsampleTerms()
    print("Finished retrieving subsampled terms")    
    print('        ')

    print('Subsampling and generating PPMI cofrequency graph')
    cofreq_graph = pd.read_csv('../data/mappings/1_Cofrequencies/cofreqs_terms_perBin_1d.txt', delim_whitespace=True)
    cofreq_graph.columns = ["node1", "node2", "count"]
    
    print('Graph length before subsampling: {0}'.format(len(cofreq_graph.index)))
    cofreq_graph = cofreq_graph[cofreq_graph["node1"] != cofreq_graph["node2"]]
    cofreq_graph = cofreq_to_ppmi(cofreq_graph, singleton_freqs)
    print('Graph length after subsampling: {0}'.format(len(cofreq_graph.index)))

    # Remove all co-frequency edges where PPMI is 0
    cofreq_graph = cofreq_graph[cofreq_graph["ppmi"] > 0]
    print('      ')
    print('Final Graph length filtered to Postive PPMI: {0}'.format(len(cofreq_graph.index)))
    print('PPMI Max: {0}, Mean: {1}'.format(np.max(cofreq_graph["ppmi"]), np.mean(cofreq_graph["ppmi"])))
    print('      ')

    datasetDict = {}    
    
    node1 = cofreq_graph['node1'].to_numpy()
    node2 = cofreq_graph['node2'].to_numpy()
    ppmi_vals = cofreq_graph['ppmi'].to_numpy()

    for index in range(len(cofreq_graph.index)):
        id1 = node1[index]
        id2 = node2[index]
        ppmi = ppmi_vals[index]

        if id1 in datasetDict:
            datasetDict[id1].append((id2, ppmi))
        else:
            datasetDict[id1] = [(id2, ppmi)]
        if id2 in datasetDict:
            datasetDict[id2].append((id1, ppmi))
        else:
            datasetDict[id2] = [(id1, ppmi)]

    if (writeDatasets):
        print("Writing co-frequency dataset")
        pickle.dump(datasetDict, open('../data/sym_data/sub_neighbors_dict_ppmi_perBin_1.pkl', "wb"), protocol=-1)
    else:
        print("Skipping write of co-frequency dataset")


    # Store sample to visualize
    for term, adjList in datasetDict.items():
        examplePPMIRow = [term, adjList]
        break
    print("File written")

    # Split train test 90/10
    totalTermsShuffled = singleton_freqs.sample(frac=1)
    split = round(len(totalTermsShuffled.index) * 0.9)
    trainTerms = pd.DataFrame(totalTermsShuffled.to_numpy()[:split])
    trainTerms.columns = ["node", "count"]
    testTerms = pd.DataFrame(totalTermsShuffled.to_numpy()[split+1:])
    testTerms.columns = ["node", "count"]

    # Use train/test terms to build synonyms data - the labels for the final model
    createSynonymGraphs(trainTerms, testTerms)


def subsampleTerms():
    '''
    # Subsample the terms list using approach from reference paper:
    #   Distributed Representations of Words and Phrases and their Compositionality
    '''
    
    singleton_freqs = pd.read_csv('../data/mappings/2_Singleton_Frequency_Counts/singlets_terms_perBin_1d.txt', delim_whitespace=True)
    singleton_freqs.columns = ["node", "count"]
    
    # Calculate term frequency
    singleton_freqs["freq"] = singleton_freqs["count"] / np.sum(singleton_freqs["count"])
    
    print('Terms before subsampling: {0}'.format(len(singleton_freqs.index)))
    
    # Perform subsampling
    t = 10e-5
    singleton_freqs["prob"] = 1 - np.sqrt( t / singleton_freqs["freq"] )
    singleton_freqs["rand"] = np.random.rand(len(singleton_freqs.index))
    singleton_freqs["remove"] = singleton_freqs["prob"] >= singleton_freqs["rand"]
    singleton_freqs = singleton_freqs[singleton_freqs["remove"] == False][["node", "count"]]

    print('Terms after subsampling: {0}'.format(len(singleton_freqs.index)))

    return singleton_freqs


# ======================================
# Build synonym graphs using Term to Concept mapping
# ======================================
def createSynonymGraphs(train_terms, test_terms):
    global exampleSynonym

    print("Building train/test synonym graphs")
    synonyms = pd.read_csv('../data/mappings/3_ID_Mappings/3_term_ID_to_concept_ID.txt', delim_whitespace=True)
    synonyms.columns = ["termID", "conceptID"]

    train_terms = train_terms.rename(columns={"node": "termID"})
    trainSyns = synonyms.merge(train_terms["termID"].to_frame(), how='inner', on='termID')

    test_terms = test_terms.rename(columns={"node": "termID"})
    testSyns = synonyms.merge(test_terms["termID"].to_frame(), how='inner', on='termID')

    # Build synonym edges for train
    termID = trainSyns['termID'].to_numpy()
    conceptID = trainSyns['conceptID'].to_numpy()

    synonymDict = {}
    for index in range(len(trainSyns.index)-1):
        if (conceptID[index] in synonymDict):
            synonymDict[conceptID[index]].append(termID[index])
        else:
            synonymDict[conceptID[index]] = [termID[index]]

    # Store sample to visualize
    synEdgesTrain = []
    for c, t in synonymDict.items():
        exampleSynonym = [c, t]
        break
    #

    for concept, termList in synonymDict.items():
        for term in termList:
            for adjTerm in termList:
                if term != adjTerm:
                    synEdgesTrain.append([term, adjTerm])


    # Build synonym edges for test
    termID = testSyns['termID'].to_numpy()
    conceptID = testSyns['conceptID'].to_numpy()

    synonymDict = {}
    for index in range(len(testSyns.index)-1):
        if (conceptID[index] in synonymDict):
            synonymDict[conceptID[index]].append(termID[index])
        else:
            synonymDict[conceptID[index]] = [termID[index]]

    synEdgesTest = []
    for concept, termList in synonymDict.items():
        for term in termList:
            for adjTerm in termList:
                if term != adjTerm:
                    synEdgesTest.append([term, adjTerm])
    

    trainDF = pd.DataFrame(synEdgesTrain, columns = ['term', 'adjterm'])
    testDF = pd.DataFrame(synEdgesTest, columns = ['term', 'adjterm'])

    trainG = nx.from_pandas_edgelist(trainDF, source="term", target="adjterm")
    testG = nx.from_pandas_edgelist(testDF, source="term", target="adjterm")
    
    if (writeDatasets):
        print("Graphs are built, storing.")
        pickle.dump(trainG, open("../data/sym_data/train_graph_nx_perBin_1.pkl", 'wb'), protocol=-1)
        pickle.dump(testG, open("../data/sym_data/test_graph_nx_perBin_1.pkl", 'wb'), protocol=-1)
        print("Graphs Stored")
    else:
        print("Skipping storage of graphs.")

generateDatasetPkl()



Getting subsampled terms
Terms before subsampling: 56594
Terms after subsampling: 55985
Finished retrieving subsampled terms
        
Subsampling and generating PPMI cofrequency graph
Graph length before subsampling: 61824936
Graph length after subsampling: 46388854
      
Final Graph length filtered to Postive PPMI: 419062
PPMI Max: 9.072313000642843, Mean: 1.249204815728219
      
Skipping write of co-frequency dataset
File written
Building train/test synonym graphs
Skipping storage of graphs.


## Examples of Processed Data
Below are example entries (mapped to the term/concept strings) of the generated datasets:
1. Subsampled PPMI Edges built from co-frequency graph
2. Synonym Labels (all terms under the same concept are deemed synonymous)

In [None]:
# Load TermID -> Term String mapping
termStringMapping = {}
with open('../data/mappings/3_ID_Mappings/1_term_ID_to_string.txt', "r", encoding = "ISO-8859-1") as file:
    for line in file:
        lineVals = line.split()
        termStringMapping[lineVals[0]] = ' '.join(lineVals[1:])


# Display Example of Term PPMIs
print('Term: {0}'.format(termStringMapping[str(examplePPMIRow[0])]))
print('Cofrequent Terms:')
for adj in examplePPMIRow[1]:
    print('     Term: {0}'.format(termStringMapping[str(adj[0])]))
    print('          ->: {0}'.format(adj[1]))


Term: isosorbide mononitrate
Cofrequent Terms:
     Term: extended release tablet
          ->: 0.7239638154357573
     Term: extended release
          ->: 0.01802295225513008
     Term: apresoline
          ->: 0.5778422400713806
     Term: isosorbide
          ->: 1.5449863771375318
     Term: imdur
          ->: 1.1324074241194695
     Term: isosorbide mononitrate 30 mg
          ->: 2.5297275357897813
     Term: ranolazine
          ->: 0.8105997967439126
     Term: subclavian steal syndrome
          ->: 0.3386792427817149
     Term: isosorbide mononitrate 60 mg
          ->: 2.5297275357897813


In [None]:
# Load ConceptID -> String mapping
conceptStringMapping = {}
with open('../data/mappings/3_ID_Mappings/2a_concept_ID_to_string.txt', "r", encoding = "ISO-8859-1") as file:
    for line in file:
        lineVals = line.split()
        conceptStringMapping[lineVals[0]] = ' '.join(lineVals[1:])

# Display Example of Concept Synonyms
print('Concept: {0}'.format(conceptStringMapping[str(exampleSynonym[0])]))
print('Synonym Terms:')
for syn in exampleSynonym[1]:
    print('     {0}'.format(termStringMapping[str(syn)]))

Concept: diagnosis
Synonym Terms:
     diagnosis
     diagnosed


## Training Phase 1
The first portion of training is for Context Prediction. This training uses the pre-trained embeddings (word and subword) to generate a prediction of term's global context. For this problem statement, the co-frequency is a representation of a terms global context as it indicates its context of use among all other terms in the vocabulary. Using a predicted context rather than using the co-frequency graph allows for SurfCon to handle query terms that are not in the vocabulary. 

In [None]:
#Uncomment to train context prediction model
# Requires 
#!python -u main_pretrain.py --batch_size=10000 --learning_rate=0.001 --save_interval=200 --save_dir='./saved_models/saved_pretrained/' --ngram_embed_path='../data/embeddings/wiki-news-300d-1M-subword.vec'

args:  Namespace(embed_filename='../data/embeddings/glove.6B.100d.txt', ngram_embed_path='../data/embeddings/wiki-news-300d-1M-subword.vec', per='Bin', days='1', ngram_embed_dim=100, n_grams='2, 3, 4', node_embed_dim=128, word_hidden_dim=100, word_embed_dim=100, num_epochs=201, batch_size=10000, random_seed=43, dropout=0.5, log_interval=100, test_interval=1, early_stop_epochs=1000, learning_rate=0.001, save_best=True, save_dir='./saved_models/saved_pretrained/', save_interval=200, neg_sampling=False, num_negs=5)
Total number of candidates:  54060
Find 102937 grams with pretrain ratio: 0.37512264783314064
Find 84276 words with pretrain ratio: 0.7794745835113199
Model Parameters Stored!
ContextPredictionWordNGram(
  (ngrams_embeddings): Embedding(102938, 100)
  (w2v_embeddings): Embedding(84277, 100)
  (fc_out): Linear(in_features=200, out_features=128, bias=True)
  (context_out): Linear(in_features=128, out_features=54060, bias=False)
  (out): LogSoftmax(dim=1)
)
Training terms: 54060
0

## Training Phase 2
After training the context predictor model, it is time to train the ranking model. This training combines the inputs from a Query term and a Candidate term and attempts to predict whether the terms are synonyms. The combination of surface-form inputs is done directly, but the combination of the global contexts is done through a dynamic matching algorithm. This algorithm generates a semantic vectors for each term (query and candidate) and outputs a score based on their similarity.

In [None]:
#Uncomment to train ranking model
# Update --restore_model_path to the saved model epoch with best performance
# Required X RAM / GPU
#!python -u main_dym.py --use_context=True --restore_model_path='./saved_models/saved_pretrained/snapshot_epoch_200.pt' --min_epochs=2 --neg_sampling=True --num_contexts=50 --ngram_embed_path='../data/embeddings/wiki-news-300d-1M-subword.vec'

args:  Namespace(per='Bin', days='1', random_seed=42, num_oov=2000, re_sample_test=False, train_neg_num=50, test_neg_num=100, num_contexts=50, max_contexts=1000, context_gamma=0.3, ngram_embed_dim=100, n_grams='2, 3, 4', word_embed_dim=100, node_embed_dim=128, dropout=0, bi_out_dim=50, use_context=True, do_ctx_interact=True, num_epochs=3, log_interval=2000, test_interval=1, early_stop_epochs=10, metric='map', learning_rate=0.0001, min_epochs=2, clip_grad=5.0, lr_decay=0.05, embed_filename='../data/embeddings/glove.6B.100d.txt', node_embed_path='../data/embeddings/line2nd_ttcooc_embedding.txt', ngram_embed_path='../data/embeddings/wiki-news-300d-1M-subword.vec', restore_model_path='./saved_models/saved_pretrained/snapshot_epoch_200.pt', restore_idx_data='', logging=False, log_name='empty.txt', restore_model_epoch=600, save_best=True, save_dir='./saved_models', save_interval=5, random_test=True, neg_sampling=True, num_negs=5, rank_model_path=None)
********Key parameters:******
Use GPU? T

## Testing
After both phases of training are completed, we can now utilize a simple interface with the model to take a `query` term in as input and predict the top 10 likely synonym terms.

In [None]:
# Uncomment to Query against model for synonyms
# Note when querying: results may be delayed ~20seconds
#!python -u main_testing.py --restore_model_path='./saved_models/saved_pretrained_fastext_nonzero/snapshot_epoch_5000.pt' --rank_model_path='./saved_models/rank_model_perBin_1/best_epoch_10.pt' --num_results=10 --cand_terms_path='' --use_context=True --ngram_embed_path='wiki-news-300d-1M-subword.vec' --neg_sampling=True

## Results

Below is the comparison of results between the work of original authors and our team. Note the below terms/shorthand:
- `InV`: Terms that are In Vocabulary. This means that the Query Term was a term present in the Co-Frequency graph dataset
- `Dissim`: A subset of synonymous terms that appear dissimilar. These indicate a harder subset of synonyms to identify due as they are visually quite different.
- `Context`: The `predicted context` portion of the model architecture (`Training Phase 1`) which attempts to generate a Query Term's top co-occuring terms.

Interestingly, our results indicated slightly improved model performance without the `Context Predictions` portion of the model, unlike the original work. This improvement may not have been the case for OOV terms however, as we did not execute that portion of the experiment.  

![Results Summary](../results/ResultsSummary.png "Results Summary")

## References
1. S. G. Finlayson, P. LePendu, and N. H. Shah. 2014. Building the graph of medicine from millions of clinical narratives. Scientific data 1 (2014), 140032.
2. H. J. Lowe, T. A. Ferris, P. M. Hernandez, and S. C. Weber. 2009. STRIDE–An integrated standards-based translational research informatics platform. In AMIA.
3. K. Hashimoto, Y. Tsuruoka, R. Socher, and o. 2017. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks. In ACL.
4. K. Hashimoto, Y. Tsuruoka, R. Socher, and o. 2017. A Joint Many-Task Model:Growing a Neural Network for Multiple NLP Tasks. In ACL.
5. J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2016. Charagram: Embedding words and sentences via character n-grams. In EMNLP.
6. P. Neculoiu, M. Versteegh, and M. Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Workshop on Representation Learning for NLP.
7. M. Qu, X. Ren, and J. Han. 2017. Automatic synonym discovery with knowledge bases. In KDD.
8. SurfCon: Synonym Discovery on Privacy-Aware Clinical Data, https://dl.acm.org/doi/pdf/10.1145/3292500.3330894
9. W. Hamilton, Z. Ying, and J. Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS.
10. P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. 2018. Graph attention networks. In ICLR.
11. Fastext embedding http://christopher5106.github.io/deep/learning/2020/04/02/fasttext_pretrained_embeddings_subword_word_representations.html
12. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS
