Skip to content

nlpie/modeldeid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model De-ID

De-identification for co-occurrence models

This is a simple Java implementation of the distributional model de-identification strategy proposed in the 2016 AMIA Annual Symposium podium abstract "Automated De-Identification of Distributional Semantic Models" by Finley, Pakhomov, and Melton. See abstract for a short description.

Current capabilities include building an allowed- or forbidden-words list and applying that as a filter to a word2vec model and its vocabulary. See the shell script for details on how to invoke model deidentification.

Given a word2vec model trained on clinical notes, the algorithm removes PHI words that are not part of the SPECIALIST Lexicon and any words that are part of the "patient info database" (names and addresses of patients associated with notes used to train the word2vec model). Our research indicates that model performance is minimally impacted by allowing exceptions for the 2,000 most common words in the patient database. Retaining these words in the model tends to account for homonyms like ‘white’. This top-n inclusion parameter is configurable.

Expressed as pseudocode:

def keep-word = ( top-n-corpus-word || ( specialist-lexicon-word & !phi-word ) ) ? true : false

Javadoc

You can find the api documentation for this project here

Invoking the model scrubber

The bash script assumes the Java source code has been compiled and paths to compiled classes, the raw word2vec model, patient names and addresses file, and output model have been specified in the shell script.

./deidentify_model.sh

Contact and Support

For issues or enhancement requests, feel free to submit to the Issues tab on GitHub.

About Us

BioMedICUS is developed by the University of Minnesota Institute for Health Informatics NLP/IE Group.

Credits

Code in this repository was originally written by Greg Finley as part of his post-doctoral work at the University of Minnesota.

About

De-identification for co-occurrence models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published