Model De-ID

De-identification for co-occurrence models

This is a simple Java implementation of the distributional model de-identification strategy proposed in the 2016 AMIA Annual Symposium podium abstract "Automated De-Identification of Distributional Semantic Models" by Finley, Pakhomov, and Melton. See abstract for a short description.

Current capabilities include building an allowed- or forbidden-words list and applying that as a filter to a word2vec model and its vocabulary. See the shell script for details on how to invoke model deidentification.

Given a word2vec model trained on clinical notes, the algorithm removes PHI words that are not part of the SPECIALIST Lexicon and any words that are part of the "patient info database" (names and addresses of patients associated with notes used to train the word2vec model). Our research indicates that model performance is minimally impacted by allowing exceptions for the 2,000 most common words in the patient database. Retaining these words in the model tends to account for homonyms like ‘white’. This top-n inclusion parameter is configurable.

Expressed as pseudocode:

def keep-word = ( top-n-corpus-word || ( specialist-lexicon-word & !phi-word ) ) ? true : false

Javadoc

You can find the api documentation for this project here

Invoking the model scrubber

The bash script assumes the Java source code has been compiled and paths to compiled classes, the raw word2vec model, patient names and addresses file, and output model have been specified in the shell script.

./deidentify_model.sh

Contact and Support

For issues or enhancement requests, feel free to submit to the Issues tab on GitHub.

About Us

BioMedICUS is developed by the University of Minnesota Institute for Health Informatics NLP/IE Group.

Credits

Code in this repository was originally written by Greg Finley as part of his post-doctoral work at the University of Minnesota.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
site		site
src/main/java		src/main/java
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
_config.yml		_config.yml
deidentify_model.sh		deidentify_model.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

site

site

src/main/java

src/main/java

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

_config.yml

_config.yml

deidentify_model.sh

deidentify_model.sh

pom.xml

pom.xml

Repository files navigation

Model De-ID

Javadoc

Invoking the model scrubber

Contact and Support

About Us

Credits

About

Releases

Packages

Contributors 3

Languages

License

nlpie/modeldeid

Folders and files

Latest commit

History

Repository files navigation

Model De-ID

Javadoc

Invoking the model scrubber

Contact and Support

About Us

Credits

About

Resources

License

Stars

Watchers

Forks

Languages