Skip to content

kflasch/term-predict

Repository files navigation

Term Prediction on Text Notes

This project is an assortment of tools used for term prediction in text notes. It was originally used specifically for term prediction on clinical notes. It is mainly based around scikit-learn.

Dependencies

python3, nltk, numpy, pandas, scikit-learn, empath, spaCy, scispaCy, liac-arff, matplotlib, wordcloud

Optional Dependencies

Weka, Java, python-weka-wrapper3

Installation

First, clone the repository:

git clone https://github.com/kflasch/term-predict.git

Next, install the dependencies. It is suggested to setup a venv (Python virtual environment) or equivalent:

python -m venv venv/term-predict
source venv/term-predict/bin/activate

and then use pip to install them. You can install the dependencies via the provided requirements.txt file:

pip install -r requirements.txt

Or you can install the dependencies individually:

pip install nltk
pip install numpy
pip install pandas
pip install scikit-learn
pip install liac-arff
pip install matplotlib
pip install wordcloud
pip install empath
pip install spacy
pip install scispacy

# Optionally, for Weka (requires Weka and Java to be installed separately):
pip install javabridge
pip install python-weka-wrapper3

Download data for NLTK as described at Installing NLTK Data. An easy way to do this (which typically downloads the data to ~/nltk_data) is:

python -m nltk.downloader all

Finally, install at least one of the scispaCy models for NER. You may want to confirm these URLs, find updates to them, or find additional models at: https://github.com/allenai/scispacy

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_scibert-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz

Usage

Most programs can be run with -h to see their options.

term_predict.py

The main program to run prediction and cross-validations across different classifiers.

Cross-Validation

Performs 10-fold cross-validation on defaults or given dataset.

Run cross-validation on default classifier (SVC) and default files:

./term_predict.py -c

Run cross-validation on specific classifier and default files:

./term_predict.py -c dtree

Run cross-validation on specific classifier and specific file:

./term_predict.py -c dtree data/sarcopenia_chunk_5_train.pkl

Prediction

Performs prediction using training and test datasets. If ratings exist (for chunk 5), these will be used to compute performance.

Run prediction on default classifier (SVC) and default files:

./term_predict.py -p

Run prediction on specific classifier and default files:

./term_predict.py -p dtree

Run prediction on specific classifier and specific file:

./term_predict.py -c dtree data/sarcopenia_chunk_5_train.pkl

create_datafiles.py

Creates pandas dataframes (as pkl files) and arff files form CSV notes to be used by term_predict.py. Also creates templates to be used for manual ratings. Files created are chunked by the setting chunk_sizes in config.py.

Create dataframe pkl files for use with term_predict.py using defaults:

./create_datafiles.py -p

Create dataframe pkl files for use with term_predict.py using specific csv file and data type:

./create_datafiles.py -p data/sarcopenia_training_data.csv train

Create arff files for use with weka_runner.py using defaults:

./create_datafiles.py -a

Create rating template file from test data:

./create_datafiles.py -r

notes_analyzer.py

Show information about CSV notes, ratings information, generate wordclouds.

Analyze training notes:

./notes_analyzer.py -t

Analyze test notes:

./notes_analyzer.py -s

Generate Wordcloud images:

./notes_analyzer.py -w

Show ratings for sentence chunks:

./notes_analyzer.py -r

empath_helper.py

Create categories to be used by Empath, and analyze notes with Empath. Categories must be created initially to use Empath category features, by running ./empath_helper.py -c

weka_runner.py

Used to run Weka on datafiles with python-weka-wrapper3.

misc

The misc directory contains a few unused files that may be of some interest, relating to using other libraries.

Configuration

Configuration settings are in config.py. Comments in the config file should help explain what each option does. Some of these options must be set before running, such as specifying data locations. Some of the options are listed below.

OptionDescriptionExample Value
termThe term used for prediction“sarcopenia”
maskReplaces term in training data””
chunk_sizesHow many sentences per chunk[5, 7, 9]
data_dirLocation of data files“data/”
train_dataCSV file of training data“data/training.csv”
test_dataCSV file of test data“data/test.csv”
anatomy_terms_fileList of anatomy terms to match“data/anatomy.txt”
ratings_dirLocation of ratings data“data/ratings/”
empath_categoriesEmpath categories used for feature[“fracture”, “frail”]
empath_cat_wordsWords used to build Empath categories{“fracture”: [“fracture”], “frail”: [“frail”, “frailty”]}
ner_transformFlag to create NER transformed textTrue
spacy_modelWhich spaCy model to use in NER“en_core_sci_sm”
oversample_amountHow many times to oversample training set0
use_feature_notetypeEnable note type featureTrue
use_feature_textlenEnable text length featureFalse
use_feature_empathEnable Empath featureFalse
use_feature_anatomy_termsEnable anatomy term matching featureTrue
plot_dtreeSave plot of decision treeTrue

License

Distributed under the GNU General Public License v3.0. See LICENSE for more information.


Kevin Flasch | kflasch.net

About

Term Prediction on Text Notes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages