Term Prediction on Text Notes

This project is an assortment of tools used for term prediction in text notes. It was originally used specifically for term prediction on clinical notes. It is mainly based around scikit-learn.

Dependencies

python3, nltk, numpy, pandas, scikit-learn, empath, spaCy, scispaCy, liac-arff, matplotlib, wordcloud

Optional Dependencies

Weka, Java, python-weka-wrapper3

Installation

First, clone the repository:

git clone https://github.com/kflasch/term-predict.git

Next, install the dependencies. It is suggested to setup a venv (Python virtual environment) or equivalent:

python -m venv venv/term-predict
source venv/term-predict/bin/activate

and then use pip to install them. You can install the dependencies via the provided requirements.txt file:

pip install -r requirements.txt

Or you can install the dependencies individually:

pip install nltk
pip install numpy
pip install pandas
pip install scikit-learn
pip install liac-arff
pip install matplotlib
pip install wordcloud
pip install empath
pip install spacy
pip install scispacy

# Optionally, for Weka (requires Weka and Java to be installed separately):
pip install javabridge
pip install python-weka-wrapper3

Download data for NLTK as described at Installing NLTK Data. An easy way to do this (which typically downloads the data to ~/nltk_data) is:

python -m nltk.downloader all

Finally, install at least one of the scispaCy models for NER. You may want to confirm these URLs, find updates to them, or find additional models at: https://github.com/allenai/scispacy

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_scibert-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz

Usage

Most programs can be run with -h to see their options.

term_predict.py

The main program to run prediction and cross-validations across different classifiers.

Cross-Validation

Performs 10-fold cross-validation on defaults or given dataset.

Run cross-validation on default classifier (SVC) and default files:

./term_predict.py -c

Run cross-validation on specific classifier and default files:

./term_predict.py -c dtree

Run cross-validation on specific classifier and specific file:

./term_predict.py -c dtree data/sarcopenia_chunk_5_train.pkl

Prediction

Performs prediction using training and test datasets. If ratings exist (for chunk 5), these will be used to compute performance.

Run prediction on default classifier (SVC) and default files:

./term_predict.py -p

Run prediction on specific classifier and default files:

./term_predict.py -p dtree

Run prediction on specific classifier and specific file:

./term_predict.py -c dtree data/sarcopenia_chunk_5_train.pkl

create_datafiles.py

Creates pandas dataframes (as pkl files) and arff files form CSV notes to be used by term_predict.py. Also creates templates to be used for manual ratings. Files created are chunked by the setting chunk_sizes in config.py.

Create dataframe pkl files for use with term_predict.py using defaults:

./create_datafiles.py -p

Create dataframe pkl files for use with term_predict.py using specific csv file and data type:

./create_datafiles.py -p data/sarcopenia_training_data.csv train

Create arff files for use with weka_runner.py using defaults:

./create_datafiles.py -a

Create rating template file from test data:

./create_datafiles.py -r

notes_analyzer.py

Show information about CSV notes, ratings information, generate wordclouds.

Analyze training notes:

./notes_analyzer.py -t

Analyze test notes:

./notes_analyzer.py -s

Generate Wordcloud images:

./notes_analyzer.py -w

Show ratings for sentence chunks:

./notes_analyzer.py -r

empath_helper.py

Create categories to be used by Empath, and analyze notes with Empath. Categories must be created initially to use Empath category features, by running ./empath_helper.py -c

weka_runner.py

Used to run Weka on datafiles with python-weka-wrapper3.

misc

The misc directory contains a few unused files that may be of some interest, relating to using other libraries.

Configuration

Configuration settings are in config.py. Comments in the config file should help explain what each option does. Some of these options must be set before running, such as specifying data locations. Some of the options are listed below.

Option	Description	Example Value
term	The term used for prediction	“sarcopenia”
mask	Replaces term in training data	””
chunk_sizes	How many sentences per chunk	[5, 7, 9]
data_dir	Location of data files	“data/”
train_data	CSV file of training data	“data/training.csv”
test_data	CSV file of test data	“data/test.csv”
anatomy_terms_file	List of anatomy terms to match	“data/anatomy.txt”
ratings_dir	Location of ratings data	“data/ratings/”
empath_categories	Empath categories used for feature	[“fracture”, “frail”]
empath_cat_words	Words used to build Empath categories	{“fracture”: [“fracture”], “frail”: [“frail”, “frailty”]}
ner_transform	Flag to create NER transformed text	True
spacy_model	Which spaCy model to use in NER	“en_core_sci_sm”
oversample_amount	How many times to oversample training set	0
use_feature_notetype	Enable note type feature	True
use_feature_textlen	Enable text length feature	False
use_feature_empath	Enable Empath feature	False
use_feature_anatomy_terms	Enable anatomy term matching feature	True
plot_dtree	Save plot of decision tree	True

License

Distributed under the GNU General Public License v3.0. See LICENSE for more information.

Kevin Flasch | kflasch.net

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
misc		misc
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
config.py		config.py
create_datafiles.py		create_datafiles.py
empath_helper.py		empath_helper.py
notes_anaylzer.py		notes_anaylzer.py
requirements.txt		requirements.txt
term_predict.py		term_predict.py
utils.py		utils.py
weka_runner.py		weka_runner.py

Folders and files

Latest commit

History

Repository files navigation

Term Prediction on Text Notes

Dependencies

Optional Dependencies

Installation

Usage

term_predict.py

Cross-Validation

Run cross-validation on default classifier (SVC) and default files:

Run cross-validation on specific classifier and default files:

Run cross-validation on specific classifier and specific file:

Prediction

Run prediction on default classifier (SVC) and default files:

Run prediction on specific classifier and default files:

Run prediction on specific classifier and specific file:

create_datafiles.py

Create dataframe pkl files for use with term_predict.py using defaults:

Create dataframe pkl files for use with term_predict.py using specific csv file and data type:

Create arff files for use with weka_runner.py using defaults:

Create rating template file from test data:

notes_analyzer.py

Analyze training notes:

Analyze test notes:

Generate Wordcloud images:

Show ratings for sentence chunks:

empath_helper.py

weka_runner.py

misc

Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages