This project is an assortment of tools used for term prediction in text notes. It was originally used specifically for term prediction on clinical notes. It is mainly based around scikit-learn.
python3, nltk, numpy, pandas, scikit-learn, empath, spaCy, scispaCy, liac-arff, matplotlib, wordcloud
Weka, Java, python-weka-wrapper3
First, clone the repository:
git clone https://github.com/kflasch/term-predict.gitNext, install the dependencies. It is suggested to setup a venv (Python virtual environment) or equivalent:
python -m venv venv/term-predict
source venv/term-predict/bin/activateand then use pip to install them. You can install the dependencies via the provided requirements.txt file:
pip install -r requirements.txtOr you can install the dependencies individually:
pip install nltk
pip install numpy
pip install pandas
pip install scikit-learn
pip install liac-arff
pip install matplotlib
pip install wordcloud
pip install empath
pip install spacy
pip install scispacy
# Optionally, for Weka (requires Weka and Java to be installed separately):
pip install javabridge
pip install python-weka-wrapper3Download data for NLTK as described at Installing NLTK Data. An easy way to do this (which typically downloads the data to ~/nltk_data) is:
python -m nltk.downloader allFinally, install at least one of the scispaCy models for NER. You may want to confirm these URLs, find updates to them, or find additional models at: https://github.com/allenai/scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_sm-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_lg-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_core_sci_scibert-0.4.0.tar.gz
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gzMost programs can be run with -h to see their options.
The main program to run prediction and cross-validations across different classifiers.
Performs 10-fold cross-validation on defaults or given dataset.
./term_predict.py -c
./term_predict.py -c dtree
./term_predict.py -c dtree data/sarcopenia_chunk_5_train.pkl
Performs prediction using training and test datasets. If ratings exist (for chunk 5), these will be used to compute performance.
./term_predict.py -p
./term_predict.py -p dtree
./term_predict.py -c dtree data/sarcopenia_chunk_5_train.pkl
Creates pandas dataframes (as pkl files) and arff files form CSV notes
to be used by term_predict.py. Also creates templates to be used for
manual ratings. Files created are chunked by the setting chunk_sizes
in config.py.
./create_datafiles.py -p
./create_datafiles.py -p data/sarcopenia_training_data.csv train
./create_datafiles.py -a
./create_datafiles.py -r
Show information about CSV notes, ratings information, generate wordclouds.
./notes_analyzer.py -t
./notes_analyzer.py -s
./notes_analyzer.py -w
./notes_analyzer.py -r
Create categories to be used by Empath, and analyze notes with
Empath. Categories must be created initially to use Empath category
features, by running ./empath_helper.py -c
Used to run Weka on datafiles with python-weka-wrapper3.
The misc directory contains a few unused files that may be of some interest, relating to using other libraries.
Configuration settings are in config.py. Comments in the config file should help explain what each option does. Some of these options must be set before running, such as specifying data locations. Some of the options are listed below.
| Option | Description | Example Value |
|---|---|---|
| term | The term used for prediction | “sarcopenia” |
| mask | Replaces term in training data | ”” |
| chunk_sizes | How many sentences per chunk | [5, 7, 9] |
| data_dir | Location of data files | “data/” |
| train_data | CSV file of training data | “data/training.csv” |
| test_data | CSV file of test data | “data/test.csv” |
| anatomy_terms_file | List of anatomy terms to match | “data/anatomy.txt” |
| ratings_dir | Location of ratings data | “data/ratings/” |
| empath_categories | Empath categories used for feature | [“fracture”, “frail”] |
| empath_cat_words | Words used to build Empath categories | {“fracture”: [“fracture”], “frail”: [“frail”, “frailty”]} |
| ner_transform | Flag to create NER transformed text | True |
| spacy_model | Which spaCy model to use in NER | “en_core_sci_sm” |
| oversample_amount | How many times to oversample training set | 0 |
| use_feature_notetype | Enable note type feature | True |
| use_feature_textlen | Enable text length feature | False |
| use_feature_empath | Enable Empath feature | False |
| use_feature_anatomy_terms | Enable anatomy term matching feature | True |
| plot_dtree | Save plot of decision tree | True |
Distributed under the GNU General Public License v3.0. See LICENSE for more information.
Kevin Flasch | kflasch.net