Outbreak.info Resource library topic classifier

This script classifies publications in outbreak.info into broad topicCategories. Works with python version 3.6 to 3.8.5

refresh_annotations.py - refreshes all the topicCategory annotations. Ie - Runs all needed scripts to get the topicCategory for all relevant records. Took ~1.5 hrs as of 2022.09.07 (~250K records). This should be run on a regular basis.

update_annotations.py - checks for new records and generates topicCategory annotations for them. DO NOT USE--the refresh function has been made more efficient, so this script no longer saves much time.

update_training_data.py - Pulls LitCovid categories and other training data generated using mapping from Clinical Trials (/src/clin_mapping.py), keyword searches (/data/keywords, /data/subtopics/keywords), and curator labeled data (/data/subtopics/curated_training_df.pickle). This should be run prior to updating a model.

update_models.py - re-trains the model(s) based on updated training data. This is slow and computationally expensive. Should be done only if there are changes in the training data which are expected to improve the model or if there are new categories. Default test methods are included and the test results can be found at /results/in_depth_classifier_test.tsv

/mutvariant_extraction.py - has regex methods for extracting suspected mutations and variants, but the resulting content (/results/lineages.tsv, /results/mutations.tsv, /results/polymorphisms.tsv) have not been sufficiently tested for merging/inclusion into the resources API.

The topicCategories are generally not parsed from any original source so there is no risk of overwriting existing topicCategory data by accident. Hence, the results of the classifier may be added via BioThings SDK's default merging tools and using a dummy plugin for the generated data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Outbreak.info Resource library topic classifier

About

Releases 1

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
models		models
results		results
src		src
.gitignore		.gitignore
README.md		README.md
additional_functions.zip		additional_functions.zip
mutvariant_extraction.py		mutvariant_extraction.py
refresh_annotations.py		refresh_annotations.py
requirements.txt		requirements.txt
subtopics_classifier.ipynb		subtopics_classifier.ipynb
topic_classifier.ipynb		topic_classifier.ipynb
update_annotations.py		update_annotations.py
update_models.py		update_models.py
update_training_data.py		update_training_data.py

outbreak-info/topic_classifier

Folders and files

Latest commit

History

Repository files navigation

Outbreak.info Resource library topic classifier

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages