Skip to content

itayair/hierarchybuilder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchy Builder package

The Hierarchy Builder Python package is a tool for organizing and visualizing a large collection of related textual strings in a hierarchical DAG structure for exploratory search. Currently, the package works especially for biomedical data, but in the future, it will be expanded to support other domains. This will be achieved by allowing users to add their own dictionaries and taxonomic relations datasets for a specific domain.

Showcase Demonstration

Video Illustration

For a visual demonstration, please refer to the following video: https://vimeo.com/801853889

Online Demonstration

To experience the system under various conditions, an online demonstration is accessible at: http://35.204.179.34:3001/

Installations

To install the Hierarchy Builder package, please follow the steps below:

  1. Install the package via pip:
pip install hierarchybuilder
  1. Install the following dependency graph parser:
pip install https://storage.googleapis.com/en_ud_model/en_ud_model_sm-2.0.0.tar.gz

Optional Tools

You can download the UMLS data, which is a valuable resource for biomedical text analysis:

  • MRCONSO.RRF - synonyms dictionary
  • MRREL.RRF - taxonomic relations dataset

Usage

To use the Hierarchy Builder package, follow these steps:

UMLS

You have the option to use UMLS to improve the performance for the BIO domain. You need to run the UMLS server first, as follows (this process takes a while until it is loaded):

from hierarchybuilder.UMLS import umls_services
# The default values are:
# host="127.0.0.1", port=5000, umls_relations_file_path='../UMLS_data/MRREL.RRF', 
# umls_synonymous_file_path='../UMLS_data/MRCONSO.RRF'
umls_services.create_umls_servercreate_umls_server()

Run hierarchy builder

import hierarchybuilder.hierarchy_builder as hierarchy_builder
Examples = [("sentence1", "span in sentence1"), ("sentence2", "span in the sentence2"), ...]
# The default values are:
# entries_number=50, ignore_words=None, device="", umls_host="127.0.0.1", usml_port=5000, has_umls_server=False
json_output = hierarchy_builder.hierarchy_builder(examples=Examples, entries_number=50)

To use the hierarchy_builder function, provide a list of sentence and span tuples as input via the examples parameter. The entries_number parameter determines the number of entries that will be displayed at the top level of the resulting DAG. The ignore_words parameter is a list of words that should be excluded from the top level entries, usually words that appear in the query. The has_umls_server parameter indicates if the user uses UMLS or not. The umls_host and usml_port parameters are for the UMLS server. The package will use these examples to generate a DAG structure that organizes and displays a large collection of related textual strings in a hierarchical form.

Output

The output of the hierarchy_builder function is a JSON file that represents the hierarchical structure of the input sentences and spans. The structure is defined as a Directed Acyclic Graph (DAG) where each node represents a concept. Each node in the DAG is defined by the following properties:

  • Label that represent the concept of the node
  • aliases - Aliases of the concept
  • sources_number - the total number of input spans that are associated with the node and its children
  • aliases_sources_number - the number of input spans that are fully represented by one of the aliases of the node
  • sentences: a list of the input sentences of spans that are fully represented by one of the aliases of the node
  • children - Nodes that are defined by more specific concepts

For example, a node in the JSON file might look like this:

{
    "drug": 
            {
                "aliases": ["drug", "agent", "drugs", "these agents", "the other agents", "this drug", "these drugs", "the drug"],
                "sources_number": 32,
                "aliases_sources_number": 8,
                "sentences": ["..."],
                "children": {"label_child_1":"..."}
            }
}

In this example, the label of the node is drug, the aliases of drag are agent, drugs, these agents, the other agents etc. The number of sources that represented by drug is 32, and the number of sources that are fully represented by one of the aliases is 8.

Conclusion

The Hierarchy Builder package is a useful tool for organizing and exploring large collections of related textual strings. The package provides an easy-to-use interface to generate a DAG structure that organizes and displays related textual strings in a hierarchical structure. The package is especially useful for exploring biomedical data, but it can be extended to support other domains as well by allowing users to add their own dictionaries and taxonomic relations datasets

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages