Skip to content
This repository has been archived by the owner on Jul 23, 2022. It is now read-only.

Term Occurence

Jennifer Vendetti edited this page Feb 28, 2015 · 19 revisions

The BioPortal software generates a dictionary of terms from preferred labels and synonyms for all ontologies in the BioPortal application. The software then creates a set of data files containing the number of times dictionary terms occur, both as singlets and in pairs, for each resource in the NCBO Resource Index. This page describes the content of these files.

For more information about how this data could be used in a research setting, refer to "Building the graph of medicine from millions of clinical narratives" by Samuel G. Finlayson, Paea LePendu, & Nigam H. Shah.

Data files

There are three data files associated with each resource in the Resource Index.

Sampling rate file

The sampling rate file is a simple text file that contains:

  • resource acronym
  • total number of documents in the resource
  • sampling rate used to create term singleton and co-frequency counts

The following is an example of the sampling rate file for the ArrayExpress resource:

 Acronym: AE
 Total documents: 48565  
 Sampling rate: 1

A sampling rate of 1 means that every document in the resource is visited during calculation of term singleton and co-frequency counts. Sampling rates larger than 1 indicate that counts are calculated from a subset of documents. For example, a sampling rate of 100 means that the system only visits one document out of every 100.

Singleton frequency counts file

The singleton frequency data file contains the total number of times dictionary terms occur in the documents for a particular resource. In the example below, the term "above knee amputation" occurs once across the set of documents, "abscess" appears 6 times across the set of documents, etc.

 1  above knee amputation
 1  abrasion
 6  abscess
 4  absence
 3  absence of
 4  absent
 1  absolute
11  absorb
 5  absorbing

File format: tab-delimited; column 1 = frequency count, column 2 = term. Files are compressed using the gzip command. To UNZIP a file, use the gunzip command: gunzip filename.tsv.gz.

Co-frequency counts file

The co-frequency data file contains the total number of pair-wise occurrences of dictionary terms in the documents for a particular resource. In the example below, the pair of terms "anaphylaxis" and "activity" occur in 3 documents of the resource, the pair "anaphylaxis" and "allergic" occur in 18 documents, etc.

 1  anaphylaxis  activities
 3  anaphylaxis  activity
 1  anaphylaxis  actual
 3  anaphylaxis  additional
 1  anaphylaxis  affect
 8  anaphylaxis  after
13  anaphylaxis  against
18  anaphylaxis  allergic

Format: tab-delimited; column 1 = frequency count, columns 2 & 3 = terms. Files are compressed using the gzip command. To UNZIP a file, use the gunzip command: gunzip filename.tsv.gz.

Resource Index data file directories

Adverse Event Reporting System Data
AgingGenesDB (via NIF)
Antibody Registry (via NIF)
ArrayExpress
ARRS GoldMiner
AutDB (via NIF)
BioGRID (via NIF)
BioModels
Biositemaps
caArray
caNanoLab
Cell Centered Database (via NIF)
CellImageLibrary (via NIF)
ClinicalTrials.gov
Conserved Domain Database
Coriell Cell Repository (via NIF)
CTD ChemDisease (via NIF)
CTD ChemGene (via NIF)
CTD DiseasePathway (via NIF)
Database of Genotypes and Phenotypes
Drug Related Gene Database (via NIF)
DrugBank
Entrez Gene (via NIF)
GEMMA (via NIF)
Gene Expression Omnibus DataSets
GeneNetwork (via NIF)
Integrated Disease View (via NIF)
Integrated Videos (via NIF)
InterNano Process Database
MICAD
ModelDB (via NIF)
NIH RePORTER (via NIF)
Online Mendelian Inheritance in Man
Pathway Commons
PDSP Ki database (via NIF)
PharmGKB [Disease]
PharmGKB [Drug]
PharmGKB [Gene]
PubChem
PubMed
PubMedHealth Drugs (via NIF)
PubMedHealth Tests (via NIF)
Reactome
ResearchCrossroads
Stanford Microarray Database
ToxinDB (via NIF)
UniProt KB
WikiPathways

Converting terms to classes

In addition to the singleton and co-frequency counts of dictionary terms provided in the data files above, researchers may wish to calculate counts for occurrences of ontology classes. To assist in this calculation, the BioPortal software generates a file that maps dictionary terms to their corresponding classes.

The mapping file contains an alphabetically sorted list of all terms (preferred labels and synonyms) in the BioPortal application. Each row contains a term, the ontology in which the term appears, and the corresponding class ID for the term.

In the following example excerpt from the mapping file, the term CELL appears in the BIRNLEX ontology with class ID http://bioontology.org/projects/ontologies/birnlex#birnlex_12. CELL also appears in the RXNORM ontology with class ID http://purl.bioontology.org/ontology/STY/T025, etc.

 CELL  BIRNLEX  http://bioontology.org/projects/ontologies/birnlex#birnlex_12
 CELL  RXNORM   http://purl.bioontology.org/ontology/STY/T025
 CELL  SAO      http://ccdb.ucsd.edu/SAO/1.2#sao1813327414
 CELL  SCTSPA   http://purl.bioontology.org/ontology/STY/T025
 CELL  SIO      http://semanticscience.org/resource/SIO_010001
 CELL  SNMI     http://purl.bioontology.org/ontology/STY/T025
 CELL  SNOMEDCT http://purl.bioontology.org/ontology/SNOMEDCT/362837007

Format: tab-delimited; column 1 = term, column 2 = ontology acronym, column 3 = class ID. File is compressed using the gzip command. To UNZIP, use the gunzip command: gunzip labels_to_classes.tsv.

Download mapping file