# Simple Example of multiple biological data source acquisition

This notebook is intended as a starting point for other researchers and domain experts to explore and experiment with various data sources and how they can be utilized to build pipelines to support a Blackboard architecture that can address important science. Our initial modest goals are to focus on the Translator *competency questions* and begin to incorporate and integrate those data sources we anticipate being useful.

## Typical Structure

This Notebook, and those that are cloned from it, will follow a typical structure like this:

- Background
    - Relevant Competency Question(s) or Research Problem
    - Current Status and remaining work (just to give the reader context about how finished the notebook is)
- Data Sources
    - Descriptions and reference, including the API documentation links and a brief description of their scope and content
- Transformation and Integration
    - Simple Data Access examples to illustrate the API usage and the type/shape of the data
    - More sophisticated examples to examine sources and experiment/demonstrate integration possibilities
    - Visualization and Summarization
- Develop Prototype Pipelines (optional)
    - Where possible, prototype a reusable set of code illustrating a desired solution or capability, with an eye towards extracting and modularizing it for presentation via BioLink or integration into other workflows.

---

## Background

### Current Status

- Accesses CHEBI data via the BioLink API
- Accesses GINAS data via their API
- Trying to *join* information about 'acetylsalicylic acid' from both data sources, as a toy problem to get started.
- We're going to look up 'acetylsalicylic acid' rather than 'aspirin', because it is a common term in all of the sources right now and I'm not sure that the Monarch BioLink API I'm using has the term 'aspirin' yet.


### Next Steps

- Explore [Pharos](https://pharos.nih.gov/idg/index) API and data sources
- Use [mybinder](http://mybinder.org) badge (or a similar hosted Jupyter mechanism) to simplify invocation and editing from GitHub
- Try to extract Drug-to-conditions and Condition-to-Drugs relations from sources. Use competency questions to guide this integration.
- Consider WikiData as a source
- Accommodate anticipated BROAD probability models, possibly by developing a mock API.


---

## Data Sources

### CHEBI Data

Monarch ingests [Chemical Entities of Biological Interest (ChEBI)](https://www.ebi.ac.uk/chebi/) data and makes it available via SciGraph, the Monarch API, and the new BioLink API.

For reference, here is the link to CHEBI's entry for 'acetylsalicylic acid' (aka 'Aspirin'):

[CHEBI:15365 acetylsalicylic acid](https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15365)


### BioLink substance data from CHEBI via Monarch

Monarch has ingested CHEBI data, and we have a `/biolink/substance/{id}/participant_in/` endpoint that seems to return some data:

https://api.monarchinitiative.org/api/bioentity/substance/CHEBI:40036/participant_in/

However, the basic `/biolink/substance/{id}` endpoint returns no useful data, so we'll have to use the above link until BioLink has a fleshed out `/substance` endpoint.



### ginas API Substance data from ginas

The [Global Ingredient Archival System (ginas)](https://tripod.nih.gov/ginas/#/) provides a common identifier for all of the substances used in medicinal products, utilizing a consistent definition of substances globally, including active substances under clinical investigation. More info at [NCATS ginas](https://ncats.nih.gov/expertise/preclinical/ginas).

#### Examples

- [ginas Aspirin](https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:"^ASPIRIN$")

- [ginas acetylsalicylic acid](https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:"^acetylsalicylic acid$")





---

## Transformation and Integration

I'm going to start out by ensuring that I can obtain useful data from each of the above sources. In this case, I am focusing on a single substance, **aspirin** or **acetylsalicylic acid** (CHEBI:15365).


### Reading BioLink's `/substance` endpoint for CHEBI data

In [1]:
import pandas as pd
from urllib.parse import urlencode
pd.set_option('max_colwidth', 3800)
pd.set_option('display.expand_frame_repr', False)
biolinkURL = "https://api.monarchinitiative.org/api/bioentity/substance/CHEBI%3A15365/participant_in/?rows=20&fetch_objects=true"
df = pd.read_json(biolinkURL, typ="frame", orient="records")
df.head(3) # Show the first 3 rows only 
# df


Unnamed: 0,evidence_graph,evidence_types,id,object,object_extension,provided_by,publications,qualifiers,relation,slim,subject,subject_extension,type
0,"{'nodes': None, 'edges': None}",,,"{'xrefs': None, 'categories': None, 'id': 'CHEBI:22723', 'taxon': {'label': None, 'id': None}, 'synonyms': None, 'description': None, 'label': 'benzoic acids', 'types': None}",,,,,"{'categories': None, 'id': None, 'synonyms': None, 'description': None, 'label': None, 'types': None}",,"{'xrefs': None, 'categories': None, 'id': 'CHEBI:15365', 'taxon': {'label': None, 'id': None}, 'synonyms': None, 'description': None, 'label': 'acetylsalicylic acid', 'types': None}",,
1,"{'nodes': None, 'edges': None}",,,"{'xrefs': None, 'categories': None, 'id': 'CHEBI:47622', 'taxon': {'label': None, 'id': None}, 'synonyms': None, 'description': None, 'label': 'acetate ester', 'types': None}",,,,,"{'categories': None, 'id': None, 'synonyms': None, 'description': None, 'label': None, 'types': None}",,"{'xrefs': None, 'categories': None, 'id': 'CHEBI:15365', 'taxon': {'label': None, 'id': None}, 'synonyms': None, 'description': None, 'label': 'acetylsalicylic acid', 'types': None}",,
2,"{'nodes': None, 'edges': None}",,,"{'xrefs': None, 'categories': None, 'id': 'CHEBI:50630', 'taxon': {'label': None, 'id': None}, 'synonyms': None, 'description': None, 'label': 'cyclooxygenase 1 inhibitor', 'types': None}",,,,,"{'categories': None, 'id': None, 'synonyms': None, 'description': None, 'label': None, 'types': None}",,"{'xrefs': None, 'categories': None, 'id': 'CHEBI:15365', 'taxon': {'label': None, 'id': None}, 'synonyms': None, 'description': None, 'label': 'acetylsalicylic acid', 'types': None}",,


Now that we see the data frame from our source, we can use ordinary Python (and the pandas library) to access different parts. For example, let's grab the first row's `object` value.

In [2]:
df.object[0]

{'categories': None,
 'description': None,
 'id': 'CHEBI:22723',
 'label': 'benzoic acids',
 'synonyms': None,
 'taxon': {'id': None, 'label': None},
 'types': None,
 'xrefs': None}

In [3]:
df.object[0]['label']

'benzoic acids'

### Reading ginas data

In [4]:
ginasBase = "https://tripod.nih.gov/ginas/app/api/v1/substances/search?"
ginasParams = {'q': "root_names_name:\"^acetylsalicylic acid$\""}

ginasPath = urlencode(ginasParams)
ginasURL = ginasBase + ginasPath

ginasURL

'https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name%3A%22%5Eacetylsalicylic+acid%24%22'

One problem that I encountered is that the ginas API response is not in a format that is compatible with the `pandas` `read_json` method. So we'll need to preprocess the response to make it pandas-compatible, or alternatively, we can just use the JSON directly without using pandas, although that may inhibit some experimentation and visualization that is afforded by data frames.

In [5]:
import json

import requests

r = requests.get(ginasURL)
c = r.json()
# print(json.dumps(c, indent=2))
c['content']


[{'_approvalIDDisplay': 'R16CO5Y76E',
  '_codes': {'count': 51,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/codes'},
  '_moieties': {'count': 1,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/moieties'},
  '_name': 'ASPIRIN',
  '_names': {'count': 76,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/names'},
  '_references': {'count': 73,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/references'},
  '_relationships': {'count': 27,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/relationships'},
  '_self': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)?view=full',
  'access': [],
  'approvalID': 'R16CO5Y76E',
  'approved': 1470433417000,
  'approvedBy': 'FDA_SRS',
  'created': 1471037

##### Work in Progress

The GINAS JSON result is not compatible with a pandas data frame, so we will have to use an alternative way to manipulate it, or try to convert it to a data frame.

**The following code breaks**

In [6]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)


df = pd.read_json(ginasURL, typ='frame', orient="index")
# df.head(5)
df

ValueError: arrays must all be same length

---

## Prototype Pipelines

None available yet
