# Simple Example of multiple biological data source acquisition

This notebook is intended as a starting point for other researchers and domain experts to explore and experiment with various data sources and how they can be utilized to build pipelines to support a Blackboard architecture that can address important science. Our initial modest goals are to focus on the Translator *competency questions* and begin to incorporate and integrate those data sources we anticipate being useful.

## Typical Structure

This Notebook, and those that are cloned from it, will follow a typical structure like this:

- Background
    - Relevant Competency Question(s) or Research Problem
    - Current Status and remaining work (just to give the reader context about how finished the notebook is)
- Data Sources
    - Descriptions and reference, including the API documentation links and a brief description of their scope and content
- Transformation and Integration
    - Simple Data Access examples to illustrate the API usage and the type/shape of the data
    - More sophisticated examples to examine sources and experiment/demonstrate integration possibilities
    - Visualization and Summarization
- Develop Prototype Pipelines (optional)
    - Where possible, prototype a reusable set of code illustrating a desired solution or capability, with an eye towards extracting and modularizing it for presentation via BioLink or integration into other workflows.

---

## Background

### Current Status

- Accesses CHEBI data via the BioLink API
- Accesses GINAS data via their API
- Trying to *join* information about 'acetylsalicylic acid' from both data sources, as a toy problem to get started.
- We're going to look up 'acetylsalicylic acid' rather than 'aspirin', because it is a common term in all of the sources right now and I'm not sure that the Monarch BioLink API I'm using has the term 'aspirin' yet.


### Next Steps

- Explore [Pharos](https://pharos.nih.gov/idg/index) API and data sources
- Use [mybinder](http://mybinder.org) badge (or a similar hosted Jupyter mechanism) to simplify invocation and editing from GitHub
- Try to extract Drug-to-conditions and Condition-to-Drugs relations from sources. Use competency questions to guide this integration.
- Consider WikiData as a source
- Accommodate anticipated BROAD probability models, possibly by developing a mock API.


---

## Data Sources

### CHEBI Data

Monarch ingests [Chemical Entities of Biological Interest (ChEBI)](https://www.ebi.ac.uk/chebi/) data and makes it available via SciGraph, the Monarch API, and the new BioLink API.

For reference, here is the link to CHEBI's entry for 'acetylsalicylic acid' (aka 'Aspirin'):

[CHEBI:15365 acetylsalicylic acid](https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15365)


### BioLink substance data from CHEBI via Monarch

Monarch has ingested CHEBI data, and we have a `/biolink/substance/{id}/participant_in/` endpoint that seems to return some data:

https://api.monarchinitiative.org/api/bioentity/substance/CHEBI:40036/participant_in/

However, the basic `/biolink/substance/{id}` endpoint returns no useful data, so we'll have to use the above link until BioLink has a fleshed out `/substance` endpoint.



### ginas API Substance data from ginas

The [Global Ingredient Archival System (ginas)](https://tripod.nih.gov/ginas/#/) provides a common identifier for all of the substances used in medicinal products, utilizing a consistent definition of substances globally, including active substances under clinical investigation. More info at [NCATS ginas](https://ncats.nih.gov/expertise/preclinical/ginas).

#### Examples

- [ginas Aspirin](https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:"^ASPIRIN$")

- [ginas acetylsalicylic acid](https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:"^acetylsalicylic acid$")





---

## Transformation and Integration

I'm going to start out by ensuring that I can obtain useful data from each of the above sources. In this case, I am focusing on a single substance, **aspirin** or **acetylsalicylic acid** (CHEBI:15365).


### Reading BioLink's `/substance` endpoint for CHEBI data

In [8]:
import pandas as pd
from pandas.io.json import json_normalize
try:
    from urllib.parse import urlencode
except:
    from urllib import urlencode 
pd.set_option('max_colwidth', 3800)
pd.set_option('display.expand_frame_repr', False)
biolinkURL = "https://api.monarchinitiative.org/api/bioentity/substance/CHEBI%3A15365/participant_in/?rows=20&fetch_objects=true"
df = pd.read_json(biolinkURL, typ="frame", orient="records")
df.head(3) # Show the first 3 rows only 
# df


Unnamed: 0,evidence_graph,evidence_types,id,object,object_extension,provided_by,publications,qualifiers,relation,slim,subject,subject_extension,type
0,"{'edges': None, 'nodes': None}",,,"{'label': 'benzoic acids', 'xrefs': None, 'description': None, 'deprecated': None, 'id': 'CHEBI:22723', 'consider': None, 'synonyms': None, 'taxon': {'label': None, 'id': None}, 'replaced_by': None, 'categories': None, 'types': None}",,,,,"{'label': None, 'description': None, 'deprecated': None, 'id': None, 'consider': None, 'synonyms': None, 'replaced_by': None, 'categories': None, 'types': None}",,"{'label': 'acetylsalicylic acid', 'xrefs': None, 'description': None, 'deprecated': None, 'id': 'CHEBI:15365', 'consider': None, 'synonyms': None, 'taxon': {'label': None, 'id': None}, 'replaced_by': None, 'categories': None, 'types': None}",,
1,"{'edges': None, 'nodes': None}",,,"{'label': 'antipyretic', 'xrefs': None, 'description': None, 'deprecated': None, 'id': 'CHEBI:35493', 'consider': None, 'synonyms': None, 'taxon': {'label': None, 'id': None}, 'replaced_by': None, 'categories': None, 'types': None}",,,,,"{'label': None, 'description': None, 'deprecated': None, 'id': None, 'consider': None, 'synonyms': None, 'replaced_by': None, 'categories': None, 'types': None}",,"{'label': 'acetylsalicylic acid', 'xrefs': None, 'description': None, 'deprecated': None, 'id': 'CHEBI:15365', 'consider': None, 'synonyms': None, 'taxon': {'label': None, 'id': None}, 'replaced_by': None, 'categories': None, 'types': None}",,
2,"{'edges': None, 'nodes': None}",,,"{'label': 'non-narcotic analgesic', 'xrefs': None, 'description': None, 'deprecated': None, 'id': 'CHEBI:35481', 'consider': None, 'synonyms': None, 'taxon': {'label': None, 'id': None}, 'replaced_by': None, 'categories': None, 'types': None}",,,,,"{'label': None, 'description': None, 'deprecated': None, 'id': None, 'consider': None, 'synonyms': None, 'replaced_by': None, 'categories': None, 'types': None}",,"{'label': 'acetylsalicylic acid', 'xrefs': None, 'description': None, 'deprecated': None, 'id': 'CHEBI:15365', 'consider': None, 'synonyms': None, 'taxon': {'label': None, 'id': None}, 'replaced_by': None, 'categories': None, 'types': None}",,


Now that we see the data frame from our source, we can use ordinary Python (and the pandas library) to access different parts. For example, let's grab the first row's `object` value.

In [9]:
df.object[0]

{'categories': None,
 'consider': None,
 'deprecated': None,
 'description': None,
 'id': 'CHEBI:22723',
 'label': 'benzoic acids',
 'replaced_by': None,
 'synonyms': None,
 'taxon': {'id': None, 'label': None},
 'types': None,
 'xrefs': None}

In [3]:
df.object[0]['label']

'benzoic acids'

### Reading ginas data

In [10]:
ginasBase = "https://tripod.nih.gov/ginas/app/api/v1/substances/search?"
ginasParams = {'q': "root_names_name:\"^acetylsalicylic acid$\""}

ginasPath = urlencode(ginasParams)
ginasURL = ginasBase + ginasPath

ginasURL

'https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name%3A%22%5Eacetylsalicylic+acid%24%22'

One problem that I encountered is that the ginas API response is not in a format that is compatible with the `pandas` `read_json` method. So we'll need to preprocess the response to make it pandas-compatible, or alternatively, we can just use the JSON directly without using pandas, although that may inhibit some experimentation and visualization that is afforded by data frames.

In [5]:
import json
import requests

r = requests.get(ginasURL)
c = r.json()
print(json.dumps(c, indent=2))
c['content']


{
  "query": "q=root_names_name:\"^acetylsalicylic acid$\"",
  "facets": [
    {
      "_self": "http://tripod.nih.gov/ginas/app/api/v1/substances/search/@facets?q=root_names_name%3A%22%5Eacetylsalicylic+acid%24%22&field=Code+System",
      "name": "Code System",
      "values": [
        {
          "label": "EMA ASSESSMENT REPORTS",
          "count": 1
        },
        {
          "label": "EPA PESTICIDE CODE",
          "count": 1
        },
        {
          "label": "IUPHAR",
          "count": 1
        },
        {
          "label": "LIVERTOX",
          "count": 1
        },
        {
          "label": "NDF-RT",
          "count": 1
        },
        {
          "label": "RXCUI",
          "count": 1
        },
        {
          "label": "WHO INTERNATIONAL PHARMACPOEIA",
          "count": 1
        },
        {
          "label": "WHO-ATC",
          "count": 1
        },
        {
          "label": "WHO-ESSENTIAL MEDICINES LIST",
          "count": 1
        },
   

[{'_approvalIDDisplay': 'R16CO5Y76E',
  '_codes': {'count': 51,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/codes'},
  '_moieties': {'count': 1,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/moieties'},
  '_name': 'ASPIRIN',
  '_names': {'count': 76,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/names'},
  '_references': {'count': 73,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/references'},
  '_relationships': {'count': 27,
   'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/relationships'},
  '_self': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)?view=full',
  'access': [],
  'approvalID': 'R16CO5Y76E',
  'approved': 1470433417000,
  'approvedBy': 'FDA_SRS',
  'created': 1471037

##### GINAS JSON to Data Frame

In [6]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
# pd.set_option('display.max_colwidth', 50)

r = requests.get(ginasURL)
c = r.json()
# print(json.dumps(c, indent=2))
df = pd.DataFrame(c['content'])

#df = pd.DataFrame(json_normalize(c['content']))



# # df = pd.read_json(ginasURL, typ='frame', orient="index")
# # df.head(5)
df

Unnamed: 0,_approvalIDDisplay,_codes,_moieties,_name,_names,_references,_relationships,_self,access,approvalID,approved,approvedBy,created,createdBy,definitionLevel,definitionType,deprecated,lastEdited,lastEditedBy,status,structure,substanceClass,uuid,version
0,R16CO5Y76E,"{'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/codes', 'count': 51}","{'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/moieties', 'count': 1}",ASPIRIN,"{'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/names', 'count': 76}","{'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/references', 'count': 73}","{'href': 'http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)/relationships', 'count': 27}",http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)?view=full,[],R16CO5Y76E,1470433417000,FDA_SRS,1471037433000,admin,COMPLETE,PRIMARY,False,1471037433000,admin,approved,"{'molfile': '  Symyx 06151614272D 1 1.00000 0.00000 0  13 13 0 0 0 999 V2000  4.6125 -1.9917 0.0000 C 0 0 0 0 0 0 0 0 0  4.6125 -0.6792 0.0000 C 0 0 0 0 0 0 0 0 0  3.4542 -2.6792 0.0000 C 0 0 0 0 0 0 0 0 0  2.3167 -1.9917 0.0000 O 0 0 0 0 0 0 0 0 0  1.1542 -2.6375 0.0000 C 0 0 0 0 0 0 0 0 0  5.7417 -0.0250 0.0000 O 0 0 0 0 0 0 0 0 0  0.0000 -1.9917 0.0000 O 0 0 0 0 0 0 0 0 0  3.4792 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0  5.7750 -2.6792 0.0000 C 0 0 0 0 0 0 0 0 0  3.4542 -3.9667 0.0000 C 0 0 0 0 0 0 0 0 0  1.1542 -3.9667 0.0000 C 0 0 0 0 0 0 0 0 0  5.7750 -4.0042 0.0000 C 0 0 0 0 0 0 0 0 0  4.5792 -4.6917 0.0000 C 0 0 0 0 0 0 0 0 0  2 1 1 0 0 0  3 1 2 0 0 0  4 3 1 0 0 0  5 4 1 0 0 0  6 2 2 0 0 0  7 5 2 0 0 0  8 2 1 0 0 0  9 1 1 0 0 0  10 3 1 0 0 0  11 5 1 0 0 0  12 9 2 0 0 0  13 12 1 0 0 0  13 10 2 0 0 0 M END ', 'smiles': 'CC(=O)OC1=CC=CC=C1C(O)=O', 'lastEdited': 1471037433000, 'atropisomerism': 'No', 'stereoCenters': 0, 'stereoComments': '', 'charge': 0, 'self': 'http://tripod.nih.gov/ginas/app/api/v1/structures(3d311a51-d4f2-4878-ab53-45e6d9640ae3)?view=full', 'definedStereo': 0, 'digest': '8f13a762c2818d1407a79edf0fe0218dc88f7a9f', 'stereochemistry': 'ACHIRAL', 'createdBy': 'admin', 'hash': 'NNQ793F142LD', 'deprecated': False, 'mwt': 180.1574, 'id': '3d311a51-d4f2-4878-ab53-45e6d9640ae3', 'formula': 'C9H8O4', '_properties': {'href': 'http://tripod.nih.gov/ginas/app/api/v1/structures(3d311a51-d4f2-4878-ab53-45e6d9640ae3)/properties', 'count': 5}, 'lastEditedBy': 'admin', 'ezCenters': 0, 'count': 1, 'references': ['e794afad-a478-4c69-b6bd-1c7a8fe46baf', '25e72485-091c-47f8-9a00-7dd47689bf58'], 'opticalActivity': 'UNSPECIFIED', 'access': [], 'created': 1471037433000}",chemical,8911c794-5da3-4934-a683-16d98d93db97,1


In [7]:
from datetime import datetime
from datetime import date

df = pd.DataFrame(c['content'])

newdf = pd.DataFrame(df)
newdf.created = newdf.created.apply(lambda d: pd.to_datetime(d / 1000, unit='s'))
newdf.approved = newdf.approved.apply(lambda d: pd.to_datetime(d / 1000, unit='s'))

newdf = newdf[['_name', 'substanceClass', '_self', 'created', 'approved']]

newdf

Unnamed: 0,_name,substanceClass,_self,created,approved
0,ASPIRIN,chemical,http://tripod.nih.gov/ginas/app/api/v1/substances(8911c794-5da3-4934-a683-16d98d93db97)?view=full,2016-08-12 21:30:33,2016-08-05 21:43:37


---

## Prototype Pipelines

None available yet
