Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provenance for some CbP relationships is lost during condense() operation #28

Open
cthoyt opened this issue Feb 10, 2020 · 3 comments
Open

Comments

@cthoyt
Copy link

cthoyt commented Feb 10, 2020

I'm going through metadata in the compound-binds-gene relationships, and taking a specific look at the actions lists. In many examples, there are several actions, such as with drugbank:DB00502 binds ncbigene:1813. In the JSON GZ export, there are two actions listed: ['antagonist', 'inverse agonist']. I made a query to the Neo4j instance to confirm this is also true there:

MATCH p=(s:Compound)-[r:BINDS_CbG]->(t:Gene)
WHERE s.identifier = 'DB00502' and t.identifier = 1813
RETURN p
LIMIT 25

However, on DrugBank I could only find the antagonist label. Is it the case that the DrugBank source data that gets parsed and converted in Hetionet contains extra information that doesn't make it to the web page I linked? If so, do you have any idea on how they pick which of many gets displayed?

@dhimmel
Copy link
Member

dhimmel commented Feb 14, 2020

Is it the case that the DrugBank source data that gets parsed and converted in Hetionet contains extra information that doesn't make it to the web page I linked?

Hetionet uses DrugBank version 4.2 as processed in dhimmel/drugbank. In the past when https://www.drugbank.ca was displaying data version 4.2, I think what you would see there would be the same as what we extract from the corresponding drugbank.xml.

In the case of Haloperidol-binds-DRD2, I think these actions are coming from ChEMBL not DrugBank. Notice the following edge property:

If you go to either of the ChEMBL URLs above you'll see the following table, which does contain "inverse agonist" (link)

image

For reference, we combined multiple sources of Compound-binds-Gene relationships in the CbG-binding.ipynb notebook.

@cthoyt
Copy link
Author

cthoyt commented Feb 23, 2020

Okay, so I will interpret an edge with many actions as all of them being separately true, even if there are some conflicts. I was looking through the Jupyter notebook at it seems that the actions and sources lists available are generated using the following code block

def condense(df):
    """Combine gene-compound relationships"""
    row = pandas.Series()
    row['sources'] = set(itertools.chain.from_iterable(df.sources))
    row['pubmed_ids'] = set(itertools.chain.from_iterable(df.pubmed_ids))
    row['actions'] = set(itertools.chain.from_iterable(df.actions))
    row['affinity_nM'] = df.affinity_nM.mean(skipna=True)
    row['license'] = get_license(row['sources'])
    row['urls'] = set(itertools.chain.from_iterable(df.urls))
    return row

so the information about which action comes from which source is not maintained. As far as I know, the neo4j schema is a bit limiting to having JSON/dictionary objects as the values, but it would be nice to be able to figure out from the final data what the provenance for each relationship was. Maybe a data structure that would be appropriate would be parallel lists, at the cost of being a bit repetitive.

@cthoyt cthoyt changed the title Can't find provenance for some CbP relationships Provenance for some CbP relationships is lost during condense() operation Feb 23, 2020
@dhimmel
Copy link
Member

dhimmel commented Mar 2, 2020

the neo4j schema is a bit limiting to having JSON/dictionary objects as the values

Referencing https://stackoverflow.com/a/38026494/4651668. I don't remember whether this limitation influenced how I encoded these properties. "parallel lists" or json-encoded text could be a sufficient workaround.

it would be nice to be able to figure out from the final data what the provenance for each relationship was

Yeah. Good lesson for the future. If need be, we could potentially create some sort of mapping from neo4j relationship id to full provenance info for CbP edges. Not as good as having it in the database, but hopefully an acceptable workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants