Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there any good aggregated queries on the database? #7

Closed
wuyinjun-1993 opened this issue May 14, 2018 · 11 comments
Closed

Are there any good aggregated queries on the database? #7

wuyinjun-1993 opened this issue May 14, 2018 · 11 comments

Comments

@wuyinjun-1993
Copy link

wuyinjun-1993 commented May 14, 2018

Could you please provide some queries like the aggregated query to the path from target to the gene? If you can share the python notebook with the queries, that would be very fantastic. Thanks!

@alawinia
Copy link

Daniel, it was great catching up with you. Would you also please let us know if you come across a database with aggregated queries or views. Your help with our project is highly appreciated.
Thank you!

@dhimmel
Copy link
Member

dhimmel commented May 16, 2018

Here is a query you can run on the Hetionet online browser at https://neo4j.het.io/browser/.

The query investigates which biological processes the drug Topiramate may effect. It's looking for paths where Topirmate binds a Gene which participates in a biological process. Each path receives a different weight, called a PDP or path-degree product, based on its specificity. The BINDS_CbG has pubmed_ids metadata. Therefore you could assign each path to zero or more source studies based on these pubmed_ids. Here's the query.

// Search for CbGpBP paths starting with Topiramate
MATCH path = (n0:Compound)-[e1:BINDS_CbG]-(n1)-[:PARTICIPATES_GpBP]-(n2:BiologicalProcess)
WHERE n0.name = 'Topiramate'
// Implement the DWPC to adjust for node degree along paths
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpBP]-()),
  size(()-[:PARTICIPATES_GpBP]-(n2))
] AS degrees, e1, path, n1, n2
RETURN
  // Return the GO Process ID and name
  n1.name AS gene_symbol,
  n2.name AS biological_process,
  e1.pubmed_ids AS pubmed_ids,
  // Compute the path-degree product
  reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS PDP,
  // Count the number of genes in the GO Process
  size((n2)-[:PARTICIPATES_GpBP]-()) AS n_genes
ORDER BY PDP DESC

What we usually do is to aggregate all PDPs for the same target node (in this case biological process). We sum PDPs to compute DWPCs (degree-weighted path counts).

Here's a query for the top five DWPCs:

// Search for CbGpBP paths starting with Topiramate
MATCH path = (n0:Compound)-[e1:BINDS_CbG]-(n1)-[:PARTICIPATES_GpBP]-(n2:BiologicalProcess)
WHERE n0.name = 'Topiramate'
// Implement the DWPC to adjust for node degree along paths
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpBP]-()),
  size(()-[:PARTICIPATES_GpBP]-(n2))
] AS degrees, e1, path, n2
WITH
  // Return the GO Process ID and name
  n2.identifier AS go_id,
  n2.name AS go_name,
  count(path) AS PC,
  collect(e1.pubmed_ids) AS pubmed_ids,
  // Compute the DWPC
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC,
  // Count the number of genes in the GO Process
  size((n2)-[:PARTICIPATES_GpBP]-()) AS n_genes
  WHERE n_genes >= 5 AND PC >= 2
RETURN
  go_id, go_name, pubmed_ids, PC, DWPC, n_genes
ORDER BY DWPC DESC
LIMIT 5

If you want to see the paths that get aggregated to compute DWPCs for these top five biological process you can run the following query:

// Search for CbGpBP paths starting with Topiramate
MATCH path = (n0:Compound)-[e1:BINDS_CbG]-(n1)-[:PARTICIPATES_GpBP]-(n2:BiologicalProcess)
WHERE n0.name = 'Topiramate'
// Implement the DWPC to adjust for node degree along paths
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpBP]-()),
  size(()-[:PARTICIPATES_GpBP]-(n2))
] AS degrees, e1, path, n2
WITH
  n2.name AS biological_process,
  count(path) AS PC,
  // Compute the DWPC
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC,
  // Collects paths
  collect(path) as paths,
  // Count the number of genes in the GO Process
  size((n2)-[:PARTICIPATES_GpBP]-()) AS n_genes
  WHERE n_genes >= 5 AND PC >= 2
RETURN
  paths
ORDER BY DWPC DESC
LIMIT 5

The result looks like:

topiramate

Each path is based on different pubmed_ids for its BINDS_CbG. You could then assign each source the weight of the PDP divided by the total number of pubmed_ids for that path.

@wuyinjun-1993
Copy link
Author

Thanks very much. We really appreciate it

@wuyinjun-1993
Copy link
Author

Hello, Dr. Daniel Himmelstein! Sorry for disturbing again.. Thanks again for your information provided last time. It is very helpful!

Currently we are using the data and query that you provided, which will be very important for our experiments. Our goal is to make our work more convincing. So is it possible for you to provide more aggregate user queries against this database OR to provide any other databases that you know or you are working with where aggregated queries exist?

Thanks in advance for your help!

@wuyinjun-1993 wuyinjun-1993 reopened this Jun 1, 2018
@dhimmel
Copy link
Member

dhimmel commented Jun 5, 2018

@thuwuyinjun to make sure I'm spending my time providing actually useful examples, can you be more specific about what you exactly you would like. What characteristics would you like these aggregation queries to have?

@wuyinjun-1993
Copy link
Author

Thanks for the quick response! I think the aggregate queries that we want should be very similar to the one that you provided to us last time, which should have one important characteristic, i.e. the query result should be some curated data or linked to some citation information like DOIs.

Our goal is simply want more aggregate queries so that we can convince readers of the applicability of our techniques. I think the information provided by you will be very helpful for it.

Thanks in advance for your help.

@dhimmel
Copy link
Member

dhimmel commented Jun 8, 2018

Aggregated GWAS assocaitions

This file named gene-associations.tsv contains gene-disease associations from GWAS. GWAS measures disease associations with SNPs. This file aggregates SNP associations to genes. Some disease-gene associations have multiple GWAS studies reporting significant p-values, which are given in the pubmed_ids column. Perhaps it would be nice to weight the contribution of each study by its p-value... you'd have to play with the source code to output that information. The algorithm to aggregate these associations is rather complex.

@wuyinjun-1993
Copy link
Author

Cool, thanks very much! I will figure it out.

@dhimmel
Copy link
Member

dhimmel commented Jun 8, 2018

You could do something similar with Drug-binds-Protein relationships from BindingDB. See this dataset named bindings-drugbank-collapsed.tsv.

@dhimmel
Copy link
Member

dhimmel commented Oct 15, 2018

I got a Google Scholar notification about "ProvCite: Provenance-based Data Citation", which I assume is related to this, but it appears to have been crawled from the academic social network that shall not be named and is no longer available there.

Assuming this will become available from elsewhere some point in the future? Looking forward to reading and possibly presenting it at the Greene Lab journal club.

@wuyinjun-1993
Copy link
Author

wuyinjun-1993 commented Oct 15, 2018

Oh, yes, that is our recently paper which is still under review. I attached the paper here just in case you want to read it early: vldb-2019-conference.pdf.

Thanks!

@dhimmel dhimmel closed this as completed Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants