Are there any good aggregated queries on the database? #7

wuyinjun-1993 · 2018-05-14T17:57:50Z

Could you please provide some queries like the aggregated query to the path from target to the gene? If you can share the python notebook with the queries, that would be very fantastic. Thanks!

alawinia · 2018-05-14T18:21:09Z

Daniel, it was great catching up with you. Would you also please let us know if you come across a database with aggregated queries or views. Your help with our project is highly appreciated.
Thank you!

dhimmel · 2018-05-16T17:34:57Z

Here is a query you can run on the Hetionet online browser at https://neo4j.het.io/browser/.

The query investigates which biological processes the drug Topiramate may effect. It's looking for paths where Topirmate binds a Gene which participates in a biological process. Each path receives a different weight, called a PDP or path-degree product, based on its specificity. The BINDS_CbG has pubmed_ids metadata. Therefore you could assign each path to zero or more source studies based on these pubmed_ids. Here's the query.

// Search for CbGpBP paths starting with Topiramate
MATCH path = (n0:Compound)-[e1:BINDS_CbG]-(n1)-[:PARTICIPATES_GpBP]-(n2:BiologicalProcess)
WHERE n0.name = 'Topiramate'
// Implement the DWPC to adjust for node degree along paths
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpBP]-()),
  size(()-[:PARTICIPATES_GpBP]-(n2))
] AS degrees, e1, path, n1, n2
RETURN
  // Return the GO Process ID and name
  n1.name AS gene_symbol,
  n2.name AS biological_process,
  e1.pubmed_ids AS pubmed_ids,
  // Compute the path-degree product
  reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS PDP,
  // Count the number of genes in the GO Process
  size((n2)-[:PARTICIPATES_GpBP]-()) AS n_genes
ORDER BY PDP DESC

What we usually do is to aggregate all PDPs for the same target node (in this case biological process). We sum PDPs to compute DWPCs (degree-weighted path counts).

Here's a query for the top five DWPCs:

// Search for CbGpBP paths starting with Topiramate
MATCH path = (n0:Compound)-[e1:BINDS_CbG]-(n1)-[:PARTICIPATES_GpBP]-(n2:BiologicalProcess)
WHERE n0.name = 'Topiramate'
// Implement the DWPC to adjust for node degree along paths
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpBP]-()),
  size(()-[:PARTICIPATES_GpBP]-(n2))
] AS degrees, e1, path, n2
WITH
  // Return the GO Process ID and name
  n2.identifier AS go_id,
  n2.name AS go_name,
  count(path) AS PC,
  collect(e1.pubmed_ids) AS pubmed_ids,
  // Compute the DWPC
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC,
  // Count the number of genes in the GO Process
  size((n2)-[:PARTICIPATES_GpBP]-()) AS n_genes
  WHERE n_genes >= 5 AND PC >= 2
RETURN
  go_id, go_name, pubmed_ids, PC, DWPC, n_genes
ORDER BY DWPC DESC
LIMIT 5

If you want to see the paths that get aggregated to compute DWPCs for these top five biological process you can run the following query:

// Search for CbGpBP paths starting with Topiramate
MATCH path = (n0:Compound)-[e1:BINDS_CbG]-(n1)-[:PARTICIPATES_GpBP]-(n2:BiologicalProcess)
WHERE n0.name = 'Topiramate'
// Implement the DWPC to adjust for node degree along paths
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpBP]-()),
  size(()-[:PARTICIPATES_GpBP]-(n2))
] AS degrees, e1, path, n2
WITH
  n2.name AS biological_process,
  count(path) AS PC,
  // Compute the DWPC
  sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC,
  // Collects paths
  collect(path) as paths,
  // Count the number of genes in the GO Process
  size((n2)-[:PARTICIPATES_GpBP]-()) AS n_genes
  WHERE n_genes >= 5 AND PC >= 2
RETURN
  paths
ORDER BY DWPC DESC
LIMIT 5

The result looks like:

Each path is based on different pubmed_ids for its BINDS_CbG. You could then assign each source the weight of the PDP divided by the total number of pubmed_ids for that path.

wuyinjun-1993 · 2018-05-16T23:20:20Z

Thanks very much. We really appreciate it

wuyinjun-1993 · 2018-06-01T01:41:19Z

Hello, Dr. Daniel Himmelstein! Sorry for disturbing again.. Thanks again for your information provided last time. It is very helpful!

Currently we are using the data and query that you provided, which will be very important for our experiments. Our goal is to make our work more convincing. So is it possible for you to provide more aggregate user queries against this database OR to provide any other databases that you know or you are working with where aggregated queries exist?

Thanks in advance for your help!

dhimmel · 2018-06-05T13:01:45Z

@thuwuyinjun to make sure I'm spending my time providing actually useful examples, can you be more specific about what you exactly you would like. What characteristics would you like these aggregation queries to have?

wuyinjun-1993 · 2018-06-05T16:51:42Z

Thanks for the quick response! I think the aggregate queries that we want should be very similar to the one that you provided to us last time, which should have one important characteristic, i.e. the query result should be some curated data or linked to some citation information like DOIs.

Our goal is simply want more aggregate queries so that we can convince readers of the applicability of our techniques. I think the information provided by you will be very helpful for it.

Thanks in advance for your help.

dhimmel · 2018-06-08T16:58:55Z

Aggregated GWAS assocaitions

This file named gene-associations.tsv contains gene-disease associations from GWAS. GWAS measures disease associations with SNPs. This file aggregates SNP associations to genes. Some disease-gene associations have multiple GWAS studies reporting significant p-values, which are given in the pubmed_ids column. Perhaps it would be nice to weight the contribution of each study by its p-value... you'd have to play with the source code to output that information. The algorithm to aggregate these associations is rather complex.

wuyinjun-1993 · 2018-06-08T17:00:54Z

Cool, thanks very much! I will figure it out.

dhimmel · 2018-06-08T17:02:56Z

You could do something similar with Drug-binds-Protein relationships from BindingDB. See this dataset named bindings-drugbank-collapsed.tsv.

dhimmel · 2018-10-15T15:01:02Z

I got a Google Scholar notification about "ProvCite: Provenance-based Data Citation", which I assume is related to this, but it appears to have been crawled from the academic social network that shall not be named and is no longer available there.

Assuming this will become available from elsewhere some point in the future? Looking forward to reading and possibly presenting it at the Greene Lab journal club.

wuyinjun-1993 · 2018-10-15T15:37:30Z

Oh, yes, that is our recently paper which is still under review. I attached the paper here just in case you want to read it early: vldb-2019-conference.pdf.

Thanks!

wuyinjun-1993 closed this as completed May 16, 2018

wuyinjun-1993 reopened this Jun 1, 2018

dhimmel closed this as completed Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there any good aggregated queries on the database? #7

Are there any good aggregated queries on the database? #7

wuyinjun-1993 commented May 14, 2018 •

edited

Loading

alawinia commented May 14, 2018

dhimmel commented May 16, 2018

wuyinjun-1993 commented May 16, 2018

wuyinjun-1993 commented Jun 1, 2018

dhimmel commented Jun 5, 2018

wuyinjun-1993 commented Jun 5, 2018

dhimmel commented Jun 8, 2018 •

edited

Loading

wuyinjun-1993 commented Jun 8, 2018

dhimmel commented Jun 8, 2018

dhimmel commented Oct 15, 2018

wuyinjun-1993 commented Oct 15, 2018 •

edited by dhimmel

Loading

Are there any good aggregated queries on the database? #7

Are there any good aggregated queries on the database? #7

Comments

wuyinjun-1993 commented May 14, 2018 • edited Loading

alawinia commented May 14, 2018

dhimmel commented May 16, 2018

wuyinjun-1993 commented May 16, 2018

wuyinjun-1993 commented Jun 1, 2018

dhimmel commented Jun 5, 2018

wuyinjun-1993 commented Jun 5, 2018

dhimmel commented Jun 8, 2018 • edited Loading

Aggregated GWAS assocaitions

wuyinjun-1993 commented Jun 8, 2018

dhimmel commented Jun 8, 2018

dhimmel commented Oct 15, 2018

wuyinjun-1993 commented Oct 15, 2018 • edited by dhimmel Loading

wuyinjun-1993 commented May 14, 2018 •

edited

Loading

dhimmel commented Jun 8, 2018 •

edited

Loading

wuyinjun-1993 commented Oct 15, 2018 •

edited by dhimmel

Loading