Investigate differences in `evidenceCount` field in associations datasets #1819

andrewhercules · 2021-10-14T19:26:40Z

Describe the bug

Based on a Community post, it appears that the evidenceCount values in our associationByDatasource and associationByDatatype datasets are incorrect. Both the direct and indirect versions are affected by this issue.

Observed behaviour

When parsing and querying the associationByDatasourceDirect and associationByDatasourceIndirect datasets, the association for CFTR (ENSG00000001626) and cystic fibrosis (Orphanet_586), returns an evidence count of 2188 for the eva datasource. However, on the CFTR and cystic fibrosis evidence page, there are 2409 evidence strings from eva. And so 221 records are missing from the evidenceCount field.

When parsing and querying the associationByDatatypeDirect and associationByDatatypeIndirect datasets, the same association returns an evidence count of 2289 for the genetic_associations datatype. However, this should be 2510. And so it appears the same 221 eva records are also missing from the evidenceCount field in the datatype datasets.

Expected behaviour

The evidenceCount values for a given datasource or datatype should match what is displayed in the web interface.

Additional context**

The CFTR and cystic fibrosis association is noteworthy because cystic fibrosis does not have any child terms and so we should expect that the direct and indirect counts are the same.

The text was updated successfully, but these errors were encountered:

ireneisdoomed · 2021-10-15T10:43:01Z

This is not a systemic error that affects all evidence counts.
I have gone through ≈40 associations analogous to the one described above (that involves a disease without descendants) and found 6 examples where the evidence count in the associations set does not match the one displayed in the FE. All of them were coming from EVA.

|diseaseId|     datasourceId|     targetId|    associationsEvidenceCount | displayedEvidenceCount

|EFO_0008263|         eva|ENSG00000135218|          101 | 102
|EFO_0009299|         eva|ENSG00000133392|         1146| 1147
|EFO_0009299|         eva|ENSG00000163513|          467| 468
|EFO_0009299|         eva|ENSG00000166147|         3488| 3494
|EFO_0009299|         eva|ENSG00000168542|          656| 657
|EFO_0009300|         eva|ENSG00000176884|          270| 271

This does not seem to be a data problem, since the displayed evidence is correct and matches the observed evidence count from the evidence dataset. For the CFTR and cystic fibrosis association:

(evd.filter(F.col('targetId') == 'ENSG00000001626').filter(F.col('diseaseId') == 'Orphanet_586')
.filter(F.col('datasourceId') == 'eva').distinct().count())
>>> 2409

I think the problem here is in the algorithm that computes the associations.
@opentargets/be-team can you think of any test to debug where the incorrect evidence count is coming from for one example? It's hard for me to follow the train of logic of the code.

mkarmona · 2021-10-15T12:45:11Z

the thing is, there are pieces of evidence with a score of 0. But when the evidence is prepared for the associations they are filtered out to not waste resources.

    val dfs = IoHelpers.readFrom(mappedInputs)
    val evidences = dfs("evidences").data
      .selectExpr(evidenceColumns: _*)
      .where(col(evScore) > 0D)

In the case of EVA, it contains mutations that are not pathogenic or the clinical significance is unknown so that implies score 0

@ evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586").count 
res11: Long = 2409L

@ evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" > 0).count 
res13: Long = 2188L

evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" === 0).selectExpr("targetId", "diseaseId", "score", "explode(clinicalSignificances) as clinicalSig").show 
+---------------+------------+-----+------------+
|       targetId|   diseaseId|score| clinicalSig|
+---------------+------------+-----+------------+
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
+---------------+------------+-----+------------+
only showing top 20 rows

and giving closer to which clinical significances are the offending ones

@ 
evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" === 0).selectExpr("targetId", "diseaseId", "score", "explode(clinicalSignificances) as clinicalSig").select("clinicalSig").distinct.show 
+------------+
| clinicalSig|
+------------+
|      benign|
|not provided|
+------------+

therefore, the counts sound correct to me.

andrewhercules added Bug Something isn't working Data Relates to Open Targets data team labels Oct 14, 2021

andrewhercules assigned ireneisdoomed Oct 14, 2021

cmalangone assigned cmalangone and JarrodBaker Oct 15, 2021

mkarmona assigned mkarmona and unassigned cmalangone, JarrodBaker and ireneisdoomed Oct 15, 2021

mkarmona closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate differences in `evidenceCount` field in associations datasets #1819

Investigate differences in `evidenceCount` field in associations datasets #1819

andrewhercules commented Oct 14, 2021

ireneisdoomed commented Oct 15, 2021 •

edited

mkarmona commented Oct 15, 2021

Investigate differences in evidenceCount field in associations datasets #1819

Investigate differences in evidenceCount field in associations datasets #1819

Comments

andrewhercules commented Oct 14, 2021

Describe the bug

Observed behaviour

Expected behaviour

Additional context**

ireneisdoomed commented Oct 15, 2021 • edited

mkarmona commented Oct 15, 2021

Investigate differences in `evidenceCount` field in associations datasets #1819

Investigate differences in `evidenceCount` field in associations datasets #1819

ireneisdoomed commented Oct 15, 2021 •

edited