Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate differences in evidenceCount field in associations datasets #1819

Closed
andrewhercules opened this issue Oct 14, 2021 · 2 comments
Closed
Assignees
Labels
Bug Something isn't working Data Relates to Open Targets data team

Comments

@andrewhercules
Copy link
Contributor

Describe the bug

Based on a Community post, it appears that the evidenceCount values in our associationByDatasource and associationByDatatype datasets are incorrect. Both the direct and indirect versions are affected by this issue.

Observed behaviour

When parsing and querying the associationByDatasourceDirect and associationByDatasourceIndirect datasets, the association for CFTR (ENSG00000001626) and cystic fibrosis (Orphanet_586), returns an evidence count of 2188 for the eva datasource. However, on the CFTR and cystic fibrosis evidence page, there are 2409 evidence strings from eva. And so 221 records are missing from the evidenceCount field.

When parsing and querying the associationByDatatypeDirect and associationByDatatypeIndirect datasets, the same association returns an evidence count of 2289 for the genetic_associations datatype. However, this should be 2510. And so it appears the same 221 eva records are also missing from the evidenceCount field in the datatype datasets.

Expected behaviour

The evidenceCount values for a given datasource or datatype should match what is displayed in the web interface.

Additional context**

The CFTR and cystic fibrosis association is noteworthy because cystic fibrosis does not have any child terms and so we should expect that the direct and indirect counts are the same.

@andrewhercules andrewhercules added Bug Something isn't working Data Relates to Open Targets data team labels Oct 14, 2021
@ireneisdoomed
Copy link

ireneisdoomed commented Oct 15, 2021

This is not a systemic error that affects all evidence counts.
I have gone through ≈40 associations analogous to the one described above (that involves a disease without descendants) and found 6 examples where the evidence count in the associations set does not match the one displayed in the FE. All of them were coming from EVA.

|diseaseId|     datasourceId|     targetId|    associationsEvidenceCount | displayedEvidenceCount

|EFO_0008263|         eva|ENSG00000135218|          101 | 102
|EFO_0009299|         eva|ENSG00000133392|         1146| 1147
|EFO_0009299|         eva|ENSG00000163513|          467| 468
|EFO_0009299|         eva|ENSG00000166147|         3488| 3494
|EFO_0009299|         eva|ENSG00000168542|          656| 657
|EFO_0009300|         eva|ENSG00000176884|          270| 271

This does not seem to be a data problem, since the displayed evidence is correct and matches the observed evidence count from the evidence dataset. For the CFTR and cystic fibrosis association:

(evd.filter(F.col('targetId') == 'ENSG00000001626').filter(F.col('diseaseId') == 'Orphanet_586')
.filter(F.col('datasourceId') == 'eva').distinct().count())
>>> 2409

I think the problem here is in the algorithm that computes the associations.
@opentargets/be-team can you think of any test to debug where the incorrect evidence count is coming from for one example? It's hard for me to follow the train of logic of the code.

@mkarmona
Copy link
Contributor

the thing is, there are pieces of evidence with a score of 0. But when the evidence is prepared for the associations they are filtered out to not waste resources.

    val dfs = IoHelpers.readFrom(mappedInputs)
    val evidences = dfs("evidences").data
      .selectExpr(evidenceColumns: _*)
      .where(col(evScore) > 0D)

In the case of EVA, it contains mutations that are not pathogenic or the clinical significance is unknown so that implies score 0

@ evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586").count 
res11: Long = 2409L

@ evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" > 0).count 
res13: Long = 2188L

evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" === 0).selectExpr("targetId", "diseaseId", "score", "explode(clinicalSignificances) as clinicalSig").show 
+---------------+------------+-----+------------+
|       targetId|   diseaseId|score| clinicalSig|
+---------------+------------+-----+------------+
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
|ENSG00000001626|Orphanet_586|  0.0|not provided|
+---------------+------------+-----+------------+
only showing top 20 rows

and giving closer to which clinical significances are the offending ones

@ 
evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" === 0).selectExpr("targetId", "diseaseId", "score", "explode(clinicalSignificances) as clinicalSig").select("clinicalSig").distinct.show 
+------------+
| clinicalSig|
+------------+
|      benign|
|not provided|
+------------+

therefore, the counts sound correct to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Data Relates to Open Targets data team
Projects
None yet
Development

No branches or pull requests

5 participants