New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate differences in evidenceCount
field in associations datasets
#1819
Comments
This is not a systemic error that affects all evidence counts.
This does not seem to be a data problem, since the displayed evidence is correct and matches the observed evidence count from the evidence dataset. For the CFTR and cystic fibrosis association:
I think the problem here is in the algorithm that computes the associations. |
the thing is, there are pieces of evidence with a score of 0. But when the evidence is prepared for the associations they are filtered out to not waste resources. val dfs = IoHelpers.readFrom(mappedInputs)
val evidences = dfs("evidences").data
.selectExpr(evidenceColumns: _*)
.where(col(evScore) > 0D) In the case of EVA, it contains mutations that are not pathogenic or the clinical significance is unknown so that implies score 0 @ evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586").count
res11: Long = 2409L
@ evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" > 0).count
res13: Long = 2188L
evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" === 0).selectExpr("targetId", "diseaseId", "score", "explode(clinicalSignificances) as clinicalSig").show
+---------------+------------+-----+------------+
| targetId| diseaseId|score| clinicalSig|
+---------------+------------+-----+------------+
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
|ENSG00000001626|Orphanet_586| 0.0|not provided|
+---------------+------------+-----+------------+
only showing top 20 rows and giving closer to which clinical significances are the offending ones @
evs.filter($"sourceId" === "eva" and $"targetId" === "ENSG00000001626" and $"diseaseId" === "Orphanet_586" and $"score" === 0).selectExpr("targetId", "diseaseId", "score", "explode(clinicalSignificances) as clinicalSig").select("clinicalSig").distinct.show
+------------+
| clinicalSig|
+------------+
| benign|
|not provided|
+------------+ therefore, the counts sound correct to me. |
Describe the bug
Based on a Community post, it appears that the
evidenceCount
values in ourassociationByDatasource
andassociationByDatatype
datasets are incorrect. Both thedirect
andindirect
versions are affected by this issue.Observed behaviour
When parsing and querying the
associationByDatasourceDirect
andassociationByDatasourceIndirect
datasets, the association for CFTR (ENSG00000001626) and cystic fibrosis (Orphanet_586), returns an evidence count of2188
for theeva
datasource. However, on the CFTR and cystic fibrosis evidence page, there are2409
evidence strings fromeva
. And so221
records are missing from theevidenceCount
field.When parsing and querying the
associationByDatatypeDirect
andassociationByDatatypeIndirect
datasets, the same association returns an evidence count of2289
for thegenetic_associations
datatype. However, this should be2510
. And so it appears the same 221eva
records are also missing from theevidenceCount
field in the datatype datasets.Expected behaviour
The
evidenceCount
values for a given datasource or datatype should match what is displayed in the web interface.Additional context**
The CFTR and cystic fibrosis association is noteworthy because cystic fibrosis does not have any child terms and so we should expect that the direct and indirect counts are the same.
The text was updated successfully, but these errors were encountered: