S3 dataset for associationsByDatatypeIndirect does not correspond with FTP file #2661

d0choa · 2022-07-07T09:29:14Z

AWS pulls our data from FTP to serve it as a public dataset in https://registry.opendata.aws/opentargets/

A user in the community site has reported that for 22.06, the data in

s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/

does not correspond with the data in:

ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.06/output/etl/parquet/associationByDatatypeIndirect

The data on our FTP seems to have the right format but not the one on the AWS. Although this is beyond our codebase, it would be good to try to resolve it so we can ensure data in S3 is correct.

I believe all code to pull our data should be publicly available here:
https://github.com/aws-samples/data-lake-as-code/
From what I see there are several branches that might be relevant

The text was updated successfully, but these errors were encountered:

d0choa · 2022-07-07T13:12:57Z

I believe this is their process:
https://github.com/aws-samples/data-lake-as-code/blob/roda/scripts/ssmdoc.import.opentargets.latest.json

DSuveges · 2022-07-08T14:19:59Z

That's really interesting.. I think they are making a mistake here:

"aws s3 sync opentargets/sourceExports/open-targets-data-releases/latest/ s3://{{openTargetsSourceFileTargetBucketLocation}}/opentargets/sourceExports/latest/"

The aws documentation says sync doesn't delete files, instead just copies and overwrites them. As the partitioned datasets have a unique hash in the file names, you can copy files from multiple releases without overwriting each other. So the user's observation was correct: there might be multiple entries with the same disease id in the disease dataset coming from multiple releases. Unless they do something about it, form September there will be a fifth identical identifier.

Demonstrating the mistake:

-rw-r--r--  1 dsuveges  EBI\Domain Users   326K 14 Dec  2021 part-00199-ad8db45e-239a-4036-88a1-012033909e5a-c000.snappy.parquet
-rw-r--r--  1 dsuveges  EBI\Domain Users   335K  6 Jul 08:20 part-00199-bd041124-4c12-491a-80e1-2c06a071d275-c000.snappy.parquet

The same partition from December and July. When reading the dataset seemingly the data is duplicated. ~~I am not sure if anyone reads the issues on the AWS repo, but I open a ticket there.~~ Nah, they have 4k repositories, and unanswered issues from 2020.

mbdebian · 2022-07-29T09:14:54Z

I've reviewed their repo and, as far as I can see, although they start with a clean local folder, as you can see here, when it comes to syncing the download to the Open Targets Source bucket, it always goes to the same destination and, as pointed out by @DSuveges , the data is not mirrored, but aggregated, as no --delete parameter is used for the synchronization process.
I have submitted the following PR, aws-samples/data-lake-as-code#25, explaining the issue and implementing the operation as a mirroring process, instead of an aggregation.

d0choa added Bug Something isn't working Data Relates to Open Targets data team Backend Relates to Open Targets backend team Platform Issues related to Open Targets Platform labels Jul 7, 2022

JarrodBaker assigned mbdebian Jul 7, 2022

DSuveges mentioned this issue Jul 11, 2022

OpenTargets dataset update in the S3 buckets aws-samples/data-lake-as-code#24

Closed

mbdebian closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 dataset for associationsByDatatypeIndirect does not correspond with FTP file #2661

S3 dataset for associationsByDatatypeIndirect does not correspond with FTP file #2661

d0choa commented Jul 7, 2022

d0choa commented Jul 7, 2022

DSuveges commented Jul 8, 2022 •

edited

Loading

mbdebian commented Jul 29, 2022

S3 dataset for associationsByDatatypeIndirect does not correspond with FTP file #2661

S3 dataset for associationsByDatatypeIndirect does not correspond with FTP file #2661

Comments

d0choa commented Jul 7, 2022

d0choa commented Jul 7, 2022

DSuveges commented Jul 8, 2022 • edited Loading

mbdebian commented Jul 29, 2022

DSuveges commented Jul 8, 2022 •

edited

Loading