You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BackendRelates to Open Targets backend teamBugSomething isn't workingDataRelates to Open Targets data teamPlatformIssues related to Open Targets Platform
The data on our FTP seems to have the right format but not the one on the AWS. Although this is beyond our codebase, it would be good to try to resolve it so we can ensure data in S3 is correct.
The text was updated successfully, but these errors were encountered:
d0choa
added
Bug
Something isn't working
Data
Relates to Open Targets data team
Backend
Relates to Open Targets backend team
Platform
Issues related to Open Targets Platform
labels
Jul 7, 2022
The aws documentation says sync doesn't delete files, instead just copies and overwrites them. As the partitioned datasets have a unique hash in the file names, you can copy files from multiple releases without overwriting each other. So the user's observation was correct: there might be multiple entries with the same disease id in the disease dataset coming from multiple releases. Unless they do something about it, form September there will be a fifth identical identifier.
The same partition from December and July. When reading the dataset seemingly the data is duplicated. I am not sure if anyone reads the issues on the AWS repo, but I open a ticket there. Nah, they have 4k repositories, and unanswered issues from 2020.
I've reviewed their repo and, as far as I can see, although they start with a clean local folder, as you can see here, when it comes to syncing the download to the Open Targets Source bucket, it always goes to the same destination and, as pointed out by @DSuveges , the data is not mirrored, but aggregated, as no --delete parameter is used for the synchronization process.
I have submitted the following PR, aws-samples/data-lake-as-code#25, explaining the issue and implementing the operation as a mirroring process, instead of an aggregation.
BackendRelates to Open Targets backend teamBugSomething isn't workingDataRelates to Open Targets data teamPlatformIssues related to Open Targets Platform
AWS pulls our data from FTP to serve it as a public dataset in https://registry.opendata.aws/opentargets/
A user in the community site has reported that for 22.06, the data in
s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/
does not correspond with the data in:
ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.06/output/etl/parquet/associationByDatatypeIndirect
The data on our FTP seems to have the right format but not the one on the AWS. Although this is beyond our codebase, it would be good to try to resolve it so we can ensure data in S3 is correct.
I believe all code to pull our data should be publicly available here:
https://github.com/aws-samples/data-lake-as-code/
From what I see there are several branches that might be relevant
The text was updated successfully, but these errors were encountered: