Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding external data on S3 fails #3341

Closed
shcheklein opened this issue Feb 16, 2020 · 15 comments
Closed

adding external data on S3 fails #3341

shcheklein opened this issue Feb 16, 2020 · 15 comments
Labels
bug Did we break something? enhancement Enhances DVC p2-medium Medium priority, should be done, but less important research

Comments

@shcheklein
Copy link
Member

To keep track of the problem:

https://discordapp.com/channels/485586884165107732/485596304961962003/678284389230182430

the data set i'm trying to add is an LDC data set
in s3://BUCKET/LDC/LDC96T17/ there's only one directory
callhome_spanish_trans_970711
adding s3://BUCKET/LDC/LDC96T17/ fails
adding s3://BUCKET/LDC/LDC96T17/callhome_spanish_trans_970711/ work

more context here:

https://discordapp.com/channels/485586884165107732/485596304961962003/678261668723032075

ERROR: s3://<BUCKET>/dvc_cache/da/8ae919cf1e2400af35eae297eacd67 does not exist: An error occurred (404) when calling the HeadObject operation: Not Found
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 16, 2020
@efiop
Copy link
Member

efiop commented Feb 16, 2020

@shcheklein It is probably some kind of acl issue, I don't think the problem is on dvc side.

@efiop
Copy link
Member

efiop commented Feb 16, 2020

Closing for now, as we've suggested an alternative approach and there is not enough info to pinpoint the issue more than we've done already.

@efiop efiop closed this as completed Feb 16, 2020
@efiop
Copy link
Member

efiop commented Feb 16, 2020

Missed some details, reopening.

@efiop efiop reopened this Feb 16, 2020
@shcheklein
Copy link
Member Author

Was able to reproduce the part with "Removing ...":

(.env) [ivan@ivan ~/Projects/test-ext-dep]$ aws s3 ls s3://dvc-temp/cache/
                           PRE b9/

then:

(.env) [ivan@ivan ~/Projects/test-ext-dep]$ dvc add -v s3://dvc-temp/ivan/configure.sh
2020-02-15 18:24:20,283 DEBUG: PRAGMA user_version;
2020-02-15 18:24:20,285 DEBUG: fetched: [(3,)]
2020-02-15 18:24:20,285 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-02-15 18:24:20,286 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-02-15 18:24:20,286 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-02-15 18:24:20,286 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-02-15 18:24:20,287 DEBUG: PRAGMA user_version = 3;
2020-02-15 18:24:20,361 DEBUG: Path .dvc/cache/27/5dc5b7c7d03ded4193f55a8836e316 inode 60745619
2020-02-15 18:24:20,361 DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
2020-02-15 18:24:20,361 DEBUG: fetched: [('1578506993901407232', '3187', '275dc5b7c7d03ded4193f55a8836e316', '1581810849255110912')]
2020-02-15 18:24:20,362 DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
2020-02-15 18:24:20,362 DEBUG: cache '.dvc/cache/27/5dc5b7c7d03ded4193f55a8836e316' expected '275dc5b7c7d03ded4193f55a8836e316' actual '275dc5b7c7d03ded4193f55a8836e316'
2020-02-15 18:24:20,363 DEBUG: Path configure.sh inode 60752984
2020-02-15 18:24:20,363 DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
2020-02-15 18:24:20,363 DEBUG: fetched: [('1578506993901407232', '3187', '275dc5b7c7d03ded4193f55a8836e316', '1581810849254287872')]
2020-02-15 18:24:20,364 DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
2020-02-15 18:24:21,253 DEBUG: {'s3://dvc-temp/ivan/configure.sh': 'modified'}
2020-02-15 18:24:21,362 DEBUG: Computed stage 'configure.sh.dvc' md5: 'b74bc306405e05917ac37442dfeeef4e'
2020-02-15 18:24:21,368 DEBUG: Saving 's3://dvc-temp/ivan/configure.sh' to 's3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316'.
2020-02-15 18:24:21,698 DEBUG: cache 's3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316' expected '275dc5b7c7d03ded4193f55a8836e316' actual 'None'
2020-02-15 18:24:21,891 DEBUG: Removing s3://dvc-temp/ivan/configure.sh
2020-02-15 18:24:22,379 DEBUG: Created 'copy': s3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316 -> s3://dvc-temp/ivan/configure.sh
2020-02-15 18:24:22,381 DEBUG: Saving information to 'configure.sh.dvc'.
100% Add|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:02,  2.10s/file]

To track the changes with git, run:

	git add configure.sh.dvc
2020-02-15 18:24:22,389 DEBUG: SELECT count from state_info WHERE rowid=?
2020-02-15 18:24:22,389 DEBUG: fetched: [(7,)]
2020-02-15 18:24:22,389 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?

cache after that:

(.env) [ivan@ivan ~/Projects/test-ext-dep]$ aws s3 ls s3://dvc-temp/cache/
                           PRE 27/
                           PRE b9/

When you run it second time:

2020-02-15 18:28:16,533 DEBUG: PRAGMA user_version;
2020-02-15 18:28:16,535 DEBUG: fetched: [(3,)]
2020-02-15 18:28:16,535 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-02-15 18:28:16,537 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-02-15 18:28:16,537 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-02-15 18:28:16,537 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-02-15 18:28:16,538 DEBUG: PRAGMA user_version = 3;
2020-02-15 18:28:16,994 DEBUG: cache 's3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316' expected '275dc5b7c7d03ded4193f55a8836e316' actual '275dc5b7c7d03ded4193f55a8836e316'
2020-02-15 18:28:17,349 DEBUG: Saving 's3://dvc-temp/ivan/configure.sh' to 's3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316'.
2020-02-15 18:28:17,575 DEBUG: cache 's3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316' expected '275dc5b7c7d03ded4193f55a8836e316' actual '275dc5b7c7d03ded4193f55a8836e316'
2020-02-15 18:28:17,575 DEBUG: Removing s3://dvc-temp/ivan/configure.sh
2020-02-15 18:28:17,995 DEBUG: Created 'copy': s3://dvc-temp/cache/27/5dc5b7c7d03ded4193f55a8836e316 -> s3://dvc-temp/ivan/configure.sh
Stage is cached, skipping.
  0% Add|                                                                                                                                                                          |0/0 [00:01,     ?file/s]
2020-02-15 18:28:18,357 DEBUG: SELECT count from state_info WHERE rowid=?
2020-02-15 18:28:18,358 DEBUG: fetched: [(7,)]
2020-02-15 18:28:18,358 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?

We need to check why Removing s3://dvc-temp/ivan/configure.sh is in the logs - both times. We need to be sure that we don't create a copy two times.

@shcheklein
Copy link
Member Author

Don't see a reasonable explanation for these problems so far, would need way more info about config

An error occurred (404) when calling the HeadObject operation: Not Found (considering that aws ls works and considering that it happens on some object in cache and it data)

and

2020-02-15 16:57:52,715 DEBUG: cache 's3://<BUCKET>/dvc-cache/c9/863800000b3f4f2d52e5f964d3cd09' expected 'c9863800000b3f4f2d52e5f964d3cd09' actual 'None'                                                                                                                                                                   
2020-02-15 16:57:53,532 DEBUG: Removing s3://<BUCKET>/DATA_PATH/transcrp/devtest/sp_1930.txt                                                                                                                                                                                                
2020-02-15 16:57:55,275 DEBUG: Cache type 'copy' is not supported: s3://<BUCKET>/DATA_PATH/transcrp/devtest/sp_1930.txt does not exist                                                                                                                                                      
Adding...          

@efiop efiop added bug Did we break something? research labels Feb 18, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 18, 2020
@efiop efiop added this to To do in DVC Sprint 11 Feb - 25 Feb 2020 via automation Feb 18, 2020
@efiop efiop added this to To do in DVC 25 Feb - 10 March 2020 via automation Feb 18, 2020
@efiop efiop removed this from To do in DVC 25 Feb - 10 March 2020 Feb 18, 2020
@efiop efiop added the p1-important Important, aka current backlog of things to do label Feb 18, 2020
@efiop efiop added this to To do in DVC 25 Feb - 10 March 2020 via automation Feb 18, 2020
@efiop efiop added this to To do in DVC 10 - 24 March 2020 via automation Mar 10, 2020
@skshetry skshetry self-assigned this Mar 12, 2020
@skshetry
Copy link
Member

@shcheklein, @efiop, Do we have more information regarding the original issue?

@efiop
Copy link
Member

efiop commented Mar 12, 2020

@skshetry That is pretty much all we have. If it doesn't make sense/not reproducible, feel free to close it.

@shcheklein
Copy link
Member Author

There are two problems, @skshetry . I kept them here together in case they are related. I was able to reproduce one of those, like I mentioned (at least I don't see a reasonable explanation why is it "Removing" stuff). This part:

We need to be sure that we don't create a copy two times.

might be p1 if it's confirmed.

As for the HEAD operation fail and a few other problems - we definitely need to do more research and try similar scenarios. Is this part p1? - I don't know - would need to dig for an hour or two to say something. If go that path I would try versioned buckets first and would contact the guy on Discord.

@skshetry skshetry moved this from To do to In progress in DVC 10 - 24 March 2020 Mar 13, 2020
@skshetry
Copy link
Member

skshetry commented Mar 13, 2020

We do move the file from storage to the cache (i.e. copy and then remove the file from storage), and then relinked again from the cache to storage.

dvc/dvc/remote/base.py

Lines 426 to 428 in 682275d

if self.changed_cache(checksum):
self.move(path_info, cache_info)
self.link(cache_info, path_info)

As this happens on RemoteBase and ._save_file(), this likely affects all the remote storages we support (not confimed except for s3) and file & dir uploads as well respectively.

You can see at the following frames (top - bottom depth, see last two lines):

<module>, __main__.py:7
main, main.py:50
run, add.py:20
wrapper, __init__.py:27
run, scm_context.py:4
add, add.py:86
wrapper, decorators.py:39
rwlocked, stage.py:161
__call__, decorators.py:60
commit, stage.py:823
commit, base.py:241
save, base.py:520
_save, base.py:526
_save_dir, base.py:479
_save_file, base.py:427
move, base.py:660
remove, s3.py:186

It can also be verified with a quick aws s3 ls s3://<path> (check timestamp).

Log for single file upload

➜ dvc add s3://dvc-temp/saugat/storage/foo -v
2020-03-13 15:55:39,505 DEBUG: PRAGMA user_version;
2020-03-13 15:55:39,505 DEBUG: fetched: [(3,)]
2020-03-13 15:55:39,505 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-03-13 15:55:39,506 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-03-13 15:55:39,506 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-03-13 15:55:39,506 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-03-13 15:55:39,506 DEBUG: PRAGMA user_version = 3;
2020-03-13 15:56:03,337 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00' expected 'd3b07384d113edec49eaa6238ad5ff00' actual 'd3b07384d113edec49eaa6238ad5ff00'
2020-03-13 15:56:18,814 DEBUG: Saving 's3://dvc-temp/saugat/storage/foo' to 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00'.
2020-03-13 15:56:22,760 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00' expected 'd3b07384d113edec49eaa6238ad5ff00' actual 'd3b07384d113edec49eaa6238ad5ff00'
2020-03-13 15:56:22,761 DEBUG: Removing s3://dvc-temp/saugat/storage/foo
2020-03-13 15:56:30,530 DEBUG: Created 'copy': s3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00 -> s3://dvc-temp/saugat/storage/foo
Stage is cached, skipping.
  0% Add|                                                                                                                                       |0/0 [00:51,     ?file/s]
2020-03-13 15:56:30,953 DEBUG: SELECT count from state_info WHERE rowid=?
2020-03-13 15:56:30,953 DEBUG: fetched: [(0,)]
2020-03-13 15:56:30,953 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?

Log for folder uploads

➜ dvc add s3://dvc-temp/saugat/storage/data -v
2020-03-13 15:52:45,052 DEBUG: PRAGMA user_version;
2020-03-13 15:52:45,052 DEBUG: fetched: [(3,)]
2020-03-13 15:52:45,053 DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
2020-03-13 15:52:45,053 DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
2020-03-13 15:52:45,053 DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
2020-03-13 15:52:45,054 DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
2020-03-13 15:52:45,054 DEBUG: PRAGMA user_version = 3;
2020-03-13 15:52:51,487 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir' expected '634763d85ae67510ffd8dfad284fa239.dir' actual 'None'
2020-03-13 15:52:51,488 WARNING: Output 's3://dvc-temp/saugat/storage/data' of 'data.dvc' changed because it is 'not in cache'
2020-03-13 15:53:04,810 DEBUG: Uploading '../../../../tmp/tmpj44owyz_' to 's3://dvc-temp/saugat/tmp-1/cache/.VPNaHQGJEdCKTyPxbbZ4gp.tmp'
2020-03-13 15:53:08,628 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir' expected '634763d85ae67510ffd8dfad284fa239.dir' actual 'None'
2020-03-13 15:53:12,151 DEBUG: Removing s3://dvc-temp/saugat/tmp-1/cache/.VPNaHQGJEdCKTyPxbbZ4gp.tmp
2020-03-13 15:53:13,452 DEBUG: {'s3://dvc-temp/saugat/storage/data': 'modified'}
2020-03-13 15:53:18,024 DEBUG: Uploading '../../../../tmp/tmp2i8sptd7' to 's3://dvc-temp/saugat/tmp-1/cache/.BhpWSzEfnS49dm9AcrjzDg.tmp'
2020-03-13 15:53:20,826 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir' expected '634763d85ae67510ffd8dfad284fa239.dir' actual '634763d85ae67510ffd8dfad284fa239'
2020-03-13 15:53:20,827 DEBUG: Computed stage 'data.dvc' md5: 'd09e4e34bb5c2ad7858adb8e2436bff2'
2020-03-13 15:53:21,186 DEBUG: Saving 's3://dvc-temp/saugat/storage/data' to 's3://dvc-temp/saugat/tmp-1/cache/63/4763d85ae67510ffd8dfad284fa239.dir'.
2020-03-13 15:53:24,127 DEBUG: cache 's3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00' expected 'd3b07384d113edec49eaa6238ad5ff00' actual 'None'
2020-03-13 15:53:25,864 DEBUG: Removing s3://dvc-temp/saugat/storage/data/foo
2020-03-13 15:53:30,758 DEBUG: Created 'copy': s3://dvc-temp/saugat/tmp-1/cache/d3/b07384d113edec49eaa6238ad5ff00 -> s3://dvc-temp/saugat/storage/data/foo
2020-03-13 15:53:31,008 DEBUG: Saving information to 'data.dvc'.
100% Add|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:45, 45.97s/file]

To track the changes with git, run:

	git add data.dvc
2020-03-13 15:53:31,026 DEBUG: SELECT count from state_info WHERE rowid=?
2020-03-13 15:53:31,026 DEBUG: fetched: [(0,)]
2020-03-13 15:53:31,027 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?

@skshetry skshetry added the enhancement Enhances DVC label Mar 13, 2020
@shcheklein
Copy link
Member Author

Thanks @skshetry , great research!!

@dmpetrov
Copy link
Member

@skshetry thank you for nailing this down! I’d appreciate if you could provide more details on a possible fix and the complexity?

@skshetry skshetry moved this from In progress to To do in DVC 10 - 24 March 2020 Mar 19, 2020
@efiop efiop moved this from To do to In progress in DVC 10 - 24 March 2020 Mar 19, 2020
@skshetry skshetry moved this from In progress to To do in DVC 10 - 24 March 2020 Mar 19, 2020
@efiop efiop moved this from To do to Done in DVC 10 - 24 March 2020 Mar 24, 2020
@efiop efiop added this to To do in DVC 24 March - 7 April 2020 via automation Mar 24, 2020
@efiop efiop moved this from To do to In progress in DVC 24 March - 7 April 2020 Mar 31, 2020
@efiop efiop added this to To do in DVC 7 - 21 April 2020 via automation Apr 7, 2020
@efiop efiop removed this from In progress in DVC 24 March - 7 April 2020 Apr 7, 2020
@efiop efiop removed this from To do in DVC 7 - 21 April 2020 Apr 21, 2020
@efiop efiop added this to To do in DVC Sprint 21 Apr - 5 May 2019 via automation Apr 21, 2020
@efiop efiop added this to To do in DVC Sprint 5 - 19 May 2019 via automation May 5, 2020
@efiop efiop removed this from To do in DVC Sprint 21 Apr - 5 May 2019 May 5, 2020
@pared pared added this to To do in DVC Sprint 19 May - 2 June 2019 via automation May 12, 2020
@pared pared removed this from To do in DVC Sprint 5 - 19 May 2019 May 12, 2020
@efiop efiop added this to To do in DVC 2 - 16 June 2020 via automation May 25, 2020
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels May 26, 2020
@efiop efiop removed this from To do in DVC 2 - 16 June 2020 May 26, 2020
@shcheklein
Copy link
Member Author

Similar problem - https://discordapp.com/channels/485586884165107732/485596304961962003/733450578138759268 ... and we do extra copy which is not nice

@skshetry
Copy link
Member

skshetry commented Jul 17, 2020

@shcheklein Yes, I think eventual consistency is to blame for that too. Our S3 tests have been flakey for quite a while now (probably not these days as we migrated to moto).

Another thing to note that the following is not guaranteed in S3 (emphasized), for which I think, we do quite often for external outputs:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

@shcheklein
Copy link
Member Author

If we can confirm that, then it means that wrapping at least certain operation into @Retry should help. (And also remove that extra copy after all! :))

@efiop efiop removed the c8-full-day label Mar 28, 2023
@efiop
Copy link
Member

efiop commented Dec 8, 2023

external outputs are no longer supported.

@efiop efiop closed this as completed Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? enhancement Enhances DVC p2-medium Medium priority, should be done, but less important research
Projects
No open projects
Development

No branches or pull requests

4 participants