Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Error [Errno36] File name too long #3423

Closed
rsomani95 opened this issue Feb 29, 2020 · 5 comments
Closed

Unexpected Error [Errno36] File name too long #3423

rsomani95 opened this issue Feb 29, 2020 · 5 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@rsomani95
Copy link

rsomani95 commented Feb 29, 2020

DVC version:

DVC version: 0.86.5
Python version: 3.7.4
Platform: Linux-5.3.0-40-generic-x86_64-with-debian-buster-sid
Binary: False
Package: pip
Cache: reflink - not supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('ext4', '/dev/nvme0n1p2')
Filesystem type (workspace): ('ext4', '/dev/nvme0n1p2')

OS: Ubuntu 18.04

When trying to dvc pull from a GCS bucket, I face the following error"

ERROR: unexpected error - [Errno 36] File name too long: '/home/rahul/github_projects/CinemaNet-Dataset/data/film_grab_and_google/train/shot_location/shot_location_exterior_nature_wetlands/677px-%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0.jpg.uqi3uTLbYsUU6FJNjzm9EC.tmp'

At first, I thought this was an issue with the linux filesystem (I have ext4) because of a limit in the max no. of characters in the file name. This can be seen with

getconf NAME_MAX / # (== 255 on my system)

(more details here). I tried changing this manually but to no avail.


However, the following scenario makes me think that there's more going on

I have a local copy of the data I was trying to pull from the remote. And when I try to add that folder in a new folder initialised using dvc init --no-scm, I can do so with some persistence.

When first running dvc add shot_location (data-dir = shot_location), I get the same error, but for a different file. Every time I re-run dvc add ..., I get the same error, but for a different file, until finally I'm able to add the file successfully. Here's what that process looks like:

# 1
dvc add shot_location/

Adding...                                                                                                
ERROR: unexpected error - [Errno 36] File name too long: '/home/rahul/datasets/tmp/shot_location/shot_location_exterior_nature_wetlands/677px-%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0.jpg.AVhMoVH67pezi4Dqc33LAe.tmp'

# 2
dvc add shot_location/

Adding...                                                                                                
ERROR: unexpected error - [Errno 36] File name too long: '/home/rahul/datasets/tmp/shot_location/shot_location_exterior_structure_ruins/51rVoriXTXL._SR600%2C315_PIWhiteStrip%2CBottomLeft%2C0%2C35_PIAmznPrime%2CBottomLeft%2C0%2C-5_PIStarRatingFOURANDHALF%2CBottomLeft%2C360%2C-6_SR600%2C315_ZA-8%20Reviews-%2C445%2C291%2C400%2C400%2Carial%2C12%2C4%2C0%2C0%2C5_SCLZZZZZZZ_.jpg.moMJP7BBQcmiAsigDJQ36R.tmp'

....
....

# 6
dvc add shot_location/

Adding...                                                                                                
ERROR: unexpected error - [Errno 36] File name too long: '/home/rahul/datasets/tmp/shot_location/shot_location_interior_structure_building_stage/09-may-2018-portugal-lisbon-maltas-christabelle-standing-on-the-stage-during-the-second-dress-rehearsal-of-the-second-semi-final-at-the-eurovision-song-contest-the-final-takes-place-on-the-12th-may-2018-photo-jrg-carstensendpa-mm7hb4.jpg.BsQCnKzVGHypo4MWtSTpYg.tmp'

# 7
dvc add shot_location/

100% Add|███████████████████████████████████████████████████████████████████████|1/1 [00:14, 14.95s/file]

Repeatedly trying dvc checkout data.dvc from the GCS remote did not behave this way; it kept throwing the same error over and over again.

Now after this, I tested whether I could checkout this data after deleting it

rm -r shot_location
dvc checkout shot_location.dvc

And I was able to successfully, with the caveat that none of the files that threw this unexpected error exist.

NOTE
The filenames that throw these unexpected errors aren't fully accurate. They have the format {long_filename}.jpg.{garbled_string}.tmp (# characters > 255) whereas on disk, the actual names are just {long_filename}.jpg (# characters < 255)
Also, I haven't modified the config files to include symlinks or hardlinks, both on remote and local

@efiop
Copy link
Contributor

efiop commented Mar 2, 2020

Hi @rsomani95 !

Sorry for the delay. Could you please provide verbose log? I.e. run your command with -v added.

But the issue here is that your filename is percent-encoded. E.g.

%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0

aka

Белоручейская_УЖД_в_районе_посёлка_Саменжа

has a length of 227 chars when percent-encoded, but only 79 chars when decoded. And so when our tmp prefix is added (we need it to provide atomicity) we go over 256 chars for NAME_MAX and get that error.

Now, the question here is what is using those percent-encoded names. I have a few ideas:

  1. you have your files named that way in percent-encoded form. If it is the case then the solution would be to shorten our tmp prefix or reconsider tmpfile naming in general. But you should also reconsider using percent-encoded names in your projects like that, it might be error prone not only with dvc (we have seen some similar issues in other components too).

  2. your filesystem tries to handle that way (i do see that it is ext4, so this is a pretty unlikely cause, unless it is misconfigured somehow)

  3. dvc not decoding things correctly when reading things from GS (e.g. when reading .dir cache file through the stream). This is a bit unlikely, but who knows...

So in order to debug further, please show us the verbose log as requested above 🙂 Thanks for reporting this issue!

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Mar 2, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Mar 2, 2020
@rsomani95
Copy link
Author

Hi @efiop! Thanks a ton for the detailed response, appreciate it!

Now, the question here is what is using those percent-encoded names. I have a few ideas:

  1. is correct. That's how the file names are stored.

For files like this...

'09-may-2018-portugal-lisbon-maltas-christabelle-standing-on-the-stage-during-the-second-dress-rehearsal-of-the-second-semi-final-at-the-eurovision-song-contest-the-final-takes-place-on-the-12th-may-2018-photo-jrg-carstensendpa-mm7hb4.jpg.BsQCnKzVGHypo4MWtSTpYg.tmp'

... it isn't encoding-decoding issue, but the filename is just too long, correct?

Verbose Error Message (Truncated):

2020-03-02 17:58:43,075 DEBUG: checking if 'data/film_grab_and_google/train/shot_location/shot_location_exterior_nature_wetlands/677px-%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0.jpg'('{'md5': 'b8261e689bc14d83cc90091d06b89715'}') has changed.
2020-03-02 17:58:43,076 DEBUG: 'data/film_grab_and_google/train/shot_location/shot_location_exterior_nature_wetlands/677px-%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0.jpg' doesn't exist.
2020-03-02 17:58:43,076 DEBUG: Cache type 'reflink' is not supported: reflink is not supported                                      
2020-03-02 17:58:43,081 DEBUG: SELECT count from state_info WHERE rowid=?                                                           
2020-03-02 17:58:43,081 DEBUG: fetched: [(93025,)]
2020-03-02 17:58:43,081 DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
2020-03-02 17:58:43,094 ERROR: unexpected error - [Errno 36] File name too long: '/home/rahul/github_projects/CinemaNet-Dataset/data/film_grab_and_google/train/shot_location/shot_location_exterior_nature_wetlands/677px-%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0.jpg.Am5WZHo2cnELtXmLYDroQd.tmp'
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/main.py", line 49, in main
    ret = cmd.run()
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/command/data_sync.py", line 30, in run
    recursive=self.args.recursive,
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/repo/__init__.py", line 28, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/repo/pull.py", line 31, in pull
    targets=targets, with_deps=with_deps, force=force, recursive=recursive
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/repo/checkout.py", line 67, in _checkout
    filter_info=filter_info,
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/funcy/decorators.py", line 39, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/stage.py", line 161, in rwlocked
    return call()
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/funcy/decorators.py", line 60, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/stage.py", line 992, in checkout
    filter_info=filter_info,
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/output/base.py", line 301, in checkout
    filter_info=filter_info,
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 974, in checkout
    path_info, checksum, force, progress_callback, relink, filter_info
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 991, in _checkout
    path_info, checksum, force, progress_callback, relink, filter_info
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 903, in _checkout_dir
    self.link(entry_cache_info, entry_info)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 368, in link
    self._link(from_info, to_info, self.cache_types)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 375, in _link
    self._try_links(from_info, to_info, link_types)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/slow_link_detection.py", line 38, in wrapper
    result = f(remote, *args, **kwargs)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 393, in _try_links
    self._do_link(from_info, to_info, link_method)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/base.py", line 409, in _do_link
    link_method(from_info, to_info)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/remote/local.py", line 155, in copy
    System.copy(from_info, tmp_info)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/dvc/system.py", line 38, in copy
    return shutil.copyfile(src, dest)
  File "/home/rahul/anaconda3/lib/python3.7/site-packages/pyfastcopy/__init__.py", line 77, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
OSError: [Errno 36] File name too long: '/home/rahul/github_projects/CinemaNet-Dataset/data/film_grab_and_google/train/shot_location/shot_location_exterior_nature_wetlands/677px-%D0%91%D0%B5%D0%BB%D0%BE%D1%80%D1%83%D1%87%D0%B5%D0%B9%D1%81%D0%BA%D0%B0%D1%8F_%D0%A3%D0%96%D0%94_%D0%B2_%D1%80%D0%B0%D0%B9%D0%BE%D0%BD%D0%B5_%D0%BF%D0%BE%D1%81%D1%91%D0%BB%D0%BA%D0%B0_%D0%A1%D0%B0%D0%BC%D0%B5%D0%BD%D0%B6%D0%B0.jpg.Am5WZHo2cnELtXmLYDroQd.tmp'
------------------------------------------------------------

Everything you've mentioned so far suggests to me that cleaning up the file-names on my end would be the cleanest approach.

Thanks again for all the help :)

@efiop
Copy link
Contributor

efiop commented Mar 9, 2020

Sorry for the delay, @rsomani95 , had a conf run this week, so didn't check notifications that much.

... it isn't encoding-decoding issue, but the filename is just too long, correct?

Yes, correct 🙁 Looks like you are trying to store a lot of info your filenames, maybe consider creating a meta file alongside or something. It is hard for me to tell if you really have strong reasons to organise the storage that way, but personally feels suboptimal. And, well, dvc's suffix is striking the last nail there. We could shorten the prefix to something like 3-4 chars (because when you dvc add long-name, you will be limited by long-name.dvc anyways), but even that will break at some point. So I'm a bit hesitant to shorten our prefixes right away, but that will indeed now leave in the back of my head and if we get new reports or if you really need it, we'll do that, for sure. Please let me know what you think.

@rsomani95
Copy link
Author

Hey @efiop, that's alright.

The filenames were based off of google searches and so, we retained a lot of info. Anyways, I ended up just shortening the filenames to go around this issue. Thanks for your feedback.

From my end, this issue can be closed.

@efiop
Copy link
Contributor

efiop commented Mar 9, 2020

Ok, closing for now then. @rsomani95 Thank you so much for the feedback! 🙏

@efiop efiop closed this as completed Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

2 participants