Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOs and Windows lowercases filenames #3318

Open
pared opened this issue Feb 13, 2020 · 14 comments
Open

MacOs and Windows lowercases filenames #3318

pared opened this issue Feb 13, 2020 · 14 comments

Comments

@pared
Copy link
Member

@pared pared commented Feb 13, 2020

Context: https://discordapp.com/channels/485586884165107732/485596304961962003/677178126916124702
The issue:
User created data dir which contained files of the same name, some were uppercase, some were lowercase.
Other user cloned the repository on MacOs and could not get images containing uppercase letters,
when those names ovelapped with other images having lowercase (the same) name.
Pulling on linux was fine.

Reproduction script:

#!/bin/bash

rm -rf repo storage git_repo erepo
mkdir erepo git_repo storage

MAIN=$(pwd)

pushd git_repo
git init --bare --quiet
popd
set -ex
pushd erepo
git init --quiet
dvc init -q

mkdir data_dir
echo data1>>data_dir/file
echo data2>>data_dir/File

dvc add data_dir
dvc remote add -d str $MAIN/storage
git add -A
git commit -am "initial"
git remote add origin $MAIN/git_repo

dvc push
git push origin master
popd

git clone git_repo repo

pushd repo
dvc pull

after pull, we should have file and File in target repo, which seems to not happen on MacOs.

@triage-new-issues triage-new-issues bot added the triage label Feb 13, 2020
@pared pared added the bug label Feb 13, 2020
@triage-new-issues triage-new-issues bot removed the triage label Feb 13, 2020
@triage-new-issues triage-new-issues bot removed the triage label Feb 13, 2020
@pared pared changed the title MacOs lowercases files names MacOs lowercases filenames Feb 13, 2020
@pared pared mentioned this issue Feb 13, 2020
3 of 3 tasks complete
@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 13, 2020

So, turns out that by default, Windows and MacOs do not support case sensitive file names.
Some resources:
https://superuser.com/questions/165975/are-all-versions-of-windows-case-insensitive
https://apple.stackexchange.com/questions/22297/is-bash-in-osx-case-insensitive

It seems that this filesystem setting, so one should be able to configure case sensitive support.
It doesn't seem to me that we can do something about it at dvc level.

@pared pared changed the title MacOs lowercases filenames MacOs and Windows lowercases filenames Feb 13, 2020
@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 13, 2020

We could probably warn the user about this behavior, though the detection of such use cases could be resource consuming.

@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 13, 2020

When such a situation might happen:

  1. pull repo created on Linux
  2. I could imagine some data processing step that would create processed data and name it automatically, so verification only on data sync operations is not enough.
@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 13, 2020

Some research shows that we are not the only one affected by this.
Git users on mac might also be hit with this:
https://stackoverflow.com/questions/8904327/case-sensitivity-in-git

@efiop

This comment has been minimized.

Copy link
Member

@efiop efiop commented Feb 13, 2020

if there is no good way to solve it we should at least document it. Probably a separate article about Ensuring cross-platform compatibility would be nice, which then could also be used linked in the troubleshooting guide.

But need to look into the git first, @pared the link that you've shared seems to note that it now forks fine in git. Might be missing something.

@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 14, 2020

Yes, it seems so, though situation described there is about moving the names, and not coexisting same (from case insensitive perspective) names. I will check how git behaves when we check out repository with "same" filenames.

@efiop efiop added this to To do in DVC Sprint 11 Feb - 25 Feb 2020 via automation Feb 14, 2020
@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 14, 2020

Ok, so after discussion with @efiop I write down some summary of what happend to better pinpoint the problem.

  1. The user created a data repository on Linux (case sensitive) system.
    Simple tree displaying data dir could be shown as:
dir
├── file
└── FILE
  1. Other user tried to clone repo and pull the dir. What he received on output, was
    Could not create '{md5}.dir.unpacked' (originating from here)
    Besides that, pull was successful.

The problem was that the number of data files did not match (for MacOs with default setting file and FILE is the same, so checkout was unintentionally overriding one with another).
Why did unpacked dir creation fail? Our linking method fails if link exists, and from point of view of filesystem it did (check whether FILE exists was successfull for file that was created earlier). But unpacked creation failure does not cripple dvc critically so we just log a warning.

We could improve the detection of such cases if we would include filenames in the Exception that is thrown upon linking. In this case, the warning could hint us but was too generic, we should probably prepare our own error including both filenames, and print it upon unpacked creation failure, even if its only warning.

@efiop

This comment has been minimized.

Copy link
Member

@efiop efiop commented Feb 14, 2020

Btw, in case of dvc add file + dvc add FILE + dvc checkout, the issue won't be reproduced, because our checkout logic will see that file exists and will remove it before checking out FILE. Just a side note for myself.

@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 17, 2020

@efiop, but in case of a case-insensitive system, you cannot create file and FILE.

@efiop

This comment has been minimized.

Copy link
Member

@efiop efiop commented Feb 17, 2020

@pared Did you close it intentionally?

@pared pared reopened this Feb 18, 2020
DVC Sprint 11 Feb - 25 Feb 2020 automation moved this from Done to In progress Feb 18, 2020
@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 18, 2020

I don't recall closing this, don't know what happend, sorry.

@efiop efiop moved this from In progress to To do in DVC Sprint 11 Feb - 25 Feb 2020 Feb 18, 2020
@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 19, 2020

Ok, so it seems that git is handling this problem by warning the user about collision.
Here is sample output:

Cloning into 'test_repo'...
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'FILE'
  'file'

I guess it would be reasonable for us to do the same, though we should think on how to do that to not slow down any checkout because simply checking all filenames on checkout is O(n^2)

@efiop

This comment has been minimized.

Copy link
Member

@efiop efiop commented Feb 20, 2020

@pared Amazing research! Thank you so much! 🙏 As to a reasonable solution, I suppose that git detects that collision while performing the checkout, right? From the wording, it seems like the ordering is completely random, so git doesn't care about it too much after all. We could do the same thing in dvc, for example, by playing around that defensive if self.exists() in _do_link, that has tipped us about this issue in the first place. Though I wonder if os.listdir returns case-sensitive filenames 🙂 , if it does then the implementation might be even simpler.

@pared

This comment has been minimized.

Copy link
Member Author

@pared pared commented Feb 20, 2020

@efiop

Though I wonder if os.listdir returns case-sensitive filenames, if it does then the implementation might be even simpler.

I suppose it depends on the system. As of current state of research, default settings of Windows and MacOs make them "Case insensitive, but case aware", so
os.listdir will return "cased" paths but will treat FILE as if it was file.
I believe that implementing this logic in _do_link is the best way to go now because we do the check there anyway, so improving the message sounds like a zero-impact solution. What worries me here is that in the original issue, I did not see the message Link '{}' already exists. I will investigate why it happened.

@pared pared self-assigned this Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.