Skip to content
This repository has been archived by the owner on Nov 9, 2017. It is now read-only.

Git does not see that file in working directory differs from HEAD #312

Closed
j66st opened this issue Feb 5, 2015 · 65 comments
Closed

Git does not see that file in working directory differs from HEAD #312

j66st opened this issue Feb 5, 2015 · 65 comments

Comments

@j66st
Copy link

j66st commented Feb 5, 2015

I found a strange problem in msysGit, I am wondering if it's a bug.
I already discussed it here:
https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc
You can download my Demo.zip attachment there.
It seems not to happen in Git for Linux or OS-X.
The problem occurs in msysGit 1.9.0 and 1.9.5 (running x86 version on a Windows 7 x64 system).

Essentially, the problem seems that Git for Windows assumes that two files are the same when both the timestamp and the file size match. Obviously the file contents is not inspected nor the hash recalculated.

As mentioned I created a minimal demo package to easily prove the issue. Simply download the zip file, put it in a clean directory and run the enclosed script from bash. Below is a transcript of what happens when I run the script in my situation.

The issue is a real show-stopper in my automated migration from Visual SourceSafe to Git (in a test transferring 20,000 source files the problem caused loss of roughly 0.1% of the files from the history), so I hope for a quick solution.

Welcome to Git (version 1.9.5-preview20141217)


Run 'git help git' to display the help index.
Run 'git help <command>' to display help for specific commands.

joostadm@MUPC20 ~
$ cd /d/proj/try/git

joostadm@MUPC20 /d/proj/try/git
$ ls -l
total 15
-rw-r--r--    1 joostadm Administ    26968 Jan 23 18:43 DixiLink2.zip
-rwxr-xr-x    1 joostadm Administ     1848 Feb  5 14:05 weird-git-demo

joostadm@MUPC20 /d/proj/try/git
$ weird-git-demo
Note: No argument supplied, using DemoRepo by default.
+ mkdir DemoRepo
+ cd DemoRepo
+ unzip ../DixiLink2.zip
Archive:  ../DixiLink2.zip
   creating: .git/
 extracting: .git/COMMIT_EDITMSG
  inflating: .git/config
  inflating: .git/description
  inflating: .git/gitk.cache
 extracting: .git/HEAD
   creating: .git/hooks/
  inflating: .git/hooks/applypatch-msg.sample
  inflating: .git/hooks/commit-msg.sample
  inflating: .git/hooks/post-update.sample
  inflating: .git/hooks/pre-applypatch.sample
  inflating: .git/hooks/pre-commit.sample
  inflating: .git/hooks/pre-push.sample
  inflating: .git/hooks/pre-rebase.sample
  inflating: .git/hooks/prepare-commit-msg.sample
  inflating: .git/hooks/update.sample
  inflating: .git/index
   creating: .git/info/
  inflating: .git/info/exclude
   creating: .git/logs/
  inflating: .git/logs/HEAD
   creating: .git/logs/refs/
   creating: .git/logs/refs/heads/
  inflating: .git/logs/refs/heads/master
   creating: .git/objects/
   creating: .git/objects/29/
 extracting: .git/objects/29/c51bbb9ada43dbe98cbd5dbedbab56586b24c3
   creating: .git/objects/68/
 extracting: .git/objects/68/b1f129eb06b91fc6c9a3885fc0d24ad0cdaa50
   creating: .git/objects/7e/
 extracting: .git/objects/7e/8d94849a38370141b3ca2aab5c3cadc27934da
   creating: .git/objects/cf/
 extracting: .git/objects/cf/13dbb766d4aff04ad51d3cdac84fda67dc6f50
   creating: .git/objects/da/
 extracting: .git/objects/da/eca6b50a218a78e4cca5566965381e2d384b7f
   creating: .git/objects/de/
 extracting: .git/objects/de/3643596c73be4f6112d027616c8df31acd1b09
   creating: .git/objects/ec/
 extracting: .git/objects/ec/88fb155c5e9ebd8b3ead39345618517c0af6cf
   creating: .git/objects/info/
   creating: .git/objects/pack/
   creating: .git/refs/
   creating: .git/refs/heads/
 extracting: .git/refs/heads/master
   creating: .git/refs/tags/
 extracting: .git/refs/tags/2011-11-24
  inflating: dixilinkerr.h
+ echo 'This repository now contains version 1 and 2 of dixilinkerr.h:'
This repository now contains version 1 and 2 of dixilinkerr.h:
+ git log
commit cf13dbb766d4aff04ad51d3cdac84fda67dc6f50
Author: Joost <joost@localhost>
Date:   Tue Sep 11 09:57:32 2007 +0000

    @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h

commit de3643596c73be4f6112d027616c8df31acd1b09
Author: Joost <joost@localhost>
Date:   Fri Jul 6 12:27:57 2007 +0000

    @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
+ echo 'Version 3 is in our working directory:'
Version 3 is in our working directory:
+ head -n 12 dixilinkerr.h
// DixiLinkErr.h

#pragma once

#ifndef EXCPCAT_DIXILINK
#define EXCPCAT_DIXILINK 2000
#endif

#ifndef __DixiLinkErr_H_INCLUDED__
#define __DixiLinkErr_H_INCLUDED__

#ifndef IDL_ENUM
+ echo 'The file in our HEAD'
The file in our HEAD
+ git show HEAD:dixilinkerr.h
+ head -n 12
// DixiLinkErr.h

#pragma once

#ifndef CAT_DIXILINK_ERR
#define CAT_DIXILINK_ERR 2000
#endif

#ifndef __DixiLinkErr_H_INCLUDED__
#define __DixiLinkErr_H_INCLUDED__

#ifndef IDL_ENUM
+ echo 'You see? Working copy differs from HEAD.'
You see? Working copy differs from HEAD.
+ echo 'So the working directory is dirty, right? Ask Git:'
So the working directory is dirty, right? Ask Git:
+ git status
On branch master
nothing to commit, working directory clean
+ git diff
+ echo 'In my situation, Git sees no difference here, I THINK THIS IS WRONG!'
In my situation, Git sees no difference here, I THINK THIS IS WRONG!
+ echo 'So we are unable to add version 3 of our file. Let'\''s try it once again:'
So we are unable to add version 3 of our file. Let's try it once again:
+ git add dixilinkerr.h
+ git status
On branch master
nothing to commit, working directory clean
+ echo 'In my situation at this point there is nothing to add or commit.'
In my situation at this point there is nothing to add or commit.
+ echo 'End of demo'
End of demo

joostadm@MUPC20 /d/proj/try/git
$ 
@sschuberth
Copy link

I just manually unzipped your DixiLink2.zip and opened a Git Bash in the created directory:

Welcome to Git (version 1.9.5-preview20141217)


Run 'git help git' to display the help index.
Run 'git help <command>' to display help for specific commands.

~/Downloads/Demo/DixiLink2 master
$ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   dixilinkerr.h

no changes added to commit (use "git add" and/or "git commit -a")

~/Downloads/Demo/DixiLink2 master
$ git diff
diff --git a/dixilinkerr.h b/dixilinkerr.h
index daeca6b..b5edf0e 100644
--- a/dixilinkerr.h
+++ b/dixilinkerr.h
@@ -2,8 +2,8 @@

 #pragma once

-#ifndef CAT_DIXILINK_ERR
-#define CAT_DIXILINK_ERR 2000
+#ifndef EXCPCAT_DIXILINK
+#define EXCPCAT_DIXILINK 2000
 #endif

 #ifndef __DixiLinkErr_H_INCLUDED__
@@ -16,7 +16,7 @@
 #ifndef BASE_DIXILINK_ERR
        #define BASE_DIXILINK_ERR 0
        #ifdef USE_DEFAULT_CAT
-               #define BASE_DIXILINK_ERR CAT_DIXILINK_ERR
+               #define BASE_DIXILINK_ERR EXCPCAT_DIXILINK
        #endif
 #endif

As you can see, the file shows up as modified for me.

EDIT: But interestingly, I can reproduce the issue with your script. I've compared the full directory contents from manually unzipping DixiLink2.zip vs. your script, and although they look equal, the Git statuses differ. Indeed odd.

@j66st
Copy link
Author

j66st commented Feb 5, 2015

I think it depends on which unzip tool you use. If possible, please try again unpacking the zip file with Info-ZIP v3.0. What's relevant here is that the timestamps of the files are preserved.
We could reproduce the problem on more than one PC. The problem disappears when touching the unpacked file.

@sschuberth
Copy link

I noticed there is a off-by-one difference in the seconds of the creation time when the archive is unpacked via "unzip" vs "WinRAR". That is one thing.

On the other hand, the issue disappears after deleting .git/index and git add dixilinkerr.h. I'm not sure about the index file format, but if it contains the timestamp of dirty files, and "unzip" restores the wrong time, Git might get confused.

The fact that the issue disappears when using "WinRAR" instead of "unzip" makes me believe that the bug is rather in Unzip than in Git.

Did this issue already show up in the original repo before zipping / unzipping it?

@j66st
Copy link
Author

j66st commented Feb 5, 2015

Yes the issue showed up in the original repo. (And of course also when pulling the repo directory over to another PC over the LAN.)
I am migrating Visual SourceSafe repositories using the tool vss2git (http://code.google.com/p/vss2git/, it is a .NET tool, running on Windows, it uses git.exe to add files to the git repo, so no black magic).
In a test moving 20,000 source files I lost 0.1% of the files. When inspecting the log I found out that the files missing were those of which the last version has the same file size as the previous version. In those cases the git add command failed.
EDIT: So no ZIP involved here.

@j66st
Copy link
Author

j66st commented Feb 5, 2015

I think Git should not cache file timestamps, or at least there should be an option to turn it off.

@dscho
Copy link
Member

dscho commented Feb 5, 2015

I think Git should not cache file timestamps, or at least there should be an option to turn it off.

Git is very conservative when looking at timestamps (it does not even trust the index if it has the very same timestamp as the file it is looking at), so there must be something else going wrong.

@sschuberth
Copy link

I also always thought that Git is only tracking file contents, and nothing else. But meanwhile Hannes replied to the corresponding mailing list thread saying something different ...

@j66st
Copy link
Author

j66st commented Feb 5, 2015

Now the fog is clearing. In my case the vss2git visits every file in VSS, rewinds the deltas until the origin, then starts replaying the history and adds and commits every version to git. It will do so very quickly. Knowing that the precision of timestamps on some Windows filesystems (FAT for example) is 1 or 2 seconds, and knowing that some files (like icon bitmaps) don't change size in their history it is hard to avoid clashes without inspecting file contents. You can expect other problems because of the different ways timezones and daylight saving time are handled across filesystems.

See my latest post on the https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc thread.

So please, dear developers, please let Git to only track file contents (reliably tracked by SHA-1 hashes) and nothing else. If there is a performance penalty, so be it. Any shortcut that comes with a reliability penalty should have a switch to turn it off!

I am not familiar with the internals of the Git index to really understand what is hidden there, and why the file content/hash inspection has been bypassed. But I know the basics of both the Windows and Unix filesystems, so I will follow any discussion here about the details with interest. I am willing to test any fix; the problem is fully reproducible here.

@YueLinHo
Copy link

YueLinHo commented Feb 6, 2015

Looks like the old "racy git" problem???

Failed workaround: (as this said)
debug3

Is it possible that implement inode via Windows MFT?

@t-b
Copy link

t-b commented Feb 6, 2015

The screen short from @YueLinHo shows read-cache.c for all you don't know git's code by heart.
@j66st I'm not sure that is_racy_timestamp is really the problem here. But you can try that happens if you change the function to always return 1?

Regarding with the nanoseconds timestamps. On Windows we have timestamps with at most 100ns resolution according to https://msdn.microsoft.com/en-us/library/windows/desktop/ms724290%28v=vs.85%29.aspx.

If I understand the code correctly we also do not use USE_NSEC in msysgit/git-for-windows-sdk.

@YueLinHo
Copy link

YueLinHo commented Feb 6, 2015

@j66st I suppose there is a workaround for your vss2git that insert a command git reset --mixed after checkout from VSS and before git commit.

@t-b I guess the root cause is that Git for Windows lacks some file system features.
And the workaround for racy git problem is failed, too. so issue here.

See basic logic match_stat_data() first.

Git for Windows has its own lstat() and st_ino, st_gid, st_uid are always 0. (see do_lstat())
(So, I associate the st_ino with MFT file. (but it is NTFS specific.) :P )
And the USE_NSEC is disabled by default.
I suppose it is a good idea to enable USE_NSEC. (maybe need some fix about it) ^_^

And I also recall the issue #229, not sure if it is the same root cause.

BTW, I guess such problems will come about often, because of SSD.

@YueLinHo
Copy link

YueLinHo commented Feb 6, 2015

if is_racy_timestamp() always return 1, then ce_modified_check_fs() might be called.
(see here)
And it check file context and calculate the sha-1 when the ce_compare_data() is called.
(see here and here)

@kblees
Copy link

kblees commented Feb 6, 2015

In your Demo.zip, the dixilinkerr.h file has a modification time of 2009-09-17, and the .git/index file has a modification time of 2015-01-23. Yet you say that the file was changed after committing version 2. So it seems your VSS conversion tool messes up modification timestamps. Make sure it doesn't do that and everything should work as expected.

Btw. I don't think this is a Windows-specific problem (or even Git-specific). Playing around with file times will fool Git on Linux as well (as outlined in racy-git.txt) as well as most backup software.

@YueLinHo
Copy link

YueLinHo commented Feb 6, 2015

@kblees Thank you. That makes me think more deeply.

@j66st
Copy link
Author

j66st commented Feb 6, 2015

@t-b: Re 100 ns timestamp resolution: that's only in theory. The clock ticks at a slower pace and a file system like FAT can only store seconds. Re changing code: I don't have an environment right now to rebuild git for Windows, nor the time to dive into the code, so I leave that to others.
@kblees Indeed I guess that the problem could be caused by extracting files from VSS will restore their original mtime (but don't call that "messing up"). Restore of a backup would do so too. But Git seems not prepared to that: the file in the working directory will be older than Git's index file, probably that's the reason Git looks no further and assumes that the file is stale. I wish there would be an option to turn off this reasoning; I want a command switch (and a config setting) to bypass this caching and force Git to not use the mtime as a factor in determining file change.

I really wished that Git could preserve timestamps too! I'm so used to seeing a file's age in the directory listing and being able to sort out which files were changed this week, etc.. Everyone moving from VSS to Git (or TFS, for that matter) is complaining about the loss of this information. I changed the vss2git tool to keep this information at least in the commit message.
So I'm not "playing around" with mtimes, that's what Git is doing. I consider them valuable. Backup software ought to keep an index for itself which files it saved last time in order to do incrementals. Otherwise files copied from another medium would never be backed up.

@YueLinHo Using git reset --mixed after extracting the file and before the git add is not a workaround. I tried it, Git still does not see the file as changed. But just touching the file
would be sufficient, I guess. So that the working copy is never older than Git's index.

@sschuberth
Copy link

I don't have an environment right now to rebuild git for Windows, nor the time to dive into the code, so I leave that to others.

At least the rebuilding part we have made pretty easy by now. Simply download the latest snapshot of the Git for Windows SDK, run the "Git Development Environment" from the Start Menu, and do cd git && make.

I really wished that Git could preserve timestamps too!

For reference, here's the FAQ entry why it does not. However, I agree there could be an option to still do so.

@YueLinHo
Copy link

YueLinHo commented Feb 6, 2015

@j66st
Can vss2git keep the changed file list for one commit?
If it does not, once lots files in working tree, touching all files is not that smart.
So, could you please try the following 2 ways after vss checkout?

  1. git rm --cached -r && git reset --mixed
  2. delete the .git/index, then git reset --mixed

@dscho
Copy link
Member

dscho commented Feb 6, 2015

I don't have an environment right now to rebuild git for Windows, nor the time to dive into the code, so I leave that to others.

I hope you understand that a statement like this seems very arrogant because you expect others to spend time helping you but do not see fit to set aside time yourself.

I really wished that Git could preserve timestamps too!

The mtime of a file is the time when it was last modified in this particular spot on the file system. That is why a copy of a file dated January 1st 1980 will get a new time stamp, even if the content technically is really that old.

That is the way Git treats timestamps.

You probably can convince VSS to update files' mtime to the appropriate value, too, i.e. reflecting the time when it wrote the file (as opposed to the time when the file with those contents entered the repository) and that should fix your woes.

In any case, it is well established by now that the problem is not a bug in Git for Windows, it does exactly what it is designed to do. Therefore I close this ticket.

@dscho dscho closed this as completed Feb 6, 2015
@j66st
Copy link
Author

j66st commented Feb 6, 2015

@dscho:
Well, I am just getting my feet wet as a Git user, I think it is not fair to expect me to dive into this huge codebase and start tinkering with the heart of the program. At least it would cost me days, if not weeks to understand the code that contains so mush wisdom of many smart people.
I already spent days modding and debugging vss2git to find out the reason for this strange behaviour and spent another day setting up a reproducible testcase. After discussing this issue I was supposing this was a serious flaw. Even long-time Git users are surprised.
To quote the reaction of your co-"owner" sschuberth on the google group:

Wow, that was totally new to be and sort of destroys my world picture of Git ... I always thought Git is tracking file contents, and nothing else.
(Pasted from https://groups.google.com/forum/#!topic/msysgit/6XLoSPH26kc )

These words also reflect my opinion. So I thought this was something to be fixed urgently and that it would be wise to humbly leave this work to the developers who are fully familiar with the codebase and who must be able to fix this in hours, rather than weeks. I am willing to discuss the subject, to assist with my knowledge of filesystems and VSS, and to assist in testing, that's all that would make sense for now, I guess.

Anyway, glad that I did not yet invest more time in coding a solution. I don't know about the hierarchy of the people contributing in this project, but you must be the authority here if you can decide on your own that this strange behavior is by design and does not need further action. So if I would develop a fix it wouldn't get your approval anyway.

Now I know enough to find a suitable workaround to redo my migration project. But I think the least one could do then is adding a section to the FAQ and to the manpages to explain and warn for this surprising behavior as shown in my test script. Maybe I should spend time to do that and add my demo-case as an example.

I truly respect people investing so much time, energy and knowledge in open source projects like this, and I apologize if anyone feels insulted by my posts, it was never my intention. It is up to the readers following this discussion to judge for themselves who seems arrogant here.

@j66st
Copy link
Author

j66st commented Feb 6, 2015

@sschuberth :

At least the rebuilding part we have made pretty easy by now. Simply download the latest snapshot of the Git for Windows SDK, run the "Git Development Environment" from the Start Menu, and do cd git && make.

Thanks for your good work! I am MS Visual Studio developer, I would have to set up a Gnu C build environment first, and get used to it, at this moment I don't have time for that, already lost too much time with this vss2git migration, and the show must go on. I hope, some day ...

I really wished that Git could preserve timestamps too!

For reference, here's the FAQ entry why it does not. However, I agree there could be an option to still do so.

I know the reasons. To accommodate those people who are too ignorant or too lazy to add two characters to their make command to force a rebuild-all whenever they import other people's code :-) At the cost of losing valuable file timestamps. Fine with me, we can agree to disagree here. I spent considerable time to have vss2git save the timestamps in the commit messages, I think it was worth it.

@sschuberth
Copy link

I am MS Visual Studio developer, I would have to set up a Gnu C build environment first

No you don't. That's exactly what the SDK installer does for you, it's self-contained.

@j66st
Copy link
Author

j66st commented Feb 6, 2015

@YueLinHo :

Can vss2git keep the changed file list for one commit?
If it does not, once lots files in working tree, touching all files is not that smart.

Why? Touching the files will only refresh the mtime in the directory, so not an expensive operation. Vss2git bundles files having mtimes within a certain time interval like 2 minutes, so normally only a dozen files or so per commit. And vss2git is a batch process, it can run overnight, I don't care about a few minutes more.

So, could you please try the following 2 ways after vss checkout?

  1. git rm --cached -r && git reset --mixed

Yes, this works (just tried it with my Demo-setup, you can also try it yourself if you like):

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
nothing to commit, working directory clean

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git rm --cached -r dixilinkerr.h
rm 'dixilinkerr.h'

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ ls -l
total 3
-rw-r--r--    1 joostadm Administ     5986 Sep 17  2009 dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        deleted:    dixilinkerr.h

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        dixilinkerr.h


joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git reset --mixed
Unstaged changes after reset:
M       dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   dixilinkerr.h

no changes added to commit (use "git add" and/or "git commit -a")

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$
  1. delete the .git/index, then git reset --mixed

Yes, that is also a viable solution (didn't know that it is fine to delete the index file).

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
nothing to commit, working directory clean

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ ls -l
total 3
-rw-r--r--    1 joostadm Administ     5986 Sep 17  2009 dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ rm .git/index

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        deleted:    dixilinkerr.h

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        dixilinkerr.h


joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git reset --mixed
Unstaged changes after reset:
M       dixilinkerr.h

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$ git st
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   dixilinkerr.h

no changes added to commit (use "git add" and/or "git commit -a")

joostadm@W701 /c/proj/try/git/clonefault/DemoRepo (master)
$

Your workarounds will also cost some time (more than touch, I guess). Anyway, I only need to apply this workaround in those cases that git add fails, so no problem.

Thanks for your ideas!

@YueLinHo
Copy link

YueLinHo commented Feb 7, 2015

Can vss2git keep the changed file list for one commit?
If it does not, once lots files in working tree, touching all files is not that smart.

Why?

May I take it back? :)

Touching the files will only refresh the mtime in the directory, so not an expensive operation. Vss2git bundles files having mtimes within a certain time interval like 2 minutes, so normally only a dozen files or so per commit. And vss2git is a batch process, it can run overnight, I don't care about a few minutes more.

Your workarounds will also cost some time (more than touch, I guess).

Agree that.

I only need to apply this workaround in those cases that git add fails

What if there are 2 changed files in one commit, A and B, and git status only show B is modified?
Then, git add will be not failed, and A is not commit.
I think the better workaround is using the "Set File Time" of VSS's checkout to "Current" (Not "last file modification", nor "last file check-in".), IIRC.

Thanks for your ideas!

Yor're welcome!
And thanks for your report, because I got another chance to study index file format and relevent code that I want to do about 2 months ago. ^_^

BTW, I am a git user, too.

@dscho
Copy link
Member

dscho commented Feb 7, 2015

Even long-time Git users are surprised.

I fear you misunderstood.

Git – as Visual Studio, or make – needs to figure out which files have changed since the last time it saw them. The reason for Git – as for Visual Studio, or make – is that it would be prohibitively costly to just handle all the files again. Visual Studio would have to make a full build otherwise, even if all you did was to fix a typo. And Git would have to re-checksum all the files again – a really wasteful way to handle your files.

That is why Visual Studio, and make, and Git, and everybody else relies on the mtime as a means to say "this is when the file was last modified here".

By the way, that is exactly the same as with backup systems, because backup systems are not intended to synchronize data between two locations. They are intended to reconstruct the exact state of a given working directory as on that machine. That is why they preserve mtime, so that tools like Visual Studio, make – or Git – can say: "Oh, I see, git.c is older than git.o, I guess I do not have to rebuild it".

It is important to realize, and I encourage you to wrap your head around that concept, too, that the time when a given file revision entered the repository is a very different beast from the mtime of the file of the reconstructed revision.

The puzzlement you saw was of some developers who thought that Git would do something different than what the tools expect, namely to play games with the mtime (i.e. change it to something that does not reflect when that file, in that location, was last modified).

But that puzzlement went away when it was clarified that Git does exactly what it is expected to: record source code revisions (and attach a timestamp to the entire revision instead of recording individual mtimes which it is not supposed to).

Can vss2git keep the changed file list for one commit?
If it does not, once lots files in working tree, touching all files is not that smart.

Why? Touching the files will only refresh the mtime in the directory, so not an expensive operation.

No, touching is not. But the consequences are. You basically add a lot of churn for the tools that are now fooled into believing that they did not see the most recent contents of the files. You will cause complete builds – which can easily last several hours for complex C++ projects, something that you should not want to cause – or in the case of Git a full reindexing – which can still last several minutes if you have a large project, and doing so for every single revision you want to import will unduly slow down the entire process.

Well, I am just getting my feet wet as a Git user, I think it is not fair to expect me to dive into this huge codebase and start tinkering with the heart of the program. At least it would cost me days, if not weeks to understand the code that contains so mush wisdom of many smart people.

Well, I, in contrast, think it unrealistic to expect others who do not share your particular problem to solve it for you, in particular when the faulty bit – the wrong mtime modification of VSS – has been identified already and you have everything you need in hand to solve the problem yourself.

And I need to protect the time of the Git for Windows developers: we have a couple of challenges on hand that we need to solve to make Git for Windows a better user experience (and fixing vss2git is not our concern, so it is also a bit unfair to distract us from our very own problems, unless you help us with our problems in return, something that you indicated to be unlikely because of your lack of knowledge of the code base that you did not intend to change). There has been substantial help you received from developers who could not spend that time on Git for Windows as a consequence (and the solution to this problem was outlined as much as people not using VSS could), therefore I closed the issue.

I encourage you to heed the advice – ask VSS to give you a list of files it updated, then touch that set of files, in vss2git – and solve your problem. I am sure you will agree that no change in Git, let alone Git for Windows, is required. If you disagree, you should take the discussion to the Git project proper: git@vger.kernel.org.

In closing: you are welcome for all the help you received.

@j66st
Copy link
Author

j66st commented Feb 8, 2015

unrealistic to expect others who do not share your particular problem to solve it for you

fixing vss2git is not our concern

If I only came here to have a problem with vss2git fixed, you would be right. But I think you are missing my point.

To summarize: I discovered failures of vss2git to transfer certain files to Git. That boiled down to git refusing to add a file that was visibly different from the latest revision in the repository. I was surprised to see this happen and after seeing that this only happens if the file size matches, I suspected some kind of bypassing reality for performance reasons. I could not believe this, since Git is designed to be tracking contents and nothing else. So I decided to file an issue here.
Apparently Git is making assumptions that are not always valid, especially in a Windows environment. It draws a conclusion that a file is unchanged while it is different. That's bad. The excuse might be performance gain, but the price is reliability loss. Every git user should have the option to choose his preference for reliability over speed.
I was stunned to see this issue being interpreted as a vss2git-only problem, and being closed by the product owner without further discussion.

Git developers tend to think from a Unix/POSIX/Linux perspective, that's why the index reflects the Unix file attributes, of which most components do not even exist on Windows. They treated mtimes the way they are commonly used in Unix. But this differs from the Windows environment, where the modification time of a file has traditionally a different meaning. That is easily shown by looking at the basic copy commands: Unix' cp touches the mdate; Windows' COPY or XCOPY does not! This behavior is reflected by many file management tools in each of these environments (file managers, archiving tools, version control systems, backup programs etc.). Both approaches have their advantages. (I think that historically the meaning of ctime and mtime has been confused, but that's another discussion.) You cannot deny this different approach, and you cannot force your Unix view onto the Windows world.
So when porting Git to Windows I think you should accommodate this different mindset (the same way that the slash/backslash and LF/CRLF differences must be properly handled).

I encourage you to wrap your head around that concept

I don't need an argumentation how mdates affect build tooling. I am an experienced C++ developer, I grew up in Unix before Visual Studio even existed. There are different workflows possible, and in large projects you always will have to decide intelligently whether to force a full rebuild.
Many Windows developers consider the modification time of a file as an important attribute to see the age of the code file. (Yes, I know this information would be better stored in the Creation time field, but it is not. And I also know enough arguments why the mtime often won't reflect file age. But many developers use it that way, you can't deny that.) In our practice, the combination of filename+mtime+size is a pretty accurate way to assume that a file is unchanged without checking its contents. But not a guarantee! Similar to the inode that is used in Unix.

So, again, this is not a vss2git-only problem! The assumption that equal filesize and mtime means that a file is unaltered is going to cause problems under Windows (more likely than under Unix). I cannot estimate how often it will occur in the future, but the hard fact is that Git sometimes ignore diffs and that this may cause loss of files.

I am having a hard time "selling" Git to my team who are used to VSS, because (a) from now on they will lose their valuable mtime information, and (b) since yesterday they don't trust that Git will take good care of their data!

So I think that there should be a command line option (and a config setting) to opt for reliability over speed, which can force file inspection and rehashing over mtime-based assumptions. I don't think I am the only one I would welcome a discussion about this among Windows-hosted developers.

I think it does not cost much time to realize this. It may be as simple as just forcing a function to return TRUE when told so by a config flag. If I could complete it myself within a day I would do so and offer it to the community. (I am also willing to offer a day of my time in exchange for someone else doing it.)

And Git would have to re-checksum all the files again – a really wasteful way to handle your files.

Git would need to do so ONLY if size and mtime match, just to be sure.

Even if a general git diff or git status would not see the changes, these commands should have a simple option to look more thoroughly.
And git add, when mentioning a specific path, should never take shortcuts and always rehash the files specified, IMO.

Touching the files will only refresh the mtime in the directory, so not an expensive operation.

No, touching is not.

I think an efficiënt touch tool under Windows should do just that.

In closing: don't bother about my vss2git problem, it has been solved, thanks to all who contributed here. Instead, care about 100% reliability of Git in a Windows environment, in workflows that Windows developers are used to. (Where in the Git manuals is specified that git diff cannot be trusted under circumstances? Under Windows this is not just a "racy" problem!)

@YueLinHo
Copy link

YueLinHo commented Feb 9, 2015

^_^
Share my note of git index format:
git_index_format

@j66st
I got another tricky workaround: attack mtime of .git/index (to a very old date)
I suppose it will trig the racy condition and let git to check file content only for the file which the mtime, ctime, file size are all identical.
It comes from the code in is_racy_timestamp()

        istate->timestamp.sec <= ce->ce_stat_data.sd_mtime.sec

IOW, making istate->timestamp.sec(mtime of .git/index file) always small than that file's mtime.
(but can not be 0)

So, no need to touch files. Just attack one.


Assume git want to support the option, maybe "--no-trust-mtime-of-index-file", I guess the modification to the above code would be

        (no_trust_mtime_of_index_file || istate->timestamp.sec <= ce->ce_stat_data.sd_mtime.sec)

(These code is base on git 1.9.0.
The code has been changed in pu branch,
but I suppose the logic is not changed after a quick review.)

@j66st
Copy link
Author

j66st commented Feb 9, 2015

@YueLinHo
Thanks for the explanation and the good ideas. I hope you did not have to reverse-engineer everything. You can find a clear description of the index file structure in msysgit/git/Documentation/technical/index-format.txt
And I assume you also have read msysgit/git/Documentation/technical/racy-git.txt.

Indeed, zeroing the mtime in the index is a way to force considering the file racy without changing git itself. Normally the index will contain more entries, we would have to walk the list and check every file's size and clear the mtime only if the size is equal.

The code change you propose does indeed hit the right spot, IMO. You would also have to include the USE_NSEC case:

static int is_racy_timestamp(const struct index_state *istate,
                 const struct cache_entry *ce)
{
    return (!S_ISGITLINK(ce->ce_mode) &&
        istate->timestamp.sec &&
        (ignore_cached_mtime ||
#ifdef USE_NSEC
         /* nanosecond timestamped files can also be racy! */
        (istate->timestamp.sec < ce->ce_stat_data.sd_mtime.sec ||
         (istate->timestamp.sec == ce->ce_stat_data.sd_mtime.sec &&
          istate->timestamp.nsec <= ce->ce_stat_data.sd_mtime.nsec))
#else
        istate->timestamp.sec <= ce->ce_stat_data.sd_mtime.sec
#endif
                  )
         );
}

What is left to do then is to wire this option flag to a command-line switch for all relevant commands (I don't know which of the many non-porcelain commands are involved) and to a config setting.

@YueLinHo
Copy link

YueLinHo commented Feb 9, 2015

Normally the index will contain more entries, we would have to walk the list and check every file's size and clear the mtime only if the size is equal.

There are 3 kind of mtime.
One is for a file in working tree.
Another one is for a entry data in index file.
Last one is for index file itself.
istate->timestamp.sec stands for mtime of index file.
(not the mtime value saved in index file.)

So, the "attack" is not walk the all entries data.
the attack means just modify the mtime of index file itself.

I hope you did not have to reverse-engineer everything. You can find a clear description of the index file structure in msysgit/git/Documentation/technical/index-format.txt
And I assume you also have read msysgit/git/Documentation/technical/racy-git.txt.

Actually, I read them first,
then study source code by MSVC debugger (based on TortoiseGit project),
then have that index format note.

@j66st
Copy link
Author

j66st commented Feb 9, 2015

Aha, you mean not altering the index file contents, but just altering the mtime in the .git directory, something like:

touch -m --date=01/01/1980 .git/index

?
So it is a touch, but only a single one, instead of touching all the files in the work dir.
You call this 'attack', is that a command to do so?

@kblees
Copy link

kblees commented Feb 10, 2015

Indeed I guess that the problem could be caused by extracting files from VSS will restore their original mtime (but don't call that "messing up")

The file modification time is updated by the operating system on a successful write() (POSIX) or WriteFile() (Windows) system call. Both POSIX and Win32 documentation are quite clear on this.

You are obviously looking for a time stamp representing the last logical content change by some user (or VSS check-in time or whatever). However, there is no field with such a meaning in file system stat data (neither on Windows nor POSIX/UNIX).

Just that VSS abuses the modification time for this doesn't make it correct or even smart to do so, as it may interfere with a variety of software that expects the documented behaviour.

So please kindly allow me to use terms like "mess up" or "abuse" for this rogue behaviour of VSS.

I really wished that Git could preserve timestamps too! ... I changed the vss2git tool to keep this information at least in the commit message.

Why not use the commit date (git commit --date=...) rather than the commit message?

I want a command switch (and a config setting) to bypass this caching and force Git to not use the mtime as a factor in determining file change.

Clearing cached work tree stat data is as simple as git read-tree HEAD, this forces Git to re-check the content of all files.

However, it naturally comes with a huge performance penalty (> factor 100), e.g. with the WebKit repo (~200k files, ~2GB data):

$ # cold file system cache, without cached stat data:
$ git read-tree HEAD
$ time git status -s -uno

real    13m32.234s
user    0m0.015s
sys     0m0.000s

$ # hot file system cache, without cached stat data:
$ git read-tree HEAD
$ time git status -s -uno

real    1m35.323s
user    0m0.015s
sys     0m0.000s

$ # hot file system cache, with cached stat data:
$ time git status -s -uno

real    0m0.796s
user    0m0.000s
sys     0m0.000s

From these numbers, its pretty obvious that its just not feasible to have Git look at the content of files that haven't changed.

Adding a command line switch or config option to ignore cached stat data would have to be discussed in the upstream mailing list, but given previous discussions about persisting / restoring mtime, I doubt that such a change would be accepted.

@sschuberth
Copy link

@j66st I've triggered a new snapshot build on our CI, which according to the installer test works fine (modulo the mentioned wget error which you can still ignore). You can still use this link to download the new installer as that always points to the latest snapshot build.

@kblees
Copy link

kblees commented Feb 10, 2015

@t-b

The solution outlined by Hannes on the mailing list does sound tempting.

As Git did not detect a change in create date either, it seems that VSS just overwrites the file (rather than delete + recreate). So tracking the FileIndex as inode number wouldn't have helped.

AFAIK, only NTFS and ReFS have stable FileIndexes, it makes no sense to track the (generated) FileIndex of FAT volumes or network file systems.

...and _ino_t is just 16 bit on Windows, so we cannot store 64 bit (or 128 bit for ReFS) anyway.

@j66st
Copy link
Author

j66st commented Feb 10, 2015

@kblees

The file modification time is updated by the operating system on a successful write() (POSIX) or WriteFile() (Windows) system call. Both POSIX and Win32 documentation are quite clear on this.

Yes, but the CopyFile Windows system call preserves file attributes, including mtime. Clearly documented. Most file managers (Windows Explorer, Total Commander, etc.), unzippers, backup-restore tools will do so. If you consider VSS a tool to restore an old version, it is correct to also restore the file attributes. But this is configurable. Vss2git, though, does not use VSS, it replays history itself by directly accessing the VSS repository at the file level.

But forget about VSS and vss2git. That problem has been solved. I am worried that this problem might occur in daily use of Git in a Windows context, where copying files from flash drives, network drives, ZIP archives always preserves/restores the original mtime. I want an option to bypass the Git-assumption that it thinks to "know" files which are older than the index. This option may then be off by default, so no overhead at all for users who want to accept the small risk.

I really wished that Git could preserve timestamps too! ... I changed the vss2git tool to keep this information at least in the commit message.

Why not use the commit date (git commit --date=...) rather than the commit message?

I do so. But the commit date may differ considerably from the file's mtime, and the mtime differs between files. So I let vss2git in addition save a listing of file mtimes in the commit message. For the future (as soon as all team members want to use Git) the commit timestamps may be indeed sufficient, although loss of the real mtimes is still a pain.

I want a command switch (and a config setting) to bypass this caching and force Git to not use the mtime as a factor in determining file change.

Clearing cached work tree stat data is as simple as git read-tree HEAD, this forces Git to re-check the content of all files. However, it naturally comes with a huge performance penalty

I proposed to check only files of which the size has not been changed. This is normally only a small percentage of the files, and thus the performance overhead would hardly be noticeable.

given previous discussions about persisting / restoring mtime

I would appreciate if someone can point me to relevant discussions about mtimes and Git in a Windows environment, to enlighten me; apart from my vss2git problems I still can't estimate how big a risk ignoring this issue would be for the future.

@kusma
Copy link
Member

kusma commented Feb 10, 2015

Clearing cached work tree stat data is as simple as git read-tree HEAD, this forces Git to re-check the content of all files. However, it naturally comes with a huge performance penalty

I proposed to check only files of which the size has not been changed. This is normally only a small percentage of the files, and thus the performance overhead would hardly be noticeable.

Huh, are you saying the files who have changed their size is the common case...?!

@j66st
Copy link
Author

j66st commented Feb 10, 2015

Huh, are you saying the files who have changed their size is the common case...?!

Source code (text) files which are edited most of the time will get a different size. Binary files like icon bitmaps normally keep the same size.
So by calling "git add" for a file the caller indicates that the file is added or modified; Git should then first compare the file size, only if the size is unchanged the index should not be trusted (because of Git's caching habits) and the file must be inspected.

@kblees
Copy link

kblees commented Feb 10, 2015

I proposed to check only files of which the size has not been changed. This is normally only a small percentage of the files, and thus the performance overhead would hardly be noticeable.

You typically only work on a small portion of a project, and the majority of files are unchanged (i.e. same size, modification time, creation time and content). Always looking at the content of those unchanged files has quite noticeable performance impact, as measured above (0.7s vs. 95s).

Git should then first compare the file size, only if different the index should not be trusted and the file must be inspected.

This is what Git already does: if the size is different, Git knows for sure that the content changed (without even looking).

But it seems you're contradicting yourself here? Do you want to check files if the size has not been changed, or if the size is different?

@j66st
Copy link
Author

j66st commented Feb 10, 2015

@kblees Sorry for the confusion! I edited a mistake in my message a minute after posting, but you apparently got the text before that change. Please read my last paragraph again:

So by calling "git add" for a file the caller indicates that the file is added or modified; Git should then first compare the file size, only if the size is unchanged the index should not be trusted (because of Git's caching habits) and the file must be inspected.

So only added files have to be inspected, all "sleeping" files are left alone.

@t-b
Copy link

t-b commented Feb 10, 2015

@kbless thanks for the pointers regarding FileIndexes.
I would be okay if the solution does not enhance behaviour on FAT.

@t-b
Copy link

t-b commented Feb 10, 2015

Can we agree on a minimal test case which shows the "unwanted" behaviour?

@j66st Do you think the following should result in git saying that file.txt was modified?

mkdir myrepo
cd myrepo
git init
echo "file contents" > file.txt
touch -t 201501011212 file.txt
git add file.txt
git commit -m "changes"
ed -s file.txt <<< $' 1,$s/file/abcd/g\nw'
touch -t 201501011212 file.txt
git status

The fancy ed line is just a way to change the file without changing its inode number.
On my linux box with git 2.1.0 git does not recognize that file.txt has different contents.

It gives:

Initialized empty Git repository in /home/wg/thomas/devel/myrepo/.git/
1261690 file.txt
[master (root-commit) 61398a3] changes
 1 file changed, 1 insertion(+)
 create mode 100644 file.txt
1261690 file.txt
On branch master
nothing to commit, working directory clean

@j66st
Copy link
Author

j66st commented Feb 11, 2015

The discussion now is branching, but I hope you keep focus on the user's POV. So please step back for a minute and listen to this user story:

" I ran into the following problem with Git: I wanted to update a file of which a version is already in the repository. So I run the command "git add myfile". Looks OK, no error or warning returned. After adding some more files I do a "git commit". I get the normal success response. A few days later a build behaves strange: it turns out that myfile was never updated!
Reconstructing the situation made me to reproduce the problem. I also found out that the problem only occurs if the size of the updated file matches the size of the file in the repository. "git status" and "git diff" see no difference. But with a simple "git show HEAD:myfile" I can easily prove that myfile is different.

This smells like Git is taking a shortcut and does not actually read myfile. After discussing this with a long-time Git user I decided to make a simple repro case and submit a bug report. Got suprised reactions even from some Git developers. The fog started clearing as the "racy git problem" was brought in.

Then the Product Master came in. From his throne in Linux Heaven he looked down to me, humble noob in Windows Hell. He said that such things only happen in Windows where sinners think they can turn back time. But it would be unlikely to occur anyway, and we have to live with it. After preaching extensively about how mtimes affect build tools He concluded with something like "Thou Shalt Not Play With Mtimes" and "This Is How Git Is Designed", arrogantly made clear that I already wasted too much of his and his disciples' valuable time . He pointed His Thumb down and closed the issue.

Then after a short silence slowly the mumbling began in the forum. Some good ideas bubbled to the surface. Several people offered help.

Am I, humble Git noob, asking too much? If I ask "git add myfile" isn't it clear what I want? Is it unlikely that the file has been modified? Is there ANY excuse (apart from a hash collision) for silently NOT updating my file if it is different?
Some disciples started arguing about caching, filesize, inodes, mtimes (some of which is irrelevant in Windows hell). Others mention a big performance penalty that would be incurred. Heck, even if the repository is huge and my working tree is huge: I am asking to update only ONE file. Is it too much trouble for Git to even look at the file? Yes. Instead Git tries to be smart by looking at the directory information and draws the wrong conclusion.

So I think "git add" should always read the file(s) that I want to add before concluding that they are unchanged.
I can accept that "git status" (which must be fast and has to observe my complete working tree) takes a shortcut and use cached stats and occasionally misses a change. Even a global "git diff" might not notice the change. But also "git diff myfile" should not be lazy and actually read the file. Git may try to be smarter than it is right now and update the cache with new wisdom obtained.

So if I am alone with my "woes", I will have to find my own solution. Git is open source, so I can make my private fix. If other Git users feel affected, then maybe we can join forces to find a suitable solution (like an option --i-confess-i-am-a-sinner-so-please-ignore-mtime) in a way that does not itch too much in heaven and makes hell still a place worth living in. "

OK guys, thanks for your time, back to work.

P.S.: Yes, I know I can go for a full refund.

@j66st
Copy link
Author

j66st commented Feb 11, 2015

@t-b

Can we agree on a minimal test case which shows the "unwanted" behaviour?

@j66st Do you think the following should result in git saying that file.txt was modified?

mkdir myrepo
cd myrepo
git init
echo "file contents" > file.txt
touch -t 201501011212 file.txt
git add file.txt
git commit -m "changes"
ed -s file.txt <<< $' 1,$s/file/abcd/g\nw'
touch -t 201501011212 file.txt
git status

I cannot test it with the ed line (ed is not included in msysgit) but if I replace that line with

echo "abcd contents" > file.txt

then I get similar output under Windows.
So this is a good testcase.

@t-b
Copy link

t-b commented Feb 11, 2015

@j66st If my test case shows in your eyes unwanted behvaiour this is a upstream's git behaviour.

This behaviour is a bigger problem on windows as we have only time stamps with second precision and don't include the inode comparison check. Both things could be changed in windows if someone steps volunteering to do or pay.

@YueLinHo
Copy link

So please step back for a minute and listen to this user story:
" I ran into...

^_^ Nice story. @j66st you are a good story teller.

@kblees
Copy link

kblees commented Feb 11, 2015

Yes, but the CopyFile Windows system call preserves file attributes,

Which is not true, CopyFile always sets the archive attribute.

including mtime. Clearly documented.

The MSDN documentation doesn't mention file times at all (apart from community additions), so it may change at any time.

The current behaviour of copying just mtime (not ctime/atime) is completely broken, as it creates files that are last modified before they were created, which makes no sense at all.

AFAICT, the behaviour is also file system specific (e.g. last accessed time behaves differently on FAT and NTFS if NtfsDisableLastAccessUpdate=1 in the registry).

I am worried that this problem might occur in daily use of Git in a Windows context, where copying files from flash drives, network drives, ZIP archives always preserves/restores the original mtime.

Even if CopyFile preserves mtime, you still need a file of same size and mtime but different content. Which IMO is very unlikely to happen in a normal development workflow. E.g. if you copy a file to the network, edit it there and copy it back, the mtime will have changed.

Apart from vss2git, the only currently known way to reproduce the problem requires a special Unix tool (touch -t), for which there isn't even an equivalent on Windows. If there was a use case that triggered the problem (without resetting mtime on purpose), I would agree that it is worth fixing...otherwise its more of a theoretical exercise.

Am I, humble Git noob, asking too much? If I ask "git add myfile" isn't it clear what I want? Is it unlikely that the file has been modified? Is there ANY excuse (apart from a hash collision) for silently NOT updating my file if it is different?

You're forgetting that most git commands support wildcards / pathspecs. So there is good reason to check stat data first, otherwise git add . would suffer the same performance penalties as git status.

So, dear humble Git noob, if you took such great care to produce a file with same size, mtime and ctime, yet different content, is it too much to ask to just touch myfile && git add myfile???

@kblees
Copy link

kblees commented Feb 11, 2015

This behaviour is a bigger problem on windows as we have only time stamps with second precision

The one second granularity is due to a former POSIX limitation of the stat structure (changed in 2013 to include nanoseconds). NTFS time stamp resolution is 100ns.

Tracking file times with nanosecond precision would make the "same mtime by chance" case even more unlikely than it already is. It won't fix the "reset mtime on purpose" case, though.

and don't include the inode comparison check.

Tracking the inode number just helps detecting criss-cross renames. AFAICT it won't affect any of the problems discussed here.

That being said, implementing nanosecond precision is reasonably trivial and won't hurt performance, so I gave it a shot.

@j66st
Copy link
Author

j66st commented Feb 12, 2015

@kblees

The MSDN documentation doesn't mention file times at all (apart from community additions), so it may change at any time.

The current behaviour of copying just mtime (not ctime/atime) is completely broken, as it creates files that are last modified before they were created, which makes no sense at all.

I know, I agree it has always been a mess. But it has been this way since the DOS days, I don't expect MS to change it. Ctimes under Windows are hardly used by anyone, I guess. It would be useful to separately keep content modification timestamp and last write timestamps. Fact is that many Windows developers have a workflow where mtime is used as an always visible content modification time. Moving to Git will change that because Git (from their POV) messes up the mtimes. Major pain was, after the vss2git misbehavior, to find out which files were lost, because we couldn't use mtime anymore to identify if the working copy of a file was the most recent. The problem typically occurred with icon bitmaps, where only a few pixels were edited, so a simple diff is useless.

You're forgetting that most git commands support wildcards / pathspecs. So there is good reason to check stat data first, otherwise git add . would suffer the same performance penalties as git status.

I know. But if I ask to "git add ." I will accept some delay because then I am asking for inspecting every file.

if you took such great care to produce a file with same size, mtime and ctime, yet different content, is it too much to ask to just touch myfile && git add myfile???

I'm still not convinced that in my vss2git case there ever existed two files with same mtime and different content. I only would expect such thing to happen in a "racy" case with mtimes that are current, but then the index in my sample would contain a fresh timestamp. My VSS repository contained 3 versions of the file, with clearly distinct mtimes in the past, so different from the current time. So I still can't understand the content of the Git index in my Demo.zip sample. To find out, I will have to do a vss2git run from the debugger, break before every "git add" and take snapshots of the working directory and git repo. I'm too busy right now, but I will try that soon.

I agree, doing a "touch myfile && git add myfile" would be OK.

@YueLinHo
Copy link

My VSS repository contained 3 versions of the file, with clearly distinct mtimes in the past, so different from the current time. So I still can't understand the content of the Git index in my Demo.zip sample.

Assume:
(1) vss2git checkout file with current time
(2) vss2git converts 2 revisions within 1 second
Then, it should be captured by final racy-git check (that said, index file mtime <= file mtime)
But, in your Demo.zip, the file mtime is 2009-09-17. Obviously, not current time.

Assume vss2git checkout file with modification time or checkin-time
And you say: clearly distinct mtimes in the past
Then, it should captured by the check of different mtime between the file and related index entry data.
(But, In your Demo.zip, mtime are the same.)

Something wrong... :-/
@kblees Do I miss something? :)

@j66st few questions:
What kinds of file mtime does vss2git checkout? current time? modification? or ... ?
Is the Demo.zip exactly the situation you got?
How do you make that Demo.zip?

@j66st
Copy link
Author

j66st commented Feb 12, 2015

@YueLinHo : Here is the export section of the vss2git log file of the session where I built the Demo.zip from. Demo.zip simply contains the resulting directory. You can see every step, all Git commands and their response have been logged.

Initializing Git repository
Executing: C:\Program Files (x86)\Git\cmd\git.exe init
>Initialized empty Git repository in c:/proj/try/git/clonefault/DixiLink2/.git/

Replaying changeset 1 from 08/23/2010 22:25:07
*# c:\proj\try\git\clonefault\DixiLink2: Create SuniLink

Replaying changeset 2 from 07/06/2007 14:27:57
------------------------------------------------------------
Preprocessing shared file mappings

*+ $SuniLink(RKDAAAAA): Share dixilinkerr.h(ZJCAAAAA)
------------------------------------------------------------
*#File Create dixilinkerr.h(ZJCAAAAA)@1 by Joost at 2007-07-06T14:27:57
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Create revision 1
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 2 from 07/06/2007 14:27:57
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpAD3.tmp
>[master (root-commit) de36435] @VSS 01-06-2007 01:54:53 [Create] dixilinkerr.h
> 1 file changed, 139 insertions(+)
> create mode 100644 dixilinkerr.h

Replaying changeset 3 from 09/11/2007 11:57:32
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Edit revision 2
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 3 from 09/11/2007 11:57:32
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpA32C.tmp
>[master cf13dbb] @VSS 11-09-2007 11:08:35 [Edit] dixilinkerr.h
> 1 file changed, 1 deletion(-)

Replaying changeset 4 from 09/17/2009 11:43:39
c:\proj\try\git\clonefault\DixiLink2\dixilinkerr.h: Edit revision 3
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -- dixilinkerr.h
Committing changeset 4 from 09/17/2009 11:43:39
Executing: C:\Program Files (x86)\Git\cmd\git.exe add -A
Generating temp file for comment: @VSS 17-09-2009 11:43:17 [Edit] dixilinkerr.h
Executing: C:\Program Files (x86)\Git\cmd\git.exe commit -F C:\Users\joostadm\AppData\Local\Temp\tmpF207.tmp
>On branch master
>nothing to commit, working directory clean

Replaying changeset 5 from 08/23/2010 22:25:07
*# c:\proj\try\git\clonefault\DixiLink2: Share dixilinkerr.h

Replaying changeset 6 from 11/24/2011 13:28:48
Creating tag 2011-11-24
Generating temp file for comment [...]
Executing: C:\Program Files (x86)\Git\cmd\git.exe tag -F C:\Users\joostadm\AppData\Local\Temp\tmpEDF.tmp -- 2011-11-24
------------------------------------------------------------
Git export complete in 00:35:45
Replay time: 00:00:00
Git time: 00:00:01

The actual timestamps are not logged. I will do a new run to find out.

@kblees
Copy link

kblees commented Feb 12, 2015

In your .git/index file, both ctime and mtime are 0x4ab204b5 = 2009-09-17T09:43:17 (see YueLinHo's screenshot above), which seems to be the mtime of V3.

From the vss2git sources (https://code.google.com/p/vss2git/source/browse/Vss2Git/GitExporter.cs#689), at least the ctime should have been that of V1 (2007-06-01... ~= 0x465f604d).

This also means that when git saw the content of V2, the file already had mtime/ctime of V3, so something is seriously messing up your file times here.

The timestamps you dump to the commit message seem to be correct, where do you get these from (file system or vss2git classes)? Is it possible that your 'dump file times' patch screwed things up?

@t-b
Copy link

t-b commented Feb 12, 2015

@kbess said:

Tracking the inode number just helps detecting criss-cross renames. AFAICT it won't affect any of the > problems discussed here.

Hmm yes. Thanks for the nanoseconds patch btw.

@YueLinHo
Copy link

The logic I realized from vanilla git code:

for each entry data in index and file stat in working tree
{
    if (is mtime different)
        changed |= MTIME_CHANGED;
    if (is ctime different)
        changed |= CTIME_CHANGED;

    if (is mtime nanosec different)
        changed |= MTIME_CHANGED;
    if (is ctime nanosec different)
        changed |= CTIME_CHANGED;

    if (is uid gid different)
        changed |= OWNER_CHANGED;
    if (is ino different)
        changed |= INODE_CHANGED;
    if (is dev different)
        changed |= INODE_CHANGED;

    if (is size different)
        changed |= DATA_CHANGED;

    // the 1st checking for racy-git problem
    if (entry data file size is zero)
        if (!is_empty_blob_sha1(entry data sha1))
            changed |= DATA_CHANGED;

    // the 2nd checking for racy-git problem
    if (!changed && index_file_mtime <= entry_data_mtime)
        changed |= ce_modified_check_fs();  // this DOSE check file content
}

@j66st
Copy link
Author

j66st commented Feb 13, 2015

@YueLinHo

The logic I realized from vanilla git code ...

You picture it very clear in this block of pseudo-code!

Yes, this is also the logic I found out, it took me quite some time before I understood the role of the cache entries, the timestamp of the index file itself in relation to the working tree.

@kblees

Is it possible that your 'dump file times' patch screwed things up?

Yes, it turned out that my patched version used a field that contains the archived mtime of the last version of the file (even during replay of older versions) to set the mtime of the reconstructed file. My intention was to have the working directory reflect the mtimes we are used to. And it did. I did not check mtimes of the intermediary reconstucted versions as I did not care because they are not stored by Git anyway. I now understand that this caused the "racy" condition which confuses Git. Since I discovered that Git will mess up the mtimes in my working directory anyway with every checkout of a different branch, it makes no sense to preserve mtimes any longer. So I now changed vss2git back so that the intermediary mtimes will reflect the changeset time (also used as the commit date in Git). This is the safest bet to avoid any racy conditions, because vss2git's changesets are by definition distinct in time.

vss2git is a complicated program, basically it imitates the full retrieval logic of VSS. I did not study its data structures in full depth. My major goal to patch vss2git was to better handle Shared and Branched files in the VSS repo (note these words have a different meaning in Git) because otherwise most of the history would not be transferred. And the second goal was to keep the original mtimes in the commit message, because some team members are not yet ready to change their workflow and want to see the last real modification time.

To conclude:

  1. First I was stunned to see file changes to be silently ignored by Git. And to see the problem being waved aside by the community.
  2. Since my files had the proper mtimes, I did not expect vss2git was misbehaving, but now I finally understand the logic of the cache and the racy problem, I know that I unconsciously caused a racy condition.
  3. Now I better understand what's going on, I don't think it is likely for such thing to happen in daily use. The nanosecond fix even further reduces the likelihood for certain filesystems. So I think indeed it is not worth the time to further fix it. Although it might make sense to mention it briefly in the manual that mtimes/filesize may have unexpected effects on Git's caching. The pseudologic of @YueLinHo 's post is a concise and clear explanation. The racy-git.txt is a good, detailed document for insiders, but new Git users seeing the symptoms won't immediately associate the problems they run into with the word "racy" and find this document.
  4. I still think that some extension to Git that would preserve mtimes would be welcomed by many Windows users.

Everyone who contributed, thanks a lot for your help and patience!

@YueLinHo
Copy link

I learned what I want and enjoy these conversation from you top guys, so thank all of you here. ^o^d
善哉!

@webmaster33
Copy link

I have a similar case, which may belong here.
I wrote a search & replace perl script, which recursively searches
files and replaces text in them. After replace, it restores original
modification time (mtime) of file.

Interesting, that git status doesn't show replaced changes, if the
mtime is same as original.

Is there a way to force git status to show changes, even if the file
dates are the same?

I tried to set core to:
trustctime = false
checkStat = minimal
Unfortunately the change is still not detected :(
It seems isn't a way to force fallback to file checking and completely
ignore file modification date :(

The following solutions makes the changed files appear again as changed:
a) touch -m --date=01/01/1980 .git/index
So it is a touch, but only a single one, instead of touching all the
files in the work dir.

b) git read-tree HEAD
Also working well.

But these solutions are just workarounds, not the real permanent solutions.

@dscho
Copy link
Member

dscho commented Jul 23, 2015

I wrote a search & replace perl script, which recursively searches
files and replaces text in them. After replace, it restores original
modification time (mtime) of file.

By this "restoring" of the original modification time you broke the contract: the mtime should reflect the time of the latest change. You replaced something, i.e. changed the file contents. Git expects the mtime to be adjusted in that case. By painstakingly faking it back to its original value you essentially told Git: don't worry, this file has not changed since you last saw it.

There is nothing Git can do to outguess you when you go out of your way to break the most fundamental promise of the mtime value.

@PhilipOakley
Copy link

@webmaster33
The issue, as I see it, is that currently Git does not expect this behaviour, and as such has no special mechanism for a forced add of a filename, with an identical mtime (etc) stat but different content, to the staging area, such that when the next commit is made, the new sha1 hash, and content, is used.

I would not expect that Git would ever want to try and detect a content change, without mtime (etc) stat change, and try to show it as 'changed', i.e. unstaged. I'd hope it could allow a forced add (but someone [you?] would have to code it) to allow these special cases where you/your code already knows that the file has changed and the update can be forced so that the 'unstaged' indication would never show itself. (note there is a separation / split of concerns, so the suggested solution changed)

The main use case here appears to be to shift data from one version control system to another, and retain a coarse mtime value held by the old system (probably as a text field, as git does not itself record it) when recreating revisions in the git system, and what is wanted is a "this is what it is, write it, blinkers-on" approach to copying the data across.

[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]

@YueLinHo
Copy link

[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]

My memo:
​git hash-object -w​
​git update-index --add --cacheinfo​
git write-tree
​git commit-tree​

From: Chapter 10, Pro Git 2 Book

@PhilipOakley
Copy link

[As I type this I realise there maybe some low level blob writing plumbing action that I can't remember the name >> of that does this (e.g. git-hash-object etc.), the manual is quite a Full manual, so worth a read]
My memo:
​git hash-object -w​
​git update-index --add --cacheinfo​
git write-tree
​git commit-tree​

From: Chapter 10 of Pro Git 2 Book

Thanks, that's useful.

Lets hope it helps the OP with the macro/script for the transfer from VSS... yeah.

Philip

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants