Problem with smudge/clean filters #74

Closed
fd opened this Issue Mar 18, 2013 · 14 comments

Comments

Projects
None yet
6 participants
@fd

fd commented Mar 18, 2013

Smudge/clean filters are not applied by git show which causes GitGutter to detect ghost-changes.

@jisaacks

This comment has been minimized.

Show comment Hide comment
@jisaacks

jisaacks Mar 23, 2013

Owner

Can you give some more details please. I personally have never used that feature of git.

Owner

jisaacks commented Mar 23, 2013

Can you give some more details please. I personally have never used that feature of git.

@kofalt

This comment has been minimized.

Show comment Hide comment
@kofalt

kofalt Feb 17, 2016

Hello @jisaacks! I just ran into this today and I can help lend some context.

Git filters allow you to modify files when checking out or committing. You can do a couple cool things with this: store & version an encrypted file alongside your normal source code, for example. I think the Windows Git client uses filters to solve compatibility problems with line endings as well.

Docs are here but the pictures are pretty great:

smudge

clean

Here's a popular python library, edited locally as tabs, without messing up the diff:

# Grab an example repo
git clone https://github.com/kennethreitz/requests.git && cd requests

# Create a filter called "tabspace"
git config filter.tabspace.smudge 'unexpand --tabs=4 --first-only'
git config filter.tabspace.clean  'expand   --tabs=4 --initial'

# Tell git to run the filter on all python files
echo '*.py  filter=tabspace' >> .git/info/attributes

# Bit silly: delete all tracked files & checkout again to trigger the filter
git rm -rf -q .
git reset -q .
git checkout .

# Have a look!
sublime_text requests/cookies.py

You'll see that the smudge filter has been applied and the file is indented with tabs, unlike its upstream variant. Meanwhile, git diff is empty, because the clean filter applies before diffing or comitting.

You'll also see that GitGutter thinks each line changed by a filter has a diff status:

gutter

I think this is what @fd was getting at - that GitGutter's method for checking the diff (here?) doesn't account for git filters, which leads to erroneous reporting.

I'm not sure of the best approach for a fix - all the information is stored in the standard git config files, so it's possible to smudge manually, but ideally there's a way to ask git to handle it for you.

kofalt commented Feb 17, 2016

Hello @jisaacks! I just ran into this today and I can help lend some context.

Git filters allow you to modify files when checking out or committing. You can do a couple cool things with this: store & version an encrypted file alongside your normal source code, for example. I think the Windows Git client uses filters to solve compatibility problems with line endings as well.

Docs are here but the pictures are pretty great:

smudge

clean

Here's a popular python library, edited locally as tabs, without messing up the diff:

# Grab an example repo
git clone https://github.com/kennethreitz/requests.git && cd requests

# Create a filter called "tabspace"
git config filter.tabspace.smudge 'unexpand --tabs=4 --first-only'
git config filter.tabspace.clean  'expand   --tabs=4 --initial'

# Tell git to run the filter on all python files
echo '*.py  filter=tabspace' >> .git/info/attributes

# Bit silly: delete all tracked files & checkout again to trigger the filter
git rm -rf -q .
git reset -q .
git checkout .

# Have a look!
sublime_text requests/cookies.py

You'll see that the smudge filter has been applied and the file is indented with tabs, unlike its upstream variant. Meanwhile, git diff is empty, because the clean filter applies before diffing or comitting.

You'll also see that GitGutter thinks each line changed by a filter has a diff status:

gutter

I think this is what @fd was getting at - that GitGutter's method for checking the diff (here?) doesn't account for git filters, which leads to erroneous reporting.

I'm not sure of the best approach for a fix - all the information is stored in the standard git config files, so it's possible to smudge manually, but ideally there's a way to ask git to handle it for you.

@jisaacks

This comment has been minimized.

Show comment Hide comment
@jisaacks

jisaacks Feb 18, 2016

Owner

@kofalt that makes sense. Thanks for explaining. You are correct that is where the diff is happening but you may not have noticed that its not actually diff'ing the file or the repo. There is a function called update_buf_file that creates a temp file with the contents of what is in sublime text. There is another function called update_git_file that creates a temp file with the contents of what git has for that file. Then it asks git to diff those 2 temp files. So I am wondering if maybe in the update_git_file function, it can ask for the contents to be smudged, since the contents of the sublime file are already smudged (if I am understanding correctly.)

Owner

jisaacks commented Feb 18, 2016

@kofalt that makes sense. Thanks for explaining. You are correct that is where the diff is happening but you may not have noticed that its not actually diff'ing the file or the repo. There is a function called update_buf_file that creates a temp file with the contents of what is in sublime text. There is another function called update_git_file that creates a temp file with the contents of what git has for that file. Then it asks git to diff those 2 temp files. So I am wondering if maybe in the update_git_file function, it can ask for the contents to be smudged, since the contents of the sublime file are already smudged (if I am understanding correctly.)

@kofalt

This comment has been minimized.

Show comment Hide comment
@kofalt

kofalt Feb 18, 2016

That sounds right to me.
Am I reading update_git_file correctly in that it is basically running git show HEAD:[file]?

I was surprised to find no obvious ways to make git show apply a smudge filter. Considering asking the git maintainers for a flag. Or maybe I'm just missing something.

I did find two cheeky ways to apply filters without messing with the repo:

# Not smudged
$ git show HEAD:requests/cookies.py
...
    import threading

# Smudged! (renders in github as spaces; works IRL)
$ git archive HEAD requests/cookies.py | tar x requests/cookies.py -O
...
    import threading

# Also smudged! Doesn't use archives, but does touch disk
$ mkdir -p ../oy
$ git --work-tree=../oy checkout requests/cookies.py
$ cat ../oy/requests/cookies.py
...
    import threading

If you've got a more clever approach, that'd clearly be preferable :)

kofalt commented Feb 18, 2016

That sounds right to me.
Am I reading update_git_file correctly in that it is basically running git show HEAD:[file]?

I was surprised to find no obvious ways to make git show apply a smudge filter. Considering asking the git maintainers for a flag. Or maybe I'm just missing something.

I did find two cheeky ways to apply filters without messing with the repo:

# Not smudged
$ git show HEAD:requests/cookies.py
...
    import threading

# Smudged! (renders in github as spaces; works IRL)
$ git archive HEAD requests/cookies.py | tar x requests/cookies.py -O
...
    import threading

# Also smudged! Doesn't use archives, but does touch disk
$ mkdir -p ../oy
$ git --work-tree=../oy checkout requests/cookies.py
$ cat ../oy/requests/cookies.py
...
    import threading

If you've got a more clever approach, that'd clearly be preferable :)

@kofalt

This comment has been minimized.

Show comment Hide comment
@kofalt

kofalt Feb 18, 2016

Hmm. I just noticed that after running git show you're writing the result to disk anyway for git to read. Given that, the second option might be a valid drop-in replacement.

kofalt commented Feb 18, 2016

Hmm. I just noticed that after running git show you're writing the result to disk anyway for git to read. Given that, the second option might be a valid drop-in replacement.

@rchl rchl added the enhancement label Nov 16, 2016

@deathaxe

This comment has been minimized.

Show comment Hide comment
@deathaxe

deathaxe Jan 29, 2017

Collaborator

I prepared a simple fix for this issue locally by replacing the git show command by git archive.

It depends on the fix for Issue #340 which is not yet published due to pending PRs.

Collaborator

deathaxe commented Jan 29, 2017

I prepared a simple fix for this issue locally by replacing the git show command by git archive.

It depends on the fix for Issue #340 which is not yet published due to pending PRs.

@kofalt

This comment has been minimized.

Show comment Hide comment
@kofalt

kofalt Jan 30, 2017

That's awesome 👍

I still use git filters, so if you show me a way to test a new version of the plugin I will gladly do so!

kofalt commented Jan 30, 2017

That's awesome 👍

I still use git filters, so if you show me a way to test a new version of the plugin I will gladly do so!

@deathaxe

This comment has been minimized.

Show comment Hide comment
@deathaxe

deathaxe Feb 13, 2017

Collaborator

@kofalt: You could checkout the commits from my PR and help reviewing and bug hunting.

Collaborator

deathaxe commented Feb 13, 2017

@kofalt: You could checkout the commits from my PR and help reviewing and bug hunting.

@kofalt

This comment has been minimized.

Show comment Hide comment
@kofalt

kofalt Feb 13, 2017

I've got good news & bad news 😄

Using my example above, on the tip of GitGutter master:

Using 173e502, the current tip of #372:

So that works!

The bad news is that it looks like it's easily confused by symlinks.
So if you use my example above, but:

# Leave the directory we cloned into
cd ..

# Alias the directory
ln -s requests/ link

# Have a look via the symlink 
sublime_text link/requests/cookies.py

You'll see this:

It looks like I can reproduce that error without filters though. If you like I can file that as a separate ticket!

kofalt commented Feb 13, 2017

I've got good news & bad news 😄

Using my example above, on the tip of GitGutter master:

Using 173e502, the current tip of #372:

So that works!

The bad news is that it looks like it's easily confused by symlinks.
So if you use my example above, but:

# Leave the directory we cloned into
cd ..

# Alias the directory
ln -s requests/ link

# Have a look via the symlink 
sublime_text link/requests/cookies.py

You'll see this:

It looks like I can reproduce that error without filters though. If you like I can file that as a separate ticket!

@deathaxe

This comment has been minimized.

Show comment Hide comment
@deathaxe

deathaxe Feb 13, 2017

Collaborator

Good news for the bad news: I found the reason. This is not related to the smudge stuff but a more common issue. Will fix it.

Collaborator

deathaxe commented Feb 13, 2017

Good news for the bad news: I found the reason. This is not related to the smudge stuff but a more common issue. Will fix it.

@deathaxe

This comment has been minimized.

Show comment Hide comment
@deathaxe

deathaxe Feb 13, 2017

Collaborator

Pushed the fix to master and the PR branch. Should work now.

Collaborator

deathaxe commented Feb 13, 2017

Pushed the fix to master and the PR branch. Should work now.

@zertrin

This comment has been minimized.

Show comment Hide comment
@zertrin

zertrin Mar 14, 2017

Hi, I just found this issue because after setting up a .gitattribute filter to strip output data and metadata from ipython notebooks (https://github.com/kynan/nbstripout) and making sure that it is working in git bash when diffing, adding and commiting, GitGutter still shows me differences for these chunks (see screenshot below).

gitgutter_iss74_filter

git status and git diff correctly report no changes to that file (even if the working copy differs from the index).

I'm using GitGutter v1.5.0 with ST3

My .gitattribute file contains:

*.ipynb filter=nbstripout

and the filter is set up as the following:

filter.nbstripout.clean="c:/users/xxxxxxxxxx/appdata/local/continuum/anaconda3/python.exe" "c:/users/xxxxxxxxxx/appdata/local/continuum/anaconda3/lib/site-packages/nbstripout.py"
filter.nbstripout.smudge=cat
filter.nbstripout.required=true

I am suspecting that the problem comes from the fact that the smudge filter in my case would result in a different file than the one currently in the working copy (the commited file contains no outputs, while the one in the working copy still has the outputs that were subsequently filtered by the clean filter while adding and committing).

So I guess my question is: what files does GitGutter compares exactly to make a decision? Does GitGutter uses the working copy file at all? IMHO it shouldn't use the one in the working copy (that may not have been smudged yet). The comparison should be done between the "working copy file after cleaning" and the file in the index.

However that raises the question of how can gitgutter associate the lines between the "working copy" and the "working copy after applying the clean filter" in order to show the diff on the "working copy" file...

.... the more I write this comment, the more I see that this isn't easily solvable, and that assuming that the working copy file has to be already smudged may be the only solution.

zertrin commented Mar 14, 2017

Hi, I just found this issue because after setting up a .gitattribute filter to strip output data and metadata from ipython notebooks (https://github.com/kynan/nbstripout) and making sure that it is working in git bash when diffing, adding and commiting, GitGutter still shows me differences for these chunks (see screenshot below).

gitgutter_iss74_filter

git status and git diff correctly report no changes to that file (even if the working copy differs from the index).

I'm using GitGutter v1.5.0 with ST3

My .gitattribute file contains:

*.ipynb filter=nbstripout

and the filter is set up as the following:

filter.nbstripout.clean="c:/users/xxxxxxxxxx/appdata/local/continuum/anaconda3/python.exe" "c:/users/xxxxxxxxxx/appdata/local/continuum/anaconda3/lib/site-packages/nbstripout.py"
filter.nbstripout.smudge=cat
filter.nbstripout.required=true

I am suspecting that the problem comes from the fact that the smudge filter in my case would result in a different file than the one currently in the working copy (the commited file contains no outputs, while the one in the working copy still has the outputs that were subsequently filtered by the clean filter while adding and committing).

So I guess my question is: what files does GitGutter compares exactly to make a decision? Does GitGutter uses the working copy file at all? IMHO it shouldn't use the one in the working copy (that may not have been smudged yet). The comparison should be done between the "working copy file after cleaning" and the file in the index.

However that raises the question of how can gitgutter associate the lines between the "working copy" and the "working copy after applying the clean filter" in order to show the diff on the "working copy" file...

.... the more I write this comment, the more I see that this isn't easily solvable, and that assuming that the working copy file has to be already smudged may be the only solution.

@deathaxe

This comment has been minimized.

Show comment Hide comment
@deathaxe

deathaxe Mar 14, 2017

Collaborator

How GitGutter works

GitGutter basically creates 2 files in your temp folder (git_gutter_...) which are compared against each other using git diff --no-index ... <file1> <file2>.

The <file1> contains a copy of Sublime Text's view.
The <file2> contains the unzipped content of git archive --format=zip <hash> <file_name>

The <hash> is the current commit hash of your compare target (HEAD, branch, tag). <file2> is updated only, if this <hash> changes. The <file_name> is the relative path of the file within the index.

As descriped in some posts above smudge filters are applied to <file2> by git archive.

Conclusion

This means you can use smudge filters, which will manipulate the file when it is checked out, but you can't use smudge filters, which manipulate the working file before commit.

There is no way to apply any smudge filters on <file1>!

Diff

As you stated correctly, we can't call git diff against the working file directly, as it may differ from what the view shows.

Calling git diff against the temporary <file1> using the --git-dir and --work-tree arguments doesn't work reliably and would slow down the diff process, if the working tree was located on a remote machine.

GitSavvy as an example uses git hash-object -w to create dummy commits for the local file, but this creates dangling objects for each modification in the database and also may slow down diffing on remote drives. This is no option!

Collaborator

deathaxe commented Mar 14, 2017

How GitGutter works

GitGutter basically creates 2 files in your temp folder (git_gutter_...) which are compared against each other using git diff --no-index ... <file1> <file2>.

The <file1> contains a copy of Sublime Text's view.
The <file2> contains the unzipped content of git archive --format=zip <hash> <file_name>

The <hash> is the current commit hash of your compare target (HEAD, branch, tag). <file2> is updated only, if this <hash> changes. The <file_name> is the relative path of the file within the index.

As descriped in some posts above smudge filters are applied to <file2> by git archive.

Conclusion

This means you can use smudge filters, which will manipulate the file when it is checked out, but you can't use smudge filters, which manipulate the working file before commit.

There is no way to apply any smudge filters on <file1>!

Diff

As you stated correctly, we can't call git diff against the working file directly, as it may differ from what the view shows.

Calling git diff against the temporary <file1> using the --git-dir and --work-tree arguments doesn't work reliably and would slow down the diff process, if the working tree was located on a remote machine.

GitSavvy as an example uses git hash-object -w to create dummy commits for the local file, but this creates dangling objects for each modification in the database and also may slow down diffing on remote drives. This is no option!

@zertrin

This comment has been minimized.

Show comment Hide comment
@zertrin

zertrin Mar 15, 2017

Thanks for the detailed explanation! I'll live with it 😉

zertrin commented Mar 15, 2017

Thanks for the detailed explanation! I'll live with it 😉

deathaxe added a commit that referenced this issue May 13, 2017

Fix: All export-ignored files are marked as new files (Issue #409)
Issue:

Users repetitively report files being marked as new even though they aren't.

Since commit dce471e GitGutter uses `git archive` to read the content of a file
from the tree. This change was made to fix issue #74 and add some basic support
for smudge/clean filters.

But `git archive` doesn't find files which match the `export-ignore` patterns
in the `.gitattributes` or `.git/info/attributes` files. This results in empty
output for a file which is expected to be there and thus will be displayed as
added in the gutter area.

Solution:

The plumping command `git cat-file --filters` reads the file content with
smudge filters applied as well. Thus it is used to replace `git archive` to
fix this issue without breaking smudge/clean filters support.

deathaxe added a commit that referenced this issue May 29, 2017

Fix: Use git --filters argument only if git 2.11.0+ is available (Issue
#416)

Commit 4a95320 introduced `git cat-file --filters` to read content from a tree object and apply smudge filters. The reason for that change was `git archive` ignoring files which are filtered from export in a .gitattributes file.

With release of GitGutter 1.7.0 it turned out that `--filters` argument is supported by git 2.11.0 and above only, which may not be available on some conservative Linux distributions. As a result of the failed execution GitGutter sees an empty file and marks all lines added.

The only solution right now is to use `-p` instead of `--filters` if older git versions are detected. This will reintroduce issue #74 for all users of legacy git setups but there is no other solution available.

Note: smudge filters require git 2.11.0+ beginning with this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment