New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't match files based on shebang language #257

Closed
chriskuehl opened this Issue Jul 31, 2015 · 11 comments

Comments

Projects
None yet
3 participants
@chriskuehl
Member

chriskuehl commented Jul 31, 2015

Often we have scripts written in Python with no extension. Hooks usually miss these since they match on extension.

Would be nice if pre-commit could somehow read shebangs. Maybe something to think about when trying to solve #220 as well. Not sure how to best approach this yet.

@chriskuehl

This comment has been minimized.

Show comment
Hide comment
@chriskuehl

chriskuehl Aug 6, 2015

Member

I spent a little time trying to think of a good way to handle this, as well as #254 (ability for hooks to target types of objects, e.g. file/symlink/submodule) and #220 (ability for hooks to target "text" files and avoid list of "text" extensions).

Here's my suggestion:

Files get automatically tagged

Each file gets tagged with a list of tags describing it. For example, an executable file something.py might be classified ['file', 'text', 'python', 'executable'].

Hooks target file types

Hooks can then specify types (which defaults to file):

# hooks.yaml
-   id: trailing-whitespace
    types: [text]
    entry: trailing-whitespace-fixer

-   id: symlinks-point-inside-repo
    types: [symlink]

-   id: executables-have-shebang
    types: [executable]

-   id: double-quote-string-fixer
    types: [python, ruby, javascript]  # if this hook actually worked on all of these

There's an added benefit in that we no longer need to duplicate the hook extensions in repos if we want to change the path.

# .pre-commit-config.yaml
-   id: puppet-lint
    files: ^modules/ocf

Currently we have to do exclude: ^modules/(?!ocf).*$ instead which will get really complex as soon as we need to start excluding anything else. Now instead of hooks targeting files by extension, the suggested way is targeting them by type. (You can still do it by extension, of course.)

types is an OR operation; a file with any of the tags will match.

I would suggest not including an all types alias, since I think it risks people copy-pasting it without realizing it includes symlinks and submodules. [file, symlink, submodule] is not very cumbersome for the very few hooks that will actually want it (the only existing one I can think of is check-case-conflict).

Implementation

Making this happen is fairly easy (and I'd volunteer to do the work, since I really want to check files with shebangs):

  • Write a file classifier. A very functional classifier is easy to write (check if text, then check extension, fallback to reading shebang) and covers almost all cases.
  • A little plumbing

It's pretty much backwards compatible. Nothing changes about files or exclude. If omitted, types defaults to ['file'] (technically a breaking change, but I think it's a much better default (many of the current hooks will follow symlinks outside of the repo and do nasty things). I can't see any cases where we would break existing stuff.

The main issue is if people upgrade hooks but don't upgrade pre-commit. I don't know a good solution here, unfortunately.

Member

chriskuehl commented Aug 6, 2015

I spent a little time trying to think of a good way to handle this, as well as #254 (ability for hooks to target types of objects, e.g. file/symlink/submodule) and #220 (ability for hooks to target "text" files and avoid list of "text" extensions).

Here's my suggestion:

Files get automatically tagged

Each file gets tagged with a list of tags describing it. For example, an executable file something.py might be classified ['file', 'text', 'python', 'executable'].

Hooks target file types

Hooks can then specify types (which defaults to file):

# hooks.yaml
-   id: trailing-whitespace
    types: [text]
    entry: trailing-whitespace-fixer

-   id: symlinks-point-inside-repo
    types: [symlink]

-   id: executables-have-shebang
    types: [executable]

-   id: double-quote-string-fixer
    types: [python, ruby, javascript]  # if this hook actually worked on all of these

There's an added benefit in that we no longer need to duplicate the hook extensions in repos if we want to change the path.

# .pre-commit-config.yaml
-   id: puppet-lint
    files: ^modules/ocf

Currently we have to do exclude: ^modules/(?!ocf).*$ instead which will get really complex as soon as we need to start excluding anything else. Now instead of hooks targeting files by extension, the suggested way is targeting them by type. (You can still do it by extension, of course.)

types is an OR operation; a file with any of the tags will match.

I would suggest not including an all types alias, since I think it risks people copy-pasting it without realizing it includes symlinks and submodules. [file, symlink, submodule] is not very cumbersome for the very few hooks that will actually want it (the only existing one I can think of is check-case-conflict).

Implementation

Making this happen is fairly easy (and I'd volunteer to do the work, since I really want to check files with shebangs):

  • Write a file classifier. A very functional classifier is easy to write (check if text, then check extension, fallback to reading shebang) and covers almost all cases.
  • A little plumbing

It's pretty much backwards compatible. Nothing changes about files or exclude. If omitted, types defaults to ['file'] (technically a breaking change, but I think it's a much better default (many of the current hooks will follow symlinks outside of the repo and do nasty things). I can't see any cases where we would break existing stuff.

The main issue is if people upgrade hooks but don't upgrade pre-commit. I don't know a good solution here, unfortunately.

@asottile

This comment has been minimized.

Show comment
Hide comment
@asottile

asottile Aug 6, 2015

Member

Seems reasonable to me, under which conditions would files and exclude take effect? or would they be anded with the types?

Member

asottile commented Aug 6, 2015

Seems reasonable to me, under which conditions would files and exclude take effect? or would they be anded with the types?

@chriskuehl

This comment has been minimized.

Show comment
Hide comment
@chriskuehl

chriskuehl Aug 6, 2015

Member

under which conditions would files and exclude take effect? or would they be anded with the types?

My thinking is the AND them. The list of objects to consider running hooks on is built as it is currently (no change; all objects that match the files regex but don't match the exclude regex), then we filter this list to only contain objects that have at least one tag in types.

In this way, hooks are entirely backwards compatible (except for symlinks/submodules, assuming the types default is ['file']).

I think this is also the most obvious from a developer's perspective. For files they specify whatever paths they want hooks to run under (e.g. files: ^src/), for exclude they specify paths to exclude, and they can let the hook worry about specific extensions/types. Currently to only check some specific paths they have to either list all others in excludes one-by-one, duplicate the logic in the hooks.yaml, or use negative lookaheads in exclude like above (which most, including me, don't understand).

Member

chriskuehl commented Aug 6, 2015

under which conditions would files and exclude take effect? or would they be anded with the types?

My thinking is the AND them. The list of objects to consider running hooks on is built as it is currently (no change; all objects that match the files regex but don't match the exclude regex), then we filter this list to only contain objects that have at least one tag in types.

In this way, hooks are entirely backwards compatible (except for symlinks/submodules, assuming the types default is ['file']).

I think this is also the most obvious from a developer's perspective. For files they specify whatever paths they want hooks to run under (e.g. files: ^src/), for exclude they specify paths to exclude, and they can let the hook worry about specific extensions/types. Currently to only check some specific paths they have to either list all others in excludes one-by-one, duplicate the logic in the hooks.yaml, or use negative lookaheads in exclude like above (which most, including me, don't understand).

@asottile

This comment has been minimized.

Show comment
Hide comment
@asottile

asottile Aug 6, 2015

Member

You can use files in .pre-commit-config.yaml without needing to mess with exclude, other than that sounds accurate / reasonable. I'm +1 for breaking backwards compatibility with symlinks / submodules since they've only caused headaches for us and afaik there aren't any existing hooks that actively target symlinks / submodules (though I have some ideas for them).

Overall: let's do it!

Member

asottile commented Aug 6, 2015

You can use files in .pre-commit-config.yaml without needing to mess with exclude, other than that sounds accurate / reasonable. I'm +1 for breaking backwards compatibility with symlinks / submodules since they've only caused headaches for us and afaik there aren't any existing hooks that actively target symlinks / submodules (though I have some ideas for them).

Overall: let's do it!

@chriskuehl

This comment has been minimized.

Show comment
Hide comment
@chriskuehl

chriskuehl Aug 6, 2015

Member

You can use files in .pre-commit-config.yaml without needing to mess with exclude

The problem with that is that you have to go check what the hook uses for files and duplicate it with some modifications, e.g.

files: ^modules/ocf.*\.(html|erb|slim|haml|ejs|jade|js|coffee|json|rb|md|py|css|scss|less|sh|tmpl|txt|yaml|yml)$

Agreed it's not a huge deal in practice but it does feel kinda yucky.

Just one clarification before I start coding. I'm envisioning updating a hook like

-   id: end-of-file-fixer
    name: Fix End of Files
    description: Ensures that a file is either empty, or ends with one newline.
    entry: end-of-file-fixer
    language: python
    files: \.(html|erb|slim|haml|ejs|jade|js|coffee|json|rb|md|py|css|scss|less|sh|tmpl|txt|yaml|yml|pp)$

to

-   id: end-of-file-fixer
    name: Fix End of Files
    description: Ensures that a file is either empty, or ends with one newline.
    entry: end-of-file-fixer
    language: python
    types: [text]

Basically, the idea would be to omit files from the hooks.yaml entry in almost all cases.

This will break the case where people do pre-commit autoupdate but don't upgrade pre-commit. Fortunately it will break in an obvious way (rather than just silently targeting different files): https://gist.github.com/chriskuehl/7b12d2a8aee95c7f4a6a

Is that acceptable? The good news is it only breaks if people autoupdate since we pin SHAs, and it is usually easy to update pre-commit.

The new version of pre-commit would still work with old repos.

Member

chriskuehl commented Aug 6, 2015

You can use files in .pre-commit-config.yaml without needing to mess with exclude

The problem with that is that you have to go check what the hook uses for files and duplicate it with some modifications, e.g.

files: ^modules/ocf.*\.(html|erb|slim|haml|ejs|jade|js|coffee|json|rb|md|py|css|scss|less|sh|tmpl|txt|yaml|yml)$

Agreed it's not a huge deal in practice but it does feel kinda yucky.

Just one clarification before I start coding. I'm envisioning updating a hook like

-   id: end-of-file-fixer
    name: Fix End of Files
    description: Ensures that a file is either empty, or ends with one newline.
    entry: end-of-file-fixer
    language: python
    files: \.(html|erb|slim|haml|ejs|jade|js|coffee|json|rb|md|py|css|scss|less|sh|tmpl|txt|yaml|yml|pp)$

to

-   id: end-of-file-fixer
    name: Fix End of Files
    description: Ensures that a file is either empty, or ends with one newline.
    entry: end-of-file-fixer
    language: python
    types: [text]

Basically, the idea would be to omit files from the hooks.yaml entry in almost all cases.

This will break the case where people do pre-commit autoupdate but don't upgrade pre-commit. Fortunately it will break in an obvious way (rather than just silently targeting different files): https://gist.github.com/chriskuehl/7b12d2a8aee95c7f4a6a

Is that acceptable? The good news is it only breaks if people autoupdate since we pin SHAs, and it is usually easy to update pre-commit.

The new version of pre-commit would still work with old repos.

@asottile

This comment has been minimized.

Show comment
Hide comment
@asottile

asottile Aug 6, 2015

Member

I think we should require files OR types (and allow both). This would still require hooks to be explicit about the things they target.

My concern about removing files in most cases is we'll have to build a mapping of file extension to language name inside pre-commit which I'd ideally like to avoid, though I don't know a way around it. Actually, I think there's a stdlib mimetype module or something of the sort that might solve that problem... hmm...

Member

asottile commented Aug 6, 2015

I think we should require files OR types (and allow both). This would still require hooks to be explicit about the things they target.

My concern about removing files in most cases is we'll have to build a mapping of file extension to language name inside pre-commit which I'd ideally like to avoid, though I don't know a way around it. Actually, I think there's a stdlib mimetype module or something of the sort that might solve that problem... hmm...

@asottile

This comment has been minimized.

Show comment
Hide comment
@asottile

asottile Aug 6, 2015

Member

And yeah, that transformation looks correct

Member

asottile commented Aug 6, 2015

And yeah, that transformation looks correct

@chriskuehl

This comment has been minimized.

Show comment
Hide comment
@chriskuehl

chriskuehl Aug 6, 2015

Member

Agreed that we should still require either files or type. The main reason I'd like to switch to primarily relying on type for language detection is that we have a lot of files with shebangs and no extension, and I can't think of a better way to check those.

Unfortunately the stdlib mimetype module looks at the file extension (path name) and nothing else.

I can understand if pre-commit isn't interested in handling files without extensions as it does make things more complex. I think it's probably a fairly common use-case, though.

My thinking is that we can write a fairly simple classifier that just reads file extensions and shebangs, and it will work great in almost every case. Here is my first attempt:
https://gist.github.com/chriskuehl/dc4a8232ac77e80f4c7c (seems to work in py26 through py34)

The tags look like:

/etc/passwd: {'nonexecutable', 'text'}
/bin/bash: {'binary', 'executable'}
/dev/urandom: {'binary', 'nonexecutable'}
/dev/null: {'nonexecutable', 'text'}
/usr/lib/libdb-4.6.so: {'binary', 'nonexecutable'}
/usr/share/ca-certificates/mozilla/GeoTrust_Global_CA.crt: {'nonexecutable', 'text'}
/usr/bin/reportbug: {'python', 'text', 'executable'}
/usr/lib/python3.4/contextlib.py: {'python', 'nonexecutable', 'text'}
/home/c/ck/ckuehl/ocf-proj/puppet/modules/apache: {'submodule'}
/usr/bin/java: {'symlink'}

The code is probably not entirely correct, but this is the spirit of what I'd like to do.

As you mentioned, it does require maintaining a mapping of extension to file type plus interpreter to file type. Maybe maintaining this in pre-commit doesn't make sense, and it would be better to package it? (I don't mind doing that.) Especially when I think about writing tests for this, packaging it seems to make some sense.

Member

chriskuehl commented Aug 6, 2015

Agreed that we should still require either files or type. The main reason I'd like to switch to primarily relying on type for language detection is that we have a lot of files with shebangs and no extension, and I can't think of a better way to check those.

Unfortunately the stdlib mimetype module looks at the file extension (path name) and nothing else.

I can understand if pre-commit isn't interested in handling files without extensions as it does make things more complex. I think it's probably a fairly common use-case, though.

My thinking is that we can write a fairly simple classifier that just reads file extensions and shebangs, and it will work great in almost every case. Here is my first attempt:
https://gist.github.com/chriskuehl/dc4a8232ac77e80f4c7c (seems to work in py26 through py34)

The tags look like:

/etc/passwd: {'nonexecutable', 'text'}
/bin/bash: {'binary', 'executable'}
/dev/urandom: {'binary', 'nonexecutable'}
/dev/null: {'nonexecutable', 'text'}
/usr/lib/libdb-4.6.so: {'binary', 'nonexecutable'}
/usr/share/ca-certificates/mozilla/GeoTrust_Global_CA.crt: {'nonexecutable', 'text'}
/usr/bin/reportbug: {'python', 'text', 'executable'}
/usr/lib/python3.4/contextlib.py: {'python', 'nonexecutable', 'text'}
/home/c/ck/ckuehl/ocf-proj/puppet/modules/apache: {'submodule'}
/usr/bin/java: {'symlink'}

The code is probably not entirely correct, but this is the spirit of what I'd like to do.

As you mentioned, it does require maintaining a mapping of extension to file type plus interpreter to file type. Maybe maintaining this in pre-commit doesn't make sense, and it would be better to package it? (I don't mind doing that.) Especially when I think about writing tests for this, packaging it seems to make some sense.

@asottile

This comment has been minimized.

Show comment
Hide comment
@asottile

asottile Aug 7, 2015

Member

Looks great, I'm looking forward to this feature :D

Member

asottile commented Aug 7, 2015

Looks great, I'm looking forward to this feature :D

@Lucas-C

This comment has been minimized.

Show comment
Hide comment
@Lucas-C

Lucas-C Aug 17, 2015

Contributor

+1 I'd love this too !

Contributor

Lucas-C commented Aug 17, 2015

+1 I'd love this too !

@asottile asottile added the feature label Apr 4, 2016

@asottile asottile referenced this issue Jul 2, 2017

Merged

Types #551

@asottile

This comment has been minimized.

Show comment
Hide comment
@asottile

asottile Jul 3, 2017

Member

This is now available in pre-commit 0.15.0 via identify and types: [...]

Member

asottile commented Jul 3, 2017

This is now available in pre-commit 0.15.0 via identify and types: [...]

@asottile asottile closed this Jul 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment