Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isort behavior depends on which files are available in the sandbox #15069

Open
kaos opened this issue Apr 8, 2022 · 39 comments · May be fixed by #19869
Open

isort behavior depends on which files are available in the sandbox #15069

kaos opened this issue Apr 8, 2022 · 39 comments · May be fixed by #19869
Assignees
Labels
backend: Python Python backend-related issues bug

Comments

@kaos
Copy link
Member

kaos commented Apr 8, 2022

Describe the bug

When running isort on a single source file, the result is not always consistent compared to when running on all sources.

Pants version
2.11.0rc1, 2.12.0.dev1

OS
Mac

Additional info

Test case demonstrating this issue in #15002

It seems there are cases even in fmt that requires transitive deps to be present, not only for check.
See related comment: #14186 (comment)

I've run across this a couple of times now at work, where I have a pre-commit hook that runs on a subset of files (exactly the subset of changed files only), and fails, where as when I run by hand I usually use globs like a hammer and includes a whole lot more, which exposes this issue, due to isort not being stable for the two ways it is being invoked.

@kaos kaos added the bug label Apr 8, 2022
@stuhood stuhood changed the title isort may require transient sources isort may require transitive sources Apr 8, 2022
@stuhood
Copy link
Sponsor Member

stuhood commented Apr 8, 2022

One thing to check is whether it needs transitive dependencies ("everything recursively") or just direct dependencies (things literally mentioned in the imports or explicitly in BUILD files): it seems like given that it is sorting the imports, there is a distinct possibility that it only requires direct dependencies.

But yea: we should definitely fix this. Accuracy is more important than performance.

@kaos
Copy link
Member Author

kaos commented Apr 8, 2022

Ah, yes. Direct dependencies should be enough.

@thejcannon
Copy link
Member

thejcannon commented Apr 8, 2022

Adding my 2 cents: we should instead seed isort's config from source_roots. I'd prefer if formatters don't require dependencies

@kaos
Copy link
Member Author

kaos commented Apr 9, 2022

That would be better. I've not found how to configure my way out of this, so any help with that is appreciated.

Not urgent, so whenever time permits works for me.

@benjyw
Copy link
Sponsor Contributor

benjyw commented May 31, 2022

I've dug into this. It's not that it needs dependencies, it actually may need the entire source root. The reason is that isort has subtle logic to deal with namespace packages, presumably so that if foo.bar and foo.qux are published from two separate repos, it can treat foo.qux as third-party in foo.bar's repo and vice versa.

So if we only have foo/bar/bar.py in the sandbox, and that file imports from foo.qux, isort will classify foo.qux as third-party, because its namespace package detection heuristic sees no .py files in foo, and assumes that foo is a namespace package.

But if some foo/baz.py happens to also be in the sandbox, then isort sees foo as a non-namespace package, and classifies it as first-party (which is what you want).

Note that foo/baz.py may not be a transitive dep of foo/bar/bar.py, so bringing in all deps does not help here.

So we can either:

A) Bring in the entire source root
B) Special-case this, and bring into the sandbox a representative .py file from every parent dir, if there is one.

A seems like massive overkill. So, B I guess? I slightly fear unintended consequences, but it's probably fine since those are real files in the repo, that could have been brought in some other way anyway.

@kaos
Copy link
Member Author

kaos commented May 31, 2022

Thanks @benjyw!

B) Special-case this, and bring into the sandbox a representative .py file from every parent dir, if there is one.

+1 to go with option B. I guess the __init__.py could be that representative? And as such, in most cases should be very light weight as well.

@thejcannon
Copy link
Member

@benjyw just curious, have you looked into the config route?

@thejcannon
Copy link
Member

Nevermind just saw your comment about how the config is wrong for namespace packages

@Eric-Arellano
Copy link
Contributor

Thanks for digging into this Benjy, and reporting on it Andreas. For context, this was known when isort 5 first came out and our docs call attention to it: https://www.pantsbuild.org/docs/python-linters-and-formatters#isort-possible-issues-with-its-import-classifier-algorithm

But our hand-wavey workaround of config files isn't robust, it sounds like. It's also a total gotcha.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Jun 1, 2022

I suspect that that only worked because isort's heuristics respond to the presence of pyproject.toml, regardless of its content.

@thejcannon thejcannon added the backend: Python Python backend-related issues label Jun 7, 2022
@huonw
Copy link
Contributor

huonw commented Jul 4, 2022

For isort specifically, would it be feasible to specify first party modules via the --project arg ("known first party"), rather than simulate the repo structure on disk? This would effectively be using pants' understanding of what's first party rather than relying on isort to guess it.

In our project we were having similar issues with import foo and import foo.bar being randomly split across different sections and found that we could acceptably work around it by setting default_section = "FIRSTPARTY" to just lump all first party and third party packages into a section together.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Jul 5, 2022

@huonw Well, you can certainly set that in an isort config file, and Pants will use that config file. But it sounds like you're suggesting generating config?

@huonw
Copy link
Contributor

huonw commented Jul 5, 2022

Yeah, I'm wondering if, for instance, the following function could be expanded to include something like args.extend(f"-p={module}" for module in magically_list_all_python_modules()) (I definitely have no idea what's possible within Pants, and that may make too-long argument vectors).

def generate_argv(
source_files: tuple[str, ...], isort: Isort, *, is_isort5: bool
) -> Tuple[str, ...]:
args = [*isort.args]
if is_isort5 and len(isort.config) == 1:
explicitly_configured_config_args = [
arg
for arg in isort.args
if (
arg.startswith("--sp")
or arg.startswith("--settings-path")
or arg.startswith("--settings-file")
or arg.startswith("--settings")
)
]
# TODO: Deprecate manually setting this option, but wait until we deprecate
# `[isort].config` to be a string rather than list[str] option.
if not explicitly_configured_config_args:
args.append(f"--settings={isort.config[0]}")
args.extend(source_files)
return tuple(args)

@kaos
Copy link
Member Author

kaos commented Jul 5, 2022

This hit me again in another project.. ;)

And yea, the "known first party" config option seems like a reasonable way to resolve this. Putting it in a generated isort config rather than as args on the command line could help avoid issues with too long command lines as well.

@Eric-Arellano
Copy link
Contributor

I thought we were saying the config approach is not viable for namespace packages? @kaos

If config is viable, then I strongly prefer we auto-setup config for you rather than including all the projects in the sandbox. The latter would be substantially slower, and I think not properly tracked with --changed-since; the output of fmt depends on what other files are in the sandbox.

@kaos
Copy link
Member Author

kaos commented Jul 5, 2022

I'm not sure if I tried with the "known first party" config option. I only tried the "src paths" one, from what I can read in my issue/PR.

The suggested approach by Benjy I think was to add empty __init__.py files (or similar) as placeholders to aid the isort heuristics into coming to correct conclusions, so wouldn't need all the projects in the sandbox, fwict.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Jul 5, 2022

If I understand correctly, doing this automatically might break the foo.bar / foo.qux case I mentioned above (where foo.qux is third-party in foo.bar's repo and foo is a namespace package). We don't know what other repos are out there, so we can't assume that everything under foo is first-party.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Jul 5, 2022

We may have to start by documenting this and offering the workaround of setting the firstparty package explicitly in isort config.

@thejcannon
Copy link
Member

where foo.qux is third-party in foo.bar's repo and foo is a namespace package

To me, we're starting to get a little bit in the weeds on third-vs-first party, but point taken. Perhaps there's an implementation which uses the most-specific (i.e longest path) "superset" of modules?

@benjyw benjyw changed the title isort may require transitive sources isort behavior depends on which files are available in the sandbox Aug 4, 2022
@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 6, 2022

To me, we're starting to get a little bit in the weeds on third-vs-first party, but point taken. Perhaps there's an implementation which uses the most-specific (i.e longest path) "superset" of modules?

This is the crux of the issue, as I understand it. So I think it's the right amount of weeds...

@thejcannon
Copy link
Member

Well specifically I mean if "foo.bar" is "third party" when we talk about "foo.baz". They share an ancestor, so presumably they're in the same party, albeit different packages and codebases

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 7, 2022

True, but Pants doesn't care about whether the same people wrote it or own the IP or whatever. It cares about how code is consumed. The salient point here is that foo.bar is consumed as a requirement. Perhaps "external" and "internal" would be better terminology in general, but keep in mind that isort itself uses the firstparty/thirdparty terminology in its config etc, so in this case it is actually the right term to use.

@thejcannon
Copy link
Member

Right, my point, outside of Pants, is "are packages outside of this one, but still in your namespace considered first party or third-party?" And then slap an isort lens on the question as well

@thejcannon
Copy link
Member

I'm reading over this again, and I'm struggling to grasp the namespace package issue as described above.

@benjyw can you dump a reproduction of the isort issue with namespaces (not using Pants) so we can test a Pants change against this corner case?

@thejcannon
Copy link
Member

OK, so good news and bad news:

Good news: This is easily fixable in a not-nasty way. Essentially specify every import we're going to provide as a firstparty module. It makes sense, and works around the namespace issue. This works purely on the argument level.

Bad news: It requires a bugfix to isort: PyCQA/isort#2168

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 22, 2023

Re specify every import we're going to provide as a firstparty module - we'd have to know it is in fact a firstparty module. I guess we do know this because dep inference knows this?

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 22, 2023

Setting up a test case repo is a good idea though, I will try and do that asap

@thejcannon
Copy link
Member

we'd have to know it is in fact a firstparty module. I guess we do know this because dep inference knows this?

Not dep inference, but Pants wholistically. Whether it's inferred or explicit, if we're providing a Python module via a file on disk, that must be firstparty. That's what I like about this. It's super simple.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 23, 2023

We literally already have this logic in dep inference though. Seems like we can consume that mapping and save ourselves some work.

@thejcannon
Copy link
Member

We don't need the mapping though, we just need the dependencies were about to provide 🙂 (which is a product of the mapping)

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 23, 2023

As I mentioned in #15069 (comment) - looking at all deps (even transitively) is not sufficient. So can you explain what "specify every import we're going to provide as a firstparty module" means?

@thejcannon
Copy link
Member

thejcannon commented Aug 23, 2023

Basically, if an import is being provided by Pants, that's first party, everything else if third party

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 23, 2023

ELI5? I don't know what "import id" means here.

@thejcannon
Copy link
Member

Sorry fat fingered it on mobile: "import is".

Basically we just tell isort what is first party. And we know that because we know what first party (direct) dependencies we provide.

I think it's probably worth just demonstrating via a reproduction/test, especially since namespace packages are in the mix

@benjyw
Copy link
Sponsor Contributor

benjyw commented Aug 24, 2023

Yeah, I fear it's a little more subtle than that, but it's on me to make a repro of the problem.

@thejcannon
Copy link
Member

OK, I wanted this off my stack, so I decided to jump in:

josh@cephandrius:~/work/splash$ tree .
.
├── bad
│   ├── foo
│   │   └── bar
│   │       ├── bar.py
│   │       └── __init__.py
│   └── run_isort.sh
└── good
    ├── foo
    │   ├── bar
    │   │   ├── bar.py
    │   │   └── __init__.py
    │   └── baz.py
    └── run_isort.sh

6 directories, 7 files

Files are identical across tree, with:

josh@cephandrius:~/work/splash$ cat bad/foo/bar/bar.py 
import foo.qux
import requests
josh@cephandrius:~/work/splash$ cat bad/run_isort.sh 
#!/usr/bin/env bash

cd "$(dirname "$0")"
pipx run isort foo/bar/bar.py "$@"

Then running each:

josh@cephandrius:~/work/splash$ ./bad/run_isort.sh 
josh@cephandrius:~/work/splash$ ./good/run_isort.sh 
Fixing /home/josh/work/splash/good/foo/bar/bar.py

and for contents:

josh@cephandrius:~/work/splash$ diff -u bad/foo/bar/bar.py good/foo/bar/bar.py 
--- bad/foo/bar/bar.py  2023-09-08 09:52:26.240101161 -0500
+++ good/foo/bar/bar.py 2023-09-08 09:53:32.650789329 -0500
@@ -1,2 +1,3 @@
-import foo.qux
 import requests
+
+import foo.qux

So @benjyw your theory holds.


Now, let's reset everything and try again with my proposed fix (which is "if Pants could provide the direct dependency as a file, we pass -p=<modname>")

josh@cephandrius:~/work/splash$ ./bad/run_isort.sh -p=foo.qux
Fixing /home/josh/work/splash/bad/foo/bar/bar.py
josh@cephandrius:~/work/splash$ ./good/run_isort.sh -p=foo.qux
Fixing /home/josh/work/splash/good/foo/bar/bar.py
josh@cephandrius:~/work/splash$ diff -u bad/foo/bar/bar.py good/foo/bar/bar.py 

So, as expected. We know what's firstparty because we're the all-seeing-eye of firstparty/thirdparty

@huonw
Copy link
Contributor

huonw commented Sep 9, 2023

Reproducer in script form, for easier reproduction on other machines:

cd $(mktemp -d)

# isort setup
python -m venv venv
. venv/bin/activate
pip install isort==5.12.0

# configure the directory structure
mkdir -p foo/bar/
touch foo/bar/__init__.py
cat > foo/bar/bar.py <<EOF
import foo.qux
import requests
EOF

echo '"BUG": foo.qux is treated as third-party'
python -m isort foo/bar/bar.py
cat foo/bar/bar.py

echo 'Proposed fix: explicit -p arg'
python -m isort foo/bar/bar.py -p=foo.qux
cat foo/bar/bar.py

echo 'Proposed fix matches behaviour when the file actually exists'
touch foo/qux.py
python -m isort foo/bar/bar.py
cat foo/bar/bar.py

Now, let's reset everything and try again with my proposed fix (which is "if Pants could provide the direct dependency as a file, we pass -p=")

(FWIW, this is a independent reinvention of a suggestion slightly earlier in the thread, e.g. #15069 (comment))

@thejcannon
Copy link
Member

Yup I saw that comment after having the same thought. Great minds think alike?

I have the code in a branch. I need to turn this reproduction into a test on main so that we can enshrine the quirk, and the fix, forever

@benjyw
Copy link
Sponsor Contributor

benjyw commented Sep 13, 2023

So I've made a repro of the subtle problem, and I think I've convinced myself that it's not subtle after all, and that pulling direct deps into the sandbox will be sufficient to solve the problem without mucking around with config:

https://github.com/benjyw/isort-issue

I got tangled up in the idea that a file outside your deps could change the classification, which is true. But it's also true that it can only change it in the same direction as a direct dep would. So...

thejcannon added a commit that referenced this issue Sep 15, 2023
#17403 (comment)
describes a bit of a split between `lint` and `fix` in terms of
batching.

This change fixes #19634 by switching `AbstractFixRequest.files` to
report `self.snapshot.files` which is already de-duplicated. (Note that
the elements in the case of `lint` were the filepaths directly
translated from addresses. Maybe it's worth de-duplicating at a higher
layer, but 🤷‍♂️ )

This change also fixes #17403 (which really is a duplicate of #15069
manifest in new and exciting ways) by de-duplicating the `fix`-specific
batching strategy. It didn't sit right to me that in the reproduction
repo we did 1 batches of 2 files for `fix`, but 1 batch of 4 files for
`lint`.
@thejcannon thejcannon linked a pull request Sep 19, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants