refactor(ci): move to poetry #2937

cpcloud · 2021-09-06T19:09:38Z

This PR is an extraction from #2924
of just the poetry/ci bits.

The piece of code here with the most changes is the main CI workflow
(.github/workflows/main.yml)

This file is the main thing to focus on for review here.

pep8speaks · 2021-09-06T19:09:42Z

Hello @cpcloud! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file setup.py:

Line 131:80: E501 line too long (2082 > 79 characters)

Comment last updated at 2021-10-15 18:02:14 UTC

datapythonista

I don't think we want to add poetry.lock to the repo. This will freeze versions, we won't be aware if installing Ibis with the last dependencies is broken. And will also force us to regenerate the file. Or does the linguist-generated means some magic I'm unaware of?

cpcloud · 2021-09-06T19:46:44Z

linguist-generated is the way that you inform github not to display generated files in a PR.

cpcloud · 2021-09-06T19:52:34Z

This will freeze versions

Yes, that is their entire reason for existence :)

we won't be aware if installing Ibis with the last dependencies is broken.

we can check this in CI with another job

And will also force us to regenerate the file.

Yes, that's a good thing. It prevents two people from having different sets of dependencies.

Additionally, we can use dependabot to handle the boring process of upgrading existing dependencies on a regular basis.

Lock files are a really great way of preventing a large swath of "work on my machine" problems.

cpcloud · 2021-09-06T20:03:15Z

On the other hand, poetry.lock isn't necessary at the moment, so if it'll make the code easier to review I'll remove it.

datapythonista · 2021-09-06T20:08:48Z

I fail to see your point. How I see things, there are two different kind of projects:

Something like a website, that is going to be deployed in a server, and where freezing dependencies and making sure eveybody (developers, production servers...) uses the exact same environment is a good thing
A library used by lots of diverse users, in different operating systems, and different versions of the dependencies. Where freezing dependencies is good for the project developers, but bad for the users, who can just conda install ibis-project and get a broken environment, maintainers where not aware of

I guess I'm missing something, but poetry.lock seems to be great for the former case, but I fail to see how it's helpful for the latter. Do you want to have all devs test on an arbitrary freezed environment, and have jobs in the CI testing the latest and minimum dependencies? Or there is something else? How easy will be to run with the latest dependencies locally?

cpcloud · 2021-09-06T20:14:30Z

Where freezing dependencies is good for the project developers, but bad for the users, who can just conda install ibis-project and get a broken environment, maintainers where not aware of

I think this is a misunderstanding of how lock files work. Dependencies are constrained by pyproject.toml, which doesn't force users into anything except the version specifications in that file. Right now, those are fairly liberally constrained, for exactly the kind of scenario you are talking about: diverse sets of users.

With that in mind, how does checking in poetry.lock actually prevent users from doing anything? Users are never exposed to lockfiles.

cpcloud · 2021-09-06T20:16:34Z

How easy will be to run with the latest dependencies locally?

As easy as poetry update. If you want to check in any changes made with those new dependencies, then you'd have to run it through CI which would use the locked versions, because that's what you tested with.

docs/web/config.yml

ibis/__init__.py

pyproject.toml

setup.py

cpcloud · 2021-09-06T20:32:34Z

@jreback setup.py is automatically generated, so it's not that interesting to review.

jreback · 2021-09-06T20:44:04Z

my main concern here is to have a lower bound of deps that are tested in the CI that are the oldest supported version. we don't have a reason to bump pyarrow for example from 1.* i think (which is where we had it). ibis likely just works on the oldest versions so we should allow as liberal as possible. now in a future version of course we should bump, but i don't see why we would do that now.

cpcloud · 2021-09-06T21:45:40Z

@jreback I've added some minimum version testing jobs, let's see what shakes out.

datapythonista

I also don't understand the new division of builds/backends. Maybe I just got used to what we've got. But those only_backends, linux_only... builds are not very clear to me. Do you mind explaining a bit why these new builds are better than what we've got now please?

datapythonista · 2021-09-06T22:34:48Z

.github/workflows/main.yml

+        python-version:
+          - "3.7"
+          - "3.9"
+        pyspark:
+          - "2.4.3"
+          - "3"
+        pyarrow:
+          - "0.12.1"
+          - "5"
+        exclude:
+          - python-version: "3.7"
+            pyspark: "3"
+            pyarrow: "0.12.1"
+          - python-version: "3.9"
+            pyspark: "3"
+            pyarrow: "0.12.1"


I personally find this a step backewards to what we've got now. If these versions can be obtained from the pyproject.toml or poetry files, my preference would be to keep what we've got now (files in ci/deps/...).

cpcloud · 2021-09-07T12:52:43Z

@datapythonista

I also don't understand the new division of builds/backends. Maybe I just got used to what we've got. But those only_backends, linux_only... builds are not very clear to me. Do you mind explaining a bit why these new builds are better than what we've got now please?

There are two very different things in question here, let's discuss them individually:

The way the workflow jobs are named. Happy to rename to something that is more clear for you. Keep in mind that I did actually try to follow a similar structural pattern to what was already there. The way they are currently named is:
- tests_no_backends: tests that don't require a backend, basically whatever's inibis/tests.
- tests_simple_backends: tests that require a "simple" backend; csv, dask, hdf5, pandas, parquet, and sqlite
- tests_linux_only_backends: tests that require a backend that can only be tested on a linux runner due to use of services (mysql, postgresql) and pyspark. These backends likely work on Windows too, but GitHub does not offer services on Windows runners.
- tests_impala_clickhouse: tests against the impala and clickhouse backends

There are a handful of other jobs whose role is hopefully clear from their name, e.g., benchmarks.

Why is this an improvement upon the existing workflows? A few reasons:
- There's much less ad-hoc shell scripting. I think putting scripts in CI hurts the ability to understand what is happening in CI. Maintainers should basically never have to open up a shell script to understand a particular job. Encapsulation should be done with reuse of GitHub actions elements, not with ad hoc shell scripts. Any shell scripting is done inline now, because it's very short. Since no one is running these scripts outside of CI there's no real reason IMO to move any of the logic to a script.
- There's much more (re)use of existing github actions. In the current workflow environments are setup in a totally ad hoc way. With this workflow, environments are created using conda-incubator/setup-miniconda, and don't require a bunch of shell scripting to create an environment. Another example, docs aren't using SSH anymore, they are using a GitHub action that pushes to the docs repo using a personal access token (which any maintainer can create)
- Fewer environment variables. Minimizing the use of ad-hoc global state is basically always a good thing IMO.
- There are more concurrent jobs with just a tiny increase in CI minutes. Dask for example is now being tested on windows.
- Jobs are doing one thing and one thing only. There's no clear reason why linting and benchmarking (for example) are being run together. These things are unrelated and should run in separate tasks. In the new workflow linting is outsourced to pre-commit.ci, and benchmarking is its own job.

datapythonista · 2021-09-07T14:54:54Z

Thanks for all the information, and for all the work on this. While I agree with your points, and there are clearly some advantages to this approach, there is a trade-off, and we're losing several things:

The reason to have the (very simple) shell scripts, is that to change a paramter of the pytest, we could change it in just one place, and have it updated in all jobs. Also making sure pytest parameters are consistent among jobs
That's also one of the reasons we don't use GitHub actions to setup conda. Instead of having to repeat all parameters at each job, we just have these two simple scripts to call, and we keep the github config very simple. You're adding 700+ lines of code to the config, to remove less than 100 in the two scripts with minimal complexity
There is also the addition in the poetry python script in your implementation, which makes things even more complex
Having dask tested on windows, or each job doing a single thing, is something we can easily implement with the current approach. The reason we don't do it is because we optimize for the time that the CI takes, and the number of jobs, which if too big, makes it more difficult to find failing jobs

We had lots of problems with the CI in the past, I spent monts to make it simple, robust and fast. And this PR to me seems to be adding a lot of complexity, and slower builds, for no gain. I think this PR is making several independent changes (I may be wrong, but I don't think removing the shell scripts in favor of github actions, or the new job division is related to poetry). Is any of this needed for the automation of packaging?

cpcloud · 2021-09-07T18:03:28Z

The reason to have the (very simple) shell scripts, is that to change a paramter of the pytest, we could change it in just one place, and have it updated in all jobs. Also making sure pytest parameters are consistent among jobs

We can keep run_tests.sh if you want, that's not the script that I think is problematic. Again, scripts are useful if something is changing all the time. Are pytest commands in CI changing so often that a script is warranted?

Instead of having to repeat all parameters at each job, we just have these two simple scripts to call, and we keep the github config very simple.

Can you spell out the logic of ci/setup_env.sh here? It might help others to validate the claim that that it's a simple script.

Generally this script seems to be doing a huge number of unnecessarily manual operations, despite it being only 77 lines. This is abuse of Bash in the name of concision.

To my eye this is what's happening:

A bunch of totally unnecessary path munging on linux and windows (useless in general AFAICT, this PR demonstrates it is entirely unnecessary)
Update the base conda environment (unnecessary when using the setup-miniconda action)
install mamba (unnecessary when using the setup-miniconda action)
dynamically construct an environment.yml based on a bunch of files in some location, arbitrarily deciding that the minimum supported version of Python implies the minimum supported version of all the other dependencies (why?). This appears to be utterly untested in some cases and is in fact broken in the case of pyspark=2.4.3 (see this PR's CI which catches it), see this part of the pyspark docs for details. Searching the ibis repo I see nothing about this. Here we have complexity in the form of file generation hiding a missed testing scenario
showing the environment file (unnecessary when using the setup-miniconda action)
updating the base environment again? (unnecessary when using the setup-miniconda action)
installing ibis
loading data for backends depends on the value of two global variables. (totally unnecessary by just inlining the main loop in CI)
showing the packages that were installed (unnecessary when using the setup-miniconda action)

You're adding 700+ lines of code to the config, to remove less than 100 in the two scripts with minimal complexity

By my calculation I'm adding 611 (line count of main.yml in this PR) + 135 (poetry2recipe.py) + 56 (poetry2setup.py) minus 336 (existing main.yml) + 77 (setup_env.sh) + 21 (run_tests.sh) + 36 (the line count of the deps files) , for a grand total of 332 net lines of code if my arithmetic is correct.

How are you calculating 700+ additional lines?

There is also the addition in the poetry python script in your implementation, which makes things even more complex

Which one are you talking about? There are two.

Is any of this needed for the automation of packaging?

Ultimately yes.

I want to move to a model where we reuse github actions, even if that means repeating some of the configuration. That configuration hardly ever changes. Why should we continue to roll our own scripts that are clearly generating broken CI jobs?

datapythonista · 2021-09-07T18:42:28Z

If we had a single job, I'd surely prefer your approach to setting up the conda environment. You don't like ci/setup_env.sh, but it solves the problem of having to repeat this code for each job (we would have 5 copies of it with the current CI):

      - name: download env file
        uses: actions/download-artifact@v2
        with:
          name: conda-env-file
          path: tmp

      - uses: conda-incubator/setup-miniconda@v2
        with:
          mamba-version: "*"
          miniforge-variant: Mambaforge
          miniforge-version: latest
          channel-priority: strict
          activate-environment: ibis
          python-version: ${{ matrix.python-version }}
          environment-file: tmp/environment.yaml
          condarc-file: ci/condarc

      - name: install boost
        run: conda install -n ibis boost

      - name: install ibis
        run: pip install .

      - name: download backend data
        run: python ci/datamgr.py download

      - name: set backends
        id: set_backends
        run: echo '::set-output name=backends::["impala", "clickhouse"]'

      - name: install backend data
        run: |
          set -euo pipefail
          python ci/impalamgr.py load --data
          python ci/datamgr.py clickhouse

Instead, we have that script implemented once, and in each job we simply have:

    - name: set up environment
      run: ./ci/setup_env.sh "${{ matrix.python_version }}" "$BACKENDS"

I'm open to your approach, even if I don't think it adds value, and it creates a lot of duplication. But I think this discussion will be much more efficient, if you propose each of the changes here in a separate PR. There are many unrelated things that are being mixed, and it's difficult to know what each of them involve with this huge PR.

Would you mind opening separate PRs for:

Change the jobs to the new ones you propose
Moving the code of ci/setup_env.sh to actions in the github yaml
Moving the call in ci/run_tests.sh to the yaml
Changes to the website
Generation of the dependencies with poetry
Replacing versioneer by importlib.metadata
...

I think discussions will be much more focussed, and reviewing much more efficient if we don't mix things.

cpcloud · 2021-09-07T19:31:47Z

I will open separate PRs.

Just to set expectations:

Change the jobs to the new ones you propose

ok

Moving the code of ci/setup_env.sh to actions in the github yaml

ok

Moving the call in ci/run_tests.sh to the yaml

sounds good

Changes to the website

sure

Generation of the dependencies with poetry

Replacing versioneer by importlib.metadata

These need to be done together because poetry doesn't know anything about versioneer, and importlib metadata replaces the __version__ attribute creation

cpcloud · 2021-09-07T19:43:13Z

Small modification: the first two steps (yaml refactoring and setup_env.sh don't make sense to split because of the minimum deps issue)

.github/workflows/main.yml

ci/pytest_errors.sh

kszucs · 2021-09-09T10:17:45Z

dev/poetry2recipe.py

+from ruamel.yaml import YAML
+
+
+def main(


@cpcloud could you consolidate the two poetry scripts into a single one?

Eventually I think we should have a python library containing all the required functionality for development. Including the poetry conversion scripts, datamgr, impalamgr and possibly other utilities in the future. It would make testing the utility scripts easier as well.

Maybe but these are pretty different tools, so I'm not sure putting them into one place makes sense. I think making poetry2recipe a separate package outside of ibis would be useful.

After talking with @mariusvniekerk I think we can remove poetry2recipe altogether after the next release of ibis, and depend on conda-forge machinery to discover dependencies.

.github/workflows/main.yml

kszucs · 2021-09-09T10:38:54Z

.github/workflows/main.yml

+          PYTEST_BACKENDS: ${{ join(fromJSON(steps.set_backends.outputs.backends), ' ') }}
+        run: |
+          pytest \
+            ibis/backends/{${{ join(fromJSON(steps.set_backends.outputs.backends), ',') }}} \


Why not pytest ibis/backends/?

I assume PYTEST_BACKENDS should automatically handle to skip the not requested backends - otherwise we should tune that.

I will open up a new issue, I don't think it currently works that way unfortunately.

pyproject.toml

jreback

a question

.github/workflows/main.yml

jreback · 2021-10-15T18:46:49Z

thanks @cpcloud

followups if you can verify that have sufficient documentation around how to use this (e.g. where / how to update the min versions of things).

cpcloud · 2021-10-15T22:28:49Z

Yup there's some more automation to implement as well to avoid manual updating of minor/patch versions of dependencies for example.

ibis-project-bot · 2022-01-12T20:37:53Z

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

2.1.0
GitHub release

Your semantic-release bot 📦🚀

cpcloud force-pushed the poetry branch from 85a5752 to 681da82 Compare September 6, 2021 19:11

datapythonista reviewed Sep 6, 2021

View reviewed changes

jreback suggested changes Sep 6, 2021

View reviewed changes

docs/web/config.yml Outdated Show resolved Hide resolved

ibis/__init__.py Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

setup.py Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

setup.py Show resolved Hide resolved

jreback added the ci Continuous Integration issues or PRs label Sep 6, 2021

datapythonista reviewed Sep 6, 2021

View reviewed changes

jreback added this to the Next release milestone Sep 7, 2021