Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(ci): move to poetry #2937

Merged
merged 1 commit into from
Oct 15, 2021
Merged

refactor(ci): move to poetry #2937

merged 1 commit into from
Oct 15, 2021

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 6, 2021

This PR is an extraction from #2924
of just the poetry/ci bits.

The piece of code here with the most changes is the main CI workflow
(.github/workflows/main.yml)

This file is the main thing to focus on for review here.

@pep8speaks
Copy link

pep8speaks commented Sep 6, 2021

Hello @cpcloud! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 131:80: E501 line too long (2082 > 79 characters)

Comment last updated at 2021-10-15 18:02:14 UTC

Copy link
Contributor

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to add poetry.lock to the repo. This will freeze versions, we won't be aware if installing Ibis with the last dependencies is broken. And will also force us to regenerate the file. Or does the linguist-generated means some magic I'm unaware of?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

linguist-generated is the way that you inform github not to display generated files in a PR.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

This will freeze versions

Yes, that is their entire reason for existence :)

we won't be aware if installing Ibis with the last dependencies is broken.

we can check this in CI with another job

And will also force us to regenerate the file.

Yes, that's a good thing. It prevents two people from having different sets of dependencies.

Additionally, we can use dependabot to handle the boring process of upgrading existing dependencies on a regular basis.

Lock files are a really great way of preventing a large swath of "work on my machine" problems.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

On the other hand, poetry.lock isn't necessary at the moment, so if it'll make the code easier to review I'll remove it.

@datapythonista
Copy link
Contributor

I fail to see your point. How I see things, there are two different kind of projects:

  • Something like a website, that is going to be deployed in a server, and where freezing dependencies and making sure eveybody (developers, production servers...) uses the exact same environment is a good thing
  • A library used by lots of diverse users, in different operating systems, and different versions of the dependencies. Where freezing dependencies is good for the project developers, but bad for the users, who can just conda install ibis-project and get a broken environment, maintainers where not aware of

I guess I'm missing something, but poetry.lock seems to be great for the former case, but I fail to see how it's helpful for the latter. Do you want to have all devs test on an arbitrary freezed environment, and have jobs in the CI testing the latest and minimum dependencies? Or there is something else? How easy will be to run with the latest dependencies locally?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

Where freezing dependencies is good for the project developers, but bad for the users, who can just conda install ibis-project and get a broken environment, maintainers where not aware of

I think this is a misunderstanding of how lock files work. Dependencies are constrained by pyproject.toml, which doesn't force users into anything except the version specifications in that file. Right now, those are fairly liberally constrained, for exactly the kind of scenario you are talking about: diverse sets of users.

With that in mind, how does checking in poetry.lock actually prevent users from doing anything? Users are never exposed to lockfiles.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

How easy will be to run with the latest dependencies locally?

As easy as poetry update. If you want to check in any changes made with those new dependencies, then you'd have to run it through CI which would use the locked versions, because that's what you tested with.

docs/web/config.yml Outdated Show resolved Hide resolved
ibis/__init__.py Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

@jreback setup.py is automatically generated, so it's not that interesting to review.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

my main concern here is to have a lower bound of deps that are tested in the CI that are the oldest supported version. we don't have a reason to bump pyarrow for example from 1.* i think (which is where we had it). ibis likely just works on the oldest versions so we should allow as liberal as possible. now in a future version of course we should bump, but i don't see why we would do that now.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 6, 2021

@jreback I've added some minimum version testing jobs, let's see what shakes out.

@jreback jreback added the ci Continuous Integration issues or PRs label Sep 6, 2021
Copy link
Contributor

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand the new division of builds/backends. Maybe I just got used to what we've got. But those only_backends, linux_only... builds are not very clear to me. Do you mind explaining a bit why these new builds are better than what we've got now please?

Comment on lines 208 to 223
python-version:
- "3.7"
- "3.9"
pyspark:
- "2.4.3"
- "3"
pyarrow:
- "0.12.1"
- "5"
exclude:
- python-version: "3.7"
pyspark: "3"
pyarrow: "0.12.1"
- python-version: "3.9"
pyspark: "3"
pyarrow: "0.12.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find this a step backewards to what we've got now. If these versions can be obtained from the pyproject.toml or poetry files, my preference would be to keep what we've got now (files in ci/deps/...).

@cpcloud
Copy link
Member Author

cpcloud commented Sep 7, 2021

@datapythonista

I also don't understand the new division of builds/backends. Maybe I just got used to what we've got. But those only_backends, linux_only... builds are not very clear to me. Do you mind explaining a bit why these new builds are better than what we've got now please?

There are two very different things in question here, let's discuss them individually:

  1. The way the workflow jobs are named. Happy to rename to something that is more clear for you. Keep in mind that I did actually try to follow a similar structural pattern to what was already there. The way they are currently named is:
    • tests_no_backends: tests that don't require a backend, basically whatever's inibis/tests.
    • tests_simple_backends: tests that require a "simple" backend; csv, dask, hdf5, pandas, parquet, and sqlite
    • tests_linux_only_backends: tests that require a backend that can only be tested on a linux runner due to use of services (mysql, postgresql) and pyspark. These backends likely work on Windows too, but GitHub does not offer services on Windows runners.
    • tests_impala_clickhouse: tests against the impala and clickhouse backends

There are a handful of other jobs whose role is hopefully clear from their name, e.g., benchmarks.

  1. Why is this an improvement upon the existing workflows? A few reasons:
    • There's much less ad-hoc shell scripting. I think putting scripts in CI hurts the ability to understand what is happening in CI. Maintainers should basically never have to open up a shell script to understand a particular job. Encapsulation should be done with reuse of GitHub actions elements, not with ad hoc shell scripts. Any shell scripting is done inline now, because it's very short. Since no one is running these scripts outside of CI there's no real reason IMO to move any of the logic to a script.
    • There's much more (re)use of existing github actions. In the current workflow environments are setup in a totally ad hoc way. With this workflow, environments are created using conda-incubator/setup-miniconda, and don't require a bunch of shell scripting to create an environment. Another example, docs aren't using SSH anymore, they are using a GitHub action that pushes to the docs repo using a personal access token (which any maintainer can create)
    • Fewer environment variables. Minimizing the use of ad-hoc global state is basically always a good thing IMO.
    • There are more concurrent jobs with just a tiny increase in CI minutes. Dask for example is now being tested on windows.
    • Jobs are doing one thing and one thing only. There's no clear reason why linting and benchmarking (for example) are being run together. These things are unrelated and should run in separate tasks. In the new workflow linting is outsourced to pre-commit.ci, and benchmarking is its own job.

@datapythonista
Copy link
Contributor

Thanks for all the information, and for all the work on this. While I agree with your points, and there are clearly some advantages to this approach, there is a trade-off, and we're losing several things:

  • The reason to have the (very simple) shell scripts, is that to change a paramter of the pytest, we could change it in just one place, and have it updated in all jobs. Also making sure pytest parameters are consistent among jobs
  • That's also one of the reasons we don't use GitHub actions to setup conda. Instead of having to repeat all parameters at each job, we just have these two simple scripts to call, and we keep the github config very simple. You're adding 700+ lines of code to the config, to remove less than 100 in the two scripts with minimal complexity
  • There is also the addition in the poetry python script in your implementation, which makes things even more complex
  • Having dask tested on windows, or each job doing a single thing, is something we can easily implement with the current approach. The reason we don't do it is because we optimize for the time that the CI takes, and the number of jobs, which if too big, makes it more difficult to find failing jobs

We had lots of problems with the CI in the past, I spent monts to make it simple, robust and fast. And this PR to me seems to be adding a lot of complexity, and slower builds, for no gain. I think this PR is making several independent changes (I may be wrong, but I don't think removing the shell scripts in favor of github actions, or the new job division is related to poetry). Is any of this needed for the automation of packaging?

@cpcloud
Copy link
Member Author

cpcloud commented Sep 7, 2021

The reason to have the (very simple) shell scripts, is that to change a paramter of the pytest, we could change it in just one place, and have it updated in all jobs. Also making sure pytest parameters are consistent among jobs

We can keep run_tests.sh if you want, that's not the script that I think is problematic. Again, scripts are useful if something is changing all the time. Are pytest commands in CI changing so often that a script is warranted?

Instead of having to repeat all parameters at each job, we just have these two simple scripts to call, and we keep the github config very simple.

Can you spell out the logic of ci/setup_env.sh here? It might help others to validate the claim that that it's a simple script.

Generally this script seems to be doing a huge number of unnecessarily manual operations, despite it being only 77 lines. This is abuse of Bash in the name of concision.

To my eye this is what's happening:

  1. A bunch of totally unnecessary path munging on linux and windows (useless in general AFAICT, this PR demonstrates it is entirely unnecessary)
  2. Update the base conda environment (unnecessary when using the setup-miniconda action)
  3. install mamba (unnecessary when using the setup-miniconda action)
  4. dynamically construct an environment.yml based on a bunch of files in some location, arbitrarily deciding that the minimum supported version of Python implies the minimum supported version of all the other dependencies (why?). This appears to be utterly untested in some cases and is in fact broken in the case of pyspark=2.4.3 (see this PR's CI which catches it), see this part of the pyspark docs for details. Searching the ibis repo I see nothing about this. Here we have complexity in the form of file generation hiding a missed testing scenario
  5. showing the environment file (unnecessary when using the setup-miniconda action)
  6. updating the base environment again? (unnecessary when using the setup-miniconda action)
  7. installing ibis
  8. loading data for backends depends on the value of two global variables. (totally unnecessary by just inlining the main loop in CI)
  9. showing the packages that were installed (unnecessary when using the setup-miniconda action)

You're adding 700+ lines of code to the config, to remove less than 100 in the two scripts with minimal complexity

By my calculation I'm adding 611 (line count of main.yml in this PR) + 135 (poetry2recipe.py) + 56 (poetry2setup.py) minus 336 (existing main.yml) + 77 (setup_env.sh) + 21 (run_tests.sh) + 36 (the line count of the deps files) , for a grand total of 332 net lines of code if my arithmetic is correct.

How are you calculating 700+ additional lines?

There is also the addition in the poetry python script in your implementation, which makes things even more complex

Which one are you talking about? There are two.

Is any of this needed for the automation of packaging?

Ultimately yes.

I want to move to a model where we reuse github actions, even if that means repeating some of the configuration. That configuration hardly ever changes. Why should we continue to roll our own scripts that are clearly generating broken CI jobs?

@datapythonista
Copy link
Contributor

datapythonista commented Sep 7, 2021

If we had a single job, I'd surely prefer your approach to setting up the conda environment. You don't like ci/setup_env.sh, but it solves the problem of having to repeat this code for each job (we would have 5 copies of it with the current CI):

      - name: download env file
        uses: actions/download-artifact@v2
        with:
          name: conda-env-file
          path: tmp

      - uses: conda-incubator/setup-miniconda@v2
        with:
          mamba-version: "*"
          miniforge-variant: Mambaforge
          miniforge-version: latest
          channel-priority: strict
          activate-environment: ibis
          python-version: ${{ matrix.python-version }}
          environment-file: tmp/environment.yaml
          condarc-file: ci/condarc

      - name: install boost
        run: conda install -n ibis boost

      - name: install ibis
        run: pip install .

      - name: download backend data
        run: python ci/datamgr.py download

      - name: set backends
        id: set_backends
        run: echo '::set-output name=backends::["impala", "clickhouse"]'

      - name: install backend data
        run: |
          set -euo pipefail
          python ci/impalamgr.py load --data
          python ci/datamgr.py clickhouse

Instead, we have that script implemented once, and in each job we simply have:

    - name: set up environment
      run: ./ci/setup_env.sh "${{ matrix.python_version }}" "$BACKENDS"

I'm open to your approach, even if I don't think it adds value, and it creates a lot of duplication. But I think this discussion will be much more efficient, if you propose each of the changes here in a separate PR. There are many unrelated things that are being mixed, and it's difficult to know what each of them involve with this huge PR.

Would you mind opening separate PRs for:

  • Change the jobs to the new ones you propose
  • Moving the code of ci/setup_env.sh to actions in the github yaml
  • Moving the call in ci/run_tests.sh to the yaml
  • Changes to the website
  • Generation of the dependencies with poetry
  • Replacing versioneer by importlib.metadata
  • ...

I think discussions will be much more focussed, and reviewing much more efficient if we don't mix things.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 7, 2021

I will open separate PRs.

Just to set expectations:

  • Change the jobs to the new ones you propose

ok

  • Moving the code of ci/setup_env.sh to actions in the github yaml

ok

  • Moving the call in ci/run_tests.sh to the yaml

sounds good

  • Changes to the website

sure

  • Generation of the dependencies with poetry
  • Replacing versioneer by importlib.metadata

These need to be done together because poetry doesn't know anything about versioneer, and importlib metadata replaces the __version__ attribute creation

@cpcloud
Copy link
Member Author

cpcloud commented Sep 7, 2021

Small modification: the first two steps (yaml refactoring and setup_env.sh don't make sense to split because of the minimum deps issue)

@jreback jreback added this to the Next release milestone Sep 7, 2021
.github/workflows/main.yml Outdated Show resolved Hide resolved
ci/pytest_errors.sh Outdated Show resolved Hide resolved
from ruamel.yaml import YAML


def main(
Copy link
Member

@kszucs kszucs Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud could you consolidate the two poetry scripts into a single one?

Eventually I think we should have a python library containing all the required functionality for development. Including the poetry conversion scripts, datamgr, impalamgr and possibly other utilities in the future. It would make testing the utility scripts easier as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but these are pretty different tools, so I'm not sure putting them into one place makes sense. I think making poetry2recipe a separate package outside of ibis would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with @mariusvniekerk I think we can remove poetry2recipe altogether after the next release of ibis, and depend on conda-forge machinery to discover dependencies.

.github/workflows/main.yml Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
PYTEST_BACKENDS: ${{ join(fromJSON(steps.set_backends.outputs.backends), ' ') }}
run: |
pytest \
ibis/backends/{${{ join(fromJSON(steps.set_backends.outputs.backends), ',') }}} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pytest ibis/backends/?

I assume PYTEST_BACKENDS should automatically handle to skip the not requested backends - otherwise we should tune that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open up a new issue, I don't think it currently works that way unfortunately.

@jreback jreback added this to the 2.x milestone Sep 22, 2021
@cpcloud cpcloud force-pushed the poetry branch 2 times, most recently from 15065c9 to 2f7f248 Compare September 25, 2021 09:55
@cpcloud cpcloud added dependencies Issues or PRs related to dependencies developer-tools Tools related to ibis development labels Sep 28, 2021
@cpcloud cpcloud force-pushed the poetry branch 2 times, most recently from 52d111b to d7fbf13 Compare October 3, 2021 15:33
@cpcloud cpcloud force-pushed the poetry branch 2 times, most recently from 96099c2 to 52b4b6f Compare October 7, 2021 15:57
@cpcloud cpcloud force-pushed the poetry branch 3 times, most recently from 54d0e4b to d3a4cb1 Compare October 14, 2021 14:02
pyproject.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a question

.github/workflows/main.yml Show resolved Hide resolved
@cpcloud cpcloud force-pushed the poetry branch 2 times, most recently from cf56e5d to f09a204 Compare October 15, 2021 17:01
@jreback jreback merged commit 2a0159e into ibis-project:master Oct 15, 2021
@jreback
Copy link
Contributor

jreback commented Oct 15, 2021

thanks @cpcloud

followups if you can verify that have sufficient documentation around how to use this (e.g. where / how to update the min versions of things).

@cpcloud
Copy link
Member Author

cpcloud commented Oct 15, 2021

Yup there's some more automation to implement as well to avoid manual updating of minor/patch versions of dependencies for example.

@ibis-project-bot
Copy link
Contributor

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration issues or PRs dependencies Issues or PRs related to dependencies developer-tools Tools related to ibis development
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants