Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage with nix #727

Draft
wants to merge 40 commits into
base: master
Choose a base branch
from
Draft

Manage with nix #727

wants to merge 40 commits into from

Conversation

jacg
Copy link
Collaborator

@jacg jacg commented Jun 21, 2020

Edit: the original text that appeared here is not appropriate for use in the
merge commit message (as our practice dictates). Here is what the merge commit
should say; the original is preserved, lower down.


Merge commit text

Replace the fragile, error-prone, user-unfriendly and high-maintenance
manage.sh/Conda approach to installing dependencies and managing their
versions, with Nix and direnv.

The disadvantage of the new system is that the user must ensure that Nix and
direnv are installed: this cannot be automated to the extent that installation
of conda was automated in the old system. I have tried to provide tools and
documentation that streamline this process as much as possible, in the doc/nix
directory.

However, this needs to be done only once on any machine, thereafter the huge
advantages are:

  1. Simply cding into the IC directory ensures that all the dependencies are
    installed and made available.

  2. Checking out a different commit, automatically ensures switching to the
    corresponding versions of the dependencies, if necessary.

  3. Nix provides a far greater set of packages than Conda. Consequently we can
    provide--for example--debugging, profiling, benchmarking, visualization,
    development, etc. tools (perhaps only on specific branches where they are
    useful) without the user having to make any effort to install them.

  4. Nix is far more robust, and can provide stronger guarantees about various
    packages working together correctly, than Conda can.

For most IC contributors, the first two points should be the most visible in
day-to-day work.


Original text:

Why?

Our manage.sh environment management system is fragile, error-prone and rather user-unfriendly.

I would like to explore the possibility of replacing it with Nix.

There is much to say about Nix (lots of it good, some not so good), but from a very high-level perspective the most important points are that, if we can get this to work properly:

  • Simply cding into the IC directory would automatically initialize the appropriate development and execution environment.

  • Checking out a different commit should automatically update the environment to the one matching the checked out code.

The major downsides are:

  • Nix assumes that you have root privileges, in order to install it. (But once you have installed Nix, using it to install packages does not require admin rights.) Installing Nix without root privileges requires some extra fiddling: We have to check that we can install Nix on any machines we might want to use for production.

  • Nix is a VERY complex beast (but if we get it right, most of you will not see any of that complexity).

In summary ...

Setting this up properly will take consiberable effort, but once that is done, the result should be very pleasant indeed for anyone working in IC.

Please help

In my initial attempts, lots of tests are failing. Some of this is likely to be because the package versions specified in manage.sh are ancient, while I'm using up-to-date ones in the new config. As I have been out of touch with IC for quite some time, most of these failures are completely meaningless to me, so I'd appreciate if someone who is more in touch could have a look to see if there are any obvious solutions to any of the problems.

How can I help?

  1. Install Nix on your machine. The process is described here. It's approximately:

    • sudo curl -L https://nixos.org/nix/install | sh
    • Add . $HOME/.nix-profile/etc/profile.d/nix.sh to your shell configuration.
  2. cd /path/to/IC

  3. nix-shell

  4. pytest

  5. See if you can understand any of the test failures.

Looking forward

If we can get this to work, I would propose to maintain this branch alongside master to see how it deals with evolution of requirements, pinning of versions etc. for quite some time. If, after a while, it proves to work reliably, we could use it to replace the manage.sh abomination.

@jjgomezcadenas
Copy link
Collaborator

jjgomezcadenas commented Jun 21, 2020 via email

@gonzaponte
Copy link
Collaborator

Simply cding into the IC directory would automatically initialize the appropriate development and execution environment.

Do you know if this works neatly in computer grids?

There is currently no (sensible) way to install Nix without sudo.

Does travis like that?
I'm afraid this might also be a problem if we want to install and run in systems that we do not manage directly (computer grids from various institutions, essentially).

@jacg
Copy link
Collaborator Author

jacg commented Jun 21, 2020

Simply cding into the IC directory would automatically initialize the appropriate development and execution environment.

Do you know if this works neatly in computer grids?

What exactly?

Do you mean the automagical enviroment selection? If you manage to get nix on there somehow, I struggle to see what could stop the automagic from working ...but ... well ... There are more things in heaven, earth and (above all) computer systems, Horatio, Than are dreamt of in my philosophy, so something might throw a spanner in the works.

Besides, the automagic is very convenient for developers, but in a production environment we'd probably use an explicit nix-shell, nix-build, nix run or something along those lines, so the automagic doesn't really matter on grids.

Or do you mean getting Nix onto your grid machines in the first place? That I don't know.

There is currently no (sensible) way to install Nix without sudo.

Does travis like that?

  1. Travis actually provides a language: nix environment, so, in theory, you don't even need to install it on Travis. In practice ... well "Nix support for Travis CI is community maintained." and it has been broken for the last few days, so ...

  2. You can always tell travis that sudo: true and then you can go ahead and install it in the usual way ... which is currently failing for me for a reason that I do not understand, but in principle there should be viable solutions.

  3. There are ways of getting it installed without sudo, but I hope we won't need to go there.

@gonzaponte
Copy link
Collaborator

Do you mean the automagical enviroment selection

Yes, I meant that

but in a production environment we'd probably use an explicit nix-shell, nix-build, nix run or something along those lines so the automagic doesn't really matter on grids.

Good. These would be a major concern otherwise.

Or do you mean getting Nix onto your grid machines in the first place? That I don't know.

Yeah, this is another concern, because I'm not sure we can ask nix to be installed in certain places.

@gonzaponte
Copy link
Collaborator

gonzaponte commented Jun 21, 2020

Did you get this error?

ERROR: Could not find a version that satisfies the requirement toml (from autopep8==1.5.3) (from versions: none)
ERROR: No matching distribution found for toml (from autopep8==1.5.3)

this is while doing nix-shell

@jacg
Copy link
Collaborator Author

jacg commented Jun 21, 2020

Yes, that's what I'm getting on Travis.

On my machine I'm actually running NixOS as my main OS, and there nix-shell enters fine, and allows my to run the IC tests, many of which fail miserably.

In principle we should be able to get reproduceable builds, so that we get identical behaviour across machines, but I haven't bothered pinning any versions yet, so that might explain the difference. I'll try to reproduce that error locally.

@jacg
Copy link
Collaborator Author

jacg commented Jun 21, 2020

I've pinned the version of nixpkgs, and it now installs and runs the tests on Travis. You can see the test failures here.

After a fetch of the latest commit on the branch, you should be able to reproduce the same results on your machines ... although I apparently can't (presumably because I'm running NixOS rather than just the Nix package manager on some other OS, and somehow the configurations aren't equivalent): while Travis fails 38 tests, my machine fails 227 tests.

@gonzaponte
Copy link
Collaborator

gonzaponte commented Jun 21, 2020

38 failed + 2 errored.
Some of them are, I think, because it is taking the most up-to-date version of each package (see #723). Probably not all of them. Can we fix them to the current spec to differentiate between them?

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

I've rebased on top of #723 and all the tests pass on Travis.

Can we fix them to the current spec to differentiate between them?

I'd prefer to start with a clean slate, and try to explore different approaches to specific package pinning from there, rather than starting off with some random pinning scheme just to get the tests to pass with a random set of outdated package versions.

What is holding up #723?

I propose that we continue this work on top of #723, even if it isn't merged.

@gonzaponte
Copy link
Collaborator

What is holding up #723?

Some tests fail due to a change in behavior in scipy.interpolate, I just want to make sure it has no real impact on data and that it is something minor. As soon as I have a handle on that it will be merged.

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

On macOS Travis generates the same 218 + 81 test failures + errors as I now get on my local NixOS.

Can someone on a Mac reproduce these?

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

As soon as I have a handle on that it will be merged.

Any idea of timescale? Hours, days, months?

@gonzaponte
Copy link
Collaborator

somewhere between hours and days

@gonzaponte
Copy link
Collaborator

On macOS Travis generates the same 218 + 81 test failures + errors as I now get on my local NixOS.

I just remembered that the OSX build has been failing since we migrated to LFS. Can this be it?

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

the OSX build has been failing since we migrated to LFS.

And we've just left it like that?

Can this be it?

Looking at an OSX Travis log, the errors do look very similar, though the huge volume of output makes comparison difficult.

Most of the test failures seem to be related to databases or HDF files, which seems to agree with the LFS-failure-is-responsible hypothesis.

I presume there are still people developing IC on macOS ... so I presume that someone can quickly confirm that the standard IC tests have been passing on macOS development machines since we started using Git LFS. Please ... pretty please!

Could someone with Mac for whom the IC tests pass and who is aware of successfully using LFS please check what this branch does on an LFS-capable development Mac? This amounts to

  1. sudo curl -L https://nixos.org/nix/install | sh
  2. . $HOME/.nix-profile/etc/profile.d/nix.sh
  3. Check out this branch
  4. cd /path/to/repo/where/you/checked/out/the/branch
  5. nix-shell
  6. pytest

So that's no more than 2 minutes of typing, if you type veery slowly, and then getting on with your life while waiting for the tests to run (and waiting for the nix installation to complete, which will interrupt your typing, but I don't recall that it takes terribly long).

@gonzaponte
Copy link
Collaborator

Let's draw their attention
@carmenromo @paolafer @andLaing @jmalbos
do any of you volunteer to do @jacg's request in the last comment?

@carmenromo
Copy link
Collaborator

Yes, I can

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

Where can I see what LFS artefacts I'm expected to have, if everything is working ok? I've logged into the LFS server I'm not having much luck finding anything there.

@gonzaponte
Copy link
Collaborator

Where can I see what LFS artefacts I'm expected to have, if everything is working ok?

I don't understand what you mean

@carmenromo
Copy link
Collaborator

I have tried the first line but nothing is downloaded. I have read on the Internet that nix will not be writable on macOS Catalina

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

Where can I see what LFS artefacts I'm expected to have, if everything is working ok?

I don't understand what you mean

A successful clone or checkout of IC should result in some files in the tree being there because Git LFS put them there. I would like to have some idea of what those files are. But don't worry, I have found solid proof that something is wrong with my Git LFS setup:

find . -name "*sqlite*" -exec file {} \;
./invisible_cities/database/localdb.NEXT100DB.sqlite3: ASCII text
./invisible_cities/database/localdb.DEMOPPDB.sqlite3: ASCII text
./invisible_cities/database/localdb.NEWDB.sqlite3: ASCII text

$ find . -name "*sqlite*" -exec cat {} \;
version https://git-lfs.github.com/spec/v1
oid sha256:dc08bc015f6dff0d5103957527dd95974178fa110090d68005b3b9abfce3ac12
size 34603008
version https://git-lfs.github.com/spec/v1
oid sha256:543c744f37ad6c61fefea492dfb7f99117bd74b7cfab2f706cf34283de714ef1
size 5324800
version https://git-lfs.github.com/spec/v1
oid sha256:05dca8bbd4a17c3bf4118ec9ccd4dacf559352085abf55409ccf74162946af29
size 382033920

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

I have tried the first line but nothing is downloaded. I have read on the Internet that nix will not be writable on macOS Catalina

Hmm.

What exact output do you get?

How about

sh <(curl https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume

?

@carmenromo
Copy link
Collaborator

With the first one I have the output:

$ sudo curl -L https://nixos.org/nix/install | sh
Password:
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2490  100  2490    0     0   3401      0 --:--:-- --:--:-- --:--:--  3401
downloading Nix 2.3.6 binary tarball for x86_64-darwin from 'https://releases.nixos.org/nix/nix-2.3.6/nix-2.3.6-x86_64-darwin.tar.xz' to '/var/folders/b3/vxk4fw4n7mdc3wv84vbzbnzr0000gn/T/nix-binary-tarball-unpack.XXXXXXXXXX.YVDrKtxb'...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.4M  100 26.4M    0     0  10.2M      0  0:00:02  0:00:02 --:--:-- 10.2M
Note: a multi-user installation is possible. See https://nixos.org/nix/manual/#sect-multi-user-installation

Installing on macOS >=10.15 requires relocating the store to an apfs volume.
Use sh <(curl https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume or run the preparation steps manually.
See https://nixos.org/nix/manual/#sect-macos-installation

And with the second one:

$ sh <(curl https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
$

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

Right, so the first one did download stuff, and even gave you the instructions for what to do on Catalina (i.e. the -darwin-use-unencrypted-nix-store-voume stuff) ... buried among all the noise.

The zeros on the second one are a bit suspicious, but maybe there's some caching going on somewhere, after all, you are downloading, once again, exactly the same file as you downloaded earlier.

What happens if you try the rest of the instructions ... i.e. continue with . $HOME/.nix-profile/etc/profile.d/nix.sh ?

@carmenromo
Copy link
Collaborator

Yes, I tried it too just in case but:

 $ . $HOME/.nix-profile/etc/profile.d/nix.sh
.: no such file or directory: /Users/carmenromoluque/.nix-profile/etc/profile.d/nix.sh

@mmkekic gave me this link https://nixos.org/nix/manual/#sect-macos-installation
there are problems with Catalina apparently hmm

@jacg
Copy link
Collaborator Author

jacg commented Jun 22, 2020

Yes, there are problems on Catalina, and sh <(curl https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume is the recommended solution to those problems ... which is also confirmed in the link that @mmkekic gave you.

I have one more idea: usually these nix installation instructions show a line of code that includes some variation on the theme of downloading the installer and piping it to sh (just like the recommended solution above). This one-liner invariably does not mention sudo itself, but the need to use sudo is mentioned elsewhere. So, chances are that you are supposed to do

sudo sh <(curl https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume

in other words, exactly what you did the last time, with sudo prepended.

@carmenromo
Copy link
Collaborator

Do I have to delete something before?
Without it the result is bad again:

$ sudo sh <(curl https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Password:

sh: /dev/fd/11: Bad file descriptor

@jacg
Copy link
Collaborator Author

jacg commented Jun 23, 2020

The differences in behaviour were, indeed, down to git-lfs. I have activated git-lfs on my machine and on Travis OSX and we now get essentially identical behaviour on

  • Travs nix-on-linux
  • Travis nix-on-osx
  • My local NixOS
  • @gonzaponte's nix-on-linux (I'm assuming it's identical)

(We still don't have a reliable way to install nix on Catalina ... I'm looking into it.)

The test output is currently very noisy. Does anyone have any obvious ideas how to silence this?

@jacg jacg force-pushed the manage-with-nix branch 4 times, most recently from 8826203 to afb8b4c Compare February 13, 2021 14:53
@gondiaz
Copy link
Collaborator

gondiaz commented Nov 15, 2021

Probably an outdated question, but have you considered using docker instead?

@jacg
Copy link
Collaborator Author

jacg commented Nov 15, 2021

have you considered using docker instead?

Can't speak for anyone else, but I have.

I much prefer Nix. It's been a while since I've touched Docker, so I don't have an eloquent explanation of the many reasons for this on the tip of my tongue any more.

As a pragmatic point, you are unlikely to get administrators of HPC systems to install Docker, because of the massive security implications. Nix can be installed without admin rights, though that approach doesn't work universally and is sub-optimal.

@gonzaponte
Copy link
Collaborator

Following this comment and re-reading what's been going on in this PR, it seems that there was an attempt to try nix on different machines, but I guess not on all of them. Did it work in the ones we've tried? Do we have any showstoppers? Do we understand any of the issues that can be dealt with?

Also, I recall from a different conversation that there was going to be a major version change in nix (or something like that) that would make the client's life much much easier. Are we there yet?

@jacg
Copy link
Collaborator Author

jacg commented Nov 16, 2021

Following this comment and re-reading what's been going on in this PR, it seems that there was an attempt to try nix on different machines, but I guess not on all of them. Did it work in the ones we've tried? Do we have any showstoppers? Do we understand any of the issues that can be dealt with?

I think that there are 2 (and a bit) main obstacles:

  1. Ancient OSes and kernels
  2. Less obvious OS/architecture combinations
  3. Installation without admin rights

In slightly more detail:

  1. a) It just doesn't work on CentOS 6 (I think).
    b) Nix 2.3 requires a linux kernel > 3.10.0-693 but it starts working somewhere <= 3.10.0-1160. Don't know about the new Nix 2.4.

  2. Each combination of operating system / CPU architecture has a different set of Nix packages.

    Nix is used most (and therefore tested most and has the most complete set of packages) on the Linux / x86_64 combination. (For example, Geant4 is not available in any nixpkgs set for macOS.)

    OTOH, I am aware that support for macOS with apple's new arm-based M1 chip was officially achieved a couple of months ago. The installation went very smoothly for @carmenromo and the basic things seemed to work out of the box.

    Argudell has PowerPC processors. I've no idea what the nixpkgs support for that is. Not great, I imagine. But most of the HPC machines we use are Linux/x86_64.

  3. Any x86_64/linux with decent support for user namespaces should allow a viable installation without root privileges. This is inferior to an admin-installed multi-user Nix, but it's viable, if necessary.

We also tried to install Nix in Singularity as that does tend to be available on many HPC systems. We didn't succeed, but I forget what the problems were. @jmbenlloch did the work. Along theses lines, Nix has strong support for building Docker containers from Nix specifications. It also has (not as mature and almost completely undocumented, last time I looked) similar tools for producing Singularity containers. This might be a better alternative on such systems.

Also, I recall from a different conversation that there was going to be a major version change in nix (or something like that) that would make the client's life much much easier.

You are probably referring to Nix Flakes.

Are we there yet?

A new version of Nix (2.4) was released about 2 weeks ago. IIUC (I've had zero time to look into it), it contains enabled Flakes by default, so you don't have to fiddle around with switching on experimental features (or maybe even installing a pre-release version of Nix) in order to be able to use Flakes. The new nix version also includes a much more feature-complete version of the new Nix CLI interface. Both Flakes and the new CLI are almost certainly the future of Nix. But Flakes are still marked as experimental, even though they haven't changed significantly for something like 2 years.

In brief, are we there yet? We seem to be approaching there asymptotically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet