New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea for accelerating 'opam update' #3050
Comments
Nice! This would be for the https backend, I guess ? |
Well, to compare any repository in a fast way. |
Ok, some context to explain my question: currently, update proceeds this way: ① fetch new version → ② generate diff → ③ validate the diff (for signed repos) → ④ apply the diff When using git, it should already take care of making ① efficient, and we use it to obtain ② for free too. Http, which is still the default, on the other hand, is quite inefficient, so your work would be very helpful there. Opam 1 was first downloading an index file to check for changes, but I found this was quite complex and actually didn't gain anything, so it just downloads a full tar.gz of the repo. Another operation that is very time-consuming (at least on my machine that's not SSD-based) is building the index of packages, which is just reading all the repository tree. This is done only after |
I'd like to ping this issue. |
@aantron - using 2.0.8 or 2.1.0~beta4? |
I'm no more working on this prototype, though I think the idea is pretty nice to accelerate diffing with a remote repository. |
@dra27 Trying just now, opam 2.0.8 took 3 minutes 48 seconds to do |
How many opam remotes / pins do you have? |
I ran the commands with EDIT: each command with a separate root. In case it can still intefere, I have one remote in the default |
could you post the log of |
Just now, I ran rm -rf ./opam21
fish -c "date; and ./opam-2.1.0-beta4-x86_64-linux init -na --root ./opam21 --disable-sandboxing --bare --debug -vvv; and date" > log-init 2>&1 which took 1 minute 58 seconds and produced this file Then I ran fish -c "date; and ./opam-2.1.0-beta4-x86_64-linux update --root ./opam21 --debug -vvv; and date" > log-update 2>&1 which took 2 minutes 46 seconds and produced this file My system is
It is WSL 1. The The hardware is a 2020 model ultrabook (MSI GS66) with a fast SSD, etc. |
On the whole, this is consistent with my experience of EDIT: and I should add, across all versions of opam I have used. |
Hm, a while ago we wrote an optimisation where we store the repositories as tar archives, and decompress them to /tmp before using them; it is normally much faster because we only have to read one sequential 4MB file, instead of scanning for 10000 small files across 3 layers of directories. That is, of course, assuming that the OS is doing caching, and the lifetime of the directory below /tmp means it can be read fully from RAM. It seems that's the part that doesn't hold in your case, so the optimisation doesn't work (and even makes things slightly worse). From a very quick search, this is what I found about WSL1 (https://news.ycombinator.com/item?id=25154300):
|
This makes me doubt what the comment is referring to. There are two filesystems visible from WSL 1 — some kind of Linux filesystem that is not (at least, originally was not) visible from Windows, and NTFS, which you can access under some slightly awkward paths. It's not clear from the comment whether the person is referring to the latter (NTFS), or still referring to the former. At least as of two years ago, there was a huge difference in performance between WSL 1's Linux filesystem and WSL accessing NTFS. If the commenter is indeed referring to the latter, then the comment is irrelevant for this issue — both my home directory and The underlying storage for the Linux filesystem is undoubtedly somewhere in NTFS ultimately, but there at least was a clear difference between how WSL treated those files and the files treated as directly in NTFS. I assume that WSL made assumptions about the files it considered as being part of the Linux filesystem that allowed this massive relative increase in performance — again, it's not clear if the comment is referring to these specific assumptions. Is the comment claiming that even that increased performance was poor, and making technical statements about it, or commenting on poor performance when accessing NTFS as NTFS? So, I'm really not sure what is being said in this comment. Since then, WSL files have become visible from Windows under some obscure paths, along with a network mount. The performance of both filesystems has increased, especially massively so for accessing the "real" NTFS. The impression I get from many of the commenters in that thread is that they tried WSL 1 when it was slow for their use case and switched away from it, and retain their impression of WSL from that time. Likewise, the technical claims may be out of date, since there has clearly been a lot of optimization over the years. This is based on some commenters saying they only tried again with WSL2 — since that came out around last summer, this means that they had a gap in WSL 1 experience before the summer, and their last experience with WSL 1 would have been well before that. It seems that a lot of the thread consists of people responding to these stale impressions. In summary, I don't think much can be learned from the thead without interacting with it and asking for clarification. Nowadays, I routinely work on WSL 1, in NTFS, using both opam and npm, and I have no complaints about the performance on those workloads. |
I agree, that's the first link I found, but a random comment on HN is not a reliable source of information ^^ Whatever the OS, etc., when I have been trying to optimise this, the major cost was from scanning a large number of files across a hierarchy; an area where filesystems themselves expose big differences in performance (ext4 is not very good at it). Here are some random benchmarks on what I have on hand, for scanning a full repo (
Basically, with an SSD, the optimisation that uses In any case, is it possible on WSL1 to try and set /tmp as an in-RAM fs ? That would be worth trying... |
VoIFs ( WSL 2 on the other hand on the same machine takes 11 seconds to do init and update in However, this is nothing to do with the diffing method - the simple fact is that there are too many files. There are clear reasons for using WSL 1 vs WSL 2, but for opam you want to be using 2, or (another) VM. The long-term solution will be (optional) integration of ocaml-tar so that the tarballs are never extracted. |
There is a small risk that using ocaml-tar might make things slower on Linux machines. |
Here is the
Has it already been considered to use .zip instead of .tar.gz ? Zip files permit random access, which could avoid much of the |
Are you running with softupdates on your OpenBSD laptop @mndrix? I find it makes a huge difference to opam performance due to all the small files. |
I wasn't. Thanks for suggesting it. For anyone who finds this later: mounting /tmp with The slowest commands on my final test run after
|
From time to time I get an idea, that maybe we should slice our main opam repository to contain packages not older than two years. It should improve performance of opam update (I didn't measure that, but I'm strongly believe).... |
@Kakadu this would be pretty lame if we can only install packages which are less than two years old. |
Date is an absurd pruning criterion. Some software is mostly "done" and stable. Could we dispell the myth that only software with PR churn and an active issue tracker is worth using ? Some people do work well :-) To give an idea a package like |
Indeed, there are plenty of things in |
Hello,
I have a prototype there: https://github.com/UnixJunkie/opmer
Be careful, it is not yet super fast and it might very well contain bugs.
Here is a usage example with two opam-repository checkouts:
The text was updated successfully, but these errors were encountered: