Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] libsolv for conda / number of rules #284

Open
wolfv opened this issue Oct 11, 2018 · 37 comments
Open

[Question] libsolv for conda / number of rules #284

wolfv opened this issue Oct 11, 2018 · 37 comments
Labels

Comments

@wolfv
Copy link
Contributor

wolfv commented Oct 11, 2018

Hi,

I am investigating wether it would be worth to use the libsolv solver in the conda package manager.

For this I am wondering how many packages / rules you can expect to solve with libsolv in a reasonable timeframe? The current SAT solver of conda has the problem of being slow – but conda also produces a lot of SAT rules to be solved. If I am not mistaken, the last number I've seen was around 60'000...

Sorry if this channel is inappropriate to ask such a question.

Cheers,

Wolf

@ignatenkobrain
Copy link
Collaborator

Thanks for starting this conversation!

When you are saying "slow", could you provide more specific numbers?

@wolfv
Copy link
Contributor Author

wolfv commented Oct 11, 2018

Sure -- in the worst case times (for the entire installation process) of 2 minutes have been observed.
However, not all of this is going to be the solver, but a big chunk.

conda/conda#7239

@mlschroe
Copy link
Member

I can't tell you much about the speed of libsolv. However, if I try to upgrade all packages on my system the solver will create 68611 rules and solving takes 74 ms. The actual solving takes 5 ms, the rest is rule setup (i.e. provides lookup and the like).

added 73606 pkg rules for installed solvables
added 2626 pkg rules for updaters of installed solvables
added 0 pkg rules for packages involved in a job
added 1791 pkg rules because of weak dependencies
added 255 pkg rules because of linked packages
9984 of 60336 installable solvables considered for solving
pruned rules from 78279 to 63453
  binary: 12443
  normal: 51009, 127325 literals
pkg rule memory used: 1487 K
pkg rule creation took 53 ms
choice rule creation took 11 ms
63452 pkg rules, 2 * 2568 update rules, 0 job rules, 0 infarch rules, 0 dup rules, 23 choice rules, 0 best rules, 0 yumobs rules
overall rule memory used: 1608 K
solving...
resolving job rules
resolving installed packages
deciding orphaned packages
solver statistics: 0 learned rules, 0 unsolvable, 0 minimization steps
done solving.

solver took 5 ms
final solver statistics: 0 problems, 0 learned rules, 0 unsolvable
solver_solve took 74 ms

@Conan-Kudo
Copy link
Member

@wolfv I have a feeling that libsolv would work just fine for your case. :)

@wolfv
Copy link
Contributor Author

wolfv commented Oct 11, 2018

Hi guys, thanks for all the responses. It's encouraging to see that there is a community around this project!

I will need to look into how to get started. It would be great to swap out the solver of conda, but I need to figure out how well the concepts map to each other ... I've pinged the conda core developers on the gitter channel (https://gitter.im/conda/conda if anyone is interested).

Is there a good how-to-get-started doc for libsolv? Should one take a look at the dnf sources?

If I am understanding correctly (from looking at the doc/ folder in this repository) then the basic idea would be to create a tool that turns a conda package file specification into a .solv file which I could then probably hand over to libsolv to get a nice resolution?

@mlschroe
Copy link
Member

The .solv files are mainly used for caching, libsolv can read them very quickly. Dnf/libzypp uses them because getting the package data from the repository xml files takes a lot of time. But you can use libsolv easily without .solv files.

@Conan-Kudo
Copy link
Member

@wolfv The libdnf sources are probably a good reference.

@mlschroe
Copy link
Member

mlschroe commented Oct 11, 2018

Here's a simple example so that you can see how libsolv works:

import sys
import solv

pool = solv.Pool()

# what is installed in the system
installed = pool.add_repo('installed')
pool.installed = installed

# 
channel = pool.add_repo('channel1')

s = channel.add_solvable()
s.name = "A"
s.evr = "1-1"
s.arch = "noarch"
s.add_deparray(solv.SOLVABLE_PROVIDES, pool.rel2id(pool.str2id("A"), pool.str2id("1-1"), solv.REL_EQ))
s.add_deparray(solv.SOLVABLE_REQUIRES, pool.str2id("B"))

s = channel.add_solvable()
s.name = "A"
s.evr = "2-1"
s.arch = "noarch"
s.add_deparray(solv.SOLVABLE_PROVIDES, pool.rel2id(pool.str2id("A"), pool.str2id("2-1"), solv.REL_EQ))
s.add_deparray(solv.SOLVABLE_REQUIRES, pool.str2id("B"))

s = channel.add_solvable()
s.name = "B"
s.evr = "1-1"
s.arch = "noarch"
s.add_deparray(solv.SOLVABLE_PROVIDES, pool.rel2id(pool.str2id("B"), pool.str2id("1-1"), solv.REL_EQ))

s = channel.add_solvable()
s.name = "C"
s.evr = "1-1"
s.arch = "noarch"
s.add_deparray(solv.SOLVABLE_PROVIDES, pool.rel2id(pool.str2id("C"), pool.str2id("1-1"), solv.REL_EQ))
s.add_deparray(solv.SOLVABLE_CONFLICTS, pool.str2id("B"))

channel.internalize()

pool.createwhatprovides()

solver = pool.Solver()
jobs = []
jobs.append(pool.Job(solv.Job.SOLVER_INSTALL | solv.Job.SOLVER_SOLVABLE_NAME, pool.str2id("A")))

problems = solver.solve(jobs)
if problems:
    for p in problems:
        print "problem:", p
    sys.exit(1)
transaction = solver.transaction()

for p in transaction.newsolvables():
    print "install ", p

@mlschroe
Copy link
Member

Also have a look at the libsolv-bindings manpage:

https://github.com/openSUSE/libsolv/blob/master/doc/libsolv-bindings.txt

@mlschroe
Copy link
Member

And also the python implementation of the demo solver:

https://github.com/openSUSE/libsolv/blob/master/examples/pysolv

Looks like conda uses its own way of matching dependencies, we would need to implement that in libsolv (we currently have rpm, deb, haiku).

@msarahan
Copy link

@mlschroe @Conan-Kudo thanks for your thoughts here. I'm one of the conda maintainers, and I want to explore this a bit more. My current concerns are:

  1. conda already has its own repository index format. For example, see https://repo.continuum.io/pkgs/main/linux-64/repodata.json. We certainly could adapt to some new format over time, but it would be nicer for any new tool to be able to use the old repodata format. Is that feasible with Hawkey or any of the other libsolv tools?

  2. conda has a notion of package "build string" which is used to differentiate what I'll call "variants" of a package. These are related to versions, but not the same thing as versions. For example, you may see variants of a package for different library dependency versions (e.g. a python library built for different versions of python). In the RPM/deb worlds, it seems like they handle this more by changing the package names. Is something like build string supportable with libsolv, where you'd have potentially many packages with a similar version, but different dependencies? With conda, such packages may be specified explicitly by specifying the build string as part of a "MatchSpec" solver input, or they can be determined by specifying combinations of the underlying dependencies they represent (give me somepythonlib-2.0 and python-3.6). When specs are underdetermined, build strings themselves are not optimized, but the dependencies they represent are (https://github.com/conda/conda/blob/master/conda/resolve.py#L1020)

A large question for me is how much work we'd have to do to customize conda for libsolv specifically, vs. how much conda can be written to formulate problems for arbitrary solvers, and thus be less rigid in the future. Another conda user recommended that we look at the clingo solver, which uses the CUDF spec for formulating problems to clingo. conda/conda#7808 (comment)

How much is that approach possible with libsolv, either via CUDF, or via some similar idea, where conda would translate its current repodata and specs to some libsolv format, for feeding into libsolv behind some more generic interface in conda?

@ignatenkobrain
Copy link
Collaborator

conda already has its own repository index format. For example, see https://repo.continuum.io/pkgs/main/linux-64/repodata.json. We certainly could adapt to some new format over time, but it would be nicer for any new tool to be able to use the old repodata format. Is that feasible with Hawkey or any of the other libsolv tools?

Actually @mlschroe even created small json parser inside libsolv so you can easily implement support for conda repositories. Btw, hawkey is deprecated and has nothing to do with your problem ;)

conda has a notion of package "build string" which is used to differentiate what I'll call "variants" of a package. These are related to versions, but not the same thing as versions. For example, you may see variants of a package for different library dependency versions (e.g. a python library built for different versions of python). In the RPM/deb worlds, it seems like they handle this more by changing the package names. Is something like build string supportable with libsolv, where you'd have potentially many packages with a similar version, but different dependencies? With conda, such packages may be specified explicitly by specifying the build string as part of a "MatchSpec" solver input, or they can be determined by specifying combinations of the underlying dependencies they represent (give me somepythonlib-2.0 and python-3.6). When specs are underdetermined, build strings themselves are not optimized, but the dependencies they represent are (https://github.com/conda/conda/blob/master/conda/resolve.py#L1020)

That's something what I described in #286.

A large question for me is how much work we'd have to do to customize conda for libsolv specifically, vs. how much conda can be written to formulate problems for arbitrary solvers, and thus be less rigid in the future. Another conda user recommended that we look at the clingo solver, which uses the CUDF spec for formulating problems to clingo. conda/conda#7808 (comment)

I don't think you would need to do much, you would need to implement code in libsolv to read repository metadata and populate pool with solvables from it. Then it's mostly like job install name foo and libsolv tells you what other packages to install.

@msarahan
Copy link

@ignatenkobrain awesome, thanks!

Actually @mlschroe even created small json parser inside libsolv so you can easily implement support for conda repositories. Btw, hawkey is deprecated and has nothing to do with your problem ;)

Glad to hear - reducing the number of things for me to research is helpful!

That's something what I described in #286.

Yes, that looks like the crux of the problem. As you've pointed out in that issue, the problem is of arbitrary complexity, so simply making it all part of the name is not really viable. In conda-land, we represent that as a hash of a dictionary of variables that have affected the recipe somehow. It's less readily interpretable, but we have some tools to spit out what the hash represents.

I don't think you would need to do much, you would need to implement code in libsolv to read repository metadata and populate pool with solvables from it. Then it's mostly like job install name foo and libsolv tells you what other packages to install.

If I understand you correctly, I need to look into the JSON parser in libsolv, or perhaps somehow write something that translates conda metadata into .solv files? What's the best reference on the .solv format?

@mlschroe
Copy link
Member

You don't need to use solv files at all, they are basically a caching mechanism so that you don't need to "translate" the repository metadata each time you call the package manager.
You also do not need to use anything from the "libsolvext" library. I.e. you don't need to use some libsolv function to put the repository metadata into libsolv's pool object.

The next step forward would be me implementing conda dependency matching with a new REL_CONDA dependency type. This basically means implementing what's in conda's models/match_spec.py file.

@wolfv
Copy link
Contributor Author

wolfv commented Mar 5, 2019

Hi all,

finally I've gotten around to play with the libsolv Python bindings, and successfully created a small script that downloads conda repodata.json and then calculates the required packages using libsolv. Right now, the JSON parsing process is definitely the bottleneck.

Regarding build numbers and variants: currently this is indeed an issue. I think I'll implement a lookup-preprocessing step that calculates a new version number by appending the build number to the version string. That should overlay an ordering by build number. In the first step, the max build number will be searched, and all other build numbers will be prepended with 0 to match the max build number length. Then append.

I am just realizing that this will also require to modify all dependency relations ... so this is definitely not a nice solution. I don't know how hard it is to add this feature to libsolv?

For the build variants, the preprocessing step will check if a matching build variant is available, and if yes, remove all non-matching ones.

Anyways, here is the current state of the code.
https://gist.github.com/wolfv/f06a8a73c83920387d4dd6e67961c7f2
If you want to experience the build number issue, then one way to do it is to solve for r-rcpp and it will figure r-rbase==3.4.2 which is illegal, 3.4.1 is the correct number. This happens because r-rcpp is added multiple times with differnet build numbers and >=3.4.1,<=3.4.2 and >=3.4.2,<=3.5.0.

Besides this, the experience with libsolv is great! I would be very happy if you leave some comments on how to do this better.
I am also thinking to reimplement the parsing part with C++ or Rust, and I was looking at the Rust bindings of libsolv. But a big PR was just unmerged, so I'd be curious to know wether there is any active development?

Cheers!

@Conan-Kudo
Copy link
Member

>=3.4.1,<=3.4.2 and >=3.4.2,<=3.5.0

Stanzas like these can be done with REL_WITH, so it'd look like r-rcpp >= 3.4.1 + r-rcpp <= 3.4.2 to ensure the same package satisfies the condition.

@wolfv
Copy link
Contributor Author

wolfv commented Mar 6, 2019

Hi @Conan-Kudo,

thanks for the heads-up. I got this stuff to work, the problem is that conda has a notion of build_number.
So we have the r-rcpp 1.0.0 package multiple times with different build numbers (0, 1, 1000) and different range specifications. E.g.

r-rcpp 1.0.0
     0: r-base >=3.4.1,<=3.4.2
     1: r-base >=3.4.1,<=3.4.2
  1000: r-base >=3.5.0,<=3.6.0

So the way it's implemented right now in my version, I think libsolv ends up thinking the requirement is >=3.4.1, <=3.6.0

I'll solve this by preprocessing.

(Btw I just add multiple conditions manually in the add_deparray, is there a better way to do it? I don't know how to use REL_WITH ... also can libsolv parse stuff like >=3.3,<=3.4 ?)

@Conan-Kudo
Copy link
Member

(Btw I just add multiple conditions manually in the add_deparray, is there a better way to do it? I don't know how to use REL_WITH ... also can libsolv parse stuff like >=3.3,<=3.4 ?)

It does not, it needs to be constructed in a way as I mentioned earlier.

That said, if you're trying to retrieve a single solvable, you can do something like this to binding it to a single object by treating each condition with the REL_WITH condition.

Details about REL_WITH: https://github.com/openSUSE/libsolv/blob/master/doc/libsolv-pool.txt#boolean-dependencies

@Conan-Kudo
Copy link
Member

@wolfv Also note, libsolv needs to be built with -DENABLE_COMPLEX_DEPS=1 for this to work.

@wolfv
Copy link
Contributor Author

wolfv commented Mar 6, 2019

@Conan-Kudo great. Got it to work (apparently the pip build ships with COMPLEX_DEPS enabled :)

I managed to produce a installable environment file (meaning, if I pass this to conda to create a new environment, the current conda solver does not complain about conflicts). So this is already a pretty cool thing!

What can version strings look like with libsolv? I had one version string that looks like 8.0.0a0 indicating 8.0.0-alpha which should test false for <8.0.0 but I think it tests positive with libsolv. Also can version numbers have 4 "dots"? Could this be solved with converting it to 8.0.0.a?

Another question: for debugging, is there a way to figure out what the dependency tree looks like? E.g. if package D is required, can I figure out which package that is getting installed required that package? Like a reverse lookup? That would help a lot!

@Conan-Kudo
Copy link
Member

Conan-Kudo commented Mar 6, 2019

What can version strings look like with libsolv? I had one version string that looks like 8.0.0a0 indicating 8.0.0-alpha which should test false for <8.0.0 but I think it tests positive with libsolv. Also can version numbers have 4 "dots"? Could this be solved with converting it to 8.0.0.a?

Change the -alpha to ~alpha, and it'll resolve lower. Change the -post to ^post, and it'll resolve higher, always. The versioning constructs are based on what rpm supports, which is a superset of most system's versioning rules. You should only really need to worry about forcing a version lower, since Python ecosystem doesn't usually have issues with post-releases.

@Conan-Kudo
Copy link
Member

I am also thinking to reimplement the parsing part with C++ or Rust

libsolv has an implementation of JSON metadata in libsolvext: fae06e5

You could use that to speed things up?

@mlschroe
Copy link
Member

mlschroe commented Mar 6, 2019

wait wait wait, you really need REL_CONDA support to make libsolv's dependency matching to the right thing.

@Conan-Kudo
Copy link
Member

@mlschroe But there's no CONDA type...? Also, wouldn't it be the more generic PYTHON type, since conda would (hopefully) be compliant with the schemes and behaviors described in PEP 440.

@mlschroe
Copy link
Member

mlschroe commented Mar 6, 2019

Yes, REL_CONDA needs to be implemented first. I'll need to cross-check the specification with PEP440 to see how far it matches.

@msarahan
Copy link

msarahan commented Mar 6, 2019

Conda's version ordering is PEP440 for the most part, but not completely. Here's the rules: https://github.com/conda/conda/blob/master/conda/models/version.py#L47-L155

@wolfv
Copy link
Contributor Author

wolfv commented Mar 6, 2019

@mlschroe how much work is it to implement this and could I be of any help there?

@wolfv
Copy link
Contributor Author

wolfv commented Mar 12, 2019

For people who are still curious: I've gotten pretty far with this, over here: https://github.com/wolfv/mamba

This is using libsolv from C++ which I found quite easy to do, after some adjusting to the new concepts. It solves environments almost like the current conda solver. Obviously there are some corner cases here and there (but I think I am also not setting the repository priority correctly for the default channel etc).

In it's current state it can then hand over the packages to install & remove to conda, and conda does the installing.

I am using simd-json to parse the conda repository files, and then create all the packages in libsolv. This is pretty much sufficiently fast for my liking :)

I've also started building libsolv on conda forge: conda-forge/staged-recipes#7969
I am having issues with Windows. Have people already attempted to build libsolv in Windows? Is it at all possible? Would you be interested to make it work with Windows, upstream?

I am also doing some hacking around to make the version numbers look like something that (seems) to work fine with libsolv. I guess, the correct way would be to add a CONDA version comparison in evr.c?

Cheers,

Wolf

@ignatenkobrain
Copy link
Collaborator

I am having issues with Windows. Have people already attempted to build libsolv in Windows? Is it at all possible? Would you be interested to make it work with Windows, upstream?

cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-g3'
cl : Command line warning D9002 : ignoring unknown option '-O0'

I think @mlschroe won't be against having few conditions in CMakeLists.txt for windows stuff.

@mlschroe
Copy link
Member

(You probably already noticed, but conda's version comparison is already implemented. You need to enable CONDA support when configuring libsolv and then set the disttype to DISTTYPE_CONDA.)

@wolfv
Copy link
Contributor Author

wolfv commented Mar 26, 2019

Actually, I just noticed an hour ago :) the first protoype is released under the name "mamba" here: https://anaconda.org/conda-forge/mamba

I'll switch over to your version comparison as soon as possible! Thanks a lot!

@wolfv
Copy link
Contributor Author

wolfv commented Mar 28, 2019

@mlschroe just FYI the source code for mamba is here: https://github.com/QuantStack/mamba

I saw that you're still working on the conda support in libsolv. Just let me know if there is anything I can do to support you!

@mlschroe
Copy link
Member

Libsolv now supports a REL_CONDA relation type. The "name" part of the relation is the package name, the "evr" part is of the form "version" or "version build". The build part is currently ignored, but this will be easy to add. There's also currently no way to match the other components, i.e. "build_number", "track_features" etc.

@wolfv
Copy link
Contributor Author

wolfv commented Mar 29, 2019

This is great!
The build part seems to be more and more important, as lot's of packages e.g. in conda-forge specify it to pin against a specific version.

For the track_features: I am not sure how this is supposed to work (to be honest) but I was able to replicate the ability to prefer a certain feature by adding all packages with a feature into a new ("fake") repo and give it a high priority if the feature was requested. This seems to do exactly what's necessary.

For the build number: In my version I am currently adding it to the version, and normalize the version number, so that python 3.4 with build_number 12 becomes 3.4.0.12. That also seems to work reasonably well in the prototype.

One thing that I haven't been able to solve (and not sure if this will be solvable easily) is wildcards in build strings. E.g. on conda-forge you can find: "h5py * mpi_*" (which will match any version with the build string mpi_mpich_h1032f3f_1001 as well as mpi_openmpi_py37hc1ffa35_1002).

@mlschroe
Copy link
Member

With REL_CONDA it'll be easy to support wildcards (they already work for the version comparison)

Do you have an example of a package with dependencies that contain other component matches than 'name', 'version' and 'build'? I.e. a package that uses the '[component=xxx]' syntax in a dependency?

@wolfv
Copy link
Contributor Author

wolfv commented Mar 29, 2019

I don't think I've ever seen that, I am not even sure it's legal. Maybe @msarahan knows more?
I've looked through conda-forge and the anaconda channel and couldn't find an example.

@msarahan
Copy link

We don't currently support the bracket syntax, but it would be a nice addition. We think of this syntax in terms of conditional dependencies - "install this if some condition is true". Maybe this would be used to add in package sets for optional functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants