New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nix-local-build] Garbage collecting the store #3333

Open
ezyang opened this Issue Apr 13, 2016 · 11 comments

Comments

Projects
6 participants
@ezyang
Contributor

ezyang commented Apr 13, 2016

This is a bit of an interesting problem. On the one hand, it's intractable to determine the GC roots, because dist-newstyle "roots" may be scattered at arbitrary locations in the file system. On the other hand, so long as a library is not depended upon an executable which is being "used" (i.e., part of a profile, see #3332) then it is recoverable if we accidentally delete it: the next time someone runs new-build on the project, it will just get rebuilt. But this is not exactly a safe assumption; for example, I have a symlink to a binary in a new-dist directory (since I'm dogfooding Cabal); if I accidentally GC away a dynamic library it depends on I'll have to go and rebuild it.

@dcoutts

This comment has been minimized.

Show comment
Hide comment
@dcoutts

dcoutts May 11, 2016

Member

We should pick up the GC work that was done by the GSoC student (Vishal Agrawal) last summer. The approach to tracking roots that they came up with was essentially to register them centrally. I don't recall the exact details, but essentially a central dir with symlinks to the locations of local build trees (to ghc environment files specifying the required libs). Then on GC we scan those roots, and ignore (delete) any stale dangling symlink pointers.

Member

dcoutts commented May 11, 2016

We should pick up the GC work that was done by the GSoC student (Vishal Agrawal) last summer. The approach to tracking roots that they came up with was essentially to register them centrally. I don't recall the exact details, but essentially a central dir with symlinks to the locations of local build trees (to ghc environment files specifying the required libs). Then on GC we scan those roots, and ignore (delete) any stale dangling symlink pointers.

@hsenag

This comment has been minimized.

Show comment
Hide comment
@hsenag

hsenag Dec 31, 2016

Member

Just wondering what the current status of this is - I've been using new-build for a few weeks and now have about 5GB worth of store already. I guess I'll have to resort to just wiping it out and rebuilding what I need until there's a "proper" GC command.

Member

hsenag commented Dec 31, 2016

Just wondering what the current status of this is - I've been using new-build for a few weeks and now have about 5GB worth of store already. I guess I'll have to resort to just wiping it out and rebuilding what I need until there's a "proper" GC command.

@ezyang

This comment has been minimized.

Show comment
Hide comment
@ezyang

ezyang Dec 31, 2016

Contributor

I'm not aware of anyone who is working on it, so yes, wipe for now.

Contributor

ezyang commented Dec 31, 2016

I'm not aware of anyone who is working on it, so yes, wipe for now.

@fgaz fgaz moved this from planned to stretch goals in Last Mile for `cabal new-build` (HSOC2017) Aug 7, 2017

@fgaz fgaz moved this from stretch goals to planned in Last Mile for `cabal new-build` (HSOC2017) Aug 21, 2017

@fgaz

This comment has been minimized.

Show comment
Hide comment
@fgaz

fgaz Aug 25, 2017

Collaborator

Me and @hvr had a chat about this some days ago. Here's what we came up with:

This is the representation of the pinned packages:

type PinnedPackages = Map UnitId PinState
data PinState =
    UsedBy [PinUse] -- ^ Some use of the package prevents it from being gced. The list may be a Set instead
  | Explicit -- ^ the user explicitly ran something like `cabal new-gc pin pkgid`

instance Monoid PinState where
 -- the list is concatenated, and 'Explicit' is the absorbing element

data PinUse =
    Project FilePath -- ^ A project/package somewhere in the filesystem requires this package.
                     -- The pinning is done when new-* is invoked in the project and cabal solves for a plan.
  | Installed FilePath -- ^ the exe/lib was new-installed

PinnedPackages is stored in per-package files, containing a PinState (or not existing if the package is not pinned). If we use a path like ${HOME}/.cabal/store/<compiler>/<unitid>/gc-info we also get free locking and no issues if the store is nuked.

Like is suggested above, a run of cabal new-gc will treat those packages as roots and keep them and any deps.

If new-gc is run with a --delete-stale flag, it will also check if the projects/installed packages were removed and in that case unpin them.

Finally:

<hvr> pin-cache could actually additionally be a cabal.project or ~/.cabal/config setting
<hvr> or it could be a --cache-mode={root,pin,off}
<hvr> thing, as it effectively describes a kind of gc-policy
<hvr> which makes sense to configure at global, per-project, ad-hoc levels
<hvr> a "gc-retention-policy"

Please @hvr just edit this comment if I left something out

Collaborator

fgaz commented Aug 25, 2017

Me and @hvr had a chat about this some days ago. Here's what we came up with:

This is the representation of the pinned packages:

type PinnedPackages = Map UnitId PinState
data PinState =
    UsedBy [PinUse] -- ^ Some use of the package prevents it from being gced. The list may be a Set instead
  | Explicit -- ^ the user explicitly ran something like `cabal new-gc pin pkgid`

instance Monoid PinState where
 -- the list is concatenated, and 'Explicit' is the absorbing element

data PinUse =
    Project FilePath -- ^ A project/package somewhere in the filesystem requires this package.
                     -- The pinning is done when new-* is invoked in the project and cabal solves for a plan.
  | Installed FilePath -- ^ the exe/lib was new-installed

PinnedPackages is stored in per-package files, containing a PinState (or not existing if the package is not pinned). If we use a path like ${HOME}/.cabal/store/<compiler>/<unitid>/gc-info we also get free locking and no issues if the store is nuked.

Like is suggested above, a run of cabal new-gc will treat those packages as roots and keep them and any deps.

If new-gc is run with a --delete-stale flag, it will also check if the projects/installed packages were removed and in that case unpin them.

Finally:

<hvr> pin-cache could actually additionally be a cabal.project or ~/.cabal/config setting
<hvr> or it could be a --cache-mode={root,pin,off}
<hvr> thing, as it effectively describes a kind of gc-policy
<hvr> which makes sense to configure at global, per-project, ad-hoc levels
<hvr> a "gc-retention-policy"

Please @hvr just edit this comment if I left something out

@hvr

This comment has been minimized.

Show comment
Hide comment
@hvr

hvr Aug 25, 2017

Member

@fgaz IMO the Installed constructor makes little sense w/o additional meta-data; w/o such meta-data it seems to me it's more or less effectively the same as Explicit. IOW, PinUse should contain enough information to figure out if the reference is still "live".

Member

hvr commented Aug 25, 2017

@fgaz IMO the Installed constructor makes little sense w/o additional meta-data; w/o such meta-data it seems to me it's more or less effectively the same as Explicit. IOW, PinUse should contain enough information to figure out if the reference is still "live".

@fgaz

This comment has been minimized.

Show comment
Hide comment
@fgaz

fgaz Aug 25, 2017

Collaborator

Yes, if the installation directory can be chosen (I just discovered there's an option to do it on old install) we need a path there too. Edited.

Collaborator

fgaz commented Aug 25, 2017

Yes, if the installation directory can be chosen (I just discovered there's an option to do it on old install) we need a path there too. Edited.

@hvr

This comment has been minimized.

Show comment
Hide comment
@hvr

hvr Aug 25, 2017

Member

@fgaz well, for executables you also need to take into account that an install can replace a previous version with a newer/different version of the same tool. And it can also happen that multiple packages clash over the same executable name.

For libraries explicitly installed via "new-install" it's not so clear to me what to use as the retainer-entity (unless we install into a "package environment", but that's only supported w/ GHC 8.0.2 and later)

Member

hvr commented Aug 25, 2017

@fgaz well, for executables you also need to take into account that an install can replace a previous version with a newer/different version of the same tool. And it can also happen that multiple packages clash over the same executable name.

For libraries explicitly installed via "new-install" it's not so clear to me what to use as the retainer-entity (unless we install into a "package environment", but that's only supported w/ GHC 8.0.2 and later)

@lspitzner

This comment has been minimized.

Show comment
Hide comment
@lspitzner

lspitzner Jan 6, 2018

Collaborator

I have started writing a cabal-independent little gc tool (source not uploaded yet) that so far only does some analysis and constructs the dependency graph of packages in a package-db. I also taught it to read in a list local plan.jsons via cabal-plan, and calculate the coverage when using them as roots for a potential gc.

Considering about better ways of determining roots now. First question: Why not

data PinUse
  = Project FilePath
  | Installed FilePath
  | Explicit

? But that is only a rather superficial change anyways i guess.

Also, how do we determine from Installed path if this root is still alive? Seems we need at least a timestamp too (well, more than just a timestamp i think .. ?)

More importantly: What about the case that the same project is tested with multiple ghc versions? At the moment, when switching compiler, the old plan.json is overwritten, correct? But that would mean that we lose the information about roots for anything but the lastly-built compiler - we risk deleting dependencies of our package with compiler version 2 just because we compiled with compiler version 4. But we may want to switch back! It seems we either need to rename to plan-$compiler.json or switch to something like

data RootSources = Map FilePath (Map CompilerVersion [UnitId])

or something in a similar direction. Though this requires that projects actively update this central repository not only once to register, but almost whenever they create a new plan.

Also I ran into haskell-hvr/cabal-plan#15, which raises concern about the backwards-compatibility of plan.json (only applies to Cabal<2, so this is no real issue.)

Collaborator

lspitzner commented Jan 6, 2018

I have started writing a cabal-independent little gc tool (source not uploaded yet) that so far only does some analysis and constructs the dependency graph of packages in a package-db. I also taught it to read in a list local plan.jsons via cabal-plan, and calculate the coverage when using them as roots for a potential gc.

Considering about better ways of determining roots now. First question: Why not

data PinUse
  = Project FilePath
  | Installed FilePath
  | Explicit

? But that is only a rather superficial change anyways i guess.

Also, how do we determine from Installed path if this root is still alive? Seems we need at least a timestamp too (well, more than just a timestamp i think .. ?)

More importantly: What about the case that the same project is tested with multiple ghc versions? At the moment, when switching compiler, the old plan.json is overwritten, correct? But that would mean that we lose the information about roots for anything but the lastly-built compiler - we risk deleting dependencies of our package with compiler version 2 just because we compiled with compiler version 4. But we may want to switch back! It seems we either need to rename to plan-$compiler.json or switch to something like

data RootSources = Map FilePath (Map CompilerVersion [UnitId])

or something in a similar direction. Though this requires that projects actively update this central repository not only once to register, but almost whenever they create a new plan.

Also I ran into haskell-hvr/cabal-plan#15, which raises concern about the backwards-compatibility of plan.json (only applies to Cabal<2, so this is no real issue.)

@ezyang

This comment has been minimized.

Show comment
Hide comment
@ezyang

ezyang Jan 8, 2018

Contributor

Yes, you make all good points. We already have a concept of having different directories for different configurations, e.g., with and without optimization. It might be good to cache plans separately for each configuration as well. See also #3343

I'd be happy to take any patch that makes your life easier on this front.

Contributor

ezyang commented Jan 8, 2018

Yes, you make all good points. We already have a concept of having different directories for different configurations, e.g., with and without optimization. It might be good to cache plans separately for each configuration as well. See also #3343

I'd be happy to take any patch that makes your life easier on this front.

@lspitzner

This comment has been minimized.

Show comment
Hide comment
@lspitzner

lspitzner Jan 8, 2018

Collaborator

See https://github.com/lspitzner/pkgdbgc

Implements

  • Global registry containing Map (path, compiler) [unitid] and a list of known plan locations
  • Printing some stats about package-dbs
  • In theory: Unregistering of non-reachable (from the set of registered roots) packages/units.

Not (yet) implemented

  • Deleting the actual package contents after unregistering
  • ghc-pkg: does not seem to support unregistering specific unitIds, which blocks this whole thing. See trac #14648
  • Any tracking of configurations beyond compiler. profiling, optimization, flags.
  • Proper help/docs for the utility/its subcommands

Sorry, but I won't be making PRs against cabal. That codebase intimidates me too much.

Collaborator

lspitzner commented Jan 8, 2018

See https://github.com/lspitzner/pkgdbgc

Implements

  • Global registry containing Map (path, compiler) [unitid] and a list of known plan locations
  • Printing some stats about package-dbs
  • In theory: Unregistering of non-reachable (from the set of registered roots) packages/units.

Not (yet) implemented

  • Deleting the actual package contents after unregistering
  • ghc-pkg: does not seem to support unregistering specific unitIds, which blocks this whole thing. See trac #14648
  • Any tracking of configurations beyond compiler. profiling, optimization, flags.
  • Proper help/docs for the utility/its subcommands

Sorry, but I won't be making PRs against cabal. That codebase intimidates me too much.

@lspitzner

This comment has been minimized.

Show comment
Hide comment
@lspitzner

lspitzner Jan 16, 2018

Collaborator

I have resolved most of the issues, although an important bit remains: pkgdbgc still does not track profiling, optimization level or other flags, so there is risk of e.g. garbage-collecting the profiling-enabled dependencies because your last compile was with profiling disabled.

I am not entirely convinced that having multiple plan.jsons is a good idea. It seems somewhat likely that the user ends up accumulating several plans, for various combinations of flags, which in turn effectively requires garbage-collecting outdated plans too. A more lazy approach is to support specifying the build directory, and to pass the responsibility to the user. But it is indeed rather lazy.

Collaborator

lspitzner commented Jan 16, 2018

I have resolved most of the issues, although an important bit remains: pkgdbgc still does not track profiling, optimization level or other flags, so there is risk of e.g. garbage-collecting the profiling-enabled dependencies because your last compile was with profiling disabled.

I am not entirely convinced that having multiple plan.jsons is a good idea. It seems somewhat likely that the user ends up accumulating several plans, for various combinations of flags, which in turn effectively requires garbage-collecting outdated plans too. A more lazy approach is to support specifying the build directory, and to pass the responsibility to the user. But it is indeed rather lazy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment