Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ci cache for windows #2507

Merged
merged 4 commits into from Dec 21, 2021
Merged

Fix ci cache for windows #2507

merged 4 commits into from Dec 21, 2021

Conversation

jneira
Copy link
Member

@jneira jneira commented Dec 18, 2021

  • The cache in windows is not working from some time ago. I did not find any job where the build times are below 40 min since several days ago.
  • They are downloading all the packages (so the source cache is not working) and building all the packages (so the cabal store is not working)
  • I have an hypothesis: both caches are failing due to different reasons
    • Source cache is failing cause we were using the default path for CABAL_DIR: \\AppData\\cabal in windows and ~/.cabal in the other os
    • cabal store cache is failing due to a bug in the haskell setup action:
      • we are using wisely steps.HaskEnvSetup.outputs.cabal-store but as you can see in this run the effective cache action is:
Run actions/cache@v2
  with:
    path: C:\sr
    key: compiled-deps-Windows-9.0.1-2021-11-29T12-30-07Z-0c0e51e7952f2bb6f6a68b93790dcd1c9321c830a6ec4745f9abf8bda0b5994c
    restore-keys: compiled-deps-Windows-9.0.1-2021-11-29T12-30-07Z-
  compiled-deps-Windows-9.0.1-
  compiled-deps-Windows-

C:\sr???? afaik that is the global cache dir for stack 🤦. The default cabal store dir in windows is ~.\AppData\cabal\store (well before the CABAL_DIR change)
I have to confirm it though


This pr embrace the use of CABAL_DIR (setting it to a default value common for all os's if it is not set) for set the cache paths
We can restore the use of steps.HaskEnvSetup.outputs.cabal-store when the value is fixed upstream.

@Anton-Latukha
Copy link
Collaborator

Yep.

I've seen C:\sr even during creation. Even remember looking into some sources (of action probably). Assumed C:\sr is "I guess it just a caching directory on Windows".

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Dec 18, 2021

Before workflows were configured:

- if: runner.os == 'Windows'
name: (Windows) Platform config
run: |
echo "CABAL_STORE_DIR=$SYSTEMDRIVE\\SR" >> $GITHUB_ENV
echo "CABAL_PKGS_DIR=~\\AppData\\cabal\\packages" >> $GITHUB_ENV

So the configuration of the cabal store directory has not changed.

@jneira
Copy link
Member Author

jneira commented Dec 18, 2021

Ok i was wrong, C:\sr is also used by chocolatey to set the cabal store:

https://github.com/Mistuke/CabalChoco/blob/9e68bad284fe604b51bae94c30329d83d7318a33/3.6.2.0/cabal/tools/chocolateyInstall.ps1#L343-L344

So steps.HaskEnvSetup.outputs.cabal-store is good and my hypothesis is not

The C:\sr thing of stack:

imagen

So not sure if using C:\sr for cabal is a good idea but 🤷

@Anton-Latukha thanks for investigating it

@jneira
Copy link
Member Author

jneira commented Dec 18, 2021

So maybe the change of CABAL_DIR has affected the store dir?

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Dec 19, 2021

You are right about Windows build, store is not configured right, the C:/sr setting does not work.

Windows builds indeed ran 40m-1h longer than other builds, I guessed that they just compile & run tests that long. (& I am biased to look at Linux builds & at macOS as soon they would have fast hardware).

https://github.com/haskell/haskell-language-server/runs/4571121606?check_suite_focus=true

Shows that it both gets the relevant cache hit & then builds the store. So I agree. That means that C:/sr probably not works, but %APPDATA%\cabal\store would.

Would look into setup action code.

@Anton-Latukha
Copy link
Collaborator

I also seem like rember setup action switching or considering ghcup for Windows.

@jneira
Copy link
Member Author

jneira commented Dec 19, 2021

I also seem like rember setup action switching or considering ghcup for Windows.

yeah there is an issue about but no action afaik

Would look into setup action code.

Afair it is still using chocolatey to install ghc and cabal. So maybe it worth to check the code of the chocolatey install here: https://github.com/Mistuke/CabalChoco/blob/master/3.6.2.0/cabal/tools/chocolateyInstall.ps1

The code i linked in the previous comment is:

  # If running on Github actions, configure the package to pick things up
  if (($null -ne $Env:GITHUB_ACTIONS) -and ("" -ne $Env:GITHUB_ACTIONS)) {
    # Update the path on github actions as without so it won't be able to find
    # cabal.
    echo "$cabal_path" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append

    # We probably don't need this since choco itself is already on the PATH
    # But it won't hurt to make sure.
    $choco_bin = Join-Path $env:ChocolateyInstall "bin"
    echo "$choco_bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append

    # New GHC Packages will add themselves to the PATH, but older ones don't.
    # So let's find which one the user installed and add them to the pathh.
    $files = get-childitem $binRoot -include ghc.exe -recurse

    foreach ($file in $files) {
      $fileDir = Split-Path "$file"
      echo "$fileDir" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
    }

    # Also set a global SR.
    UpdateCabal-Config "store-dir" "$($env:SystemDrive)\SR"
  }

So it tries to change the cabal store dir adding in the global cabal config

store-dir: C:\sr

@Anton-Latukha
Copy link
Collaborator

@Anton-Latukha
Copy link
Collaborator

Ok, sorry, it is 02:37 at my location. I would look at the situation anew tomorrow.

@jneira jneira force-pushed the fix-windows-cache branch 2 times, most recently from c854db2 to c3c710c Compare December 19, 2021 14:32
@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Dec 19, 2021

Also note that the sources get redownloaded.

It is while the setting to cache them was:

      - if: runner.os == 'Windows'
        name: (Windows) Platform config
        run: |
          echo "CABAL_PKGS_DIR=~\\AppData\\cabal\\packages" >> $GITHUB_ENV

(it is seen for example in https://github.com/haskell/haskell-language-server/runs/4574676443?check_suite_focus=true)

So we try to cache C:\sr - caching is not working properly & ~\AppData\cabal\packages sources cache also does not work. It may/probably two different problems thou.

Also during work, to restart the caching anew - you can advance the Hackage index (cabal in any case rolls back to the last nearby fixed state), that both would load old caches, but workflows would make sure new information is saved in the new cache.

As for example, with Windows, the case may be that not full cache was saved somehow & gets reused (it can happen during initial cache runs in PR, for example, test or bench seizing the cache key save slot before caching & so saving test/bench cache there. But: 1. merge to master regenerated the caches, 2. what we see is not test or bench cache.


So all which are mentioned above - are details. But they are refuted (the is enough implication to mostly refute them in my logic system).


It seems to be some Windows setup & GitHub CI-related thing. For example, does CI allows to override directories in the root of a disk, because overriding root dirs is a security concern. Maybe it saved stack files, but cabal files are elsewhere. Maybe chocolatey package changed something & we keep supplying old cache & its file structure does not match the new one.

@Anton-Latukha Anton-Latukha marked this pull request as draft December 19, 2021 22:26
Copy link
Collaborator

@Anton-Latukha Anton-Latukha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reset the review status so I can be called again when it would be suitable to review.

Please, request me to look into the code again, when we would have what to dialog about in it.

So far I have no idea why caching does not work for Windows.

(I plan to try running choco in VM also, to understand what it does myself, then I would have the ability of a deeper look into the code of things)

@jneira
Copy link
Member Author

jneira commented Dec 20, 2021

I am investigating the cache issue in windows in this pr in my fork: jneira#58, branch https://github.com/jneira/haskell-language-server/tree/fix-windows-cache2

Some things i discovered:

  • The C:\cabal thing is effective in windows and also the store-dir is C:\sr\
  • The compiled deps cache is correctly downloaded in C:\sr
  • After changing the sources cache to use C:\cabal\packages, the sources are correctly restored from cache and cabal does not download them
  • But the build installs new entries in the cache instead using the exisiting ones (see https://github.com/jneira/haskell-language-server/runs/4581174687?check_suite_focus=true). In the log output for cabal build -v3 for StateVar i see
    • The store-dir existing dir:
 reading package config: C:\sr\ghc-9.0.1\package.db\StateVar-1.2.2-ce901b4ca958e681fb5e3348adb2896320608537.conf
  • But the build uses another hash:
In order, the following will be built:
............
 - StateVar-1.2.2-987753584b04a1cf66bb17e1120e295276ba27af (lib) (requires build)
.........

It seems to me that it is a cabal bug in windows only triggered in ci. Locally it works fine and dont rebuild deps.

@jneira
Copy link
Member Author

jneira commented Dec 20, 2021

I am gonna trace the keys used to compute the hash

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Dec 20, 2021

Yes. It can be as simple as constant cache inheritance.

During merge I've looked through cache builds that they started & we're saving cache.

In master only Nix & Caching builds are run. So Caching should built & saved caches. & We essentially use caches from the last cache save of last project file hash change.

Just inserting a space somewhere or advance Hackage index. It may fix it.

@jneira
Copy link
Member Author

jneira commented Dec 20, 2021

Yes. It can be as simple as constant cache inheritance.

During merge I've looked through cache builds that they started & we're saving cache.

In master only Nix & Caching builds are run. So Caching should built & saved caches. & We essentially use caches from the last cache save of last project file hash change.

Just inserting a space somewhere or advance Hackage index. It may fix it.

Yeah that will be the next step, invalidate cache keys prefixing them with v2- (the definitive ones could be using the hackage index) and see how behave the cache eviction triggering more builds in my fork

Otoh triggering two sucessive builds in the same pr should trigger the cache hit with the good lib hashes. However in the main repo succesive builds of the same pr seems to not work. Will use a controlled pr in my fork to confirm or discard that.

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Dec 20, 2021

Nota bene: GitHub in the pipeline has a feature to manage the caching storage, on release of it maintainers would be able to navigate the caching storage. So we would be able to go, remove the problematic block & it would regenerate by itself or by just rerun of master workflow. It would be actions/cache#632.

@jneira
Copy link
Member Author

jneira commented Dec 20, 2021

Ok i have the evidence of the direct cause of the issue. I've traced the package hash keys of the existing package hash and the new one for one depedency StateVar
After the build we have:

Trace keys for computing package StateVar hash

Existing entry in the store StateVar-1.2.2-ce901b4ca958e681fb5e3348adb2896320608537:

pkgid: StateVar-1.2.2
component: ComponentLib
src: 5e4b39da395656a59827b0280508aafdc70335798b50e5d6fd52596026251825
pkg-config-deps: 
deps: base-4.15.0.0, stm-2.5.0.0, transformers-0.5.6.2
compilerid: ghc-9.0.1
platform: x86_64-windows
stripped-exe: False
debug-info: 0
ghc-options: -haddock

Rebuilt StateVar-1.2.2-987753584b04a1cf66bb17e1120e295276ba27af:

pkgid: StateVar-1.2.2
component: ComponentLib
src: 5e4b39da395656a59827b0280508aafdc70335798b50e5d6fd52596026251825
pkg-config-deps: 
deps: base-4.15.0.0, stm-2.5.0.0, transformers-0.5.6.2
compilerid: ghc-9.0.1
platform: x86_64-windows
stripped-exe: False
debug-info: 0
extra-lib-dirs: C:\msys64\mingw64\lib
extra-include-dirs: C:\msys64\mingw64\include
ghc-options: -haddock

I think the $CABAL_DIR change also changed the $CABAL_DIR/config adding this entries

 extra-include-dirs: C:\msys64\mingw64\include
-- deterministic:
-- cid:
extra-lib-dirs: C:\msys64\mingw64\lib
-- extra-framework-dirs:
extra-prog-path: C:\ghcup\bin,
                 C:\cabal\bin,
                 C:\msys64\usr\bin,
                 C:\msys64\mingw64\bin

(Observe the C:\ghcup\bin entry)


Ok so invalidating the cache should fix it.

what could we do? To consider:

  • this changes are pretty rare and a normal cache invalidation should fix it
  • as we can see only 5 fields from the global config can change for now: stripped-exe, debug-info, extra-include-dirs, extra-lib-dirs, ghc-options.
    • the more likely to be changed are extra-include-dirs, extra-lib-dirs
  • cabal freeze does not take in account global config files (i just checked it)

@jneira
Copy link
Member Author

jneira commented Dec 20, 2021

I think the $CABAL_DIR change also changed the $CABAL_DIR/config adding this entries

However i dont see anything in the pr which could cause that change:

https://github.com/actions/virtual-environments/pull/4264/files#diff-9028c63a6998459164e88e67823c4141a409a4632c4ecc604fc9b5f1a8c3f0f5R41

should have the opposite effect as th 5th parameter of the ps script which changes from false to true is

    # Skip adjusting cabal.config with mingw paths
    [switch]$NoAdjustCabalConfig,

🤔

@jneira
Copy link
Member Author

jneira commented Dec 20, 2021

Well in any case those little changes to adapt the source cache to the new $CABAL_DIR are good imo

@jneira jneira marked this pull request as ready for review December 20, 2021 22:36
@Anton-Latukha Anton-Latukha added the merge me Label to trigger pull request merge label Dec 21, 2021
@Anton-Latukha
Copy link
Collaborator

This rebase check would run a while, caching update is still ongoing in master.

But be ready for the merge 🧑‍🔧

@mergify mergify bot merged commit 070f16f into haskell:master Dec 21, 2021
@jneira
Copy link
Member Author

jneira commented Dec 21, 2021

the build still took 38 min and the packages were downloded 🤔

https://github.com/haskell/haskell-language-server/runs/4591267873?check_suite_focus=true

@Anton-Latukha
Copy link
Collaborator

Anton-Latukha commented Dec 21, 2021

It is because rebase was done before the cache was built in master.

So PR used old caches & run the build in parallel & saved the caches in parallel. But thankfully all that happens localized to the PR scope.

That is why I restarted builds in #2506 after caching was done in master.

Further - the caching rebuilds would be faster (the deps already cached properly, so it would run at speed of project build (also ~50 minutes faster then first run we had)) & after the cache is proper - waits of master rebuilts are not that important, further, an old caches are almost as good as new ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge me Label to trigger pull request merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants