Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status() is slow with a large number of untracked files #181

Open
raviqqe opened this issue Oct 12, 2020 · 16 comments
Open

Status() is slow with a large number of untracked files #181

raviqqe opened this issue Oct 12, 2020 · 16 comments
Labels
help wanted Extra attention is needed performance

Comments

@raviqqe
Copy link

raviqqe commented Oct 12, 2020

func (*Worktree) Status() is slow when there are a large number of untracked files even if they are ignored by .gitignore. It is much slower than git status.

I imported this issue from src-d/go-git#844. This seems to be still an issue. Any update?

@akshaybabloo
Copy link

This is slow for even four files

@bruth
Copy link

bruth commented Jan 27, 2021

I was curious if anyone was thinking about this or working on it? Any ideas on what the path forward would be?

@jsattler
Copy link

jsattler commented Mar 6, 2021

I have the same issue at the moment. Running worktree.Status() within a large worktree is quite slow. I tracked this down to the function func (w *Worktree) diffStagingWithWorktree(reverse bool) (merkletrie.Changes, error) where a call to merkletrie.DiffTree() is calculating the list of changes between two merkletries. I have not tracked this further down, but looks like the issue is somewhere within this calculation.

Update:
I did some profiling and found a few bottlenecks while calling Status():

(pprof) top10
Showing nodes accounting for 3320ms, 80.19% of 4140ms total
Dropped 120 nodes (cum <= 20.70ms)
Showing top 10 nodes out of 126
      flat  flat%   sum%        cum   cum%
    1350ms 32.61% 32.61%     1350ms 32.61%  crypto/sha1.blockAVX2
     470ms 11.35% 43.96%      810ms 19.57%  github.com/go-git/go-git/v5/plumbing/format/gitignore.(*pattern).Match
     390ms  9.42% 53.38%      740ms 17.87%  runtime.scanobject
     350ms  8.45% 61.84%      360ms  8.70%  syscall.Syscall
     230ms  5.56% 67.39%      270ms  6.52%  runtime.findObject
     220ms  5.31% 72.71%      220ms  5.31%  syscall.Syscall6
     130ms  3.14% 75.85%      130ms  3.14%  memeqbody
      60ms  1.45% 77.29%      870ms 21.01%  github.com/go-git/go-git/v5/plumbing/format/gitignore.(*matcher).Match
      60ms  1.45% 78.74%      150ms  3.62%  path/filepath.Match
      60ms  1.45% 80.19%       60ms  1.45%  path/filepath.scanChunk`

profile001

@steffakasid
Copy link

steffakasid commented Mar 27, 2021

I experience the same. In fact I basically would like to do: git --git-dir=/home/sid/.dof --work-tree=/home/sid status -uno is there any way to do it? Otherwise I would get all files in home directory listed as untracked (and which also takes really long). Btw. the status.showuntrackedfiles=no is also not taken into account

@tcolgate
Copy link

This has been outstanding for quite a while, but I just wanted to reiterate how painful it is. We have to avoid go-git for checking status because our internal tools became unusable in our nodejs projects thanks to node_modules.
I've investigated options for fixing, but some guidance on how the maintainers would like the problem approached would be helpful. I'm willing (and capable), of implementing a correct fix, if people can suggest one.

@jwalton
Copy link

jwalton commented May 30, 2021

On a relatively small node.js project, a single call to Status() will take 7-8 seconds because of node_modules. On the moderate-sized front end for the project I'm working on now, it takes 46 seconds!

trotterdylan added a commit to bitcomplete/plz-cli that referenced this issue Mar 23, 2022
Getting the worktree status using go-git is extremely slow and is a long
standing known issue: go-git/go-git#181

Fall back to using command line git for this simple operation.

plz-review-url: https://plz.review/review/4863
trotterdylan added a commit to bitcomplete/plz-cli that referenced this issue Mar 23, 2022
PLZ-807 Use command line git for getting worktree status

Getting the worktree status using go-git is extremely slow and is a long
standing known issue: go-git/go-git#181

Fall back to using command line git for this simple operation.

plz-review-url: https://plz.review/review/4863
mikelorant added a commit to mikelorant/committed that referenced this issue Feb 7, 2023
Large worktrees were causing significant delays in displaying the user
interface. This was due to calculcating the hash of files to determine
the overall status of the worktree.

Go has poor performance with SHA1 hashing. Too many files were
unnecessarily hashed as well. These combinations caused some
repositories to take well over 10 seconds to display the user
interface.

This is a known problem in worktree status and an issue already exists.
go-git/go-git#181

Shelling out to call "git status" allowed for significant performance
increases often in the sub second range. A modified implementation was
used based on: gitleaks/gitleaks#463

The variation tries to use "git status" and if it fails falls back to
the original go-git implementation.
Copy link

To help us keep things tidy and focus on the active tasks, we've introduced a stale bot to spot issues/PRs that haven't had any activity in a while.

This particular issue hasn't had any updates or activity in the past 90 days, so it's been labeled as 'stale'. If it remains inactive for the next 30 days, it'll be automatically closed.

We understand everyone's busy, but if this issue is still important to you, please feel free to add a comment or make an update to keep it active.

Thanks for your understanding and cooperation!

@github-actions github-actions bot added the stale Issues/PRs that are marked for closure due to inactivity label Dec 13, 2023
@tcolgate
Copy link

This still renders the Status() method unusably slow on repos with large numbers of ignored files (like nodejs/npm working directories).

@github-actions github-actions bot removed the stale Issues/PRs that are marked for closure due to inactivity label Dec 14, 2023
@pjbgf pjbgf added help wanted Extra attention is needed performance labels Dec 15, 2023
@codablock
Copy link
Contributor

I started to analyse the underlying issue that causes slowdowns in the project I'm working and figured out that this issue is the underlying issue.

I also delved into go-git to figure out why it's slower than a regular git status. The reason is that git status does not traverse into directories if the whole directory is untracked, while go-git does. It even goes as far as calculating hashes for all untracked files.

I'm not sure what the best approach is to fix this. My current best guess is to change the Hasher interface to have Hash() []byte, error for the Hash function, so that we can implement lazy hash calculations. In the case of untracked files, it will then never trigger hash calculation and speed up the process a lot.

My question would now be: Is the Hasher interface considered public API? Can we simply change it? @pjbgf maybe you can help here?

@pjbgf
Copy link
Member

pjbgf commented Jan 2, 2024

@codablock yes, unfortunately that is part of the public API and is currently one of the "blockers" for sha256 - so changes to it would target v6.

Please note that #825 introduces some performance improvements to this area, but is pending some additional comments/documentation before we can merge it.

@codablock
Copy link
Contributor

@pjbgf Thanks. Yeah that's what I assumed already. And I was not aware of #825, which will clearly also fix the underlying performance issue, but without the incompatible API change. I'll then wait for it to get merged/released instead of providing my own PR (I got it working locally already, so ping me if you still want to see it).

Copy link

github-actions bot commented Apr 6, 2024

To help us keep things tidy and focus on the active tasks, we've introduced a stale bot to spot issues/PRs that haven't had any activity in a while.

This particular issue hasn't had any updates or activity in the past 90 days, so it's been labeled as 'stale'. If it remains inactive for the next 30 days, it'll be automatically closed.

We understand everyone's busy, but if this issue is still important to you, please feel free to add a comment or make an update to keep it active.

Thanks for your understanding and cooperation!

@github-actions github-actions bot added the stale Issues/PRs that are marked for closure due to inactivity label Apr 6, 2024
@akshaybabloo
Copy link

Commenting to keep this thread active.

@pjbgf
Copy link
Member

pjbgf commented Apr 6, 2024

@akshaybabloo Can you provide steps to reproduce of what you are experiencing? Did you try with a version with #825 (e.g. v5.12.0)?

@michaelangeloio
Copy link

IMO, this is a good test of the performance of a large codebase: https://gitlab.com/gitlab-org/gitlab

Repro steps: check it out and point the following code at it:

		wt, err := dt.Repo.Worktree()
		if err != nil {
			return err
		}

		status, err := wt.Status()
		if err != nil {
			return err
		}

		for filename, _ := range status {
			if status.IsUntracked(filename) {
				fmt.Printf("Untracked file: %s\n", filename)
				// Add to a list if you'd like
			}
		}

It's pretty slow (at least a few seconds) and seems to pick up ignored files (some in node_modules).

Can we not add an API to skip directories that are ignored? Is that existing functionality?

@akshaybabloo
Copy link

akshaybabloo commented Apr 7, 2024

Here is my example code @pjbgf based of @michaelangeloio for https://gitlab.com/gitlab-org/gitlab

package main

import (
	"fmt"
	"os"

	git5 "github.com/go-git/go-git/v5"
)

func main() {
	var path string
	var err error

	if len(os.Args) > 1 {
		path = os.Args[1]
	} else {
		path, err = os.Getwd()
		if err != nil {
			panic(err)
		}
	}

	fmt.Printf("Checking for untracked files in %s\n", path)

	repo, err := git5.PlainOpen(path)
	if err != nil {
		panic(err)
	}

	wt, err := repo.Worktree()
	if err != nil {
		panic(err)
	}

	status, err := wt.Status()
	if err != nil {
		panic(err)
	}

	for file, _ := range status {
		if status.IsUntracked(file) {
			fmt.Printf("Untracked file: %s\n", file)
		}
	}
}

This shows (built it with go build -o main2 -ldflags="-s -w" main.go):

time ./main2 /home/akshay/code/personal/gitlab
Checking for untracked files in /home/akshay/code/personal/gitlab
Untracked file: node_modules/tailwindcss/stubs/tailwind.config.js
Untracked file: node_modules/tailwindcss/stubs/tailwind.config.ts
Untracked file: node_modules/tailwindcss/stubs/.gitignore
Untracked file: node_modules/tailwindcss/stubs/config.full.js
Untracked file: node_modules/tailwindcss/stubs/postcss.config.cjs
Untracked file: node_modules/tailwindcss/stubs/tailwind.config.cjs
Untracked file: node_modules/tailwindcss/stubs/.prettierrc.json
Untracked file: node_modules/tailwindcss/stubs/config.simple.js
Untracked file: node_modules/tailwindcss/stubs/postcss.config.js

real    0m4.495s
user    0m3.579s
sys     0m1.159s

Using git

time git status -sb --show-stash
## master...origin/master

real	0m0.160s
user	0m0.077s
sys	0m0.158s

Update

Did git gc on the repo and got

real    0m6.759s
user    0m3.741s
sys     0m1.505s

@github-actions github-actions bot removed the stale Issues/PRs that are marked for closure due to inactivity label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance
Projects
None yet
Development

No branches or pull requests

10 participants