-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse-checkout builtin: upstreamable version #180
Sparse-checkout builtin: upstreamable version #180
Conversation
The sparse-checkout feature is mostly hidden to users, as its only documentation is supplementary information in the docs for 'git read-tree'. In addition, users need to know how to edit the .git/info/sparse-checkout file with the right patterns, then run the appropriate 'git read-tree -mu HEAD' command. Keeping the working directory in sync with the sparse-checkout file requires care. Begin an effort to make the sparse-checkout feature a porcelain feature by creating a new 'git sparse-checkout' builtin. This builtin will be the preferred mechanism for manipulating the sparse-checkout file and syncing the working directory. For now, create the basics of the builtin. Includes a single subcommand, "git sparse-checkout list", that lists the patterns currently in the sparse-checkout file. Test that these patterns are parsed and written correctly to the output. The documentation provided is adapted from the "git read-tree" documentation with a few edits for clarity in the new context. Extra sections are added to hint toward a future change to a moer restricted pattern set. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Getting started with a sparse-checkout file can be daunting. Help users start their sparse enlistment using 'git sparse-checkout init'. This will set 'core.sparseCheckout=true' in their config, write an initial set of patterns to the sparse-checkout file, and update their working directory. Using 'git read-tree' to clear directories does not work cleanly on Windows, so manually delete directories that are tracked by Git before running read-tree. The use of running another process for 'git read-tree' is likely suboptimal, but that can be improved in a later change, if valuable. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When someone wants to clone a large repository, but plans to work using a sparse-checkout file, they either need to do a full checkout first and then reduce the patterns they included, or clone with --no-checkout, set up their patterns, and then run a checkout manually. This requires knowing a lot about the repo shape and how sparse-checkout works. Add a new '--sparse' option to 'git clone' that initializes the sparse-checkout file to include the following patterns: /* !/*/* These patterns include every file in the root directory, but no directories. This allows a repo to include files like a README or a bootstrapping script to grow enlistments from that point. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git sparse-checkout add' subcommand takes a list of patterns over stdin and writes them to the sparse-checkout file. Then, it updates the working directory using 'git read-tree -mu HEAD'. Note: if a user adds a negative pattern that would lead to the removal of a non-empty directory, then Git may not delete that directory (on Windows). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
/azp run Microsoft.git |
Pull request contains merge conflicts. |
The instructions for disabling a sparse-checkout to a full working directory are complicated and non-intuitive. Add a subcommand, 'git sparse-checkout disable', to perform those steps for the user. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The clear_ce_flags_1 method is used by many types of calls to unpack_trees(). Add trace2 regions around the method, including some flag information, so we can get granular performance data during experiments. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The sparse-checkout feature can have quadratic performance as the number of patterns and number of entries in the index grow. If there are 1,000 patterns and 1,000,000 entries, this time can be very significant. Create a new 'cone' mode for the core.sparseCheckout config option, and adjust the parser to set an appropriate enum value. While adjusting the type of this variable, rename it from core_apply_sparse_checkout to core_sparse_checkout. This will help avoid parallel changes from hitting type issues, and we can guarantee that all uses now consider the enum values instead of the int value. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The parent and recursive patterns allowed by the "cone mode" option in sparse-checkout are restrictive enough that we can avoid using the regex parsing. Everything is based on prefix matches, so we can use hashsets to store the prefixes from the sparse-checkout file. When checking a path, we can strip path entries from the path and check the hashset for an exact match. As a test, I created a cone-mode sparse-checkout file for the Linux repository that actually includes every file. This was constructed by taking every folder in the Linux repo and creating the pattern pairs here: /$folder/* !/$folder/*/* This resulted in a sparse-checkout file sith 8,296 patterns. Running 'git read-tree -mu HEAD' on this file had the following performance: core.sparseCheckout=false: 0.21 s (0.00 s) core.sparseCheckout=true: 3.75 s (3.50 s) core.sparseCheckout=cone: 0.23 s (0.01 s) The times in parentheses above correspond to the time spent in the first clear_ce_flags() call, according to the trace2 performance traces. While this example is contrived, it demonstrates how these patterns can slow the sparse-checkout feature. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
To make the cone pattern set easy to use, update the behavior of 'git sparse-checkout [init|add]'. Add '--cone' flag to 'git sparse-checkout init' to set the config option 'core.sparseCheckout=cone'. When running 'git sparse-checkout add' in cone mode, a user only needs to supply a list of recursive folder matches. Git will automatically add the necessary parent matches for the leading directories. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
9937fc0
to
876024d
Compare
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
/azp run Microsoft.git |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run Microsoft.git |
Azure Pipelines successfully started running 1 pipeline(s). |
876024d
to
181fb3b
Compare
/azp run Microsoft.git |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few naive questions but overall this looks good to me.
`git sparse-checkout disable` command. | ||
|
||
Sparse checkout support in 'git read-tree' and similar commands is | ||
disabled by default. You need to set `core.sparseCheckout` to `true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to set
core.sparseCheckout
totrue
in order to have sparse checkout support
Does this mean when core.sparseCheckout=cone
that git read-tree
does the wrong thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! I just didn't update this documentation in the commits adding the cone
setting. I should fix that.
{ | ||
struct argv_array argv = ARGV_ARRAY_INIT; | ||
int result = 0; | ||
argv_array_pushl(&argv, "read-tree", "-m", "-u", "HEAD", NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pass through any progress on stdout? Should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git read-tree
does not have progress indicators. I could add some. (It would go through stderr
if so.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting to progress and maybe having a quiet mode seems good from an experience perspective. I could see a user adding a very large cone and being frustrated if they don't get feedback on what's happening like checkout provides.
We can address this in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (line.buf[0] == '/') | ||
strbuf_remove(&line, 0, 1); | ||
|
||
if (!line.len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract strbuf_trim_leading_dir_sep
so we can be a bit more explicit about our intent?
if (!(x->flags & EXC_FLAG_NEGATIVE)) { | ||
/* Not a cone pattern. */ | ||
el->use_cone_patterns = 0; | ||
warning(_("unrecognized pattern: '%s'"), x->pattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this user-facing? If it is, should we indicate something like unrecognized sparse-checkout pattern
so they know where to look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe also indicate that it might be valid for sparse=true but not sparse=cone.
return; | ||
} | ||
|
||
if (x->patternlen >= 2 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a couple of similar patterns here of trying to, I think, effectively ignore the recursive wildcard(s) on the end of the entries. Should we extract something to make that easier to parse for a first-time reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could likely be split into several methods and be better for it.
|
||
Then it compares the new skip-worktree value with the previous one. If | ||
skip-worktree turns from set to unset, it will add the corresponding | ||
file back. If it turns from unset to set, that file will be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 3 paragraphs are a bit muddy. But I don't want to bog us down wordsmith-ing right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, many of them are copied directory from Documentation/git-read-tree.txt
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was afraid of that....
@@ -15,7 +15,7 @@ SYNOPSIS | |||
[--dissociate] [--separate-git-dir <git dir>] | |||
[--depth <depth>] [--[no-]single-branch] [--no-tags] | |||
[--recurse-submodules[=<pathspec>]] [--[no-]shallow-submodules] | |||
[--[no-]remote-submodules] [--jobs <n>] [--] <repository> | |||
[--[no-]remote-submodules] [--jobs <n>] [--sparse] [--] <repository> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I can't comment on the commit message, I'll park this here.
On the "clone: add --sparse mode" commit:
Where you mention "README" you might also mention files like .gitignore
and friends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And .gitattributes
!
Enable "sparse checkout" feature. If "false", then sparse-checkout | ||
is disabled. If "true", then sparse-checkout is enabled with the full | ||
.gitignore pattern set. If "cone", then sparse-checkout is enabled with | ||
a restricted pattern set. See linkgit:git-sparse-checkout[1] for more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the term "... pattern set" threw me a bit here. I guess I'm wondering if "set" is a noun or verb here.
Would it be better to say something about the pattern matching mode. As in: If "true", then sparse-checkout is enabled using the rules for .gitignore-style pattern matching. If "cone, then it is enabled using the rules for cone-style matching.
It seems like this config setting is just changing matching algorithm.
Whereas, saying "... pattern set" implies that we have 2 cached "sets" and are deciding which to install.
@@ -86,6 +86,56 @@ negate patterns. For example, to remove the file `unwanted`: | |||
---------------- | |||
|
|||
|
|||
## CONE PATTERN SET | |||
|
|||
The full pattern set allows for arbitrary pattern matches and complicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thrown by the intro for the "cone" method talking about the "full" set.
Perhaps a L2 header for "Full/.gitignore algorithm" and then a L2 for the "Cone algorithm".
Merging without responding to comments because they may dramatically change based on community feedback. After feedback from the community, I will revert these commits and apply the new version to the feature branch as we go. It is important to unblock work. |
In #183, I reverted a change to builtin/checkout.c that slowed down 'git checkout -b'. That change introduced a new instance of the core_apply_sparse_checkout global value that was not in the previous version. I then merged #180 without regenerating a build. My bad. This fixes that remaining core_apply_sparse_checkout instance. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In #183, I reverted a change to builtin/checkout.c that slowed down 'git checkout -b'. That change introduced a new instance of the core_apply_sparse_checkout global value that was not in the previous version. I then merged #180 without regenerating a build. My bad. This fixes that remaining core_apply_sparse_checkout instance.
Uses the code currently at microsoft/git#180. Covers the basics of #8. * The default `scalar clone` runs `git sparse-checkout init` so the working directory only has files at root. * Run `git sparse-checkout add <folders.txt` to pipe in a list of folders, and it will expand those files. Example workflow: ```sh $ scalar clone https://dev.azure.com/gvfs/ci/_git/ForTests Clone parameters: Repo URL: https://dev.azure.com/gvfs/ci/_git/ForTests Branch: Default Cache Server: Default Local Cache: C:\.scalarCache Destination: C:\_git\test2\ForTests Authenticating...Succeeded Querying remote for config...Succeeded Using cache server: None (https://dev.azure.com/gvfs/ci/_git/ForTests) WARNING: Unable to validate your Scalar version Server not configured to provide supported Scalar versions Cloning...Succeeded Fetching commits and trees from origin (no cache server)...Succeeded Validating repo...Succeeded Mounting...Succeeded $ cd ForTests/src/ $ ls AuthoringTests.md GvFlt_EULA.md GVFS.sln License.md nuget.config Protocol.md Readme.md Settings.StyleCop $ echo GVFS/GVFS.Common >>../folders.txt $ echo GVFS/GVFS.UnitTests >>../folders.txt $ echo GitHooksLoader >>../folders.txt $ git sparse-checkout add <../folders.txt $ ls AuthoringTests.md GitHooksLoader/ GvFlt_EULA.md GVFS/ GVFS.sln License.md nuget.config Protocol.md Readme.md Settings.StyleCop $ echo GVFS/GVFS >>../folders2.txt $ echo GVFS/GVFS.FunctionalTests >>../folders2.txt $ git sparse-checkout add <../folders2.txt $ ls AuthoringTests.md GitHooksLoader/ GvFlt_EULA.md GVFS/ GVFS.sln License.md nuget.config Protocol.md Readme.md Settings.StyleCop $ ls GVFS GVFS/ GVFS.Common/ GVFS.FunctionalTests/ GVFS.UnitTests/ LibGit2Sharp.NativeBinaries.props ProjectedFSLib.NativeBinaries.props $ git sparse-checkout list / /GVFS/ /GVFS/GVFS/* /GVFS/GVFS.Common/* /GVFS/GVFS.FunctionalTests/* /GVFS/GVFS.UnitTests/* /GitHooksLoader/* ``` I spun up a few remaining issues for follow-up work: #76, #77, #78.
I'm creating this PR for a fourth time (see #163, #171, and #178 for earlier versions). This version is tracking my progress to create something that can be sent as an RFC upstream, but also can be used to start the sparse feature branch. This is based on v2.23.0.
Note: I currently have conflicts with the virtual filesystem feature, and I'll resolve those with a merge commit when I'm ready. I'm just creating this for tracking progress at the moment, but can also be a place for early feedback.
TODOs:
git sparse-checkout disable
to disable the sparse-checkout feature and return to a full checkout.git sparse-checkout init --cone
, to initialize in cone mode.git sparse-checkout add
whencore.sparseCheckout=cone
.This includes performance improvements with the hashset-based matching algorithm, but I'll leave further enhancements as smaller steps on top. Here are a few things I want to try: