Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

A real struct for indicating progress #890

Closed
wants to merge 3 commits into from

6 participants

@ben

git_indexer_stats is soon to get more and more things that have nothing to do with reporting progress. This PR introduces a new struct git_progress.

@vmg
Owner

Hm. It looks like git_indexer_stats didn't really get removed.

@travisbot

This pull request passes (merged 2f6c4c35 into 697665c).

@ben

As of now, the only code that makes use of the new struct is in the checkout API. Everywhere else bottoms out in the indexer API, which means using git_indexer_stats akshully does make sense. I'd like to see how people feel about hiding the fact that the indexer is in play for clone and friends.

@carlosmn
Owner

Hm. It looks like git_indexer_stats didn't really get removed.

It won't get removed, it will simply get more fields that are only sensible for the indexer, where total, current, processed and data_finished make sense. Anywhere else, it doesn't.

@vmg
Owner

zzzzzzzz @ more structs in the external API. Is there any way we can unify these?

@vmg
Owner

I'm thinking, for the general case, wouldn't a float *progress have the same effect (linear increase between 0.0 and 1.0?)

@nulltoken
Collaborator

I'm thinking, for the general case, wouldn't a float *progress have the same effect (linear increase between 0.0 and 1.0?)

Are there any case when we might need multiple steps of progress? A la Clone?

@scunz

shouldn't we be able to produce at least the same output as git.git does? like:
n of m objects ( x of y MiB ) percent_done% speed MiB/s

maybe a const char* is more suitable for that (remember that this is in fact the format we get from side-band in pull/push)

@vmg
Owner

const char * sounds like stuff of nightmares. I'd rather keep the struct and let the user build the string by hand.

@scunz

Indeed - but how should we tell the user when the side-band said: "Unpacking local objects ( 3200 / 6400 )\r" ?

@vmg
Owner

If I recall properly, we decided to have a string-progress-callback. We discussed this in Berlin, correct? @carlosmn @schu

@scunz

nice, then :)

@carlosmn
Owner

When does the server say it's unpacking objects?

When the server says it's counting and compressing objects, that gets sent over the side-band a bunch of strings, and that's how we pass it up (including the \r that it sends because it assumes that we're giving the output straight to printf for the console). Or we will pass it up, the code was ready but then I rewrote half of the network stack.

All the data needed for the download progress output has been in the library for some time (and the example fetch command shows it, except for the speed, but that's beside the point of the command atm). The delta-resolving could be a bit tricky to show right now, and that's part of the reason why I want to add additional fields to the struct.

@scunz

Indeed it doesn't say "unpacking", but counting, compressing and a summary. git push doesn't show anything the server did with the data (It's delta compression is all local and the server just 'saves' the pack)

@ben

I dislike hard-coded strings. Now the caller has to parse the string for the numbers to show them as a progress bar? What if the caller is a Turkish-language GUI client?

If the caller really wants the actual text sent back from the server, great, but I think the actual user almost never wants to see that. I'd like to make it easy to show a progress bar. If that means something other than what we have here, so be it, let's figure it out.

@ben

Here's an idea for a multi-stage progress report. It doesn't scream "thread-safe", but if stages are never removed once an operation is in motion, it'll probably be okay.

#define GIT_PROGRESS_STAGE_FETCH "Fetch"
// ...

typedef struct {
    size_t total;
    size_t current;
    const char *name;
} git_progress_stage;

typedef struct {
    size_t stage_count;
    git_progress_stage **stages;
} git_progress_multistage;

This way the name of the stage is communicated in a way that won't prevent clients from localizing ("Fetch" will always be "Fetch", and won't change when the server upgrades to the newest Git).

In order to prevent Bad Things From Happening, whatever code is filling in the git_progress_multistage struct will need to know how many stages there are and preallocate all the git_progress_stage structs before setting stage_count (and never reallocate), so a watching thread won't dereference something it shouldn't.

What do you guys think?

@vmg
Owner

Ewwwww. I don't really like this. As long as the Git server keeps returning strings, I think we should handle those verbatim. Any way to make this more pretty needs to start on the server, not on us.

@scunz

Going back one step here; first putting up a summary on what we really need here:

There are 2 different types of status reports that we should give to the user:

  • A remote does something:
    In this case we have nothing more than a const char* containing the information.
    If I'm not mistaken here, the remote git would have ran that const char* probably through gettext, thus it may be in any language and probably is UTF-8.
    I'll leave this one for now - we could get away with a simple int git_remote_set_status_callback(git_remote*, int (cb*)(git_remote*, const char*))

  • We do something.
    Here we have detailed information on how far the progress is and can express that in numbers (and possible attributes).

    • Most of the time, git.git reports 2 orthogonal progress counters at once: 1. a counter (n/m objects) and 2. a size (x of y KiB). It also reports percentage done and speed, but we can leave the calculation thereof for the advanced user.
    • Result of the deltification-progress (uses 4 Integers)
    • Current status of 'count objects' ( 1 Integer + 1 bool for the "done." message part)
    • Number of threads that will be used ( obviously 1 integer)

Though, we would still have to decide how much of that we want to report. But all in all I have not found a status report yet that cannot be expressed by 4 integer values. If someone knows of a place where 4 intergers ain't sufficient, please let me know.

The next challenge is that we have to report different progresses from inside one function. But the good thing here is: We can't parallelize things; our tasks progress linearly: i.e. first fetch, then checkout. As a conclusion: It can express with an enum what we're currently about to do.

So far, we can put up this struct as a sole carrier for progress information:

typedef struct {
    git_progress_t type;
    int payload[4];
};
@vmg
Owner

Sorry, I'm not quite following the approach towards our own progress reports... What's the payload[4] supposed to carry?

@scunz

Damn it. Editing text in web-browsers is a thing you have to get used to. Sorry if there were several mails.

@scunz

This thing is, that it has different meaning, depending on what we report. In meta code it might look like:

// This is from a git-gc run:

//Counting objects: 145054, done.
git__set_progress( GIT_PROGRESS_COUNT_OBJECTS, 145054, 0, 0, 0 );

//Delta compression using up to 4 threads.
// should not report this, hard to catch from a "watching"-thread

//Compressing objects: 100% (35523/35523), done.
git__set_progress( GIT_PROGRESS_COMPRESS_OBJECTS, 35523, 35523, 0, 0 );

//Writing objects: 100% (145054/145054), done.
git__set_progress( GIT_PROGRESS_WRITE_OBJECTS, 145054, 145054, 0, 0 );

//Total 145054 (delta 109044), reused 142871 (delta 106862)
// should not report this either, unless it's the last report, it's hard to catch from a watching thread

We could of course name the fields and set them atomic without the git__set_progress. (But that's not my point here)

@scunz

Now, let's consider an API for progress. There are 2 approaches: The first is like we do with options now:
Let the user create a container and give libgit2 a pointer to it.

git_progress progress;
do_in_thread( git_lengthy_operation( repo, &progress ) );
while( thread_running )
{
    printf( <something about progress> );
    sleep(n);
}

This requires us to publish a struct as container for progress information. We could as well do without, if we let the user peek the information:

do_in_thread( git_lengthy_operation( repo, &progress ) );
while( thread_running )
{
    git_progress_type type;
    int i1, i2, i3, i4;
    if (git_repository_peek_progress(repo, &type, &i1, &i2, &i3, &i4) < 0)
    {
        printf( "No progress available");
        break;
    }
    switch (type) {
    case GIT_PROGRESS_COUNT_OBJECTS:
        printf("Counting objects %i\r", i1 );
        break;
    case GIT_PROGRESS_COMPRESS_OBJECTS:
        printf("Compressing objects %i/%i\r", i1, i2 );
        break;
    // ...
    }
    sleep(n);
}

As the watching thread will probably be sleeping most of the time, I don't see a problem in copying these few bytes every other second.

Further, this gives us the advantage that we can atomically set a flag that the progress struct is about to be updated. And do a lock-free spin inside git_repository_peek_progress to ensure it never contains strange data because the worker thread was about to update something.
This might be neccessary because: When we first report "n of m objects" and then switch to a progress where m might be zero, the watching thread could be trying to divide n by m, where m is already zero in the cycle that the watcher reads it.

P.S.: Sorry again, this proposal wasn't meant to come as cluttered as it is, but my browser fooled me a bit.

@ben

@scunz nice!

Let me see if I can summarize all of this.

How complex does this need to be?

There are (depending on how you define them) between 3 and 5 stages to a clone:

Cloning into 'foo'...
remote: Counting objects: 2929, done.
remote: Compressing objects: 100% (870/870), done.
remote: Total 2929 (delta 1964), reused 2907 (delta 1953)
Receiving objects: 100% (2929/2929), 720.69 KiB | 218 KiB/s, done.
Resolving deltas: 100% (1964/1964), done.

...each of which is handled by a different and modular part of libgit2. These pieces don't (and shouldn't be forced to) talk to each other, so it seems desirable to have them each report their own progress, and provide all that information to the caller.

So the caller needs a collection of current/total pairs. This is C, so they need to know how many of them are valid. We may want to indicate somehow what the counters are counting.

How many counters?

If we allow parallel operations, how many counters do we allow? The existing idiom says "one per operation." The git_repository_progress_peek() idea says "one per repository." Is this limiting? Do we allow (say) parallel fetches, or anything else that might require reporting progress?

Who owns the buffer?

The current idiom has the caller owning the struct, and the library pokes values into it. The git_progress_multistage proposal has the caller owning a shell struct, but the library owning the array of stage structs (which, now that I think about it, is kind of a bad idea). The "peek" idea has the library owning the counter, and values are copied out of it on demand.

THE RIGHT ANSWERS

@arrbee, I know you were only half serious, but this solution is kind of growing on me.

typedef struct git_rational {
    size_t numerator;
    size_t denominator;
} git_rational;

#define GIT_PROGRESS_MAX_STAGES 8
typedef struct git_progress {
    size_t stage_count;
    git_rational stages[GIT_PROGRESS_MAX_STAGES];
    int stage_types[GIT_PROGRESS_MAX_STAGES];
} git_progress;

#define GIT_PROGRESS_COUNT_OBJECTS 1
//...

And:

  • Replace the two separate progress indicators on git_clone with one
  • Replace the current and total fields on git_indexer_stats with a git_rational*, so the indexer can report directly into a multi-stage progress struct

Also...

Are the server-side messages well-defined? If so, can we just parse them to make things easier for our users? They don't have to worry about the network binary protocol, why should they worry about the progress protocol?

@scunz

Interesting:

$ export LANG=de_DE.UTF-8 && git init && git remote add foo ../src && git fetch foo
Initialisierte leeres Git-Projektarchiv in /work/mgv/foo/.git/
remote: Counting objects: 3669, done.
remote: Compressing objects: 100% (1461/1461), done.
remote: Total 3669 (delta 2702), reused 2891 (delta 2172)
Empfange Objekte: 100% (3669/3669), 509.25 KiB, done.
Löse Unterschiede auf: 100% (2702/2702), done.
$ git --version
git version 1.7.12

As you can see, with current git, the messages that originate in the plumbing command are not translated. Does that mean they are stable? Might we conclude they will stay untranslated forever? Or is it just that the l10n process of core-git hasn't reached the plumbing commands yet?

If we can assume that these are stable, I'd give a definitive vote on getting the bare numbers out of them with a regular expression.

@scunz

A few topics where we might later want to extent this:

  • We might support more than one fetchurl / pushurl for git_remote.
    => In this case, GIT_PROGRESS_FETCH should be amended with a pointer to the url.
    => If we wanted to process them in parallel, how should we find a limit for the number of fetches that can be done in parallel (and thus need progress report)

  • We might implement recursive fetching and/or initializing for submodules inside libgit2.
    => Just multiplies the first case by the number of submodules, even if we don't fetch in parallel.

  • We might implement fetching submodules directly from git_clone_xxx
    => If not parallelized, just 1 more stage we'd have to account for. If parallelized, same as above.

What other cases might occur?

@ben

This is kind of a separate discussion, but I don't think git_clone() should support recursive cloning. Too much happening behind one API call, and a client can easily do a git_submodules_foreach. Though parallel clones should definitely be possible under that scenario.

Oh, and the engilsh text you're seeing? That's because the strings labeled "remote:" are piped straight from the wire to printf, including \r characters. :tongue:

@scunz

But my above example uses a local remote and spawns the "remote git" locally, using the same LANG (I'm not even sure that it actually is a separate process, but I won't check that right now :-) )

@carlosmn
Owner

It did spawn a new git-upload-pack because that's where the output comes from, but that doesn't mean that your LANG was carried into it or that the strings won't change.

Multiple fetch urls for the same remote in parallel would only make sense in a very narrow subset of cases, and that should be handled by the user. Multiple push urls are useful more generally, but it's still only a few cases where that's useful.

The alternatives for that that I would be: either let the user specify which of the urls they want to use, and make it thread-safe to run two in parallel, or build a complicated structure to describe a lot of work that's happening behind the scenes. The only one that makes sense to me is the first.

@scunz

Ah, I see, git-upload-pack is not a builtin but rather a executable on its own. Was curious about that, since I've seen dozens of places in git.git where they just prepare an argv and go to their command dispatcher instead of launching a separate process. But you're right here. As for the LANG not carried into it... I can't see "LANG" in context of a setenv call in git.git, so one might assume it will just get inherited; However I'm not sure if there isn't any mystical spell in between that might unset it :-)

As for the fetch-urls - i won't even try to disagree here, since you're more than obviously right. I'm undecided wrt to push urls.

Ben Straub and others added some commits
Ben Straub Introduce git_progress.
This is a more specialized (and better-named)
structure than git_indexer_stats, and is currently
used mainly for checkout operations.
040a7e8
@ben ben Progress: new model
Added git_rational and a multi-stage
progress struct on top of it.
3c92918
@travisbot

This pull request passes (merged 3c92918 into 2b175ca).

@ben

Here's a spike that produces nice output from the network example (note the pause while git_index_read_tree is working):

Progress: ---% / ---% / ---%  ==> (  0%)
Progress: ---% / ---% / ---%  ==> (  0%)
Progress:   1% /   1% / ---%  ==> (  0%)
Progress:   3% /   3% / ---%  ==> (  2%)
# ...
Progress:  82% /  30% / ---%  ==> ( 37%)
Progress:  92% /  31% / ---%  ==> ( 41%)
Progress: 100% /  32% / ---%  ==> ( 44%)
Progress: 100% /  37% / ---%  ==> ( 45%)
Progress: 100% /  41% / ---%  ==> ( 47%)
# ...
Progress: 100% /  94% / ---%  ==> ( 64%)
Progress: 100% /  98% / ---%  ==> ( 66%)
Progress: 100% / 100% / ---%  ==> ( 66%)
Progress: 100% / 100% / ---%  ==> ( 66%)
Progress: 100% / 100% / ---%  ==> ( 66%)
Progress: 100% / 100% / ---%  ==> ( 66%)
Progress: 100% / 100% /   8%  ==> ( 69%)
Progress: 100% / 100% /  41%  ==> ( 80%)
Progress: 100% / 100% /  84%  ==> ( 94%)
Progress: 100% / 100% / 100%  ==> (100%)

The part I'm most nervous about is adding the members to git_indexer_stats. They seem a bit clumsy to work with, and I don't like having them in addition to the three members that are already there. As of now, you get received/processed/total and (optionally) git_rationals with the same information. Is this really necessary? If this is the right approach, the three unsigned int members should probably go away.

@vmg
Owner

Where the fuck is this git_rational thing coming from? There's nothing rational about Git. ;/

Jokes aside, it has the exact same structure as git_progress, so one of those needs to go. I vote for killing git_rational

@ben

Yeah, I'm having a hard time thinking of another place we'd be using rationals, so you're probably right.

@travisbot

This pull request passes (merged ccea465 into 2b175ca).

This was referenced
@ben

Abandoning this in favor of callbacks (#990).

@ben ben closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Aug 26, 2012
  1. @ben

    Introduce git_progress.

    Ben Straub authored ben committed
    This is a more specialized (and better-named)
    structure than git_indexer_stats, and is currently
    used mainly for checkout operations.
  2. @ben

    Progress: new model

    ben authored
    Added git_rational and a multi-stage
    progress struct on top of it.
  3. @ben

    Progress: remove git_rational

    ben authored
Something went wrong with that request. Please try again.