In [None]:
#!/bin/bash

begin_notebook=$SECONDS

# Project Sizes

How do projects grow over time?

We have looked at commits over time, but what about the project sizes themselves?

There are many ways to measure the size of a project. Let's list a few.

1. Number of files
1. Total size of all the files.
1. Size of all files after compression.
1. Size of the git repository.

The first one's easy enough. In fact, we wrote a function for it in the earlier notebook.

In [None]:
# count the files in the named commit without checking them out
files() { git ls-tree -r --full-tree --name-only ${1:-HEAD}; }
nfiles() { files $1 | wc -l; }

`files()` lists all the files in the named revision, where the name is any tree-ish -- any SHA1 of a tree object, or a reference that can be resolved to a tree object, like the SHA1 of a commit, or anything that Git can turn into a SHA1, such as `master` or `HEAD~2`.

## Sampling commits

We'll also re-use the functions we built to sample commits through the history of the project.
We need

* the number of commits in the project
* how many commits to skip over to get the number of commits we want to sample
* the SHA1s of the sampled commits
* something that loops through those sampled commits, asking the question we want to ask of each

In [None]:
# number of commits
ncommits() { git rev-list --first-parent HEAD | wc -l; }

# interval between sampled commits
mod() { # how many commits do I skip to get $1 points?
    local npoints=${1:-1}  # default to every commit
    local ncmts=$(ncommits)
    echo $(( ncmts/npoints ))
}
# SHA1s of the sample commits
only_every() { awk "(NR-1)%$1 == 0"; }
sample_revs() {
    git rev-list --first-parent --abbrev-commit --reverse HEAD |  # listed from first to last
        only_every $(mod $1)
}
# loop through sample revisions, calling a function for each
run_on_samples() {
    local npoints=1 # by default, do every commit
    if [ $# -eq 2 ]; then
        local npoints=$1
        shift # discard first argument
    fi
    local func=${1:-true}  # do nothing, i.e., only report the commit
    for commit in $(sample_revs $npoints); do
        echo $commit $($func $commit)
    done
}

When at the top-level of a repo, this lets us say things like `run_on_samples 1000 nfiles` to generate lists like this
```
e83c516331 11
f864ba7448 19
...
9ae84d2e7f 4393
```

If we're looking for the rates, we really want to know *when* those commits happened.
Sure, in projects like Git, for which commits are remarkably evenly spaced over time, we don't really need to ask.
For other repositories, however, we really want the first column to be a timestamp: the number of seconds after the very first commit.

We can re-use some more functions from our earlier work.

In [None]:
# an absolute timestamp, in seconds-from-the-epoch
commit_time() { git log -1 --format=%ct $1; }
# seconds between the first commit and the second
seconds_between() { echo $(($(commit_time $2) - $(commit_time $1))); }
# and finally, the function we need: seconds from the first commit.
find_first_commit() {
    FIRST_COMMIT=$(git rev-list --first-parent --reverse HEAD | head -1)  # initialize before first use
}
timestamp() {
    seconds_between $FIRST_COMMIT $1
}

# turn that into weeks
(( SPW = 60*60*24*7 ))  # save seconds-per-week as a shell constant
spw() { echo "scale=2; $1/$SPW" | bc; }
timestamp_in_weeks() {
    spw $(timestamp $1)
}

# loop through sample revisions, calling a function for each,
# separate timestamp and week with a comma
run_on_timestamped_samples() {
    local npoints=1 # by default, do every commit
    if [ $# -eq 2 ]; then
        local npoints=$1
        shift # discard first argument
    fi
    local func=${1:-true}  # do nothing, i.e., only report the commit
    for commit in $(sample_revs $npoints); do
        echo $(timestamp_in_weeks $commit) ,$($func $commit)
    done
}

Here, we've modified our functions slightly to get csv output in the format "timestamp,result" directly.

We'll need to call `find_first_commit()`, whenever we enter a new repo.
The function illustrates something important: the default scope of shell variables is global. If you call a function that sets a shell variable, by default, that variable is set for the entire scope of the script (or notebook).

This seems like a bizarre choice, but shows the antiquity of the language -- it was created before most other higher-level languages,
some of its syntax and semantics seem odd simply because newer languages learned from its mistakes.

If it helps, think "bash is to Python as Basque is to English."

Let's try our code. As before, let's use the git repo as our guinea pig.

In [None]:
[ -d git ] || git clone https://github.com/git/git.git # clone Git's source-code repo if it's not already there
pushd git >/dev/null # and dive in
find_first_commit
run_on_timestamped_samples 10 nfiles
popd >/dev/null

In [None]:

(( elapsed_seconds = SECONDS - begin_notebook ))
(( minutes = elapsed_seconds / 60 ))
seconds=$((elapsed_seconds - minutes*60))
printf "Total elapsed time %02d:%02d\n" $minutes $seconds