In [None]:
#!/bin/bash

begin_notebook=$SECONDS

# Project Sizes

How do projects grow over time?

We have looked at commits over time, but what about the project sizes themselves?

There are many ways to measure the size of a project. Let's list a few.

1. Number of files
1. Total size of all the files.
1. Size of all files after compression.
1. Size of the git repository.

## Sampling commits

We'll want to re-use functions we built to sample commits through the history of the project.

We need

* the number of commits in the project
* how many commits to skip over to get the number of commits we want to sample
* the SHA1s of the sampled commits
* something that loops through those sampled commits, asking the question we want to ask of each

In [None]:
# number of commits
ncommits() { git rev-list --first-parent HEAD | wc -l; }

# interval between sampled commits
mod() { # how many commits do I skip to get $1 points?
    local npoints=${1:-1}  # default to every commit
    local ncmts=$(ncommits)
    echo $(( ncmts/npoints ))
}
# SHA1s of the sample commits
only_every() { awk "(NR-1)%$1 == 0"; }
sample_revs() {
    git rev-list --first-parent --abbrev-commit --reverse HEAD |  # listed from first to last
        only_every $(mod $1)
}
# loop through sample revisions, calling a function for each
run_on_samples() {
    local npoints=1 # by default, do every commit
    if [ $# -eq 2 ]; then
        local npoints=$1
        shift # discard first argument
    fi
    local func=${1:-true}  # do nothing, i.e., only report the commit
    for commit in $(sample_revs $npoints); do
        echo $commit $($func $commit)
    done
}

When at the top-level of a repo, this lets us say things like `run_on_samples 1000 nfiles` to generate lists like this
```
e83c516331 11
f864ba7448 19
...
9ae84d2e7f 4393
```

If we're looking for the rates, we really want to know *when* those commits happened.
Sure, in projects like Git, for which commits are remarkably evenly spaced over time, we don't really need to ask.
For other repositories, however, we really want the first column to be a timestamp: the number of seconds after the very first commit.

We can re-use some more functions from our earlier notebook,
though we'll rearrange and clean some of them up a bit.

For example, we'll begin with a function that sets some global constants.

In [None]:
# constants used elsewhere
set_globals() {
    FIRST_COMMIT=$(git rev-list --first-parent --reverse HEAD | head -1)  # initialize per repo
    SPW=$(( 60*60*24*7 ))  # calculate and save seconds-per-week as a shell constant
}

This illustrates something important: by default, shell variables are globals, even if you define them inside a function. If you call a function that sets a shell variable, by default, that variable is set for the entire rest of the script (or notebook).

Today, this seems like a bizarre choice. The Unix shell was created before most other common languages, so some of its syntax and semantics are odd because it was born in the early days of higher-level language design, before there was as much general agreement on what languages should look like.

It's from an early stage in program-language development. If it helps, think, "bash is to Python (or C) as Basque is to Ogden's Basic English."

And if "Ogden's Basic English" doesn't ring a bell, Google it. :-)

The second assignment, to SPW, is also interesting, because it shows how to do integer arithmetic: just enclose it in double-parens.

In [None]:
# an absolute timestamp, in seconds-from-the-epoch
commit_time() { git log -1 --format=%ct $1; }
# seconds between the first commit and the second
seconds_between() { echo $(($(commit_time $2) - $(commit_time $1))); }

# and finally, the function we need: seconds from the first commit.
timestamp() {
    seconds_between $FIRST_COMMIT $1
}

We'll usually want to have our timestamps in something smaller than seconds, like weeks,
so let's do the conversion by dividing the times in seconds by the `SPW` constant we defined earlier.

In [None]:
spw() { echo "scale=2; $1/$SPW" | bc; }
timestamp_in_weeks() {
    spw $(timestamp $1)
}

We work around the shell's refusal do floating-point arithmetic, by calling a separate utility, `bc`, an arbitrary-precision calculator.
`bc` takes instructions -- little programs -- so we construct the program, `scale=2; $1/$SPW`, and feed it to `bc` on `stdin`!

This is typical shell programming: find a utility that lets you tell it how to do what you want, then use other tools -- in this case, `echo`, to construct the instructions.

In [None]:
# loop through sample revisions, calling a function for each,
# separate timestamp and week with a comma
run_on_timestamped_samples() {
    local npoints=1 # by default, do every commit
    if [ $# -eq 2 ]; then
        local npoints=$1
        shift # discard first argument
    fi
    local func=${1:-true}  # do nothing, i.e., only report the commit
    for commit in $(sample_revs $npoints); do
        echo $(timestamp_in_weeks $commit) ,$($func $commit)
    done
}

We've modified our earlier function slightly to get csv output in the format "timestamp,result" directly.

Enough prep work. Let's try our code. 

As before, we'll use the git repo as our guinea pig.

## 1. Number Of Files

We can begin by just counting the number of files for a set of sample commits.
For this, we only need a function that returns the number of files.
We figured this out in the earlier notebook, too.

In [None]:
# count the files in the named commit without checking them out
files() { git ls-tree -r --full-tree --name-only ${1:-HEAD}; }
nfiles() { files $1 | wc -l; }

`files()` lists all the files in the named revision, where the name is any tree-ish -- any SHA1 of a tree object, or a reference that can be resolved to a tree object, like the SHA1 of a commit, or anything that Git can turn into a SHA1, such as `master` or `HEAD~2`.

In [None]:
[ -d git ] || git clone https://github.com/git/git.git # clone Git's source-code repo if it's not already there
cd git >/dev/null # and dive in
set_globals
run_on_timestamped_samples 10 nfiles

That was easy. Excelsior!

## 2. Total Size of All Files.

This is harder than #1, for several reasons.

First, the easiest way to get the sizes is to list the files,
then run `wc` on each file, and sum the sizes to get a total.
Unfortunately, filenames change over time. We actually have to check out each commit we're interested in, then look at the sizes of those -- the total size of the working tree of that snapshot.

Second, we have to sum the output of `wc`. For small repositories, we can just count on `wc` to give us totals, but when the number of files gets large, `xargs` will batch up the filenames into very long argument lists, feed each long list to `wc`, and then
print separate totals for each batch. For example, `linux` a large repository, has 17 different totals.

Third, we have to remember to ignore the files under `.git/`, because these are the repository itself and are not part of the working tree.
Even if we check out the first commit, the `.git/` folder will contain all the objects for every version.

Working around all these problems will slow down our code.

Let's see what we can do.

In [None]:
lines-and-characters() { git checkout -q ${1:-HEAD}; git ls-files | grep -v ' ' | xargs -L 1 wc | awk '{lines+=$1; chars+=$3}END{print lines "," chars}'; }

`git checkout` gives us a working tree for the commit of interest. `git ls-files` should then list the files in that commit. The rest of the line runs wc on the names, one file at a time, and hands those results to awk to give us totals, both for numbers of lines and numbers of characters.  

The `grep -v ' '` filters out filenames with embeded blanks. We could write more complicated code to include these, but there are so few that we'll just ignore them. We're favoring simplicity over precision.

In [None]:
time run_on_timestamped_samples 10 lines-and-characters 2>/dev/null

It works, though it's slow and complains about occasional files that it doesn't know how to handle. Still, running it on 1000 samples remains a real possibility.

## Size of Compressed Revision

Where did this come from? Why would we care about the compressed size?

At some level, we'd like to measure how the total amount of information in a project grows over time.
It's hard to measure information in a program or a project, but removing redundancy moves us in that direction,
and compression is an approach to removing redundancy.
(If you want to read more about this, look up "Kolmogorov Complexity.")

We might not bother to try this if it were hard to implement, but it's not:
let `tar` collect and compress all the files, excluding the `.git/` directory, send that compressed stream to `stdout`, and count the bytes.

In [None]:
compressed-size() { git checkout -q ${1:-HEAD}; tar --exclude-vcs -Jcf - . | wc -c; }

In [None]:
time run_on_timestamped_samples 10 compressed-size

That's even slower, about half the speed of the total file size calculation, but still usable.

In [None]:
(( elapsed_seconds = SECONDS - begin_notebook ))
(( minutes = elapsed_seconds / 60 ))
seconds=$((elapsed_seconds - minutes*60))
printf "Total elapsed time %02d:%02d\n" $minutes $seconds