In [None]:
begin_notebook=$SECONDS

# Introduction

I made this notebook for four reasons. To wit:

1. If you don't program in bash yet, this notebook should encourage you to try.
1. If you do program in bash, you'll learn some things you don't know.
   One of them is what a good environment for bash development Jupyter Notebooks provide.
1. If you're teaching people how to program in bash, here's another resource.
1. If you've never done software paleontology -- exploring the large-scale structure of software project histories -- this will get you thinking about it.

Let's explore repositories a bit.
We'll begin with a toy repository, `toy`.

# Exploring Repositories

In [None]:
[ -d toy ] || git clone https://github.com/jsh/toy  # make a local clone, unless it's already there.
pushd toy

We can start by listing all commits and how they relate to one another.
The command `git log --all --decorate --oneline --graph` does this for us with an ASCII graph.

There are GUI tools that provide such information, too, but Git will give it to you right from the command line. 
You don't have to install any apps, and it works on terminals connected to remote servers, 
where may not even be able to install applications and may not have access to anything but out-of-the-box Git.

Unfortunately, running this command requires remembering four options, which you'll probably have to look up. 
(I've used the mnemonic "Log-A-DOG). Plus, it's a lot of typing.

You can save wear and tear on your fingers and your brain by defining a Git alias, like this:

In [None]:
git config --global alias.lol "log --all --decorate --oneline --graph"

Once defined, you can just type `git lol` 

In [None]:
git lol

When you forget how to define this alias, an internet search for `git lol` will tell you in a trice.
It's in common use.

As you can see, it's tiny repo, with only a few commits.  How many?

In [None]:
git rev-list --all | wc -l  # a pipeline that counts the number of lines

And who did them?

In [None]:
git shortlog -ns --all

Aha. Three authors. Nine commits.
To concentrate on the current line of development, just change `--all` to `HEAD`.
Usually, we'll be interested in the default, `master` branch. 

Some projects, in a paroxysm of political correctness, have named these `main`, insisting that `master` can only be a reference to either chattel slavery or to men in "the patriarchy." Accordingly, we'll define a function to return the name of the current branch.

In [None]:
current_branch() { git symbolic-ref --short HEAD; }

We'll make the output more readable by using abbreviated commits, just long enough to be unique.
It's also convenient to reverse the order, starting our commit list with the very first commit.

Git can supply all this, but requires a command-name and some options that are annoying to have to look up.
Bash lets you wrap such things into functions, both to ease your typing burden and so you don't have to keep all those options in your head.

Here's our function:

## Shell Functions Save Your Brain and Your Fingers

Here's our first function.

In [None]:
commits() { local top=${1:-HEAD}; git rev-list --abbrev-commit --reverse $top; } # define a function
commits # and run it

`commits()` reports the list of revisions leading up to the commit specified by its first argument.
Call it with too many arguments? bash doesn't care. `commits()` only looks at the first.
Call it with too few?
The shell idiom, `${1:-value}` means "use `HEAD` as the default, first argument.

Notice that we can still even pass in `--all`!

In [None]:
echo commits in current branch: $(commits $(current_branch) | wc -l)
echo all commits: $(commits --all | wc -l)

The expression `$(some command or pipeline)` says, "Run the command or pipeline, and substitute the result here before trying to run the command line it's part of."

In other words, the first of these runs the pipeline `commits $(current_branch) | wc -l`, which generates an integer -- the number of commits in the current branch -- and then does "echo commits in current branch: 9".

The left side of that pipeline, `commits $(current_branch)`, runs `current_branch` to get the branch name, then makes that name the argument of `commits()`.

If for nothing other than practice, let's turn counting the number of commits into a function.

In [None]:
ncommits() { commits $1 | wc -l; }
echo There have been $(ncommits) commits

How about the number of committers?

In [None]:
committers() { git shortlog -ns ${1:-HEAD}; }
ncommitters() { committers $1 | wc -l; }
echo There have been $(ncommitters) committers

To list the files in a tree, we could do `find . -type f | grep -vF /.git/`, which lists all the current files,
skipping Git's database, which is under the hidden, `.git` directory.  (You can skip over the database with special arguments to `find`, but they're harder to type and remember than this simple pipeline.)

Let's also make a shell function for that.

For this quick-and-dirty listing, we'll also ignore any files that someone's named with embedded blanks -- filenames such as "this name needed more thought." You won't see these very often, but we threw in a `grep -v ' '` filter to ignore them after we stumbled over a dozen while doing this notebook. Aaaargh.

In [None]:
files() { find . -type f | grep -vF /.git/ | grep -v ' '; }
files

A drawback of this definition is that it only reports on the current, checked-out commit.
For any other commit, you'd need to do a `git checkout` perform the find, and then restore the original commit.
All those take time.

Luckily, there's a `git` subcommand that will do the trick: `git ls-files`.

In [None]:
# count the files in the named commit without checking them out
files() { git ls-tree -r --full-tree --name-only ${1:-HEAD}; }
nfiles() { files $1 | wc -l; }
echo there are currently $(nfiles) files
first_revision=$(git rev-list --all | tail -1)
echo the first revision had $(nfiles $first_revision)

Well, that was easy! Now let's count the number of files in each version.

In [None]:
for commit in $(commits); do nfiles $commit; done

We could even label each of these with its SHA1, like this:

In [None]:
for commit in $(commits); do echo $commit $(nfiles $commit); done

#### Sidebar: How can two commits have the same number of files?
Commits 4087aa7 and ece03c0 have the same number of files. Can that be right? Certainly.
Maybe, ece03c0 just edited one of the files. Or maybe it deleted a file and added another.
In fact, you can see that a commit could even have fewer files than its parent.

In [None]:
git log -1 ece03c0

## More Little Functions

How big is each version? Time for some other little functions.

In [None]:
size() { files $2 | grep -vF ' ' | xargs wc $1 | grep total | sed 's/total//'; }
nlines() { size -l $1; }
nchars() { size -c $1; }
echo This commit has $(nlines) lines and $(nchars) characters.

In [None]:
for commit in $(commits); do echo $commit $(size -l $commit); done

A lot of bang for the buck here. Plenty of information without much work at all -- just a few, one-line shell functions.

#### Sidebar: Why Doesn't the First Commit Have a Size?

"Wait!" you're probably saying, "Where's the size of the first commit?"
In `size()`, we looking for the word `total`, which is the total number of lines in all the files.
Well. If there's only one file, `wc` doesn't bother to total anything. The size of the lone file is right there in the output,
and there's nothing to sum.

There's another edge condition that we haven't yet encountered. For big enough lists of files, `wc` will report more than one total.

We could make `size()` handle these conditions, but at the expense of more complex code.
Bash's sweet spot is simplicity. It makes exploring easy. After you discover something exciting,
go back and make industrial-strength code to make your results precise and bullet-proof.

Other languages may be better suited to this later phase. Bash is more the tool of the explorers than the settlers, the Marines not the Army.

## Looping, Simplified

Looping over every commit is a lot of typing, so, lets make a function for that, too -- this time, a function that takes another function for its argument!

Spreading the definition out over a few lines leaves something easier to read.

In [None]:
every_commit() { 
    local func=$1 
    for commit in $(commits); do      # for every commit, first to last
        echo $commit $($func $commit) # run the function, and report the result with the commit's SHA1
    done
}

every_commit ncommitters

Hey, that works!

It might present problems, though, if we're working with a repository with thousands or millions of commits.
Let's improve on that by only looking at every *N*th commit.

First, let's make a simple counter.

In [None]:
only_every() { awk "(NR-1)%$1 == 0"; }
for i in {0..10}; do echo $i; done | only_every 3

If we want to sample *N* equally-spaced commits, how far apart do they need to be?

In [None]:
mod() { # how many commits do I skip to get $1 points?
    local npoints=${1:-1}  # default to every commit
    local ncmts=$(ncommits)
    echo $(( ncmts/npoints ))
}

mod 2

Next, we'll use these two to identify *N* revisions, equally spaced.

In [None]:
sample_revs() {
    git rev-list --reverse --abbrev-commit HEAD |  # listed from first to last
        only_every $(mod $1)
}
sample_revs 3

And, finally, here's a function that runs another function on each of those sample commits.

In [None]:
sample_commits() {
    local npoints=1 # by default, do every commit
    if [ $# -eq 2 ]; then
        local npoints=$1
        shift # discard first argument
    fi
    local func=${1:-true}  # do nothing, i.e., only report the commit
    for commit in $(sample_revs $npoints); do
        echo $commit $($func $commit)
    done
}
sample_commits 3 'echo ", doing something with" '
sample_commits 3 ncommitters

Notice here that we're requiring 
1. functions take a commit argument (though they may ignore it or default to HEAD),
1. if a checkout is needed, the function must perform it

The shell is a toolkit for quick, command-line exploratory data analysis (EDA), with few peers.

We could make each of these functions more flexible and robust, and could even improve their performance by re-writing them in a language like Python, but what we've written is enough begin pawing through repos. We're making what marketing would call an MVP: a Minimum Viable Product.

Paraphrasing Tom Christiansen, "What's the difference in speed between a program in bash and a program in Java? About two weeks."

## Commits Over Time

One straghtforward question to ask is how much time elapses between commits.
For this, we need the date of a commit.

We'll ask for it in seconds since the Unix epoch (1970/01/01 00:00:00 UTC)

In [None]:
commit_time() { git log -1 --format=%ct $1; }
commit_time HEAD

Why use this huge, unintuitive number? We'll often want to know elapsed time: the difference between two dates. The shell only does integer arithmetic, and seconds-since-the-epoch gives us integers that we can subtract from one another.

Here's a function to report the difference in seconds between two commits. Notice, order matters!

In [None]:
seconds_between (){ echo $(($(commit_time $2) - $(commit_time $1))); }
seconds_between HEAD~2 HEAD; seconds_between HEAD HEAD~2

And here's `relative_date`: the number of seconds since the first commit.


In [None]:
relative_date() { 
    first_commit=$(commits | head -1)   # find this once, and save the SHA1 in a global
    seconds_between $first_commit $1; 
}

In [None]:
relative_date HEAD~2; relative_date HEAD; seconds_between HEAD~2 HEAD

Now we can look at the commit dates for every commit in a repo with a single function call.

In [None]:
sample_commits $(ncommits) relative_date

Or of a sample of 3, evenly-spaced.

In [None]:
timestamps_of_sample_commits() {
    sample_size=$1
    sample_commits $sample_size relative_date |
        nl -v 0 -i $(mod $sample_size)   # number the lines with the commit numbers
}

timestamps_of_sample_commits 3

Writing tiny functions gives us flexibility. Pipelines let us massage the data into any shape.

For example, suppose we don't want the SHA1s, and we want the dates first, so the columns are in the order *date commit-number*.

In [None]:
timestamps_of_sample_commits 3 | awk '{print $3, $1}'

The shell can only do integer division, so the number of points reported is sometimes a few more or a few less than we asked for. Such is life. For quick-and-dirty exploration, we can settle for that.

## Performance Tuning: `timestamps_of_sample_commits()`

Every time we call `relative_date`, we need to re-calculate the SHA1 of the first commit.
For huge repos, this could be a performance hit.

Here's a tweak to `timestamp` that uses another shell idiom to get around that cost.

In [None]:

timestamp() {
    : ${first_commit:=$(git rev-list $(current_branch) --reverse | head -1)}  # this is magic
    seconds_between $first_commit $1
}
unset first_commit
timestamp $(current_branch)
timestamp $first_commit

You can skip this explanation of the idiom if you want. It's a performance optimization.

You've already seen that ${foo:-whatever} means "The value of `$foo` if it's set, with a default of `whatever` if it's not."

`${foo:=whatever}` means almost the same thing, but if `$foo` is *not* already set, also set foo to `whatever`.

Take, for example, `x=${a:-69}`.  If `a` is already set to `12`, this sets `x` to 12, and `a` stays `12`.
If `a` is still unset, this sets `x` to `69`, but `a` is still unset.

The statement `x=${a:=69}` does almost the same thing. If `a` is already set to `12`, this sets `x` to 12, and `a` stays `12`.
But if `a` is still unset, this sets `x` to `69`, and sets `a` to 69, too.

We're almost there. One more trick to explain -- what's that colon doing at the beginning of the line?

Colon, `:`, is a command. The shell parses the arguments it's called with, but then throws them away, and does nothing.
Wait. What?

It's a no-op whose *only* action is whatever comes about because the shell is parsing its arguments. It's the antithesis of functional programming -- we're only calling colon because of its side effects!

So, here we are. This command:

    : ${first_commit:=$(git rev-list $(current_branch) --reverse | head -1)}  # this is magic

looks to see whether first_commit is already set. If not, it sets `first_commit` to the SHA1 of the repo's first commit. From then on, it does nothing at all. This means we can run it every time we call `timestamp()`, but it only sets `$first_commit` the first time it's called.

Cool, huh? Let's see it at work.

In [None]:
sample_commits 3 timestamp

## A Real Repo: Git Itself.

The first Git repo was the repo for the Git source code.
It was created as soon as Git could host its own source code, three days after Linus announced the project, and has been kept there ever since. It has every committed version of Git, ever, and is the Pre-Cambrian shale of Git version control.

Let's take a peek.

In [None]:
popd  # get back out of toy repo
[ -d git ] || git clone https://github.com/git/git.git # clone Git's source-code repo if it's not already there
pushd git # and dive in
unset first_commit

In [None]:
echo Git has had 
echo $(ncommits) commits
echo $(ncommitters) committers
echo Git has 
echo $(nfiles) files
echo $(nlines) lines

***Question:** sha1collisiondetection is an empty directory. Git normally ignores these, so how did it get committed?*

How about something more complex, like `timestamps_of_sample_commits()`?

In [None]:
time timestamps_of_sample_commits 3

This is no speed daemon, so more performance tuning is a future goal, but it's useable. But what might we do with such data?
One obvious thing to try is to see how one varies with the other. Are commits fast and furious at first, slowing down as time goes on?
Do they start gradually, then pick up the pace?
Do they race faster and faster at first, but then plateau as Git matures?

A first step might be to do a least-squares fit of commit number against the timestamp.

The ranges of the numbers aren't very comparable. At this point, Git has had about 1000 weeks of history so printing the timestamps in weeks instead of seconds might make more sense. Let's start pipe the output of `commit_number_by_date()` through a filter to scale it. 

We can turn seconds into weeks by dividing the timestamp by 60 * 60 * 24 * 7 (seconds/minute * minutes/hour * hours/day * days/week = seconds/week).

In [None]:
timestamps_of_sample_commits 10 | awk 'BEGIN {spw = 60*60*24*7} {print $1, $3/spw}'

The next step is to do curve fitting and measure goodness-of-fit to whatever curve we fit.

This is not a job for the shell.

## Fitting Curves to Data

One easy option for exploring curve-fitting is spreadsheets. 
It's easy to import columnar data into Google Sheets, graph it, then use various models to get best-fits of curves like polynomials,
logs and exponentials to some or all of your data to see what models you like best.

For the moment, let's do something simpler: find the least-squares fit of a line to our data, and use some standard goodness-of-fit measure to see how wildly the commit data varies from that straight line.

Even this is not a job for bash. Python seems like a reasonable choice, but we don't want this notebook to turn into a course on Python,
so we'll just ask ChatGPT to write a program for us.  It offers up the program `lsfit.py`, which we include in our repository.

To help make the result more sensible, let's also swap the columns of output we feed to `lsfit.py`. This makes time be the X-axis and commit-number the Y-axis. Instead of asking how time grows over commits, we'll see how commits accumulate over time.

`lsfit.py` requires `numpy`, so we need to make sure it's installed.

In [None]:
pip install numpy

In [None]:
time timestamps_of_sample_commits 1000 | awk 'BEGIN {spw = 60*60*24*7} {print $3/spw, $1}' | ../lsfit.py

Goodness!  The fit to a straight line is nearly perfect.

Think for a second about what this means:

1. Since its inception, the mean commit rate to the Git repository hasn't varied. Today, with over 2000 committers, without a central manager or corporation dictating who will do what and when, the rate is the same as it was in the beginning, when there was only Linus.
1. Predicting how many commits there will be a week from now or calculating how many there were a year ago is simple arithmetic.
1. Because commit number and date are linearly related, we can treat the commit number as just another a time stamp. It differs from seconds or days or years only by a multiplicative constant.
To ask how any other property, like the number of committers or the number of files, varies over time, we can just ask how it varies with commit number.

Yes, it is a little slow. "Make it work, then make it fast."

In [None]:
(( elapsed_seconds = SECONDS - begin_notebook ))
(( minutes = elapsed_seconds / 60 ))
seconds=$((elapsed_seconds - minutes*60))
printf "Elapsed time running notebook %02d:%02d\n" $minutes $seconds

## Sources

This notebook lives in the repository
`https://github.com/jsh/paleontology-notebook.git`