# Exploring repositories.

Let's explore repositories a bit.
We'll begin with a toy repository, `toy`.

In [None]:
[ -d toy ] || git clone https://github.com/jsh/toy  # make a local clone, unless it's already there.
pushd toy

We can start by listing all commits and how they relate to one another.
The command `git log --all --decorate --oneline --graph` does this for us with an ASCII graph.

There are GUI tools that provide such information, too, but Git will give it to you right from the command line. 
You don't have to install any apps, and it works on terminals connected to remote servers, 
where may not even be able to install applications and may not have access to anything but out-of-the-box Git.

Unfortunately, running this command requires remembering four options, which you'll probably have to look up. 
(I've used the mnemonic "Log-A-DOG). Plus, it's a lot of typing.

You can save wear and tear on your fingers and your brain by defining a Git alias, like this:

In [None]:
git config --global alias.lol "log --all --decorate --oneline --graph"

Once defined, you can just type `git lol` 

In [None]:
git lol

When you forget how to define this alias, an internet search for `git lol` will tell you in a trice.
It's in common use.

As you can see, it's tiny repo, with only a few commits.  How many?

In [None]:
git rev-list --all | wc -l

And who did them?

In [None]:
git shortlog -ns --all

Aha. Three authors. Nine commits.
To concentrate on the current line of development, just change `--all` to `HEAD`.
Usually, we'll be interested in `master`.

We'll make the output more readable by using abbreviated commits, just long enough to be unique.
It's also convenient to reverse the order, starting our commit list with the very first commit.

Git can supply all this, but requires a command-name and some options that are annoying to have to look up.
Bash lets you wrap such things into functions, both to ease your typing burden and so you don't have to keep all those options in your head.

Here's our function:

In [None]:
commits() { local top=${1:-HEAD}; git rev-list --abbrev-commit --reverse $top; }
commits

Here, we're reporting the list of revisions leading up to the commit specified by the first argument to `commits()`

The shell idiom, `${1:-value}` means "use `HEAD` as the first argument to `commits()`, if none is specified.


Notice that we can still even pass in `--all`!

In [None]:
echo commits in master branch: $(commits | wc -l)
echo all commits: $(commits --all | wc -l)

If for nothing other than practice, let's turn counting commits into a function.

In [None]:
ncommits() { commits $1 | wc -l; }
ncommits

In [None]:
ncommitters() { git shortlog -ns --all | wc -l; }
ncommitters

To list the files in a tree, we can do `find . -type f | grep -vF /.git/`, which lists all the current files,
skipping Git's database, which is under the hidden, `.git` directory.  (You can skip over the database with special arguments to `find`, but they're harder to type and remember than this simple pipeline.)

Let's also make a shell function for that.

For this quick-and-dirty listing, we'll also ignore any files that someone's named with embedded blanks -- filenames such as "this name needed more thought." You won't see these very often, but we threw in a `grep -v ' '` filter to ignore them after we stumbled over a dozen while doing this notebook. Aaaargh.

In [None]:
files() { find . -type f | grep -vF /.git/ | grep -v ' '; }
files

Well, that was easy. Now let's count the number of files in each version.

In [None]:
for commit in $(commits); do git checkout --quiet $commit; files | wc -l; done

We could make this slightly prettier with a fancier function,

In [None]:
nfiles() { files | wc -l; }
nfiles

and a prettier loop.

In [None]:
for commit in $(commits); do git checkout --quiet $commit; echo $(nfiles) $commit; done

How many total lines in each of those versions? Time for another little function.

In [None]:
nlines() { files | xargs wc -l | grep total |  sed 's/total//'; }
nlines

In [None]:
for commit in $(commits); do git checkout --quiet $commit; echo $(nlines) $commit; done

And, finally, how many authors?

In [None]:
nauthors() { git shortlog -ns ${1:-HEAD} | wc -l; }
nauthors

In [None]:
for commit in $(commits); do git checkout --quiet $commit; echo $(nauthors) $commit; done

A lot of bang for the buck here. Plenty of information without much work at all -- just a few, one-line shell functions.

Looping over every commit is a lot of typing, so, lets make a function for that, too -- this time, a function that takes another function for its argument!
Also, after checking out the last version, we'll be in "detached HEAD" state, which is annoying.
For now, we'll finish up by going back to `master`.
 
Spreading the definition out over a few lines leaves something easier to read.

In [None]:
every_commit() { 
    local func=$1 
    for commit in $(commits); do      # for every commit, first to last
        git checkout --quiet $commit  # check out the commit
        echo $($func) $commit         # run the function, and report the result with the commit's SHA1
    done
    git checkout --quiet master       # get back to where you once belonged
}

every_commit nauthors

The shell is a toolkit for quick, command-line exploratory data analysis (EDA), with few peers.

We could make each of these functions more flexible and robust, and could even improve their performance by re-writing them in a language like Python, but what we've written is enough begin pawing through repos. We're making what marketing would call an MVP: a Minimum Viable Product.

Paraphrasing Tom Christiansen, "What's the difference in speed between a program in bash and a program in Java? About two weeks."

## Commits Over Time

One straghtforward question to ask is how much time elapses between commits.
For this, we need the date of a commit.

We'll ask for it in seconds since the Unix epoch (1970/01/01 00:00:00 UTC)

In [None]:

commit_date() { git log -1 --format=%ct $1; }
commit_date HEAD

Why use this huge, unintuitive number? We'll often want to know elapsed time: the difference between two dates. The shell only does integer arithmetic, and seconds-since-the-epoch gives us integers that we can subtract from one another.

Here's a function to report the difference in seconds between two commits. Notice, order matters!

In [None]:
seconds_between (){ echo $(($(commit_date $2) - $(commit_date $1))); }
seconds_between HEAD~2 HEAD; seconds_between HEAD HEAD~2

And here's `relative_date`: the number of seconds since the first commit.


In [None]:
first_commit=$(commits | head -1)   # find this once, and save the SHA1 in a global
relative_date() { seconds_between $first_commit $1; }

In [None]:
relative_date HEAD~2; relative_date HEAD; seconds_between HEAD~2 HEAD

Now we can look at the commit dates for every commit in a repo with a single function call.

In [None]:
every_commit relative_date

For a big repo, this could be a tad slow and generate more data than we really want.
How much? Well, the Git repo, https://github/git/git, has over 70,000 commits, and the linux repo, https://github/torvalds/linux, has over 1.2 million.

Let's filter the list of commits by only taking every *Nth* commit.

In [None]:
only_every() { awk "(NR-1)%$1 == 0"; }
for i in {0..10}; do echo $i; done | only_every 3


This lets us tweak `every_commit()`.

In [None]:
commits_by() { 
    incr=${1:-1}  # report every commit by default
    if [ $# -eq 2 ]; then inc=$1; shift; fi
    local func=$1
    for commit in $(commits | only_every $incr); do  # only report every $incr commit 
        git checkout --quiet $commit
        echo $($func) $commit
    done
    git checkout --quiet master 
}

commits_by 10 relative_date

And if we want a table of the date of every 3rd commit?

In [None]:
dates_of_commit_numbers() { 
    local incr=$1 
    commits_by $incr relative_date |
        nl -v 0 -i $incr   # number the lines, starting at 0, by increments of $incr
}

dates_of_commit_numbers 3

Writing tiny functions gives us flexibility.

For example, instead we don't want the SHA1s, and we want the second column first, so the columns are in the order *date commit-number*.  Oh, and instead of specifying the increment, let's specify the number of evenly-spaced data points that we want, and *calculate* the increment.

In [None]:
commit_number_by_date() {
    local ncmts=$(ncommits)       # repo-specific
    local npoints=${1:-$ncmts}       # by default, do every commit
    (( incr = ncmts/npoints ))
    local first_commit=$(commits | head -1)   # find this once, and save the SHA1 in a global
    commits_by $incr relative_date |
        nl -v 0 -i $incr |   # number the lines, starting at 0, by increments of $incr
        awk '{print $2, $1}'
}

commit_number_by_date 5

The shell can only do integer division, so the number of points reported is off-by-one. Such is life. For quick-and-dirty exploration, we can settle for that.

## A Real Repo: Git Itself.

The first Git repo was the repo for the Git source code.
It was created as soon as Git could host its own source code, three days after Linus announced the project, and has been kept there ever since. It has every committed version of Git, ever, and is the Pre-Cambrian shale of Git version control.

Let's take a peek.

In [None]:
popd  # get back out of scratchy
[ -d git ] || git clone https://github.com/git/git.git # clone Git's source-code repo if it's not already there
cd git

In [None]:
echo Git has had 
echo $(ncommits) commits
echo $(ncommitters) committers
echo Git has 
echo $(nfiles) files
echo $(nlines) lines

Let's try someting complex, like `commit_number_by_date()`.

In [None]:
start=$SECONDS
commit_number_by_date 100 | tail
echo finding timestamps of 100 equally-spaced commits from git 
echo takes $(( SECONDS - start )) seconds

This is no speed daemon, so performance tuning is a future goal, but it's useable. But what might we do with such data?
One obvious thing to try is to see how one varies with the other. Are commits fast and furious at first, slowing down as time goes on?
Do they start gradually, then pick up the pace?
Do they race faster and faster at first, but then plateau as Git matures?

A first step might be to do a least-squares fit of commit number against the timestamp.

The scales of the numbers aren't very comparable, so let's start by piping the output of `commit_number_by_date()` through a filter to scale it. 

Right now, the timestamps are in seconds since the first commit. We could turn seconds into weeks by dividing the timestamp by 60 * 60 * 24 * 7 (seconds/minute * minutes/hour * hours/day * days/week = seconds/week).

In [None]:
commit_number_by_date 10 | awk 'BEGIN {spw = 60*60*24*7} {print $1/spw, $2}'

Git now has about 1000 weeks of history, so we could get a weekly history with `commit_number_by_date 1000`,
but we still need to do curve fitting and measure goodness-of-fit to whatever curve we fit.

This is not a job for the shell.

## Fitting Curves to Data

One easy option for exploring curve-fitting is spreadsheets. 
It's easy to import columnar data into Google Sheets, graph it, then use various models to get best-fits of curves like polynomials,
logs and exponentials to some or all of your data to see what models you like best.

For the moment, let's do something simpler: find the least-squares fit of a line to our data, and use some standard goodness-of-fit measure to see how wildly the commit data varies from that straight line.

Even this is not a job for bash. Python seems like a reasonable choice, but we don't want this notebook to turn into a course on Python,
so we'll just ask ChatGPT to write a program for us.  It offers up the program `lsfit.py`, which we include in our repository.

Let's try it.

In [None]:
commit_number_by_date 1000 | awk 'BEGIN {spw = 60*60*24*7} {print $1/spw, $2}' | ../lsfit.py

Goodness!  The fit to a straight line is nearly perfect.

Think for a second about what this means:

1. Since its inception, the commit rate to the Git repository hasn't varied.
1. Predicting how many commits there will be a week from now or calculating how many there were a year ago is simple arithmetic.
1. Because commit number and date are linearly related, we can use the commit number as a time stamp.
To see how some other quantity, like the number of committers or the number of files, varies over time, we can just ask how it varies with commit number.