# How to retrieve a file in git by its object hash

In certain circumstances, you may want to retrieve a file in [git](https://en.wikipedia.org/wiki/Git) by some short identifier.

For example, you have some relevant file which is useful in reproducing a computation, say for debugging or regression testing. Or, for compliance, you might want to record and easily retrieve which version of a contract template a user saw, but have the version field only change when the _contents_ of the html file changes. (And not when unrelated files change as with a commit hash.)

One simple way to achieve this is to use the Git _object hash_. This is the hash of file contents that git uses to store the underlying file.

In [1]:
# hide
import os
import tempfile
dir = tempfile.mkdtemp()
os.chdir(dir)
!pwd

/private/var/folders/0x/v2lbxd814bv46ngy5zhtsf700000gn/T/tmpc9qdxxjl


Now let's create an empty git repository, with one file `foo` with contents `foo\n`:

In [2]:
!git init

Initialized empty Git repository in /private/var/folders/0x/v2lbxd814bv46ngy5zhtsf700000gn/T/tmpc9qdxxjl/.git/


In [3]:
!echo foo > foo
!echo bar > bar
!git add .
!git commit -m 'init'

[master (root-commit) 465eb8f] init
 2 files changed, 2 insertions(+)
 create mode 100644 bar
 create mode 100644 foo


How are these files stored internally in Git? The files are content-addressable by the object hash. We can copute the object hash using the [`git-hash-object` command](https://git-scm.com/docs/git-hash-object):

In [4]:
!cat foo | git hash-object -w --stdin

257cc5642cb1a054f08cc83f2d943e56fd3ebe99


We can then retrieve the contents in a single command with only that hash with the `git cat-file` command:

In [5]:
!git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99

foo


The object hash is computed as the sha1 hash of a particular string constructed from the file's contents (`blob `, followed by the content length as a decimal integer, followed by a zero byte, followed by the file contents).

This can be computed in python as follows:

In [6]:
from hashlib import sha1
from typing import Union
def git_hash_object(s: Union[str, bytes]) -> str:
    if isinstance(s, str):
        b = s.encode('utf-8')
    else:
        b = s
    return sha1(b'blob %d\0%s' % (len(b), b)).hexdigest()

git_hash_object('foo\n')

'257cc5642cb1a054f08cc83f2d943e56fd3ebe99'

The loading code can just call [`git-cat-file`](https://git-scm.com/docs/git-cat-file) as a subprocess, or find some [python implementation](https://gist.github.com/leonidessaguisagjr/594cd8fbbc9b18a1dde5084d981b8028).

In [7]:
import subprocess
def load_git_object_hash(object_hash: str) -> bytes:
    result = subprocess.run(["git", "cat-file", "-p", object_hash], capture_output=True)
    result.check_returncode()
    return result.stdout

load_git_object_hash(git_hash_object('foo\n'))

b'foo\n'

Now, in our contract template example, we might have some code like:

In [8]:
def agree_to_contract(user_id):
    with open('contract-template.html', 'r') as f:
        template = f.read()
    terms = get_specific_terms_for_users(user_id)
    record_contract(user_id, git_hash_object(template), terms)
    return render(template, terms)

In [9]:
def display_exact_contract(user_id):
    object_hash, terms = get_contract_data(user_id)
    template = load_git_object_hash(object_hash).decode()
    return render(template, terms)

This allows us to easily group templates by version, retrieve old versions, and not waste a bunch of space storing the same template version again and again. You could apply the same principle to lockfiles, configuration files, docker files, self-contained algorithms, etc.

For auditing or debugging, this approach has some advantages over recording the commit hash, in that you can retrieve the file of interest in constant time without doing any writes to disk. It has the disadvantage that you need to record every file of interest to completely reproduce the relevant computation. In any case, it's a useful technique in some circumstances, and hopefully helps you to learn a bit about how git operates.

In [10]:
# hide
import shutil
shutil.rmtree(dir)