<!--NAVIGATION-->
< [Shell Scripting](02-ShellScripting.ipynb) | [Main Contents](Index.ipynb) | [Scientific documents with $\LaTeX$](04-LaTeX.ipynb)>

# Version control with Git  <span class="tocSkip"></span> <a name="chap:git"></a>

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What-is-Version-Control?" data-toc-modified-id="What-is-Version-Control?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What is Version Control?</a></span></li><li><span><a href="#Why-Version-Control?" data-toc-modified-id="Why-Version-Control?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Why Version Control?</a></span></li><li><span><a href="#git" data-toc-modified-id="git-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>git</a></span><ul class="toc-item"><li><span><a href="#git-workflow" data-toc-modified-id="git-workflow-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span><code>git</code> workflow</a></span></li><li><span><a href="#Basic-git-commands" data-toc-modified-id="Basic-git-commands-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Basic git commands</a></span></li></ul></li><li><span><a href="#Your-first-repository" data-toc-modified-id="Your-first-repository-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Your first repository</a></span></li><li><span><a href="#Ignoring-Files" data-toc-modified-id="Ignoring-Files-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Ignoring Files</a></span><ul class="toc-item"><li><span><a href="#Dealing-with-binary-files" data-toc-modified-id="Dealing-with-binary-files-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Dealing with binary files</a></span></li><li><span><a href="#Dealing-with-large-files" data-toc-modified-id="Dealing-with-large-files-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Dealing with large files</a></span></li></ul></li><li><span><a href="#Removing-files" data-toc-modified-id="Removing-files-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Removing files</a></span><ul class="toc-item"><li><span><a href="#Un-tracking-files" data-toc-modified-id="Un-tracking-files-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Un-tracking files</a></span></li></ul></li><li><span><a href="#Accessing-history-of-the-repository" data-toc-modified-id="Accessing-history-of-the-repository-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Accessing history of the repository</a></span></li><li><span><a href="#Reverting-to-a-previous-version" data-toc-modified-id="Reverting-to-a-previous-version-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Reverting to a previous version</a></span></li><li><span><a href="#Branching" data-toc-modified-id="Branching-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Branching</a></span></li><li><span><a href="#Running-git-commands-on-a-different-directory" data-toc-modified-id="Running-git-commands-on-a-different-directory-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Running git commands on a different directory</a></span><ul class="toc-item"><li><span><a href="#Running-git-commands-on-multiple-repositories-at-once" data-toc-modified-id="Running-git-commands-on-multiple-repositories-at-once-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Running git commands on multiple repositories at once</a></span></li></ul></li><li><span><a href="#Using-git-through-a-GUI" data-toc-modified-id="Using-git-through-a-GUI-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Using git through a GUI</a></span><ul class="toc-item"><li><span><a href="#Practicals" data-toc-modified-id="Practicals-11.1"><span class="toc-item-num">11.1&nbsp;&nbsp;</span>Practicals</a></span></li></ul></li><li><span><a href="#Readings-&amp;-Resources" data-toc-modified-id="Readings-&amp;-Resources-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Readings &amp; Resources</a></span></li></ul></div>

## What is Version Control?

Version control, also known as revision control or source control, is the management and tracking of changes to computer code and other certain other types of data in an automated way.

Any project (collections of files in directories) under version control has changes and additions/deletions to its files and directories recorded and archived over time so that you can recall specific versions later. 

Version control is in fact the technology embedded in the versioning of various word processor and spreadsheet applications (e.g., Google Docs, Overleaf).

<figure>
<img src="./graphics/VC.png" alt="Version control" style="width:40%">
<small> <center> <figcaption> An overview of how Version control works.</figcaption></center></small>
</figure>

## Why Version Control?

With version control of biological computing projects, you can:

1. record of all changes made to a set of files and directories, including text (usually ASCII) data files, so that you can access any previous version of the files

* "roll back" data, code, documents that are in plain text format (other file formats can also be versioned; see [section on binary files](#Dealing-with-binary-files) below).

* collaborate more easily with other on developing new code or writing documents - branch (and merge) projects

* back up your project (but git is not a backup software - see [sections on binary and large files](#Dealing-with-binary-files) below). 

<figure>
<img src="./graphics/cvs.png" alt="Version control" style="width:35%">
<small> <center>Source: [maktoons.blogspot.com/2009/06/if-dont-use-version-control-system.html](maktoons.blogspot.com/2009/06/if-dont-use-version-control-system.html) <figcaption> Don't do this.</figcaption></center></small>
</figure>

Or here's [another one](http://www.phdcomics.com/comics/archive/phd101212s.gif)

## git

We will use git, developed by Linus Torvalds, the "Linu" in Linux. This is currently the most popular tool for version control. 

In git, each user stores a complete local copy of the project, including the history and all versions. So you do not rely as much on a centralized (remote) server. First, install and configure `git`:

In [None]:
sudo apt-get install git
git config --global user.name "Your Name"
git config --global user.email "your.login@imperial.ac.uk"
git config --list

### `git` workflow

Here is a graphical outline of the git workflow and command structure: 

![image](./graphics/git.png)

Note that only when you `push` or `fetch` do you need an internet connection, because before that you are only archiving in a local (hidden) repository (that sits in a hidden `.git` directory within your project).

### Basic git commands

Here are some basic git commands:

|         |            |
| :------------- |:-------------| 
|`git init`|           Initialize a new repository|
|`git clone`|          Download a repository from a remote server|
|`git status`|         Show the current status|
|`git diff`|           Show differences between commits|
|`git blame`|          Blame somebody for the changes!|
|`git log`|            Show commit history|
|`git commit`|         Commit changes to current branch|
|`git branch`|         Show branches|
|`git branch name`|    Create new branch|
|`git checkout name`|  Switch to a different commit/branch called `name`|
|`git pull`|           Upload from remote repository|
|`git push`|           Send changes to remote repository|


## Your first repository

Time to bring your computing coursework directory under version control. For example, for CMEE Masters students: 

In [10]:
cd CMEECourseWork

In [11]:
git init

Initialised empty Git repository in /home/mhasoba/Documents/CMEECourseWork/.git/


In [12]:
echo "My CMEE Coursework Repository" > README.txt

In [13]:
git config --list

user.email=mhasoba@gmail.com
user.name=Samraat Pawar
push.default=simple
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true


In [14]:
ls -al

total 20
drwxrwxr-x  4 mhasoba mhasoba 4096 Oct  3 10:18 [0m[01;34m.[0m
drwxr-xr-x 33 mhasoba mhasoba 4096 Sep 30 09:23 [01;34m..[0m
drwxrwxr-x  7 mhasoba mhasoba 4096 Oct  3 10:18 [01;34m.git[0m
-rw-rw-r--  1 mhasoba mhasoba   30 Oct  3 10:18 README.txt
drwxrwxr-x  5 mhasoba mhasoba 4096 Oct  2 15:07 [01;34mWeek1[0m


In [15]:
git add README.txt #Staging for commit

In [16]:
git status

On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	[32mnew file:   README.txt[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mWeek1/[m



In [17]:
git commit -m "Added README file." #you can use -am too

[master (root-commit) 95f7d0b] Added README file.
 1 file changed, 1 insertion(+)
 create mode 100644 README.txt


In [18]:
git status #what does it say now?

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mWeek1/[m

nothing added to commit but untracked files present (use "git add" to track)


In [19]:
git add -A

In [21]:
git status

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   Week1/ListRootDir.txt[m
	[32mnew file:   Week1/Sandbox/testfile1.txt[m
	[32mnew file:   Week1/Sandbox/testfile2.txt[m
	[32mnew file:   Week1/test.txt[m



Next you can commit the rest of these files inside `Week 1` with another message, just like you did for `readme` above.

---

> **Make meaningful Comments**: Please don't neglect to make each commit message meaningful. And use this mantra: "commit often, comment always". 
<figure>
<img src=".\graphics\git_xkcd.png" alt="Git mantra" style="width:45%">
<small> <center>Source: [https://m.xkcd.com/1296/](https://m.xkcd.com/1296/) <figcaption> Don't succumb to this.</figcaption></center></small>
</figure>
> Here are some tips for good practices: https://chris.beams.io/posts/git-commit/

---

Nothing has been sent to the remote server yet (see [section below](#git-commands) for how git uses the remote server). 

So let's go to your git service (e.g., bitbucket or github) and set up. Note that bitbucket and github both give you unlimited free private repositories if you register with an academic email. Not a big deal if you will not be writing private code, handy if you are (*can you think of examples when you would need to write private code?*). 

So let's proceed with connecting your local git repository to your remote server:  

* Login to your bitbucket or github account

* Set up your `ssh`-based access. SSH (`S`ecure `S`ocket S`h`ell) is a protocol that allows you to connect to and interact with remote servers.

Here are two sets of guidelines (you can use whichever seems easier to you, irrespective of whether you are using githib or bitbucket):

[bitbucket](https://confluence.atlassian.com/bitbucket/set-up-ssh-for-git-728138079.html) (most of you will want the "Set up SSH on macOS/Linux" option)

[github](https://help.github.com/articles/connecting-to-github-with-ssh)

* Next, create a new repository on your remote service with the same name as your local project (e.g., `CMEECourseWork`), and push your new project to this newly created remote git repository.

Instructions for this step are here:

[bitbucket](https://confluence.atlassian.com/bitbucket/set-up-a-repository-877174034.html) (choose the "I have existing files on my local system." option)

[github](https://help.github.com/articles/adding-an-existing-project-to-github-using-the-command-line)

Note that you have already done the `git init` step, so no need to repeat those bits.  

You are done. Now you can really start to use git!

The first step after having created your remote (say, github or bitbucket) repository and added your ssh key to it, is to link the remote to your local repo (as the instructions in web pages linked above will already have told you):  

In [30]:
git remote add origin git@github.com:mhasoba/CMEECourseWork.git

In [31]:
git remote -v

origin	git@github.com:mhasoba/CMEECourseWork.git (fetch)
origin	git@github.com:mhasoba/CMEECourseWork.git (push)


No you can `git push` all your local commits: 

In [32]:
git push origin master

Counting objects: 3, done.
Writing objects: 100% (3/3), 253 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
remote: 
remote: Create a pull request for 'master' on GitHub by visiting:[K
remote:      https://github.com/mhasoba/CMEECourseWork/pull/new/master[K
remote: 
To git@github.com:mhasoba/CMEECourseWork.git
 * [new branch]      master -> master


This pushes the (committed) changes in your local repository up to the remote repository you specified as the `origin`. Note that `master` refers to the branch (you currently only have one). More on branching below.

You can rename your remote `origin` to a more meaningful name, (e.g., `github_CMEECourseWork`) using the `git remote rename` command. [See this](https://help.github.com/articles/renaming-a-remote).

## Ignoring Files

You will have some files you don't want to track (log files, temporary files, executables, etc). You can ignore entire classes of files with `.gitignore`. 

$\star$ Let's try it (be in your coursework directory (e.g., `CMEECourseWork`)!):

In [22]:
echo -e "*~ \n*.tmp" > .gitignore

In [23]:
cat .gitignore

*~ 
*.tmp


In [24]:
git add .gitignore

In [25]:
touch temporary.tmp

In [26]:
git add *

The following paths are ignored by one of your .gitignore files:
temporary.tmp
Use -f if you really want to add them.


: 1

In [27]:
git add -A

In [28]:
git status

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   .gitignore[m
	[32mnew file:   Week1/ListRootDir.txt[m
	[32mnew file:   Week1/Sandbox/testfile1.txt[m
	[32mnew file:   Week1/Sandbox/testfile2.txt[m
	[32mnew file:   Week1/test.txt[m



You can also create a global gitignore file that lists rules for files to be ignored [in every Git repository on your computer](https://help.github.com/articles/ignoring-files). 

---
> **Standard templates for .gitignore**: You can find standard `.gitignore` templates online. For example, Google ".gitignore templates".

### Dealing with binary files

A binary file is computer-readable but not human-readable, that is, it cannot be read by opening them in a text viewer. Examples of binary files include compiled executables, zip files, images, word documents and videos. In contrast, text files are stored in a form (usually ASCII) that is human-readable by opening in a text reader (e.g., gedit). Without some git extensions and configurations (coming up next), binary files cannot be properly version-controlled because each version of the entire file is saved *as is* in a hidden directory in the repository (`.git`).

However, with some more effort, git can be made to work for binary formats like `*.docx` or image formats such as `*.jpeg`, but it is harder to compare versions; have a look at [this](https://git-scm.com/docs/gitattributes) and [this](https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes)<sup>[1](#git:word)</sup>, and also, [this](https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7)

### Dealing with large files

As such, git was designed for version control of workflows and software projects, *not* large files (say, &gt;100mb) (which may be plain-text or binary). Binary files are particularly problematic because each version of the file is saved *as is* in `.git`, when you have a large number of versions it means that there are the same number of binary files in the hidden directory (for example 100 $\times$ &gt;100mb files!).

*So please do no keep large files (especially binary files) under version control*<sup>[2](#git:largefiles)</sup>. For example, if you are doing GIS work, you will may have to handle large raster image files.  Do not bring such files under version control. We suggest that you include files larger than some size in your `.gitignore`. For example, you can use the following bash command:

The 100M means 100 mb – you can reset it to whatever you want.

*Then what about code that needs large files?* For this, the best approach is write code that scales up with data size. If it works on a 1 mb file, it shoulda slo work on a 1000 mb file! If you have written such code, then you can include a smaller filea s a MWE (minimum working example).   

*And how do you back up your large data files?* Remember, version control software like git are not meant for backing up data. The solution is to back up separately, either to an external hard drive or a cloud service. `rsync` is a great Linux utility for making such backups. Google it! 

You may also explore alternatives such as `git-annex` (e.g., see <https://git-annex.branchable.com/>), and `git-lfs` (e.g., see <https://www.atlassian.com/git/tutorials/git-lfs>).

## Removing files

To remove a file (i.e. stop version controlling it) use `git rm`:

In [None]:
echo "Text in a file to remove" > FileToRem.txt

git add FileToRem.txt

git commit -am "added a new file that we'll remove later"

git rm FileToRem.txt

git commit -am "removed the file"

I typically just make all my changes and then just use `git add -A` for the whole directory (and it's subdirectories; `-A` is recursive).   

### Un-tracking files

`.gitignore` will prevent untracked files from being added to the set of files tracked by git. However, git will continue to track any files that are already being tracked. To stop tracking a file you need to remove it from the index. This can be achieved with this command.

```bash
git rm --cached <file>
```
The removal of the file from the head revision will happen on the next commit.

Accessing history of the repository
-----------------------------------

To see particular changes introduced, read the repo’s log :

In [None]:
git log

For a more detailed version, add <span>-p</span> at the end.

Reverting to a previous version
-------------------------------

If things go horribly wrong with new changes, you can revert to the
previous, “pristine” state:

In [None]:
git reset --hard
git commit -am "returned to previous state" #Note I used -am here

If instead you want to move back in time (temporarily), first find the
“hash” for the commit you want to revert to, and then check-out:

In [None]:
git status

In [None]:
git log

Then, you can 

```bash
git checkout c79782
```
(`c79782` is an example).

Now you can play around. However, if you commit changes, you create a "branch" (git plays safe!). To go back to the future, type 

```bash
git checkout master
```

## Branching

Imagine you want to try something out, but you are not sure it will work well. For example, say you want to rewrite the Introduction of your paper, using a different angle, or you want to see whether switching to a library for a piece of code improves speed. What you then need is branching, which creates a project copy in which you can experiment:

In [None]:
git branch anexperiment

In [None]:
git branch

In [None]:
git checkout anexperiment 

In [None]:
git branch

In [None]:
echo "Do I like this better?" >> README.txt

In [None]:
git commit -am "Testing experimental branch"

If you decide to merge the new branch after modifying it:

In [None]:
git checkout master

In [None]:
git merge anexperiment

In [None]:
cat README.txt 

If there are no conflicts (i.e., some files that you changed also
changed in the master in the meantime), you are done, and you can delete
the branch:

In [None]:
git branch -d anexperiment

If instead you are not satisfied with the result, and you want to
abandon the branch:

In [None]:
git branch -D anexperiment

When you want to test something out, always branch! Reverting changes, especially in code, is typically painful. Merging can be tricky, especially if multiple people have simultaneously worked on a particular document. In the worst-case scenario, you may want to delete the local copy and re-clone the remote repository.

---
<figure>
<img src="./graphics/git_xkcd_1.png" alt="xkcd on workflow" style="width:25%">
<small> <center>(Source: [https://xkcd.com/1597/)) <figcaption> Try not to do this, but we all will have, at some point!
</figcaption></center></small>
</figure>


## Running git commands on a different directory

Since <span>git</span> version 1.8.5, you can run git directly on a different directory than the current one using absolute or relative paths. For example, using a relative path, you can do:

```bash 
git -C ../SomeDir/ status
```

### Running git commands on multiple repositories at once

For git pulling in multiple subdirectories (each a separate repository), here is an example:

In [None]:
find . -mindepth 1 -maxdepth 1 -type d -print -exec git -C {} pull \;

Breaking down these commands one by one,

`find .` searches the current directory

`-type d` finds directories, not files

`-mindepth 1` sets min search depth to one sub-directory

`-maxdepth 1` sets max search depth to one sub-directory

`-exec git -C {} pull \ ` runs a custom git command one on every git repo found

## Using git through a GUI 

There are many nice git GUI's out there. For example, [gitKracken](https://www.gitkraken.com/). Or if you are using a code editor like vs code, there are nice extensions that will give you considerable GUI functionality.  

### Practicals

* The only practical submission for git is pushing your coursework git repository, `.gitgnore` and `readme` files included. Make sure your `.gitignore` has meaningful exclusions, and your `readme` has useful information. Google "readme good practices" or something like that to find online tips. 
 
* Also, invite your assessor to your coursework repository (e.g, `CMEECourseWork`) repository with *write privileges*. The current assessor is s.pawar@imperial.ac.uk (or "mhasoba" on both bitbucket and github). 

Also, remember, you can clone TheMulQuaBio (see the [Intro Chapter](00-Intro.ipynb)). You can then `cp` files from this to your own CMEECourseWork as and when needed. Please don't work in the master repo, as you will lose your work when I next update it!

You can thereafter `git pull` at the start of every new session (to get the updates) inside your local copy of `TheMulQuaBio` (but always do `git status` first)

---
> **Checking git status**: Always do a `git status` in a repository before pulling from or pushing to a remote repository!. 


## Readings & Resources

There is a wealth of information on git out there - just google it!

* Excellent book on Git: <http://git-scm.com/book>
* Also, <https://www.atlassian.com/git/>
* A git tutorial: <https://try.github.io>

**Footnotes**

<a name="git:word">1</a>: There you will find the following phrase: "...one of the most annoying problems known to humanity: version-controlling Microsoft Word documents.". LOL!

<a name="git:largefiles">2</a>: None of the computing weeks assessments will require you to use such large files anyway