# ``git`` and ``GitHub``: version control and social open-source development

All code should be under version control to keep track of changes over time and when it comes to version control ``git`` is the dominant system. A 2018 Stack Overflow survey of developers' version control use found that [90% of developers use git](https://insights.stackoverflow.com/survey/2018#work-_-version-control), with the second most popular version control being the older [Subversion](https://subversion.apache.org/), likely mostly to use legacy code that still lives in Subversion repositories. ``git`` is most widely supported by code hosting services, with [GitHub](https://github.com/) only hosting ``git`` repositories and [BitBucket](https://bitbucket.org/) [dropping support](https://bitbucket.org/blog/sunsetting-mercurial-support-in-bitbucket) for the main ``git`` alternative ``mercurial`` in 2020. Basically, ``git`` is now the only game in town.

## Version control

Version control is a system for tracking and making changes to code as the code develops. Version control stores code in a "repository" and when using version control, any changes to the code are logged through a code "commit" that lists the files changed and provides a brief description of the changes; version control software then generates a "diff" with respect to the previous version that gets stored, such that the full history of changes is available for use in the future. Opinions differ on how many changes to include in a single commit (which can consist of changes to multiple files at once), but typically it is best to keep commits as "atomic" as possible, that is, create a commit at the smallest change that is reasonable to call a change or improvement to the code. Commits are often as small as changing a single line, perhaps improving the documentation or fixing a small bug. When making changes to existing code, it is always best to keep commits at the level of small changes, so any issues with the changes can later easily be pinned down to a specific change; change commits should typically consist of a few lines to a few dozen lines of edited code at most. When first implementing a new feature, it may make sense to wait to commit until a draft version of the feature is working and one can thus end up with a larger commit, but even then it is best to first implement as bare a skeleton of the new feature as possible and then edit it with small changes until the feature is fully implemented.

Early version control systems kept the history of changes stored in a central location while each developer only had a copy of the current version of the code; thus, every code commit and every query of the code's history required interaction with the central location (often remote, requiring an internet connection). Because this meant that one could not commit while offline and even when online was an impediment to quick progress due to the sometimes slow response time of the central location, this led to often bloated commits. One of the great improvements in ``git`` is that each copy of the code's repository contains the full history, leading to a decentralized system where there is no need for a central location. Different copies of a ``git`` repositories are called "clones" and with ``git``, clones can communicate among themselves without needing to go through a central place. Of course, the current reality is that most ``git`` repositories have "main" copies that are stored in online services like GitHub or Bitbucket, with most of the communication between different clones happening through the main hosted copy of the code. Nevertheless, the fact that each clone contains the entire history means that you can easily commit code and investigate the history without requiring interaction with a centralized repository, and this is therefore much faster and robust against network interruptions.

In this chapter, I provide a brief overview of the basic ``git`` features and commands and discuss how to use GitHub to host your scientific code package. Note that this is not supposed to be an exhaustive guide to using ``git``, many such guides already exist.

## Basic ``git`` use

The most basic cycle of ``git`` use is a cycle of ``git pull``, ``git commit``, ``git push``. These commands, respectively, pull in the latest changes from the remote main version of the code repository (e.g., hosted on GitHub), commit new changes made to the code, and push the changes in this commit(s) back to the remote main version. If you add ``git diff`` for looking at the not-yet-committed current set of changes, ``git status`` for interrogating the status of the repository and ``git log`` for looking at the history of the code, and you have well over 95% of my typical usage of ``git``.

As discussed above, you should make a commit as soon as enough changes to the code have accrued to make up a reasonable change to the code (again, this could be as simple as fixing a typo in the code, a single character). Simply running
```
git commit
```
will open a text editor (set by your ``git`` defaults) that allows you to write a message describing the change; this will perform a single commit for *all* current changes to *all* changed files in the repository. You can avoid the use of the text editor by directly specifying the message as
```
git commit -m "A message describing the atomic change made"
```
which I personally prefer as a fan of the command line (and of speed!). You can list specific files to only commit changes to those files by adding them to the command as
```
git commit -m "A message describing the atomic change made" file1.py file2.py
```
It is good practice to *always specify the files you are committing changes for* rather than not specifying any files or specifying a folder (which would commit changes to all files in that folder). This way, you don't end up accidentally committing changes made to other files that are unrelated to the current commit (we will see [below](#Some-useful-advanced-git-features) how you can even split up changes in a single files into different commits).

Before you can start committing changes to files, you need to tell ``git`` about the existence of the file in the first place (typically soon after you create the file, in preparation for your first commit of the file). This is done using ``git add`` which you call with
```
git add file1.py
```
and you can also list multiple files. Even though you specify a directory and you can use wildcards, it is again good practice to always explicitly list all files that you are adding, rather than an entire folder or more, because that way you will invariably end up adding files that you did not want to add and removing them again can be difficult.

When your code is centrally hosted (as it should be!), each coding session should start with a ``git pull`` to pull in changes in the remote main repository that have not yet been added to your clone of the repository. If you are the sole developer of a code, this may seem silly, but is again good practice to always do this such that it becomes muscle memory and because even if you are the sole developer, you are likely to be developing the code on multiple machines (a personal laptop, a desktop at work, a remote server for running large jobs, ...) and this keeps the code in sync. When you have cloned the code from GitHub and are working in the master branch, a simple
```
git pull
```
will suffice to pull in remote changes, but in general you can specify both the location of the remote reposiroty and the branch. For example, typically the simple ``git pull`` will be equivalent to
```
git pull origin master
```
which tells ``git`` to pull changes from the remote repository referenced as "origin" and to pull changes from the master branch.

After you have made one or more commits, you will want to push these commits back to the remote main repository. Before you do that, it is good practice to again first do ``git pull`` to pull in any changes to the remote repository that may have occurred while you were coding, so you can resolve any conflicts before pushing your own changes (and possibly having them be rejected if there is a conflict). Once you have done this, you push your commits with
```
git push
```
which is again typically a short cut for the full
```
git push origin master
```

With just these four ``git`` commands you can get most of the basic functionality of ``git`` version control. Further useful basic commands are ``git diff``, ``git status``, and ``git log``. The ``git diff`` command provides a "diff" showing the difference between your clone's current state and the last commit; thus, it shows the changes you have made since the last commit. Depending on your setup, this diff will simply have '-' lines and '+' lines to show lines that were removed and added (a change on a single line given both a removal of the old line and an addition of the edited new line) or they may be colored red and green. Running
```
git diff
```
without any additional arguments goes through all files that were changed, but you can look at changes in a single file or in a set of files by specifying them in the call, e.g., as
```
git diff file1.py
```

Running ``git status`` gives a brief summary of the current version of your clone. It prints the branch you are on and whether you are up to date with the remote repository's same branch or how many commits ahead of the remote repository you are. It also prints files that have changed since the last commit and files contained in your clone that have not been declared to ``git`` (for example, new files before running ``git add`` will show up in that list; after running ``git add`` they will be listed as newly added). I use ``git status`` *a lot* to remind myself of what I have been doing since the last commit.

``git log`` prints a log of the history of changes to the code. Run without any options, it will provide a moderately verbose list of all commits, listing the commit hash (the unique identifier of every commit), the commit's author and date, and the summary that you provided when running ``git commit``. But ``git log``'s output can be highly customized. To get a very succinct listing do
```
git log --oneline
```
which will list each commit in an abbreviated manner on a single line. Or use the ``--pretty=`` option to get less or more information, e.g.,
```
git log --pretty=short
```
which is similar to the basic output, but does not include the date. 

## Branches

A feature of most version control systems and one that is especially easy to use with ``git`` is the ability to *branch* off the main development branch of your code to focus on developing a single feature, fix a single bug, etc. After you are satisfied with the changes on the branch, these changes are *merged* back into the main development commit history. A crucial part of the implementation of the ``git`` software is fast and intelligent algorithms to perform such merges automatically, even when the difference between the feature branch and the main branch are substantial. When ``git`` is unable to automatically merge branches, the repository goes into a suspended state until the user manually resolves any merges that cannot be automatically done.

Branches are an incredibly useful feature of ``git``, especially when combined with ``forks`` discussed [below](#Using-GitHub-to-build-a-community-for-your-code), and you should make liberal use of them. Branches allow you to split off things like implementing new features, while still keeping the ability to fix bugs in the main branch without that fix having to wait for the new feature to be ready to go "live". Branches also allow you to develop new features in the incremental way that you should implement all of your code (with many commits), without necessarily having to worry at first that the new feature is entirely compatible with the existing code or that it passes all existing tests.

The main branch is called ``master``. It is good practice to keep this branch as clean as possible, that is, avoid having it be in a state where it contains partially implemented features or bug fixes. The ``master`` branch should always contain a fully working version of your code. Any significant changes to your code should therefore be done in other branches.

To create a new branch, do
```
git checkout -b NEWFEATURE
```
which creates a branch called ``NEWFEATURE`` (which should be a very brief string describing the new feature, e.g., "add_cube" if you are adding a function to compute the cube) and switches the state of the repository to this branch. In detail, this command is a shorthand for the following two commands
```
git branch NEWFEATURE
git checkout NEWFEATURE
```
where the first ``git branch`` command creates the branch, while staying on the current branch (e.g., ``master``) and the ``git checkout`` command switches ("checks out") the state of the repository to the new branch. After running this, ``git status`` will report that you are now on the ``NEWFEATURE`` branch. Any commit that you make now is logged in the commit history of the branch, which is the same as that of the branch it branched off from up until the branching point and then starts containing additional commits. Running
```
git branch
```
without any further arguments will show a list of all branches that exist in the local clone of the repository (this is not necessarily the same as the branches that exist in the centrally-hosted repository if those branches haven't been checked-out in the local repository. To switch between branches, run
```
git checkout SWITCH_TO_BRANCH
```
where ``SWITCH_TO_BRANCH`` is the name of the branch you want to switch to (e.g., ``git checkout master`` to go back to master). This keeps the branch intact, it simply places the working state of the repository to another branch. This is useful if you are working on a new feature in one branch, but want to fix a bug in another branch. Make sure to commit all changes that you made in a branch before switching to another branch, otherwise there is a good chance that you will accidentally commit a change you meant to commit in the feature branch in the wrong branch!

Once you are ready to merge the changes in your branch back into the ``master`` branch, you switch back to ``master`` and run the merge command
```
git checkout master
git merge NEWFEATURE
```
``git merge`` will attempt to perform the merge automatically, in which case you have to do nothing except to okay a commit that performs the merge. If the automatic merge fails, you will get a message like
```
Auto-merging file.py
CONFLICT (content): Merge conflict in file.py
Automatic merge failed; fix conflicts and then commit the result.
```
notifying you that the merge has failed and that you have to resolve conflicts between the branches yourself. This is an annoying situation, but it will happen. The failed merge process will leave your files in a state where they record the attempted merge and why it failed; your ``file.py`` in this case will have a section that looks like
```
<<<<<<< HEAD:file.py
def cube(x):
   return x**3
=======
def newcube(x):
   return x**3.
>>>>>>> NEWFEATURE:file.py
```
You can then manually resolve these, but it is typically easier to use a tool for this, which you can bring up with
```
git mergetool
```
This command will ask you which tool to use (e.g., ``opendiff``) and will then open the files with conflicts in sequence in the merge-tool to allow you to resolve the changes, with typical output
```
This message is displayed because 'merge.tool' is not configured.
See 'git mergetool --tool-help' or 'git help config' for more details.
'git mergetool' will now attempt to use one of the following tools:
opendiff kdiff3 tkdiff xxdiff meld tortoisemerge gvimdiff diffuse diffmerge ecmerge p4merge araxis bc3 codecompare vimdiff emerge
Merging:
file.py

Normal merge conflict for 'file.py':
  {local}: modified file
  {remote}: modified file
Hit return to start merge resolution tool (opendiff):
```
Typically, these tools will show the two versions of the file, labelling all sections that need to be merged and showing which cannot be performed automatically and it will show the merged version of the file, which you can edit to resolve the merge (either through an option, such as "choose master" or "choose NEWFEATURE" or by manually editing the merged file). Once you have resolved the conflicts, you need to perform a simple
```
git commit
```
without any other arguments (i.e., don't specify any files) to commit the merge.

Once you have merged a branch's changes back into the ``master`` branch, you can delete the branch by running
```
git branch -d NEWFEATURE
```
If you have performed the merge elsewhere (e.g., on GitHub), this command might complain that the ``NEWFEATURE`` branch contains changes that have not been merged yet, but if you are sure that all is okay, you can force-delete the branch by switching to an uppercase "D"
```
git branch -D NEWFEATURE
```
Be careful with this though, because if you accidentally delete a branch that you still need, it will be *very difficult* to get it back (although, because it's ``git``, not necessarily impossible...).

## Some useful advanced ``git`` features

The ``git`` features discussed above will allow you to do most of your day-to-day work with ``git`` version control, but ``git`` has many advanced features. This is not supposed to be an exhaustive guide to all ``git`` features, but in this section I briefly discuss some of the more advanced ``git`` features that I use on a semi-regular basis.

Above, we have used ``git checkout`` to switch branches, but ``git checkout`` can do much more. One often used invocation is
```
git checkout -- file.py
```
which discards all changes in ``file.py`` since the previous commit (you can also run it on the entire repository). This is useful when you've made a big mess and the easiest way out is to just give up and start over (this happens to me *a lot*). Again, be careful with this command, because once you discard the changes, it is impossible to get them back.

Besides checking-out branches, ``git checkout`` can also check-out a previous commit, by specifying the commit's hash as
```
git checkout COMMITHASH
```
where ``COMMITHASH`` is the hash (the number like ``625123ab491088d6714809648d8a13ae435b7cf8`` that you can get from ``git log`` or elsewhere). This will leave the repository in a "detached HEAD" state, which doesn't sound good and which isn't indeed all that good (if you want to actually start making changes, you will have to create a new branch starting from this commit), but it allows you to switch back to an earlier state of the repository and see what it looked like or run tests etc. for the earlier state. That's often useful when you are trying to figure out where in the commit history *something went wrong*.

If you are working in a branch and have uncommitted changes and you want to switch to another branch (briefly, say) and you *really* don't want to commit the uncommitted changes before the switch, you can "stash" them away for future use. For this run
```
git stash
```
which stashes all uncommitted changes and reverts the repository back to the previous commit. Then you can switch to another branch without carrying over the uncommitted change. Once you are ready to start work on the uncommitted changes again, switch back to their branch and do
```
git stash pop
```
to bring back the uncommitted changes. You can stash multiple sets of uncommitted changes and there is support for listing them etc., but in practice that becomes ugly very quickly, so it is best to use ``git stash`` very sparingly and only for very brief periods of time (e.g., you are in the middle of working on a new feature, someone reports a bug that will just take two minutes to fix, so you switch to a branch to fix the bug before coming back five minutes later to take up the new feature's implementation again).

FINALLY
``git add -p``

## Using GitHub to build a community for your code

Need to explain how to sync branches with GitHUb