### What this tutorial will teach you

 1. What is version control and what problems does it solve?
 2. What are the benefits of using version control?
 3. How to use Git in managing a project.

 # What is version control & what problems does it solve?

<br/>

Have you ever encountered a situation like this? If so, _what stands out_ and _how do you think this can be improved_?

<figure>
    <img alt="Thesis Version Control" src="https://www.riffomonas.org/reproducible_research/assets/images/phd-story-told-in-filenames.gif" align="center">
  <figcaption><b>Figure 1: </b><i>Data management without version control</i>. Source: <a href="http://www.phdcomics.com">http://www.phdcomics.com</a></figcaption>
</figure>

<br/>

## What is version control?

<br/>

Abstractly, version control is a system that provides functionality to control changes (i.e. versions), to a file or set of files over time such that you can access specific instances later. Version control is implemented through specialized tools called version control systems (VCS) that allow you to track, modify, or even _undo_ changes to such files. Although traditionally geared towards source code management (SCM), it can be used for maintaining data - i.e. _pretty much anything_ - that requires a historical record of changes made.

Although we'll be focusing on Git, there are a number of different VCS available (e.g. [Concurrent Versions System / CVS](https://en.wikipedia.org/wiki/Concurrent_Versions_System), [Subversion / SVN](https://en.wikipedia.org/wiki/Apache_Subversion), [Mercurial](https://en.wikipedia.org/wiki/Mercurial), [Bazaar](https://en.wikipedia.org/wiki/GNU_Bazaar), etc.) since each system is built around different philosophies and use-cases. Git has two important features that makes it appealing as a VCS:

1. **It keeps track of all changes made to a repository** (i.e. the collection of files under version control).
2. **It is a distributed VCS**, which means that every person working with the repository can easily make changes locally, and do not require constant communication with a centralized server whenever a change is made. It's only when the local changes are synchronized with the remote repository that those changes are resolved.

These two traits makes Git very fast, and suited towards the maintenance and development of projects that involve an arbitrary number of people as local changes are cheap and easy to merge. These changes are often done in _feature branches_ such that the _master branch_ always contains production quality code / data.

<figure>
  <img width="50%" height="50%" src="https://www.atlassian.com/dam/jcr:fcad863b-e0da-4a55-92ee-7caf4988e34e/02.svg" align="center">
  <figcaption><b>Figure 2: </b><i>Git branching</i>. Source: <a href="https://www.atlassian.com/git/tutorials">https://www.atlassian.com/git/tutorials</a></figcaption>
</figure>

In addition to these reasons, Git has risen to prominence due to: 1) its creator, Linus Torvalds (the creator of the Linux kernel); and, 2) the rise of GitHub which has popularized Git in the opensource community.

**This makes Git appealing for projects where the change history is _critical_ to the project's purpose and use - e.g. security applications, application source code, one's thesis, etc.**

It must also be stressed that _although software projects can be maintained without version control, doing so carries **a huge risk** as it is limited by the number and recency of copies of the project in question._ In other words, it's better to ask _which VCS to use_ rather than _whether or not to do so_.

<br/>

REFERENCES & FURTHER READING

1. What is version control: https://www.atlassian.com/git/tutorials/what-is-version-control
2. About version control: https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control

# Common Git tasks


## Setting up a repository

There are two ways of setting up a repository:
   1. `git init <directory>` - this creates a `.git` subdirectory in the current working directory if no `directory` is given. This directory contains all the necessary Git metadata for objects, references, and template files. A `HEAD` file is also created which points to currently checked out commit. The root directory of the project is unaltered. Notably, each subdirectory does not require / have a `.git` subdirectory.
   2. `git clone` - is a command line utility that allows one to target an _existing_ repository and create a clone / copy of it. This can be a remote  location or a directory. Cloning can be shallow (i.e. only recent commits), or deep (the full commit history). Git also supports various URL protocols: e.g. `http[s]`,  `git` (no authentication), and `ssh` (authentication required).
     1. Remote example: `git clone https://github.com/jojker/PML_Workshops.git` (note _the_ trailing `.git`!)
     2. Local example: `git clone /example/git/repository` (note _the lack_ of the trailing `.git`!)
   
In both cases, there are more esoteric variations of these commands for scenarios involving templates and shared resources. Although we won't cover them in this tutorial, they are worth investigating.

## Configuring the repository for modification

When a user makes a change to a file (or files) in a project, metadata about them is added in the commit (or saved changes). To make this process easier, one can define project-specific, global (specific only to a user), or system (appliced to all users and projects on the OS) settings that can be used.

There are two settings that are most important to us:

1. The user's name: `git config user.name "Firstname Lastname"`
2. The user's email: `git config user.email "user@example.com"`

Note that if the setting level isn't defined, the configuration is saved _locally_. These settings can then be checked by looking at the config for the given setting level - e.g. local: `git config --local user.name`.

To see all the settings specific to a project, one can do: `git config --<scope> -l`.

## Saving changes

Once a repository has been configured, we can add code to it by using the `git add <file|directory>` command. This command doesn't affect the repository _until_ `git commit` is called, which incorporates the changes of the last commit to the files in the project.

One can then check the status of the files under version control by running `git status` in the project root. This will indicate changes Git has detected based on the files that it's permitted to track.

Should one make a bad addition before a commit, that can be undone by running `git reset`. This "resets" the commit, and one can modify the staged files before a commit is made.

When one is ready to commit the changes after `git add`,  this can be achieved by running: `git commit`. By default git will prompt you for a commit message as a way of notarizing the changes made. This is because all commits are tracked which allows for better project management, as well as identifying and fixing potential software bugs. A commit message can be applied directly with the `-m` flag. If no message is provided, git will abort the commit.

## Saving intermediate work

`git stash` temporarily shelves (or _stashes_) changes you've made to your working copy so you can work on something else, and then come back and re-apply them later on. Stashing is handy if you need to quickly switch context and work on something else, but you're mid-way through a code change and aren't quite ready to commit.

The `git stash` command takes your uncommitted changes (both staged and unstaged), saves them away for later use, and then reverts them from your working copy. At this point you're free to make changes, create new commits, switch branches, and perform any other Git operations; then come back and re-apply your stash when you're ready.

Note that the stash is local to your Git repository; stashes are not transferred to the server when you push. And, by default Git _won't_ stash changes made to untracked or ignored files.

When you're ready to continue working on your previous work, this can be done by using `git stash pop` (`pop` refers to the operation that's performed in a stack, i.e. a data structure that is LIFO or "Last-In-First-Out"). _Popping_ your stash removes the changes from your stash and reapplies them to your working copy.

Alternatively, you can _reapply_ the changes to your working copy _and_ keep them in your stash with `git stash apply`.

## Ignoring certain files

Git sees every file in your working copy as one of three things:

1. tracked - a file which has been previously staged or commited
2. untracked - a file which has not been staged or commited
3. ignored - a file which Git has been explicitly told to ignore

Ignored files are usually build artifacts and machine generated files that can be derived from your repository source or should otherwise not be committed. Some common examples are:

- dependency caches, such as the contents of /node_modules or /packages
- compiled code, such as .o, .pyc, and .class files
- build output directories, such as /bin, /out, or /target
- files generated at runtime, such as .log, .lock, or .tmp
- hidden system files, such as .DS_Store or Thumbs.db
- personal IDE config files, such as .idea/workspace.xml

Ignored files are tracked in a special file named .gitignore that is checked in at the root of your repository. There is no explicit git ignore command: instead the `.gitignore` file must be edited and committed by hand when you have new files that you wish to ignore. `.gitignore` files contain patterns that are matched against file names in your repository to determine whether or not they should be ignored.

To define your own patterns in `.gitignore`, you can consult http://linux.die.net/man/7/glob as Git uses globbing patterns to match against filenames.

## Undoing changes

In this section, we will discuss the available 'undo' Git strategies and commands. It is first important to note that Git does not have a traditional 'undo' system like those found in a word processing application. It will be beneficial to refrain from mapping Git operations to any traditional 'undo' mental model. Additionally, Git has its own nomenclature for 'undo' operations that it is best to leverage in a discussion. This nomenclature includes terms like reset, revert, checkout, clean, and more.

First, we'll need to find the ID of the revision we want to see. This can be achieved by the command: `git log --oneline`. Let's say your project history looks like this:

```
b7119f2 Continue doing crazy things
872fa7e Try something crazy
a1e8fb5 Make some important changes to hello.txt
435b61d Create hello.txt
9773e52 Initial import
```

We can use git checkout to view the commit as follows: `git checkout a1e8fb5`. This makes your working directory match the exact state of the `a1e8fb5` commit. You can look at files, compile the project, run tests, and even edit files without worrying about losing the current state of the project. Nothing you do in here will be saved in your repository. To continue developing, you need to get back to the “current” state of your project: `git checkout master` (assuming you're developing on the default `master` branch). Once back in the `master` branch, you can use either `git revert` or `git reset` to undo any undesired changes.


# Syncing

The `git remote` command is one piece of the broader system which is responsible for syncing changes. Records registered through the git remote command are used in conjunction with the `git fetch`, `git push`, and `git pull` commands. These commands all have their own syncing responsibilities which can be explored on the corresponding links.

The `git remote` command lets you create, view, and delete connections to other repositories. Remote connections are more like bookmarks rather than direct links into other repositories. Instead of providing real-time access to another repository, they serve as convenient names that can be used to reference a not-so-convenient URL.

For example, the following diagram shows two remote connections from your repo into the central repo and another developer’s repo. Instead of referencing them by their full URLs, you can pass the origin and john shortcuts to other Git commands.

<figure>
  <img src="https://www.atlassian.com/dam/jcr:df13d351-6189-4f0b-94f0-21d3fcd66038/01.svg" align="center"/>
  <figcaption><b>Figure 3:</b> Git remote connections</figcaption>
</figure>

We can view git remote configurations by using: `git remote`. To create a new connection to a remote repository, we do: `git remote add <name> <url>`. To remove a remote link, one can do: `git remote rm <name>` (`rm` is meant to evoke the similar functionality of the UNIX program `rm` which is used to delete files and directories). Lastly, we can _rename_ a remote repository by 

When you clone a repository with `git clone`, it automatically creates a remote connection called origin pointing back to the cloned repository. This is useful for developers creating a local copy of a central repository, since it provides an easy way to pull upstream changes or publish local commits. This behavior is also why most Git-based projects call their central repository origin.

## Git fetch & pull

The `git fetch` command downloads commits, files, and refs from a remote repository into your local repo. Fetching is what you do when you want to see what everybody else has been working on.

It’s similar to svn update in that it lets you see how the central history has progressed, but it doesn’t force you to actually merge the changes into your repository. Git isolates fetched content as a from existing local content, it has absolutely no effect on your local development work. Fetched content has to be explicitly checked out using the `git checkout` command. This makes fetching a safe way to review commits before integrating them with your local repository.

When downloading content from a remote repo, `git pull` and `git fetch` commands are available to accomplish the task. You can consider git fetch the 'safe' version of the two commands. It will download the remote content but not update your local repo's working state, leaving your current work intact. git pull is the more aggressive alternative, it will download the remote content for the active local branch and immediately execute git merge to create a merge commit for the new remote content. If you have pending changes in progress this will cause conflicts and kickoff the merge conflict resolution flow.


## Uploading local changes

The `git push` command is used to upload local repository content to a remote repository. Pushing is how you transfer commits from your local repository to a remote repo. It's the counterpart to git fetch, but whereas fetching imports commits to local branches, pushing exports commits to remote branches. Remote branches are configured using the git remote command. Pushing has the potential to overwrite changes, caution should be taken when pushing.

# Exercise: Creating a Git repository

We're going to create a simple repository for a project that tells jokes. The code entries are meant to be executed in the terminal, and hence cannot be run from within Jupyter. It's strongly encouraged that you follow along in your own terminal with the presenter, or when you can.

1. First create a project directory and initialize it as a Git repository:
`mkdir <project-dir>`
`git init`

2. Set up the project with a config (see relevant section)
`git config --local user.name "Firstname Lastname"`
`git config --local user.email "user@example.com"`

3. Create our first file, `joke.py` and make the file executable:
`touch joke.py`
`chmod +x joke.py`

4. Add the following code which tells a simple joke when run:

```python
#!/usr/bin/env python                                                                                                  
#                                                                                                   
# Description: this program implements a joke-telling machine.                                                                                                 
def simple_joke():
    print("X: Knock, knock.\n"\
          "Y: Who's there?\n"\
          "X: Boo.\n"
          "Y: Boo who?\n"
          "X: Don't cry - it's just a joke!")

def main():
    simple_joke()


if __name__ == '__main__':
    main()
 ```
 
 When this is run `./joke.py`, we obtain the following output:
 
 ```
X: Knock, knock.
Y: Who's there?
X: Boo.
Y: Boo who?
X: Don't cry - it's just a joke!
```

5. If we check the status of the project (`git status`) we'll observe that `joke.py` has been changed. Let's add and commit these changes: `git add . --all` and `git commit -m "Inital commit"`.

6. After saving our project, we realize that we can make it better by adding an element of randomness to it. To begin, let's create a new branch: `git branch <newbranch>`.

7. Switch to the new branch: `git checkout <newbranch>`

8. Add the ability to select a random joke (copy the text below into `joke.py`):

```python
import os
import csv
import random
import urllib.request


# These are elements of the URL that are built up
GITHUB_URL = "https://raw.githubusercontent.com"
GITHUB_USER = "amoudgl"
GITHUB_REPO = "short-jokes-dataset"
GITHUB_BRANCH = "master"
GITHUB_FILENAME = "data/reddit-cleanjokes.csv"
JOKES_URL = "{url}/{user}/{repo}/{branch}/{filename}".format(url=GITHUB_URL,
                                                             user=GITHUB_USER,
                                                             repo=GITHUB_REPO,
                                                             branch=GITHUB_BRANCH,
                                                             filename=GITHUB_FILENAME)


# this is the full URL:
# https://raw.githubusercontent.com/amoudgl/short-jokes-dataset/master/data/reddit-cleanjokes.csv


def simple_joke():
    print("X: Knock, knock.\n"\
          "Y: Who's there?\n"\
          "X: Boo.\n"
          "Y: Boo who?\n"
          "X: Don't cry - it's just a joke!")


def random_joke(url, skip_lines=0):
    # check if the data hasn't been downloaded before
    filepath = os.path.basename(url)
    if not os.path.exists(filepath):
        response = urllib.request.urlopen(url)
        data = response.read().decode('utf-8')
        with open(filepath, 'w') as output:
            output.write(data)

    # access the data
    jokes = list()
    with open(os.path.basename(url), 'r') as csv_file:
        csv_reader = csv.reader(csv_file)
        for row in range(skip_lines):
            next(csv_reader, None)
        for row in csv_reader:
            joke = row[1]
            jokes.append(joke)
    print(random.choice(jokes))
    
    
def main():
    random_joke(JOKES_URL)


if __name__ == '__main__':
    main()
```

9. Add and commit these changes.

10. Exercise: create a new branch where you modify the file to select other jokes from the internet. Add and commit those changes. Examine the branches and commits using `git blame <file>` and `git log`.

# Exercise: the "laff box"

So far we've created a nice joke telling machine that can keep ourselves chuckling for some time. 

<figure>
    <img src="https://d1yn1kh78jj1rr.cloudfront.net/image/thumbnail/r9s_LF4-givf6659i/storyblocks-stand-up-comedy-cartoon-theme-vector-art-illustration_rjbbNDiqN_thumb.jpg" align="center">
  <figcaption><b>Figure 3: </b><i>If a comedian tells a joke and no one is around to hear it, is it funny?</i></figcaption>
</figure>

Although people aren't trees and we laugh at our own jokes, it can be argued that we need an audience for a joke to be considered funny - i.e. no joke is truly funny unless it elicits an external response. For this exercise, we'll do exactly that. We'll follow the footsteps of Charles Douglass, the creator of the laugh track, and build our own. First, some history.

Pick any sitcom on TV today. Chances are that it's very likely to have a laugh track that accompanies the show. Before the invention of the laugh track, early sitcoms in the 1950s were filmed live in front of a studio audience so that their laughter and reactions could be recorded for the public at home. However, this was difficult to do reliably due to a variety of reasons - distance of audience members to the stage, aucostics of the studio, etc. - as the audiences often did not laugh at the appropriate time.

[Charles Douglass](https://en.wikipedia.org/wiki/Charles_Douglass), an enterprising sound engineer at CBS Radio noticed this problem and came up with a solution: _insert laughter when needed to engender the response to a joke_. He did this by devising an analog machine that could simulate various pitches and lengths of human laughter. By the early 1960s when live studio audiences had become unpopular, Douglass's machine was used to simulate audience laughter in a wide variety of comedies. Today, his original device is no longer is use, but his idea is very much alive in the device's digital analogue - a laptop computer that contains hundreds of human sounds.

For this exercise, we'll create a laugh track machine. Before we do, let's look at laughter as a human activity. Written out as text, there are a few forms of writing laughter (we're not going to use LOL, LMAO, and other variants as they tersely describe a humorous response):

1. "Ha ha ha..."
2. "He he he ..."
3. "Hi hi..." (uncommon)
4. "Huh huh" (uncommon)
5. "Ho ho ho" (rare, unless you're Santa)
6. "Hyuck Hyuck" (rare, unless you're Goofy)

Notably, most of the base responses "Ha, he, hi, ho, etc." at minimum tend to be repeated at least twice. Once is often considered sarcastic. We also note a few more observations:

1. Longer jokes tend to elicit a longer and more intense response. We can describe intense responses by using uppercase.
2. Shorter (often jokes involving puns) tend to evoke shorter responses.
3. Laughter written out as text can sometimes have no spaces in-between them, to indicate the shortness in time between responses.
4. Humor by definition is not universal as it's open to interpretation. As a result, there is a probability that our joke will generate groans in addition to laughs.

For this exercise we'll do items 1, 2, and 3. Item 4 is left for the reader.

Here are some useful bits of code that we can use for this process:

```python
import random
SEED = 42  # choose your number here. To obtain reproducible results, use a fixed seed
random.seed(SEED)
```

If you want your results to be psuedo-random each time, set the seed to the current UNIX timestamp. Note: for scientific purposes this kind of random number generation [can be problematic as the OS can limit which source is used for randomness](https://stackless.readthedocs.io/en/3.6-slp/library/random.html#random.seed)! For our purposes, this is sufficient.

```python
# alternatively, you can use `datetime.now()` - requires `import datetime`
import time
random.seed(time.time())
```




# Code examples

In [14]:
import random
import numpy as np


# To randomly select from a sequence we can use `random.choice`:
laughs = 'ha,he,hi,huh,ho,hyuck'.split(',')
random_laugh = random.choice(laughs)
print('Random laugh: ', random_laugh, end='\n')


# Since we want some of the laughter to be selected from a distribution, we can
# use a gaussian to sample the distribution of the kinds of laughs that is 
# allowed. For example, we can use the three-sigma rule to define boundaries 
# for the selections.
def laughs_from_distribution(mu=0.0, sample=1.0):
    mu = 0.0
    sigma = 1.0
    sigma2 = (mu + (2.0*sigma))
    sigma3 = (mu + (3.0*sigma))
    num = abs(np.random.randn())
    laughs = 'ha,he'.split(',')
    if num >= sigma2 and num < sigma3:
        laughs = 'hi,huh'.split(',')
    elif num >= sigma3:
        laughs = 'hyuck,ho'.split(',')
    return laughs


print('Laughs from distribution: ', laughs_from_distribution())


# Exercise: the following is just a suggestion - i.e. you are not limited
# by it. Adapt / modify as you see fit.
def laugh_machine():
  # Possible steps:
  # 1) select random joke
  # 2) determine joke length
  # 3) generate laughter to match / exceed the joke length
  pass

Random laugh:  hyuck
Laughs from distribution:  ['hi', 'huh']


# References

1. https://www.atlassian.com/git/tutorials
2. https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
