# DS107 Big Data : Lesson Nine Companion Notebook

### Table of Contents <a class="anchor" id="DS107L9_toc"></a>

* [Table of Contents](#DS107L9_toc)
    * [Page 1 - Introduction](#DS107L9_page_1)
    * [Page 2 - Creating and Cloning A Repository](#DS107L9_page_2)
    * [Page 3 - Git Commits](#DS107L9_page_3)
    * [Page 4 - Using your Repo](#DS107L9_page_4)
    * [Page 5 - Local vs Remote](#DS107L9_page_5)
    * [Page 6 - Making Commits](#DS107L9_page_6)
    * [Page 7 - What is a Branch?](#DS107L9_page_7)
    * [Page 8 - Pulling](#DS107L9_page_8)
    * [Page 9 - Key Terms](#DS107L9_page_9)
    * [Page 10 - ](#DS107L9_page_10)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS107L9_page_1"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Source Control using Git
VimeoVideo('388353003', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO110L02overview.zip)**.

# Introduction

Have you ever tried to share files on a project with someone and had it be a frustrating experience? You may have had multiple copies of nearly identical documents and can't figure out which one is the most recent. Maybe you've lost files in transition through email. Maybe you lost tons of time having to communicate about every little detail or having to work in lockstep, doing all the work together. Whatever the exact situation...wouldn't it be nice to avoid that whole rigamarole? Well guess what - you can! *Git* is a program for version control, and it's purpose is many fold. Git can help you:

* Share files back and forth with your team seamlessly
* Backup your files and allow you to easily revert back to a previous version if needed
* Showcase your work to potential employers

By the end of this lesson, you should be able to:
* Understand why using Git is an advantage
* Have a GitHub account
* Create your own repository
* Commit to that repository
* Understand branching

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/465837537"> recorded live workshop on the concepts in this lesson. </a> </p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Git Commits<a class="anchor" id="DS107L9_page_2"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Git Commits
VimeoVideo('294206354', width=720, height=480)


# Source Control - Git

One of the most significant problems in working with a team is coordinating changes made by multiple members of a team. As a result, almost all projects use some form of *source control management (SCM)* system. This program is responsible for coordinating the changes made by multiple people within a project so that information is not lost.

One of the most popular SCM systems is *Git*. Git is used extensively throughout the software industry, and is now often being used with data science as well. You will need to use Git to share files with your team mates and to post the final project so that potential employers can see your portfolio, so getting familiar with Git now is a great idea. 

---

## GitHub

Although there are many, many ways to interact with Git, including through the command line, you are going to be using two components. The first is *GitHub*, which is a website that houses all of your work and allows you to share your project with the world. The second is *GitKraken*, which is a GUI that allows you to perform the actual work of file sharing. You will need both in this lesson, but it all starts with GitHub. Please take the time now to create and register for your **<a href="https://github.com" target="_blank">GitHub account</a>**.  

---

## Creating a Repository

To start learning Git, you will need a *repository*. A repository, or repo for short, is a shared area where your files for a particular project live. On your GitHub home page, there is a button to create a repository. You can basically think of this as creating a new project. 

![Create repo button](Media/github-create-button.png)

Give the repo a name, and leave it public. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>GitKraken, the software you'll be using with GitHub, is only free if you make your repository public. So even if you're a little nervous about having your work out there in the world, you'll need the public setting to follow along with the lessons </p>
    </div>
</div>

Next, check the box that says `Initialize this repository with a README`. A *readme* is a file that contains information about the repository and project, and is ideally read first before going through any of the other work in the repo - hence the name "read me." 

Checking this box will add an initial commit, or change, to the repo. In this case, the commit is a file named `readme`. A *commit* is a snapshot of a collection of files at a point in time. Making a commit, or changing the contents of a file, does not influence the work that has been done before. Each commit, or change point, is saved along the way. Think of saving a word document on your computer. Every time you press the little floppy disc icon to save the file, it's writing over the previous one. But what if you could preserve the point in time when you pressed the save button, every single time? That is the essence and magic of Git - it does exactly that.

Although a repo can technically be empty, many tools do not handle this situation well, so it is generally a good idea to start off with an initial commit.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Creating and Cloning A Repository<a class="anchor" id="DS107L9_page_3"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Source Control using Git
VimeoVideo('294249243', width=720, height=480)

```c-lms
topic: Using GitKraken
video-id: Creating and Cloning A Repository
video-url-mp4: https://player.vimeo.com/external/294249243.hd.mp4?s=332e6ffa1269a998b1aafbc65221e3db6a572b5a&profile_id=175
video-url-mp4-1080: https://player.vimeo.com/external/294249243.hd.mp4?s=332e6ffa1269a998b1aafbc65221e3db6a572b5a&profile_id=175
video-url-mp4-720: https://player.vimeo.com/external/294249243.hd.mp4?s=332e6ffa1269a998b1aafbc65221e3db6a572b5a&profile_id=174
video-url-mp4-540: https://player.vimeo.com/external/294249243.sd.mp4?s=9b3c7f19e9f1bc4f9e96ae25cb50c7659625ab72&profile_id=165
video-url-mp4-360: https://player.vimeo.com/external/294249243.sd.mp4?s=9b3c7f19e9f1bc4f9e96ae25cb50c7659625ab72&profile_id=164
```

# Using GitKraken

Now that you have a repository set up, it's time to do something with it! And for that to happen, you need additional software - GitKraken. 

While the command line interface is very powerful, a graphical client has many advantages. It allows you to visualize the repo much more efficiently and, thus, it's much easier to use.

You will be using the *GitKraken* client, which is available for Windows, MacOS, and Linux for free, as long as you keep your repositories public. There are many graphical clients, but the visualization of repos in GitKraken makes it ideally suited for learning how various Git operations effect repos.

Go ahead and download GitKraken here: **<a href="https://gitkraken.com" target="_blank">GitKraken Download</a>**

You will need to register, but you can use your GitHub account information you created earlier. This will connect GitKraken to your GitHub account, which will make interacting with GitHub much easier.

Once this is done, go ahead and open up the GitKraken program on your computer. You should not be in your web browser anymore. 

---

## Cloning your Repository

The process of copying the contents of a repo from one location to another is called *cloning*. When you clone a repo, you get a full copy of it on your computer. You can clone from one directory to another or across different computers; to Git it's all the same. However, most of the time you will clone from GitHub to your local computer, using GitKraken.

With GitKraken open, first click on `File` and then `Clone Repo`: 

![Cloning in GitKraken](Media/Clone1.png)

This should bring you to a menu like this:

![Cloning in GitKraken](Media/Clone2.png)

You want to clone from `GitHub.com` and `Where to clone to` is the location you want the repository to live on your computer. You should be able to select the repository you created on GitHub easily in the `Repository to clone` section. Once you have selected a repository to clone, you will see that the `Clone the repo!` button turns green. Go ahead and push it! Who doesn't want to push buttons? 

You'll see a box in the far left corner of GitKraken pop up to show your progress in cloning. Since you started a new repository, and it's basically empty, this window shouldn't flash for more than a few seconds. If you're cloning an existing repository when being added to a project, this step can take longer.

This process is done when you see a banner along the top of GitKraken, telling you that you have `Successfully cloned repo`. Go ahead and click that green `Open Now` button so you can learn and explore. 

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Using your Repo<a class="anchor" id="DS107L9_page_4"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Using your Repo
VimeoVideo('294256528', width=720, height=480)


# Using your Repo

When prompted choose to open the cloned repo, you will see the repo with your initial commit. This repo has a single branch named master.

![New repo](Media/gk-new-repo.png)

---

## Anatomy of a Commit

Each row in your GitKraken is called a *commit*. Again, this represents a change in the files in your repo. You'll now learn a little about the different elements of a commit before diving in and making a commit yourself.

![Commit line](Media/gk-commit.png)

The check mark indicates that the files on your hard drive are from the `master` branch. Currently, there is only one branch. It means all your files are in the same place. You'll learn about branching soon, which can add additional places to work on files. 

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Local vs Remote<a class="anchor" id="DS107L9_page_5"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [5]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Source Control using Git
VimeoVideo('294256528', width=720, height=480)


# Local vs. Remote

See how there are two icons next to `master`? One's a laptop, and the other is a picture. Yours might have just an icon, but it's still meant to be a representation of your GitHub portfolio in some way. The little laptop icon represents what is called your *local* information. It's what's on your computer currently at this time. The other icon represents something called *remote*. You can think of this as the cloud, where your data is stored and connected to GitHub. 

You can see the same icons on the left-hand side:

![Branch list panel](Media/gk-branches.png)

Your goal, always, is to get your work on your local computer up to the great remote data place in the sky. When the icons are on the same line, eureka, you've done it. When they are on separate lines, like in the image below, then your work is not what others will see if they look on GitHub or if they look at their own copy of the repo.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Making Commits<a class="anchor" id="DS107L9_page_6"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Making Commits
VimeoVideo('294220774', width=720, height=480)

# Making Commits

Ok, now start making some additions to your repo and see what happens. First, create a new file in the directory where you cloned the repo. You'll start with something really simple - a text file. But know that you can create any kind of file in the folder on your computer where your repo lives. It could be a R Studio Script, a Python .ipynb, or even a Power Point. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You can also drag and drop or copy in existing files to your repo this way as well - just like any other file folder on your computer.</p>
    </div>
</div>

In this case, you'll create a text file named `newFile.txt`. While you're here, change something in the `README.md` file that was part of the initial commit.

Once you've done that on your computer, head back to GitKraken, and it will show on the right in blue that you have made changes to your repo, like the image below:

![Work in progress section](Media/gk-wip.png)

When you have pending changes that have not been committed to the repo yet, you'll see a dotted circle at the top. If you click on this, you can see all the changes you have made. In this case, there is a file that has been added:

![New file](Media/gk-new-file.png)

And a file that has been modified:

![Modified file](Media/gk-modified.png)

You can click on either file to see the text that has been added (green) or deleted (red). This works well for simple .txt files and for code saved as an R Studio script or as a `.py` file in Visual Studio Code, but as a heads up, it won't look pretty if you are viewing an `.ipynb` file in GitKraken, or products in the Microsoft Suite such as Excel, Power Point, or Word. Don't worry! They will still update and work just fine, and others will be able to see and use them without a problem. 

These two files have been changed on your computer but are not yet part of the repository. They are in limbo. To add them to your repository, and save those changes, the files must be staged and committed. You can *stage* files one at a time or use the `Stage all changes button`. *Staging* a file is just getting it ready to go, so this moves the files from the `Unstaged Files` section to the `Staged Files` area. Add a commit message and then click the `Commit changes to 2 files` button. Doing this will create a new commit on the local `master` branch. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Your commit messages will be publicly visible on GitHub, so make sure you keep them clean and professional.</p>
    </div>
</div>

You can now see that the local master is ahead of the origin (aka GitHub) master:

![Two commits on master branch](Media/gk-two-commits.png)

If you wanted to share these changes with team members or wanted to access them from another machine, you could now send the new commit back to your repo on GitHub. To do this, click the Push button at the top.

![GK push button](Media/gk-push.png)

Once this completes, you'll notice that the icon for the remote has been updated and now points to the same commit as on your local machine.

![After pushing changes](Media/gk-after-push.png)

When you perform a `commit` to your local repository, you will then want to *push* that same commit to this central repository so that the other developers can gain access to your changes. Pushing something makes it a change that everyone can see, whether they are on GitHub or whether they have cloned the repository and are working with it on their local computer. However, others can't get the latest and greatest updates you have made until they themselves perform a *pull*, which is grabbing your changes and downloading them to their local computer. You'll need to pull to view other people's changes as well. 

Congratulations! You have completed the basic process of working with source control! Now, it is time to look at some more advanced scenarios involving branches and merging.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - What is a Branch?<a class="anchor" id="DS107L9_page_7"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [7]:
from IPython.display import VimeoVideo
# Tutorial Video Name: What is a Branch?
VimeoVideo('294228371', width=720, height=480)


# What is a Branch?

When you have multiple commits in a repo, keeping track of where you want to make new commits can be confusing. Git solves this with something called a *branch*. Branches are named labels that point to a particular commit.

When you make a new commit, you do so on one of the branches. By default, there is a `master` branch, as you have so far utilized, but you can make up any branches that you like. When the new commit is made, the current branch is automatically updated to point at the new commit.

While you won't really need to make use of branches in order to work with a final project team, it's a good thing to learn about them, in case your new workplace uses them. A particularly common Git workflow is to create a new branch for each feature, or large section, that is added to a project. This new branch allows you to keep track of the changes made for the new feature without impacting the main branch that other data scientists are using to make small modifications and bug fixes. You'll need to understand how to switch branches to make edits and then merge those edits back into a "mainline" or "trunk" branch. The main branch is usually named `master` as the default.

---

## Branching in GitKraken

You should not be hesitant to make a new branch when it is helpful. To accomplish this, right-click on any commit and select `Create branch here`. You can create a branch at any commit.  In the example below, the branch named `dev` is added at the initial commit.

![New branch](Media/gk-two-branches.png)

Since the current branch, `dev` is already at the initial commit in the working directory, the `README.md` file will have its original contents, and the `newFile.txt` will not exist. Those changes are safely recorded in the commit that `master` points towards, but the current working directory reflects whatever the currently checked out branch is, in this case, `dev`.

You can experiment with this.  Open your file browser to the directory where your repo is located and look at the files. Switch branches and you will see the contents change.

Switch back to the `dev` branch and make some changes to the `README.md`. This time, instead of changing the existing line, add a new line. Once that is done, commit the changes.

![After commit on the second branch](Media/gk-two-branches2.png)

Great! You have some new work done on an independent branch. If you were working in a team, this might be work you are putting into a new project that is not yet ready to be run by everyone. At some point though, you are going to want to share what you have done with the rest of the team. This is done by merging your branch into the main working branch of your project. In your case, you will merge `dev` into `master`. You can do this by dragging `dev` onto `master` and choosing `merge dev into master`. When you do this, you will get a message about needing to switch branches. In Git, you must be on the branch that it is being merged into, so you need to check out `master` first.

![Merge conflict](Media/gk-merge-conflict.png)

GitKraken lets you know that you have a conflict and thus the merge has not taken place yet. This is one of the most stress-inducing parts of working with Git. But don't worry, once you resolve a few merges they will be far less intimidating. Click on the note at the top to see what the problem is.

![GK merge tool](Media/gk-mergetool.png)

It looks like the changes on `master` and the changes on `dev` were not automatically merged. Git is interpreting the `dev` branch as having touched line 1 since a newline was inserted at the end. This forces Git to ask, "what did you mean to have happened here?" 

To resolve this, you will need to include the first line from `master` and the second line from `dev`. To accomplish this, first select the box in the `master` column, this will add the first line to the output window. Next, you need to select just the second line from `dev` which can be accomplished by hovering over that line and clicking the green plus icon.

![GK merge tool](Media/gk-mergetool2.png)

The result of this should be that the output window shows 2 lines, one coming from `A` which in this case is `master` and the other coming from `B` which in this case is `dev`.

![Merged output](Media/gk-output-window.png)

With this complete, click the green `Save` button in the upper-right corner. The `README.md` has moved into the Staged files area, and you can now click `Commit and Merge`. This will create a new commit that looks a bit different and represents a merge.

![After merge](Media/gk-after-merge.png)

Congratulations! You have successfully resolved your first merge conflict! The last step would be to push the merged master branch up to GitHub so that other people could access it. Generally, your feature branch would also be getting pushed up to GitHub too.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Pulling<a class="anchor" id="DS107L9_page_8"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Pulling

So far all the changes have been made on your local computer and pushed up to GitHub. When working with others though, they will be pushing their modifications as well. So, what happens if someone else makes a change to master while you're making your own changes? You can think of this situation like the one you just encountered. Merging `origin/master` into `master` is not any different than merging `dev` into `master`. This is the process that often happens when you pull changes from GitHub. To pull down your teammates' changes, all you'll need to do is hit this `pull` button:

![After merge](Media/gk-pull.png)

Getting into a regular habit of pulling before you push can be nice, but if you forget, don't worry! GitKraken has your back, and if you don't have the most recent changes, it will let you know and ask if you want to `pull` or `force push`. You absolutely DO NOT want to *force push*, as this will overwrite the other person's work with your own, and their work will be lost. It is not permanently lost, as there are ways to revert (Git is a version control program, after all), but it will be tricky and you'll cause yourself undue stress. So remember, force pushing is off limits except in very special circumstances!

If there are no conflicts between your work and your teammates, the pull process will happen seamlessly, but it is advised that you commit any changes in your working directory before pulling to simplify the overall process.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Key Terms<a class="anchor" id="DS107L9_page_9"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below, is a list and short description of the central keywords you have learned in this lesson. Please read through and go back and review any concepts you don't understand fully. Great Work!


<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Source Control Management System (SCM)</td>
        <td>A program to help manage different versions of code done by different people.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>GitHub</td>
        <td>A central repository where you can share your work publicly.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>GitKraken</td>
        <td>A GUI to make source control easy!</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Commit</td>
        <td>A snapshot of the contents of one or more files.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Repository (repo)</td>
        <td>A collection of commits.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Readme</td>
        <td>File that contains information about a project and should be read first.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Cloning</td>
        <td>The process of making a copy of your GitHub repository.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Branch</td>
        <td>A marker that points to a particular commit.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Remote</td>
        <td>All the locations that contain copies of the repo that your local repo knows about.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Push</td>
        <td>Make your changes available to the public and other team members.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pull</td>
        <td>Download the most recent changes to a repository to your local machine.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Merge</td>
        <td>Combining the contents of two branches together.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Origin</td>
        <td>The location you cloned the repo from, typically GitHub or another server, this is a specific remote.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Local</td>
        <td>The machine you are performing the Git operation on.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Force Push</td>
        <td>Forcing GitKraken to take your changes over someone else's in a repository.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Lesson 9 Practice Hands-On<a class="anchor" id="DS107L9_page_10"></a>

[Back to Top](#DS107L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Practice Hands-On

For this Practice Hands On, you are going to create a repository and pushing it to Git Hub. You will also practice cloning a repo from Git Hub. Be sure to complete both parts! This Hands-On will **not** be graded, but you are encouraged to complete it. You will not need to submit this project, as there will not be much to submit. However, it is essential to practice these Git Hub skills. There will also be no solution, so if you have questions, please reach out!

----

## Part 1

* Create a new repository on your computer using GitKraken.
* Make a new project on GitHub, **_but don't initialize it yet_**.
* Insert a new remote named `origin` to your repository using the SSH URL for the repo provided by GitHub.
* Make a commit to your local repo.
* Push the changes and the `master` branch to `origin`.
* Confirm you can see the commit on GitHub.

---

## Part 2

* Clone the repo at: **<a href="https://github.com/woz-u/merge-conflict-example" target="_blank"><b>Merge conflict example on GitHub</b></a>**

* Examine the changes made to both the `master` and `creative-interpretation` branches.
  * Merge `creative-interpretation` into `master`.
  * Resolve all the conflicts.
* When you are done, your `Scene 2` file should match **<a href="https://shakespeare.mit.edu/macbeth/macbeth.1.2.html" target="_blank"><b>this Scene 2</b></a>** with the addition of the lines:

    ```text
    (c) 1623 William Shakespeare
    Macbeth source provided by http://shakespeare.mit.edu/macbeth/index.html
    ```

3. Establish a Github Organization and Repository. 