# Git and GitHub


* Other resources for developing Python coding skills
* Backing up your computer
* Moving your files in and out of the cloud/HPC
* Difference between cloud and HPC?





* NCBI - GenBank, GEO, Popsets, Structure
* DOE's IMG database
* Dryad
* Figshare
* EDI, DataOne
* Open Access publications
* bioRxiv https://www.biorxiv.org/
    Use as an example Aylward's recent paper



* Github 


* Git on RStudio Cloud using command line
* Git on RStudio using RStudio
* Git on Unity using Jupyter
* git on Unity or MGHPCC using command line
* Git on your local computer using the command line
* Git on your local computer using RStudio




Using GitHub on the command line
Using GitHub through Rstudio

Unity - https://unity.rc.umass.edu/panel/
Has git in the menu

Creating a file backup system
Creating a file archival system




The concept of a project and a folder that contains that project. This is your local (or cloud repository).
You can have many git folders (repositories on your computer).  


HPC describes a specific set of hardware and software properties to support certain kinds of workloads. Cloud means nothing more than “renting dynamic access to a pool of resources”.

HPC *is* cloud computing.

All the HPC facilities I know of are undeniably “cloud” facilities, at least in the general sense. Their point is to have a large, remote resource pool, that users can dynamically allocate according to their needs. Yes, clouds are usually treated as multi-tenant commercial undertakings. Yes, cloud has tended to focus on non-compute-intensive applications with little east-west communication. Yes, cloud tends to imply virtualization.

You can certainly design a cloud facility to be HPC-friendly - that’s mainly about interconnect. That would be quite a bit of money that would not benefit “normal” cloud loads, though.


The origins of HPC/Grid exist within the academic community where needs arose to crunch large data sets very early on.  






## Version control and Collaboration using Git and GitHub

First, just what are `git` and GitHub?

- __git__: version control software used to track files in a folder (a repository)
    - git creates the versioned history of a repository
- __GitHub__: web site that allows users to store their git repositories and share them with others

### The Git lifecycle

As a git user, you'll need to understand the basic concepts associated with versioned sets of changes, and how they are stored and moved across repositories.  Any given git repository can be cloned so that it exist both locally, and remotely.  But each of these cloned repositories is simply a copy of all of the files and change history for those files, stored in git's particular format.  For our purposes, we can consider a git repository just a folder with a bunch of additional version-related metadata.

In a local git-enabled folder, the folder contains a workspace containing the current version of all files in the repository. These working files are linked to a hidden folder containing the 'Local repository', which contains all of the other changes made to the files, along with the version metadata.

So, when working with files using git, you can use git commands to indicate specifically which changes to the local working files should be staged for versioning (using the `git add` command), and when to record those changes as a version in the local repository (using the command `git commit`).

The remaining concepts are involved in synchronizing the changes in your local repository with changes in a remote repository.  The `git push` command is used to send local changes up to a remote repository (possibly on GitHub), and the `git pull` command is used to fetch changes from a remote repository and merge them into the local repository.

![](images/git-flowchart.png)

- `git clone`: to copy a whole remote repository to local
- `git add` (stage): notify git to track particular changes
- `git commit`: store those changes as a version
- `git pull`: merge changes from a remote repository to our local repository
- `git push`: copy changes from our local repository to a remote repository
- `git status`: determine the state of all files in the local repository
- `git log`: print the history of changes in a repository

Those seven commands are the majority of what you need to successfully use git. But this is all super abstract, so let's explore with some real examples.

### Setting up git on your computer (you do not need to do this if you are using R Cloud)

* Install [git](https://github.com/git-guides/install-git).
* Go to [github.com](http://github.com) and create an account. 


Before using git, you need to tell it who you are, also known as setting the global options. The only way to do this is through the command line. Newer versions of RStudio have a nice feature where you can open a terminal window in your RStudio session. Do this by selecting Tools -> Terminal -> New Terminal.

A terminal tab should now be open where your console usually is. 

To see if you aleady have your name and email options set, use this command from the terminal:

```{sh git-config, eval=FALSE}
git config --global --list | grep user
```

* Note: Most will get an error message at this step unless you have previously set up git

To set the global options, type the following into the command prompt, with your actual name, and press enter:

```{sh git-name, eval=FALSE}
git config --global user.name "Matt Jones"
```

Next, enter the following line, with the email address you used when you created your account on github.com:

```{sh git-email, eval=FALSE}
git config --global user.email "gitcode@magisa.org"
```

Note that these lines need to be run one at a time.

Finally, check to make sure everything looks correct by entering these commands, which will return the options that you have set.

```{sh git-list, eval=FALSE}
git config --global --list | grep user
```

### Note for Windows Users

If you get "command not found" (or similar) when you try these steps through the RStudio terminal tab, you may need to set the type of terminal that gets launched by RStudio. Under some git install senerios, the git executable may not be available to the default terminal type. Follow the instructions on the RStudio site for [Windows specific terminal options](https://support.rstudio.com/hc/en-us/articles/115010737148-Using-the-RStudio-Terminal#appendixe). In particular, you should choose "New Terminals open with Git Bash" in the Terminal options (`Tools->Global Options->Terminal`).

In addition, some versions of windows have difficults with the command line if you are using an account name with spaces in it (such as "Matt Jones", rather than something like "mbjones").  You may need to use an account name without spaces.

### Create a remote repository on GitHub

Let's start by creating a repository on GitHub, then we'll edit some files.

- Log into [GitHub](https://github.com)
- Click the New repository button
- Name it `genomics-course` or something similar
- Create a README.md
- Set the LICENSE to Apache 2.0
- Add a .gitignore file for `R`

You've now created your first repository! It has a couple of files that GitHub created
for you, like the README.md file, and the LICENSE file, and the .gitignore file.


For simple changes to text files, you can make edits right in the GitHub web interface.  For example,
navigate to the `README.md` file in the file listing, and edit it by clicking on the *pencil* icon.
This is a regular Markdown file, so you can just add text, and when done, add a commit message, and 
hit the `Commit changes` button.  


Congratulations, you've now authored your first versioned commit.  If you navigate back to the GitHub page for the repository, you'll see your commit listed there, as well as the
rendered README.md file.

Let's point out a few things about this window.  It represents a view of the repository that you created, showing all of the files in the repository so far.  For each file, it shows when the file was last modified, and the commit message that was used to last change each file.  This is why it is important to write good, descriptive commit messages.  In addition, the blue header above the file listing shows the most recent commit, along with its commit message, and its SHA identifer.  That SHA identifier is the key to this set of versioned changes.  If you click on the SHA identifier (*810f314*), it will display the set of changes made in that particular commit.

In the next section we'll use the GitHub URL for the GitHub repository you created to `clone` the repository onto your local machine so that you can edit the files in RStudio.  To do so, start by copying the GitHub URL, which represents the repository location:

![](images/Get_GitHub_link.png)

### Working locally with Git via RStudio (note separate directions for syncing your GitHub repository with R Cloud below)

RStudio knows how to work with files under version control with Git, but only if you are working within an RStudio project folder.  In this next section, we will clone the repository that you created on GitHub into a local repository as an RStudio project.  Here's what we're going to do:

- Create the new project
- Inspect the Git tab and version history
- Commit a change to the README.md file
- Commit the changes that RStudio made
- Inspect the version history
- Add and commit an Rmd file
- Push these changes to GitHub
- View the change history on GitHub

__Create a New Project.__ Start by creating a *New Project...* in R Studio, select the *Version Control* option, and paste the GitHub URL that you copied into the field for the remote repository *Repository URL*.  While you can name the local copy of the repository anything, its typical to use the same name as the GitHub repository to maintain the correspondence.  You can choose any folder for your local copy, in my case I used 'git-evogeno` folder.
Once you hit `Create Project, a new RStudio windo will open with all of the files from the remote repository copied locally.  Depending on how your version of RStudio is configured, the location and size of the panes may differ, but they should all be present, including a *Git* tab and the normal *Files* tab listing the files that had been created in the remote repository.

You'll note that there is one new file `genomics-course.Rproj`, and three files that we
created earlier on GitHub (`.gitignore`, `LICENSE`, and `README.md`).

### Pulling from and pushing to your GitHub repository with R Cloud

The process of pulling from and pushing to your GitHub repository with R Cloud is similar to using a local installation only when you create a new project select from Git Repo

![](images/New_Project_GitHub.png)

paste in the link from the Git Repo as noted above.

### Inspecting your new repository

In the *Git* tab, you'll note that two files are listed.  This is the status pane that shows the current modification status of all of the files in the repository. In this case, the `.gitignore` file is listed as *M* for Modified, and `genomics-course.Rproj`  is listed with a *? ?* to indicate that the file is untracked.  This means that git has not stored any versions of this file, and knows nothing about the file. As you make version control decisions in RStudio, these icons will change to reflect the current version status of each of the files.

__Move your old course files into this new directory.__ Put your .html files into the main directory and delete the html folder.

__Add, Commit and Push the changes (new files) to the Github repo.__ First check the files you want to add. Then click Commit. Write a message to describe the changes (see below on good commit messages.). The Push the changes to the Github repo. Examine the changes in the repo.

### On good commit messages

Clearly, good documentation of what you've done is critical to making the version history of your repository meaningful and helpful.  Its tempting to skip the commit message altogether, or to add some stock blurd like 'Updates'.  Its better to use messages that will be helpful to your future self in deducing not just what you did, but why you did it.  Also, commit messaged are best understood if they follow the active verb convention.  For example, you can see that my commit messages all started with a past tense verb, and then explained what was changed.

While some of the changes we illustrated here were simple and so easily explained in a short phrase, for more complext changes, its best to provide a more complete message.  The convention, however, is to always have a short, terse first sentence, followed by a more verbose explanation of the details and rationale for the change. This keeps the high level details readable in the version log.  I can't count the number of times I've looked at the commit log from 2, 3, or 10 years prior and been so grateful for diligence of my past self and collaborators.

### Github web pages

You can enable Github pages to create a web presence for your project. 

- Go to your repository you just created
- Click on Settings
- Scroll down to GitHub pages
- Select Master branch
- Click on Save
(Do not choose a theme today. We are going to creat a simple page using RMarkdown. You have the option of choosing a theme later).
- It will create a GitHub page (e.g. https://jeffreyblanchard.github.io/jeffblanchard/)
- Copy the link to your GitHub page
- Go to the main page for your web (e.g. jeffblanchard) repository.
- In the about section add the url to your GitHub page

Under the settings tab enable Github pages.  It takes about 10 min for the web site to appear. The default web pages in the README.md file, but if you create and upload an index.html page (from a index.Rmd file) this will be your new default. This provides a way to see the html files in your browser as you intended them to appear (not just the html code).  

* Note: It is critical that you use a small `i` in `index.Rmd` and `index.html` and not a captial `I`

### More on creating documents and web pages using R Markdown

- [Creating Pretty Documents from R Markdown](https://prettydoc.statr.me/)
- [How to create a simple website with RMarkdown](https://nceas.github.io/training-rmarkdown-website/tutorial.html)
- [R Markdown: The definitive guide](https://bookdown.org/yihui/rmarkdown/)
- [More on R Markdown web site](https://garrettgman.github.io/rmarkdown/rmarkdown_websites.html)

Try adding an image to your web page

### Github project management

You can keep tract of ideas, todos and fixes by creating a wiki or using the Project

![Managing projects on Github](images/Project_Acidos.png)



### Collaboration and conflict free workflows(we walk talk more about this later in the class)

Up to now, we have been focused on using Git and GitHub for yourself, which is a great use. But equally powerful is to share a GitHib repository with other researchers so that you can work on code, analyses, and models together.  When working together, you will need to pay careful attention to the state of the remote repository to avoid and handle merge conflicts.  A *merge conflict* occurs when two collaborators make two separate commits that change the same lines of the same file.  When this happens, git can't merge the changes together automatically, and will give you back an error asking you to resolve the conflict. Don't be afraid of merge conflicts, they are pretty easy to handle.  and there are some 
[great](https://help.github.com/articles/resolving-a-merge-conflict-using-the-command-line/) [guides](https://stackoverflow.com/questions/161813/how-to-resolve-merge-conflicts-in-git).

That said, its truly painless if you can avoid merge conflicts in the first place. You can minimize conflicts by:

- Ensure that you pull down changes just before you commit
  + Ensures that you have the most recent changes
  + But you may have to fix your code if conflict would have occurred
- Coordinate with your collaborators on who is touching which files
  + You still need to comunicate to collaborate

### More with git

There's a lot we haven't covered in this brief tutorial.  There are some good longer tutorials that cover additional topics:

- [Happy Git and Github for the useR](https://happygitwithr.com/)
- [Try Git](https://try.github.io) a great interactive tutorial
- Software Carpentry [Version Control with Git](http://swcarpentry.github.io/git-novice/)



## Backing up your computer

Using Git will help you back up and share selected files that are used in data analysis. However, it is not designed to back up all the files on your computer. You can use a backup external drive, but most people do this only occasionally. I recommend using a Clould-based resource that is syncing your files as you work on them. Here are some popular ones:

- [Google Backup and Sync](https://www.google.com/drive/download/0)
- [Box](https://www.box.com/cloud-backup)
- [Dropbox](https://www.dropbox.com/features/cloud-storage/file-backup)
- [Windows OneDrive](https://www.microsoft.com/en-us/microsoft-365/onedrive/pc-cloud-backup)
- [Apple iCloud Drive](https://support.apple.com/guide/mac-help/store-files-in-icloud-drive-mchle5a61431/10.15/mac/10.15)