# Lecture 3 - Code and Data Storage

## Reproducible research (intro)

The ability of others to understand and repeat a study (*i.e.* replicable) is one of the core pillars of the scientific method, a critical check that the methods used for inference are sound and that the findings are general enough to constitute new knowledge. Historically the ability of others to understand what was done (and therefore to do the same research again) was typically provided by the Methods section in a publication. With this description in hand, it was thought, others would be able to spin up their own study and see how closely their results matched those reported in a given study. 

Although in practice this doesn't always happen, it is exactly how the most important scientific research gets used. Discovered in 2012, the gene-editing technology [CRISPR-Cas9](https://en.wikipedia.org/wiki/CRISPR) is already one of the most important discoveries in the history of Biology (and will certainly to lead to a Nobel Prize for its discoverers, Jennifer Doudna and Emmanuelle Charpentier), making it possible to edit specific genetic code to add or remove information. The rapid adoption and use of CRISPR-Cas9 to edit genes is entirely due to its being presented in a reproducible way, with a [detailed supplemental secion](http://science.sciencemag.org/highwire/filestream/593726/field_highwire_adjunct_files/0/1225829.Jinek.SM.revision.1.pdf) in the main paper that provided all the information needed to try it at home. Belief in a scientific finding comes initially from the plausabilty of the work based on the described methods (tentative belief), then fundamentally by others being able to successfully and independently repeat the study, acheiveing similar answers.

We will discuss it more in subsequent lectures, but suffice to say here that in contemporary biology reproducible research has come to mean that all necessary information to repeat a study is made available, including a description of the methods used, the data collected, and the computer code used for analysis and production of published graphics.

In this lecture we'll discuss contemporary ways to store code and data, so that others (including your future self) can figure out what was done.

## GitHub

[GitHub GitHub GitHub](https://github.com/) - GitHub is fantastic. A free, open-source, online repository for storing and sharing code. Most critically it simplifies the job of [**version control**](https://en.wikipedia.org/wiki/Version_control) - keeping track of changes made to code through a series of **commits**. Each time you create or edit a piece of code, GitHub will keep track of the new version, as well as the code differences between each revision. This greatly simplifies people having to keep track of their code and means we can always go back to previous versions of what we have done. It is an excellent collaboration tool (that hopefully [Microsoft doesn't ruin](https://www.theverge.com/2018/6/4/17422788/microsoft-github-acquisition-official-deal)).

![xkcd](git.png)

**ALSO**: the easy sharability of code on GitHub means that our scientific analyses can be readily reproduced by others, from **data manipulation** all the way to **final figures**, including **your future self**. This is a **BIG DEAL** and a crucial part of contemporary science. 

## Set up your account

The first step to using GitHub is to [set up your account](https://github.com/) - pick a username and submit your email and password and you're off. You should also sign up for the [student pack](https://education.github.com/pack), which gives you extra stuff for free (such as free secret repos).

---
# Task 1
---
 - Once you have these things in place, login to your account and click the green `New repository` button on the right hand side. Here you'll be asked for a repoistory name (`Hello_world` is a classic choice) and add a short description (`This is my first repo`). Keep it public and click on initialize with a `README`, then on to `create repository`. This will bring you to a front page for your new repository. On the upper right side, under the green **Clone or download** button, click on the the copy link to clipboard button, then paste the URL into the box below.

In [1]:
# YOUR CODE HERE


## Version control

A critical feature of GitHub is the automation of **version control**, whereby each update to your code is kept track of with corresponding text as to the changes made. 



In GitHub, version control happens by design, with each update requiring a mini-description of what was done, along the opportunity to make longer comments. Why does this matter? Typically (*i.e.* almost every time) when we conduct an analysis we build up code iteratively, starting with a simple model, simulation, or figure and buliding up to something more complex. But often (*i.e.* almost every time) we hit some dead ends along the way. And without version control our fancy-new-now-failing-analyis-file will have overwritten the old version back at the fork in the road. Trying to remember previous code = failure. 

---
# Task 2
---
 - Next click on create new file on the repo front page and create a new R file (use the .R file extension in the name) with a little function of your choice in it. Describe and comment on it, then commit to the **Master** branch. Click on the link to your new file and cut and paste it's URL into the box below.

In [2]:
# YOUR CODE HERE


## Branching

One of the most important principles in GitHub is the notion of branching, whereby people (possibly you) can pull down the current version of code, work on it in a sandbox (where your remarkable insights are quarentined), test it thouroughly, *then* merge your cleverness with the main, master branch of code.  

Rather than re-inventing the wheel here, we're going to work through GitHub's own Try Git tutorial - this will be the command-line version of git that clever people use to commit their code changes so it might be something your future self gets into.

Click on -> [Try Git](https://try.github.io/) and 'Learn Git by Branching'

################### THINGS HAPPEN ####################

Now that we've learned about how to use git, you can use [git - the simple guide](https://rogerdudler.github.io/git-guide/) as a reference for creating new repos and pushing your files.

---
# Task 3
---
 - Find your recently created .R function file and click edit on the right hand side. Edit your file in some ironically-clever way and then describe and comment on it as usual. This time click on **Create new branch** and then on the green **Create pull request** button on the following page, and finally on **Merge pull request** on the next page. Once again, click on the link to your new file and cut and paste it's URL into the box below.

In [3]:
# YOUR CODE HERE


### Gists

GitHub also has a super-useful feature where you can post code for others to see and copy, but without all the branching, merging, and commits of regular GitHub. This is particularly useful for storing Jupyter notebooks of your analysis for a scientific paper (e.g. Python code for [Graham et al. 2018](https://gist.github.com/mamacneil/f55d48278a8ba4073983b7e2b4ab0e06)) or for sharing with colleagues who might want to know how you did some analysis. Simply cut and paste the raw text from your Jupyter notebook (the file on your computer ending in `.ipynb`) and make sure to name the file with the `.ipynb` extension when you save your Gist.

If you want to dive more deeply (*i.e.* if you're a student who is coding right now), have a look at Jenny Bryan's [Happy Git with R page](http://happygitwithr.com/)

---

# Data storage

---

> Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
    - Hadley Wickam

Among the most important tasks in any research endevour is the proper recording, handling, and storage of data. Data is the core unit of science - the object that is analyzed and from which inferences are made about how the world works. Debates about objectivity in science abound, however the careful recording and storage of data is the dividing line between compentent and spurrious science. Frequently it is stored data - or lack thereof - that is used to convict or absolve scientists accused of faking research.

A recent, but already classic, case is that of Oona Lönnstedt, a Sweedish researcher who faked data saying larval fish preferentially ate microplastics in order to get the [resulting paper into Science](http://science.sciencemag.org/content/352/6290/1213). A key piece of evidence against Lönnstedt was that the data was supposedly stored only on her laptop which just-so-happened-to-have-been-stolen once other researchers went asking for it (you can read more about the case [here](https://www.acsh.org/news/2017/05/05/science-finally-retracts-absolute-mess-paper-11234)), which is contrary to the code of scientific conduct in many countries. Dalhousie is behind the times and currently working on a code of conduct, but data storage practices typically specify storage for at least 10 years. With cloud computing and the cheap cost of data warehousing, storage for your entire scientific carrear isn't unreasonable. One never knows when they might have to issue a correction...like on their first ever publication from 2002...

## Naming conventions

It may seem like an trivial thing, but how you name your files and the directories you put them in is a **big deal**. Programming languages need to be able to read file names in easliy and in naming something you're telling your dumb future self what you're looking at. It is remarkable how quickly that super-important file becomes a distant memory when time passes and priorites change.

#### Bad names

A couple of things to avoid include:
    - non specific names (unless in a clearly defined directory): e.g. abstract.docx, figure1.jpg etc.
    - S P A C E S: e.g. Figure 1.jpg, Data for BIO 1000.csv
    - commas: test_data,trial1.csv
    - puncutaion of any kind: big_result!.csv, awesomess:).jpg

#### Good names

A few things to include:
    
    - dates: e.g. trout_draft_05_May_17.doc, unicorn_data_05_05_16.csv
    - detailed descriptors: e.g. figure_1_trout_draft2.png, unicorn_meristics.csv

The most important issues for machine readability are for **file names**, **column headings**, and **objects**.

## File names

### Lifted entirely from Jenny [Bryan's](https://t.co/99waX8liuQ) three principles for file names

    1. Machine readable
    2. Human readable
    3. Plays well with default ordering


### Machine readability

Computers use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression), which means that they use standardized syntax to process and search information. Without regular expressions, any kind of search on your computer or on the internet would fail miserably. In the context of naming, regular expressions avoid spaces, punctuation, accents, and case sensitivity. 

Great file names look like they're oversharing:

```
2012-07-07_FINPRINT_Aruba-LionCay-T1.csv
2012-07-07_FINPRINT_Aruba-LionCay-T2.csv
2012-07-07_FINPRINT_Aruba-LionCay-T3.csv
2012-07-07_FINPRINT_Aruba-TigerCay-T1.csv
2012-07-07_FINPRINT_Aruba-TigerCay-T2.csv
2012-07-07_FINPRINT_Aruba-TigerCay-T3.csv
2012-07-07_FINPRINT_Aruba-OcelotCay-T1.csv
2012-07-07_FINPRINT_Aruba-OcelotCay-T2.csv
2012-07-07_FINPRINT_Aruba-OcelotCay-T3.csv
```


Awful file names are coy: 

```
transect1.csv
transect2.csv
Transect2.csv
partialreef.csv
quickone.csv
```

From the first list, we know exactly how and when the data were collected and we can use built-in R functions to do some computing for us. The second is close to useless.

For example, we can look at all the data files in a well-stored directory called (cleverly) `Filez`:

In [1]:
list.files(path='Filez')

And we can use internal R functions to find a specific pattern in those filez:

In [2]:
list.files(path='Filez','FINPRINT')

Now, with all those filez we need to do some intellgent stuff, like figuring out just how many countires we went to in the [FinPrint](https://globalfinprint.org) Project:

In [3]:
# Store list of files from FINPRINT project
flist = list.files(path='Filez','FINPRINT')
head(flist)

In [4]:
# Remove .csv from labels
flist = gsub('.csv', '', flist)
head(flist)

In [5]:
# Load the string-R library
library(stringr)

In [6]:
# Create data frame of file name information
poo = str_split_fixed(flist,'_',3)
head(poo)

0,1,2
2011-10-29,FINPRINT,Indonesia-GloversReefWest
2011-3-1,FINPRINT,USAPacific-StEustatiusWest
2011-3-10,FINPRINT,Qatar-HoutmanAbrolhosNorthOpen
2011-3-19,FINPRINT,Tonga-NorthSulawesiLembehIsland
2011-4-3,FINPRINT,BritishWestIndies-FloridaUpperKeys2
2011-5-29,FINPRINT,CookIslands-NingalooWestExmouthGulf


In [7]:
# Give columns sensible names
colnames(poo) = c('date','project','info')
head(poo)

date,project,info
2011-10-29,FINPRINT,Indonesia-GloversReefWest
2011-3-1,FINPRINT,USAPacific-StEustatiusWest
2011-3-10,FINPRINT,Qatar-HoutmanAbrolhosNorthOpen
2011-3-19,FINPRINT,Tonga-NorthSulawesiLembehIsland
2011-4-3,FINPRINT,BritishWestIndies-FloridaUpperKeys2
2011-5-29,FINPRINT,CookIslands-NingalooWestExmouthGulf


In [8]:
# Split info column into useful bits
out = str_split_fixed(poo[,3],'-',2)
out

0,1
Indonesia,GloversReefWest
USAPacific,StEustatiusWest
Qatar,HoutmanAbrolhosNorthOpen
Tonga,NorthSulawesiLembehIsland
BritishWestIndies,FloridaUpperKeys2
CookIslands,NingalooWestExmouthGulf
USAPacific,PearlandHermesAtollPearlandHermesAtoll
Qatar,NingalooCloatesOpen
Vietnam,ZairaAreaOpen
Belize,SouthernKenyaMpungutiReserve


In [9]:
# How many unique countries?
length(unique(out[,1]))

Machine readable means it is:

    1. Easy to search for files later
    2. Easy to narrow down list of file names
    3. Easy to extract information by splitting


---
# Task 4
---

Using the code above and your intiution, figure out how many unique reefs were sampled in the `SERF` project and store the number in an object called `nReefs`.

In [None]:
# YOUR CODE HERE


### Human readability


By naming things well, your ability to find a specific file later goes way, way up.

For example, if we want search through a load of files from years ago, there is a big difference in encountering:

```
first_attempt.r
goodone.r
meh.r
goodone41.r
goodone42.r
goodone43.r
goodone424.r
```

versus

```
B1_basic_analysis.r
B2_added_hierarchy.r
B3_added_hierarchy_and_covariates.r
B4_full_Bayesian.r
B5_full_Frequentist.r
```

Naming things well means it is **easy to figure out what something is based on its name**.

### Plays well with default ordering

Default ordering is the way in which files listed in a directory will look when you look at them. This order uses underscores first, then by the first number, then alphabetically. So if you want a specific file to always be at the top of your directory, you can use an underscore:

![](fileorder.png)

Clearly gift ideas are problematic for me, so I have prioritized them.

Bryan suggests a few key points:


    1. Put something numeric first
    2. Use YYYY-MM-DD for dates
    3. Use leading zeros


1 & 2 help keep things in logical order, either by date or the order you want them in. 3 just means that numbers between 1 and 10 will not fall in the correct order unless they have a leading zero. For example

```
10_final_figures.R
1_initial_data_wrangling.R
2_model_fitting.R
...

```

Isn't the behaviour we expected. Far more logical is

```
01_initial_data_wrangling.R
02_model_fitting.R
...
10_final_figures.R

```


## R-object names

Naming conventions for r-objects are much the same as for other files, and there are many potential styles to use. The choice is really up to you, but a few common choices [from Bååth](https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf) are:

**alllowercase** - All letters are lower case and no separator is used in names consisting of multiple words as in searchpaths or srcfilecopy. This naming convention is common in MATLAB. Note that a single lowercase name, such as mean, conforms to all conventions but UpperCamelCase.

**period.separated** - All letters are lower case and multiple words are separated by a period. This naming convention *is unique to R* (so avoid it) and used in many core functions such as as.numeric or read.table. 

**underscore_separated** - All letters are lower case and multiple words are separated by an underscore as in seq_along or package_version. This naming convention is used for function and variable names in many languages including C++, Perl and Ruby.

**lowerCamelCase** - Single word names consist of lower case letters and in names consisting of more than one word all, except the first word, are capitalized as in colMeans or suppressPackageStartupMessage. This naming convention is used, for example, for method names in Java and JavaScript.

**UpperCamelCase** - All words are capitalized both when the name consists of a single word, as in Vectorize, or multiple words, as in NextMethod. This naming convention is used for class names in many languages including Java, Python and JavaScript.

Take your pick, but keep it informative.

# CAPITalIZation!!!

**If you learn one thing in this course, know that uppercase and lowercase letters are not the same thing to a computer.**

---
# Excel from hell
---

Among the most significant and worst inventions in personal computing has been the spreadsheet - a computer-displayed version of a traditional ledger that has become the fundamental unit of data storage for most working scientists. The most popular version of this program is [Microsoft Excel](), a bloated whale of a program that, due to the ease with which errors can propagate through formula-ridden files, has possibly created more problems than it solves. An example from the [European Spreadsheet Risks Interest Group](http://www.eusprig.org/horror-stories.htm) illustrates the problem:

> A \$331,074 Excel spreadsheet calculation error was made on a sheriff’s operations salary worksheet during the annual budget cycle last fall, commissioners learned Monday. A \$163,083 spreadsheet error was made on a sheriff’s jail salary worksheet, officials said. “We did some forensics,” Benedict said Wednesday. “The spreadsheets were emailed back and forth between us and [County Administrator] Jim Jones’ office. Because of some cutting and pasting, not all the formulas were pasted correctly. It was an unintended error.”

This kind of thing is so serious that one department at University of Leipzig Richard does this: https://twitter.com/rlmcelreath/status/1174240592253075457?s=20

This problem reflects a fundamental problem with Excel, which is that raw data and calculations on the data are conducted in the same file - **without recording how changes are made**. This means that once one person alters the contents of a file, perhaps innocentlly, the original data can never be recovered. Therefore a key tennant of scientific research is that **raw data must be stored in a database or flat file** that can be subsequenty used for analysis, independnt from the analysis itself. In other words, [**Rule 3: Keep Raw Data Raw**](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097#sec004).

Excel is so awful for data-storage that U Winipeg researchers identified [9 Circles of Excel Hell](https://www.uwinnipeg.ca/w3/docs/nine-circles-of-excel-hell.pdf), the most important of which are:

1. **Oops** - it's too easy for people to delete rows and cells and break formulas and links
2. **Crush** - it is far too difficult to reconcile spreadsheets originating from different people
3. **Brain drain** - consolidating spreadsheets is purgatory for data analysts
8. **Whodunit** - it is impossible to track changes and know who has done what

Truly frightening is that most of the Excel mistakes that have been found have been due to their being finanical in nature - meaning people ouside of those handling the data are highly motivated for it to be done right. Unless we've cured cancer, almost no one will look at our biological data once it is published - including ourselves - making it all the more likely that careless or accidental errors will never be caught!

## BUT

Excel is also the most common way in which biologists enter and store data, so it is important to become familiar with how it works.

---
# Tidyverse
---

Ideas around reproducible research have greatly increased the rate at which we can learn (i.e. copy) from what others have done and apply new methods to our own research. While publications convey a clean narrative about what has been done in terms of data collection and analysis, the reality of doing data-focused research is that **most of the time is spent cleaning data**. This ugly (and tedious) reality is a difficult pill to swallow if you're not a patient person who will continue to work at a tedious task until it is done (in which case, science might not be for you...).

While many biologists learned through painful effort the importance of storing data in a clean way for use in programming languages, R statistican and allround genius [Hadley Wickham](http://hadley.nz) formalized these lessons in a 2014 paper, [Tidy Data](http://www.jstatsoft.org/v59/i10/paper). Here he outlined the principles of tidy data, to provide a standard way to organize data values within a dataset:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

These seem simple, but as you will see in your first assignment, biologists frequently fail to do this, causing no end of pain. So what do they mean?


NB2019: Ordering doesn't and shouldn't matter!

### 1. Each variable forms a column.

In collecting data there are two distinct classes of observation, the response - which is the thing(s) you set out to measure, your variable(s) of interest - and the covariates - which are the associated measurements that will either explain, or potentially bias, your results. *Response variables* are also called observations, independent variables, and y-values (a lack of common nomenclature is one of modelling's greatest difficulties). *Covariates* are also called factors, dependent variables, x-values etc, adding to the confusion. 

To have a tidy data, each measured thing - covariates and responses - gets it's own column. 

### 2. Each observation forms a row. (!!!!!!!)

In statistics, [an observation is the value, at a particular period, of a particular variable](https://stats.oecd.org/glossary/detail.asp?ID=6132), and this definition holds for most scientific data. The tidy data definition also implies that an observation includes not only the value observed, but also the values of covariates that relate to the observation at the time of the observation. In storing data it is crtical that each unique observation get its own row because failure to do so introduces additional, uncnessary steps in an analysis that can (do) lead to frequent errors. 

For example, storing data like this:

| Row        | a | b | c |
| ------------- |:-------------:|:-------------:|:-------------:|
| A | 1 | 4 | 7 |
| B | 2 | 5 | 8 |
| C | 3 | 6 | 9 |

embeds the observations (the numbers) within a table that makes the very hard to use. One would have to row/column index everything to make it useable in a programming language. Alternatively, 

| Row        | Column | Value |
| ------------- |:-------------:|:-------------:|
| A | a | 1 |
| B | a | 2 |
| C | a | 3 |
| A | b | 4 |
| B | b | 5 |
| C | b | 6 |
| A | c | 7 |
| B | c | 8 |
| C | c | 9 |

provides the data in an easily-readable way, one that can immediately be put into an DataFrame for anaylsis.


### 3. Each type of observational unit forms a table*.

Often data are collected at multiple scales, on different kinds of observational units - tidy data promotes storing each fact only in one place. For example:

| Track        | Artist | Date | Rank |
| ------------- |:-------------:|:-------------:|:-------------:|
| Baby don't cry | 2 Pac | 2000-02-26 | 87 |
| Baby don't cry | 2 Pac | 2000-03-04 | 82 |
| Baby don't cry | 2 Pac | 2000-03-11 | 72 |
| Baby don't cry | 2 Pac | 2000-03-25 | 77 |
| Everything in its right place | Radiohead | 2000-04-01 | 8 |
| Everything in its right place | Radiohead | 2000-04-08 | 9 |
| Everything in its right place | Radiohead | 2000-04-15 | 7 |
| Everything in its right place | Radiohead | 2000-04-22 | 11 |
| Everything in its right place | Radiohead | 2000-04-29 | 15 |

repeats a lot of information. The *tidy* version of this would be two tables:

| ID | Artist | Track |
| ------------- |:-------------:|:-------------:|
| 1 | 2 Pac | Baby don't cry |
| 2 | Radiohead | Everything in its right place |

| ID | Date | Rank |
| ------------- |:-------------:|:-------------:|
| 1 | 2000-02-26 | 87 |
| 1 | 2000-03-04 | 82 |
| 1 | 2000-03-11 | 72 |
| 1 | 2000-03-25 | 77 |
| 2 | 2000-04-01 | 8 |
| 2 | 2000-04-08 | 9 |
| 2 | 2000-04-15 | 7 |
| 2 | 2000-04-22 | 11 |
| 2 | 2000-04-29 | 15 |

*However in practice, we typically want to create the repetitive table out of the other kinds of tables so there may be times to store data as a single flat file


---
# Task 5
---

Using markdown code (double-click the example cells to see what that looks like), make tidy data out of the following information:

| Player        | Team | Game 1 | Game 2 | Game 3 | Game 4 |
| ------------- |:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|
| Giancarlo Stanton | MIA | 3/4 | 1/4 | 2/5 | 3/4 |
| Aaron Judge | NYY | 2/4 | 3/6 | 1/4 | 4/4 |
| J.D. Martinez | TOT | 0/4 | 3/4 | 3/6 | 4/4 |
| Khris Davis | OAK | 4/4 | 2/4 | 4/5 | 2/5 |
| Joey Gallo | TEX | 3/4 | 1/5 | 2/4 | 3/4 |
| Nelson Cruz | SEA | 3/4 | 1/4 | 3/4 | 3/4 |
| Edwin Encarnacion | CLE | 2/5 | 1/6 | 3/5 | 3/3 |
| Logan Morrison | TBR | 3/4 | 1/4 | 2/5 | 1/4 |
| Justin Smoak | TOR | 2/4 | 1/4 | 3/3 | 1/4 |

In [None]:
# Your code here


# What have you learned and what's next?

The point of today's lab was to figure out where and how to store data

**You should at this point be comfortable:**
 1. Uploading code to GitHub
 2. Naming things
 3. Not using Excel
 4. Keeping things tidy

Next week we will delve into data wrangling, the grist upon which analyses are made.

## ALSO

Wickahm has subsequently written an [R for Data Science](http://r4ds.had.co.nz) book, that you should delve into if you're serious about using R effectively. 

You should also keep on hand the Hart *et al.* paper [Ten Simple Rules for Data Storage](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097) when you start to go about storing data from a project.

---
# ** A bientôt ** !