In [1]:
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


## Data import

In [None]:
# read data directly from github
df = read_csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/ca.csv") 

# let's take a look at it
glimpse(df)

In [5]:
# column 'genotype'
wildtype = c(paste(rep("wt",10),seq(1,10,1),sep = "-"))
hairy = c(paste(rep("hairy",10),seq(1,10,1),sep = "-"))
bald = c(paste(rep("bald",10),seq(1,10,1),sep = "-"))
genotype = c(wildtype,hairy,bald)

# column with petra and peter
scientist = c(
  rep("petra",5),
  rep("peter",5),
  rep("petra",5),
  rep("peter",5),
  rep("petra",5),
  rep("peter",5)
)

# one column for leaf bottom side and one for leaf upper side
## upper side
u_wt = rnorm(n = 10,mean = 100,sd = 10)
u_hairy = rnorm(n = 10,mean = 500,sd = 30)
u_bald = rnorm(n = 10,mean = 20,sd = 5)
upper = c(u_wt,u_hairy,u_bald)
## lower side  
d_wt = u_wt + runif(n = 10,min = -10,max = +10)
d_hairy = u_hairy + runif(n = 10,min = -50,max = +50)
d_bald = u_bald + runif(n = 10,min = 0,max = +10)
lower = c(d_wt,d_hairy,d_bald)

df = data.frame(
  genotype = genotype,
  scientist = scientist,
  upper = upper,
  lower = lower)
head(df)

genotype,scientist,upper,lower
wt-1,petra,90.47113,86.66365
wt-2,petra,104.10747,110.01898
wt-3,petra,102.94514,107.31042
wt-4,petra,110.70072,117.51176
wt-5,petra,95.10736,89.52991
wt-6,peter,105.25195,102.78445


**Question**: How would you calculate the mean and standard deviation of the trichome counts per person? Is it easy to do? 

**Question**: what are other potential problems with this dataset? What columns need to be changed?

**Question**: how would the corresponding tidy dataset look like?

# From messy to tidy

First let's separate the genotype column into a genotype and a biological replicate column

In [10]:
df.parsed = separate(data = df,col = genotype,into = c("genotype","replicate"))
head(df.parsed)

genotype,replicate,scientist,upper,lower
wt,1,petra,90.47113,86.66365
wt,2,petra,104.10747,110.01898
wt,3,petra,102.94514,107.31042
wt,4,petra,110.70072,117.51176
wt,5,petra,95.10736,89.52991
wt,6,peter,105.25195,102.78445


Then let's reshape the dataframe so that `upper` and `lower` column are transformed into rows. They are indeed experimental variables and should  