# R `tidyverse` exercise

We will be working with the output files from the STAR aligner for this exercise. Thse files have four columns 

```
column 1: gene ID 
column 2: counts for unstranded RNA-seq 
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) 
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse) 
```

For explanation, see [STAR quantMode geneCounts values](https://www.biostars.org/p/218995/)

Based on the protocol we are using column 4 is the sense strand reads and column 3 is the anti-sense read counts, so we will be working with columns 1 and 4. 

In [14]:
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


We will work with the following data from the 4 lanes.

```
1_MA_J_S18_L001_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L002_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L003_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L004_R1_001ReadsPerGene.out.tab
```

found in the directory

```
/data/hts/2018/foot
```

## Setting up variables

**1**. Use `file.path` to create a path to the data directory and save it as the variable `data_dir`.

In [9]:
data_dir =  file.path('/', 'data', 'hts', '2018', 'foot')

**2**. Save the filenames in a variable `filenames`

In [10]:
filenames <- c('1_MA_J_S18_L001_R1_001ReadsPerGene.out.tab',
               '1_MA_J_S18_L002_R1_001ReadsPerGene.out.tab',
               '1_MA_J_S18_L003_R1_001ReadsPerGene.out.tab',
               '1_MA_J_S18_L004_R1_001ReadsPerGene.out.tab')

## Explore one data file

**3**. Read the data from the first file into a `data.frame` or `tibble` called `df`. Note that the file does not have a header row. Name the columns `id`,`us`,  `fs`  and `rs`.

In [20]:
f1 <- file.path(data_dir, filenames[[1]])
f1

In [22]:
df <- read_tsv(f1, col_names = c('id', 'us', 'fs', 'rs'))

Parsed with column specification:
cols(
  id = col_character(),
  us = col_integer(),
  fs = col_integer(),
  rs = col_integer()
)


**4**. View the first and last 10 lines of `df`

In [23]:
df %>% head(10)

id,us,fs,rs
N_unmapped,12683,12683,12683
N_multimapping,48837,48837,48837
N_noFeature,9605,2207680,18570
N_ambiguous,169971,1593,409
gene0,0,0,0
gene1,0,0,0
gene2,8,0,8
gene3,1,0,1
gene4,0,0,0
gene5,66,0,66


In [33]:
df %>% tail(10)

id,us,fs,rs
gene8319,0,0,0
gene8323,0,0,0
gene8324,1,0,1
gene8326,0,0,0
gene8327,0,0,0
gene8328,0,0,0
gene8329,0,0,0
gene8330,0,0,0
gene8333,2,0,2
gene8334,0,0,0


**5**. Save the lines from 5 onwards into a new `data.frame`  called `df_genes`.

In [26]:
df_genes <- df %>% slice(-(1:4))

In [29]:
df_genes %>% head(3)

id,us,fs,rs
gene0,0,0,0
gene1,0,0,0
gene2,8,0,8


**6**. Create a new file from `df_genes` contining only the 1st and 4th columns and save as a new variable `df_final`.

In [53]:
df_final <- df_genes %>% select(id, rs)
df_final %>% head(3)

id,rs
gene0,0
gene1,0
gene2,8


**7**. Now do steps 3, 5 and 6 for the other 3 files using a loop, and combine them with `df_final` using `full_join` on the `id` column to end up with a data.frame with 5 columns (id and 4 count columns).

In [54]:
for (i in 2:4) {
    filename <- filenames[i]
    path <- file.path(file.path(data_dir, filename))
    df <- suppressMessages(read_tsv(path, col_names =c('id', 'us', 'fs', 'rs')))
    df <- df %>% slice(-(1:4)) %>% select('id', 'rs')
    df_final <- full_join(df_final, df, by = 'id') 
}

**8**. Rename the counts columns as `lane1`, `lane2`, `lane3` and `lane4`. At this point you should have a `data.frame` that looks like this

| id |	lane1 | lane2 | lane3 | lane4 |
| - | - | - | - | -| 
| gene0 | 0 |	0 |	0 |	1 |
| gene1	| 0 |	0 |	0 |	0 |
| gene2	| 8 |	4 |	10 | 3 | 



In [56]:
names(df_final) <- c('id', 'lane1', 'lane2', 'lane3', 'lane4')

In [78]:
head(df_final, 3)

id,lane1,lane2,lane3,lane4
gene0,0,0,0,1
gene1,0,0,0,0
gene2,8,4,10,3


**9**. Create a new column containng the sum of lanes 1-4 called `counts` and save as `df_wiht_counts`.

In [110]:
df_with_counts <- df_final %>% 
mutate(counts = lane1 + lane2+ lane3 + lane4) 
head(df_with_counts)

id,lane1,lane2,lane3,lane4,counts
gene0,0,0,0,1,1
gene1,0,0,0,0,0
gene2,8,4,10,3,25
gene3,1,0,1,1,3
gene4,0,1,0,0,1
gene5,66,58,72,65,261


**Note on more facny R**.

If there are too many columns to add, you can use `Reduce`. The funny `.` notation is a dummy variable for the entity begin pipled in. 

`Reduce` applies the binary operation in the first argument to all pairs of variables in the second arguemnt, with an optional initial value (defaults to a the identity for the binary operation).

For example:

`Reduce('+', 1:4)` is the same as `Reduce('+', 1:4, 0)` which is `((((0 + 1) + 2) + 3) + 4)` and returns 10 - i.e. this gives the same result as sum.

In [95]:
df_final %>% 
mutate(counts = Reduce('+', .[2:5])) %>% 
head(5)

id,lane1,lane2,lane3,lane4,counts
gene0,0,0,0,1,1
gene1,0,0,0,0,0
gene2,8,4,10,3,25
gene3,1,0,1,1,3
gene4,0,1,0,0,1


**10**. Keep only the `id` and `coutns` columns and remove reow where the gene count is 0 and save as `df_counts`. 

- How many genes with non-zero counts are there?
- What is the gene(s) with the highest count?
- What are the top 10 largest counts - i.e. the set with the largest number of genes having the same count?

In [113]:
df_counts <- df_with_counts %>% 
select(id, counts) %>%
filter(counts != 0)

In [117]:
df_counts %>% summarize(n())

n()
7852


In [121]:
df_counts %>% filter(counts == max(counts))

id,counts
gene7418,210788


In [131]:
df_counts %>% 
group_by(counts) %>% 
summarize(num_genes = n())  %>%
arrange(desc(num_genes)) %>%
head(10)

counts,num_genes
1,199
2,132
3,115
5,87
4,82
6,80
7,78
9,62
11,55
8,54
