# R `tidyverse` exercise

We will be working with the output files from the STAR aligner for this exercise. Thse files have four columns 

```
column 1: gene ID 
column 2: counts for unstranded RNA-seq 
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) 
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse) 
```

For explanation, see [STAR quantMode geneCounts values](https://www.biostars.org/p/218995/)

Based on the protocol we are using column 4 is the sense strand reads and column 3 is the anti-sense read counts, so we will be working with columns 1 and 4. 

In [7]:
library(tidyverse)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


We will work with the following data from the 4 lanes.

```
1_MA_J_S18_L001_ReadsPerGene.out.tab
1_MA_J_S18_L002_ReadsPerGene.out.tab
1_MA_J_S18_L003_ReadsPerGene.out.tab
1_MA_J_S18_L004_ReadsPerGene.out.tab
```

found in the directory

```
/data/hts2018_pilot/star_counts
```

## Setting up variables

**1**. Use `file.path` to create a path to the data directory and save it as the variable `data_dir`.

In [15]:
data_dir <- "/data/hts2018_pilot/star_counts"

In [16]:
data_dir <-  file.path('/', 'data', 'hts2018_pilot', 'star_counts')

**2**. Save the filenames in a variable `filenames`

In [17]:
filenames <- c('1_MA_J_S18_L001_ReadsPerGene.out.tab',
'1_MA_J_S18_L002_ReadsPerGene.out.tab',
'1_MA_J_S18_L003_ReadsPerGene.out.tab',
'1_MA_J_S18_L004_ReadsPerGene.out.tab')

## Explore one data file

**3**. Read the data from the first file into a `data.frame` or `tibble` called `df`. Note that the file does not have a header row. Name the columns `id`,`us`,  `fs`  and `rs`.

In [18]:
f1 <- file.path(data_dir, filenames[[1]])
f1

In [19]:
df <- read_tsv(f1, col_names = c('id', 'us', 'fs', 'rs'))

Parsed with column specification:
cols(
  id = col_character(),
  us = col_integer(),
  fs = col_integer(),
  rs = col_integer()
)


**4**. View the first and last 10 lines of `df`

In [20]:
df %>% head(10)

id,us,fs,rs
N_unmapped,2690,2690,2690
N_multimapping,66100,66100,66100
N_noFeature,10626,2238382,20347
N_ambiguous,173170,1622,647
CNAG_04548,0,0,0
CNAG_07303,0,0,0
CNAG_07304,8,0,8
CNAG_00001,0,0,0
CNAG_07305,0,0,0
CNAG_00002,66,0,66


In [21]:
df %>% tail(10)

id,us,fs,rs
CNAG_09008,1,0,1
CNAG_09009,9,0,9
CNAG_11015,0,0,0
ENSRNA049545623,0,0,0
CNAG_11016,0,0,0
ENSRNA049545680,0,0,0
CNAG_09010,0,0,0
CNAG_09011,0,0,0
ENSRNA049545749,0,0,0
CNAG_09012,0,0,0


**5**. Save the lines from 5 onwards into a new `data.frame`  called `df_genes`.

In [22]:
df_genes <- df %>% slice(-(1:4))

In [23]:
df_genes %>% head(3)

id,us,fs,rs
CNAG_04548,0,0,0
CNAG_07303,0,0,0
CNAG_07304,8,0,8


**6**. Create a new file from `df_genes` contining only the 1st and 4th columns and save as a new variable `df_final`.

In [24]:
df_final <- df_genes %>% select(id, rs)
df_final %>% head(3)

id,rs
CNAG_04548,0
CNAG_07303,0
CNAG_07304,8


**7**. Now do steps 3, 5 and 6 for the other 3 files using a loop, and combine them with `df_final` using `full_join` on the `id` column to end up with a data.frame with 5 columns (id and 4 count columns).

In [25]:
for (i in 2:4) {
    filename <- filenames[i]
    path <- file.path(file.path(data_dir, filename))
    df <- suppressMessages(read_tsv(path, col_names =c('id', 'us', 'fs', 'rs')))
    df <- df %>% slice(-(1:4)) %>% select('id', 'rs')
    df_final <- full_join(df_final, df, by = 'id') 
}

**8**. Rename the counts columns as `lane1`, `lane2`, `lane3` and `lane4`. At this point you should have a `data.frame` that looks like this

| id |	lane1 | lane2 | lane3 | lane4 |
| - | - | - | - | -| 
| gene0 | 0 |	0 |	0 |	1 |
| gene1	| 0 |	0 |	0 |	0 |
| gene2	| 8 |	4 |	10 | 3 | 



In [26]:
names(df_final) <- c('id', 'lane1', 'lane2', 'lane3', 'lane4')

In [27]:
head(df_final, 3)

id,lane1,lane2,lane3,lane4
CNAG_04548,0,0,0,1
CNAG_07303,0,0,0,0
CNAG_07304,8,7,10,9


**9**. Create a new column containng the sum of lanes 1-4 called `counts` and save as `df_wiht_counts`.

In [28]:
df_with_counts <- df_final %>% 
mutate(counts = lane1 + lane2+ lane3 + lane4) 
head(df_with_counts)

id,lane1,lane2,lane3,lane4,counts
CNAG_04548,0,0,0,1,1
CNAG_07303,0,0,0,0,0
CNAG_07304,8,7,10,9,34
CNAG_00001,0,0,0,0,0
CNAG_07305,0,1,0,0,1
CNAG_00002,66,59,74,66,265


**Note on more facny R**.

If there are too many columns to add, you can use `Reduce`. The funny `.` notation is a dummy variable for the entity begin pipled in. 

`Reduce` applies the binary operation in the first argument to all pairs of variables in the second arguemnt, with an optional initial value (defaults to a the identity for the binary operation).

For example:

`Reduce('+', 1:4)` is the same as `Reduce('+', 1:4, 0)` which is `((((0 + 1) + 2) + 3) + 4)` and returns 10 - i.e. this gives the same result as sum.

In [35]:
df_final %>% 
mutate(counts = Reduce('+', .[2:5])) %>% 
head(5)

id,lane1,lane2,lane3,lane4,counts
CNAG_04548,0,0,0,1,1
CNAG_07303,0,0,0,0,0
CNAG_07304,8,7,10,9,34
CNAG_00001,0,0,0,0,0
CNAG_07305,0,1,0,0,1


**10**. Keep only the `id` and `coutns` columns and remove reow where the gene count is 0 and save as `df_counts`. 

- How many genes with non-zero counts are there?
- What is the gene(s) with the highest count?
- What are the top 10 largest counts - i.e. the set with the largest number of genes having the same count?

In [36]:
df_with_counts

id,lane1,lane2,lane3,lane4,counts
CNAG_04548,0,0,0,1,1
CNAG_07303,0,0,0,0,0
CNAG_07304,8,7,10,9,34
CNAG_00001,0,0,0,0,0
CNAG_07305,0,1,0,0,1
CNAG_00002,66,59,74,66,265
CNAG_00003,38,25,27,22,112
CNAG_00004,74,79,79,69,301
CNAG_00005,33,25,32,24,114
CNAG_12000,31,45,39,36,151


In [37]:
df_counts <- df_with_counts %>% 
select(id, counts) %>%
filter(counts != 0)

df_counts

id,counts
CNAG_04548,1
CNAG_07304,34
CNAG_07305,1
CNAG_00002,265
CNAG_00003,112
CNAG_00004,301
CNAG_00005,114
CNAG_12000,151
CNAG_12001,13
CNAG_00006,1904


In [39]:
df_counts %>% summarize(total = n())

total
7857


In [40]:
df_counts %>% filter(counts == max(counts))

id,counts
CNAG_06125,213686


In [41]:
df_counts %>% 
group_by(counts) %>% 
summarize(num_genes = n())  %>%
arrange(desc(num_genes)) %>%
head(10)

counts,num_genes
1,183
2,133
3,107
5,90
4,88
6,81
7,76
8,56
9,56
11,55
