In [None]:
# This is to limit the number of lines showing as the result of a code chunck (see https://community.rstudio.com/t/showing-only-the-first-few-lines-of-the-results-of-a-code-chunk/6963/2)
library(knitr)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
  lines <- options$output.lines
  if (is.null(lines)) {
    return(hook_output(x, options))  # pass to default hook
  }
  x <- unlist(strsplit(x, "\n"))
  more <- "..."
  if (length(lines)==1) {        # first n lines
    if (length(x) > lines) {
      # truncate the output, but add ....
      x <- c(head(x, lines), more)
    }
  } else {
    x <- c(more, x[lines], more)
  }
  # paste these lines together
  x <- paste(c(x, ""), collapse = "\n")
  hook_output(x, options)
})

We will review basic commands in R.  
We will take a look at the results from a differential expression analysis between males and females with colon cancer.

## Reading data from files
First, you can set the directory that you're working in. 
If you don't want to write out the full path file, in RStudio you can use the menu option:  
Session > Set Working Directory > Choose Directory

In [None]:
 # To check the working directory
getwd()

In [None]:
# To set the working directory use setwd()

Now, load the data and assign it to a variable.

In [None]:
dge <- read.delim("DEresults_colonCancer.txt", stringsAsFactors = FALSE)
annotations <- read.delim("gene_annotations.txt", stringsAsFactors = FALSE) 

In RStudio you can load the data using the menu option: File > Import Dataset > From Text  

Let's look the first six rows of each data (stored as a variable)

In [None]:
head(dge)
head(annotations)

####Access data by index, by name or by logical vector  
Select the first column

In [None]:
dge[1]
dge[,1]
dge[,"logFC"]
dge$logFC

Select the first row

In [None]:
dge[1,]
dge["DDX3Y",]

Select the element in the first row and second column

In [None]:
dge[1,2]

Select the first three columns

In [None]:
dge[,1:3]

Select all columns except first

In [None]:
dge[2:ncol(dge)]
dge[-1]

##Logical test operators
Operator  | Details
--------- | ----------------------------------------------------------------------
== | equal to    
!= |  not equal to  
& |  and   
\| |  or  
< |  less than  
>  | greater than    
<= |  less than or equal to  
>= |  greater than or equal to  

####Logical tests
A logical test will return a logical vector (true or false result).

In [None]:
x <- 1:10
x == 5
x < 5
x >= 5
table(x == 5)

Now we can pull out the associated values resulted from the logical test.  
Return values in x smaller than 5

In [None]:
x[x < 5]

####Conditional statements and filtering
We can extend the same principle to data frames.  
Let's extract only the rows for genes with logFC greater than 2.

In [None]:
dge[dge$logFC > 2,]

Select the rows for genes with logFC greater than 2 or less than -2

In [None]:
dge[dge$logFC > 2 | dge$logFC < -2,]
dge[abs(dge$logFC) > 2,]

Select genes that are on chromosome Y

In [None]:
is_chrY <- annotations$chrom == "chrY"
is_chrY
annotations[is_chrY,]
# You can invert the result using a logical not operation
annotations[!(is_chrY),]

Select genes that are in a defined group of lncRNAs

In [None]:
lncRNAs <- c("LINC01128", "LINC00115", "LINC01342", "LINC00982", "LINC00278", "LINC00280", "LINC00279", "LINC00265-2P", "LINC00266-2P", "LINC00266-4P", "LINC00265-3P")
is_lncRNAs <- annotations$name %in% lncRNAs
annotations[is_lncRNAs,]

Combine multiple criteria with AND (&) operation

In [None]:
selection_criteria <- is_lncRNAs & is_chrY
annotations[selection_criteria, ]

####Sorting and ordering

In [None]:
sort(dge$logFC)
sort(dge$logFC, decreasing = TRUE)
dge[order(dge$logFC, decreasing=TRUE),]

## Text manipulation
Combining text vectors together

In [None]:
paste(annotations$chrom)
paste(annotations$chrom, ":", annotations$start, "-", annotations$end)
paste(annotations$chrom, ":", annotations$start, "-", annotations$end, sep="")

Searching and replacing

In [None]:
grep("MIR", annotations$name)
grep("MIR", annotations$name, value=TRUE)
# You can also invert
grep("MIR", annotations$name, value=TRUE, invert=TRUE)

# Create a vector with the genomic coordinates
genome <- paste(annotations$chrom, ":", annotations$start, "-", annotations$end, sep="")
head(genome)
# Replace "chr" by "chromosome"
gsub("chr", "chromosome", genome)
# Now remove the text "chr" (you can just replace "chr" by blank)
gsub("chr", "", genome)

Find the chromosome location to each gene in the dge data by matching the gene names in dge and annotations

In [None]:
match(rownames(dge), annotations$name)
chrom <- annotations[match(rownames(dge), annotations$name), "chrom"]
head(chrom)

## Looping
The “for loop”

In [None]:
my_vector <- c(10,11,12,13) 
for (item in my_vector){
  print(item)
  #do actual operations here
}

Apply will perform an operation on each row or each column of a data frame

In [None]:
?apply
apply(dge, 1, mean)
apply(dge, 2, mean)
colMeans(dge)

tapply can be used to perform a function on subsets of your data.  
tapply(vector of values, vector of categories, function to apply)  
For example, what is the first position a gene occupies in each chromosome? 

In [None]:
?tapply
tapply(annotations$start, annotations$chrom, min)
tapply(annotations$start, annotations$chrom, max)

You can also do looping through lists and vectors (lapply and sapply)

## Final questions
1. Add the chromosome information (chromosome, start and end position) to the differential gene expression results (dge data).
2. Calculate the gene length.
3. What is the gene length mean?
4. Subset the dge data to include only statistically significant genes with a cutoff of FDR < 0.05 and absolute logFC greater than 2.
5. Get the gene names for the statistically significant genes located on the Y chromosome.

####Solutions

In [None]:
chr <- annotations[match(rownames(dge), annotations$name), c("chrom", "start", "end")]
dge <- cbind(dge, chr)
dim(dge)
dge$gene_length <- dge$end - dge$start
mean(dge$gene_length, na.rm=TRUE)
apply(dge["gene_length"],2, function(x) mean(x,na.rm=TRUE))
sig <- dge[abs(dge$logFC) > 2 & dge$adj.P.Val < 0.05,]
rownames(sig)[sig$chrom == "chrY"]