# Brown Univ. Introduction to Bioconductor 2018, Period 1

## Motivations and core values

### Why R?

Bioconductor is based on R. Three key reasons for this are:

- R is used by many statisticians and biostatisticians to create algorithms that advance our ability to understand complex experimental data.

- R is highly interoperable, and fosters reuse of software components written in other languages.

- R is portable to the key operating systems running on commodity computing equipment (Linux, MacOSX, Windows) and can be used immediately by beginners with access to any of these platforms.

Other languages are starting to share these features.  However the large software ecosystems of R and Bioconductor will continue to play a role even as new languages and environments for genome-scale analysis start to take shape.

### What is R?

We'll see more clearly what R is as we work with it.  Two features that merit attention are its approach to *functional* and *object-oriented* programming.

Before we get into these programming concepts, let's get clear on the approach we are taking to working with R.

- We are using R *interactively* in the Jupyter notebook system for scientific computing
- Our interaction with R is defined in notebook "cells"
- We can put some code in a cell and ask the notebook server to execute the code
- If there's an error or we want to modify the cell for some reason, we just change the content of the cell and request a new execution

The next two cells introduce a simple R function and then pose some questions that you can answer by modifying the second cell.


In [None]:
# functional programming example
cube = function(x) x^3
cube(4)

In [None]:
# Two exercises:
# 1) what is the cube of 7?
# 2) given the cube function, what is a concise way of defining a function that computes 
# the ninth power of its argument, without invoking the exponential directly?

In [None]:
# towards object-oriented programming -- here we'll just use the concept

In [None]:
library(Homo.sapiens)
methods(class=class(Homo.sapiens))

In [None]:
promoters(Homo.sapiens)

In [None]:
# Exercise: Think of the biological definition of 'promoter'.  Does this suggest additional 
# parameters to the 'promoters' method?

### Upshots

We've seen two important facets of the language/ecosystem

- improvised software creation through user-defined functions
- acquisition of complex, biologically meaningful 'objects' and 'methods' with library()

Use of library() is essential to acquire access to functions and documentation on library components.  This can be a stumbling block if you remember the name of a function of interest, but not the package in which it is defined.
    
Another feature worth bearing in mind: all computations in R proceed by evaluation of functions.  You may write scripts, but they will be sequences of function calls.

Finally: There are many ways of using software in R.  

- write scripts and execute them at a command line using Rscript or unix-like pipes.
- use a bespoke interactive development environment like Rstudio
- use R as a command-line interpreter
- use jupyter notebooks
- use R through online "apps", often composed using the shiny package.
    
We'll explore some of these alternate approaches as we proceed.

## Putting it all together

Bioconductor’s core developer group works hard to develop data structures that allow users to work conveniently with genomes and genome-scale data. Structures are devised to support the main phases of experimentation in genome scale biology:

- Parse large-scale assay data as produced by microarray or sequencer flow-cell scanners.
- Preprocess the (relatively) raw data to support reliable statistical interpretation.
- Combine assay quantifications with sample-level data to test hypotheses about relationships between molecular processes and organism-level characteristics such as growth, disease state.
- In this course we will review the objects and functions that you can use to perform these and related tasks in your own research.

## Bioconductor installation and documentation

### Installation support

Once you have R you can obtain a utility for adding Bioconductor packages with

```
source("http://www.bioconductor.org/biocLite.R")
```

This is sensitive to the version of R that you are using.  It installs and loads the BiocInstaller
package.  Let's illustrate its use.

In [None]:
library(BiocInstaller)
biocLite("genomicsclass/GSE5859Subset")  # this will be used later

### Documentation

For novices, large-scale help resources are always important.  

- Bioconductor's __[main portal](http://www.bioconductor.org)__ has sections devoted to installation, learning, using, and developing.  There is a twitter feed for those who need to keep close by.
- The __[support site](http://support.bioconductor.org)__ is active and friendly
- R has extensive help resources at __[CRAN](http://cran.r-project.org)__ and within any instance 

In [None]:
# help.start() -- may not work with notebooks but very useful in Rstudio or at command line

Documentation on any base or Bioconductor function can be found using the ?-operator 

In [None]:
?mean # use ?? to find documentation on functions in all installed packages

All good manual pages include executable examples

In [None]:
example(lm)

We can get 'package level' help -- a concise description and list of documented functions.

In [None]:
help(package="genefilter", help_type="html") # generates embedded view of DESCRIPTION and function index

Large-scale documentation that narrates and illustrates package (as opposed to function) capabilities is provided in vignettes.

In [None]:
vignette()

In [None]:
vignette("create_objects", package="pasilla")

In summary: documentation for Bioconductor and R utilities is diverse but discovery is supported in many ways.  RTFM.

## Data structure and management for genome-scale experiments

Data management is often regarded as a specialized and tedious dimension of scientific research. 

- Because failures of data management are extremely costly in terms of resources and reputation, highly reliable and efficient methods are essential. 
- Customary lab science practice of maintaining data in spreadsheets is regarded as risky. We want to add value to data by making it easier to follow reliable data management practices.

In Bioconductor, principles that guide software development are applied in data management strategy. 

- High value accrues to data structures that are modular and extensible. 
- Packaging and version control protocols apply to data class definitions. 
- We will motivate and illustrate these ideas by 
    - giving examples of transforming spreadsheets to semantically rich objects, 
    - working with the NCBI GEO archive, 
    - dealing with families of BAM and BED files, and (optionally)
    - using external storage to foster coherent interfaces to large multiomic archives like TCGA.

### Coordinating information from multiple tables

With the GSE5859Subset package, we illustrate a "natural" approach to collecting microarray data and its annotation.

In [None]:
library(GSE5859Subset)
data(GSE5859Subset) # will 'create' geneExpression, sampleInfo, geneAnnotation

In [None]:
dim(geneExpression)

In [None]:
head(geneExpression[,1:5])

In [None]:
head(sampleInfo)

In [None]:
head(geneAnnotation)

Here we have three objects in R that are conceptually linked.  We notice that `sampleInfo` has an ethnicity token and that the column names for the `geneExpression` table are similar in format to the `filename` field of `sampleInfo`.  Let's check that they in fact agree:

In [None]:
all(sampleInfo$filename == colnames(geneExpression))

In [None]:
table(sampleInfo$group)

In [None]:
# What is the distribution of ethnicity in this dataset?