
# Introduction to R


R is a open-source programming language for statistical computing 1; it is derived from another programming language - S, which was developed in 1976 at the world-renowned Bell laboratories.  

R's ability to implement a wide and constantly expanding catalogue of statistical models differs from other statistical programmes in that it allows users to write and compile their own mathematical functions, or suite of functions known as packages.

Packages can be submitted to the core team, who perform checks on the integrity and functionality of user submitted packages, those that pass this quality control stage are stored in the Comprehensive R Archive Network (CRAN) repository -  a global network of FTP mirror sites, totalling  104 sites in 49 countries. 

In the United Kingdom, five academic institutions host the CRAN:

1.	School of Mathematics, University of Bristol
2.	European Bioinformatics Institute, Wellcome Trust Genome Campus
3.	Department of Mathematics, Imperial College London
4.	Computer Science Department, Middlesex University London
5.	School of Physics and Astronomy, University of St Andrews

There are currently 6,367 packages available in CRAN, in addition various other repositories have been created for specific applications, for example, the Bioconductor  project2 contains 934 packages for the analysis of genomic data. R's popularity in academia has extended into large-scale corporate business and government agencies because it provides the latest tools available for data analysis at zero cost.

Why use it?  
* Free  
* Well developed and effective programming language  
* Powerful suite of tools that is well-documented  
* Good online support  
* It comes in a variety of forms - available for all major operating systems and can be run on desktops and high performance computing clusters


## *Basic Operations*


At the most basic level you can use it as a calculator:

In [2]:
1 + 2

It allows you to assign a value to a named data structure (object):

In [3]:
a <- 1
b <- 2
a + b

You can assign a number of values (a vector) to an object:

In [4]:
a <- c(1,2,3,4,5)
b <- c(1, 1e1, 1e2, 1e3, 1e4)
a * b

Instead of numbers, you can supply characters to objects:

In [5]:
first.name <- c("Homer", "Peter", "Francine")
last.name <- c("Simpson", "Griffin", "Smith")
paste(first.name, last.name, sep = " ")


## *Moving Around*


An important concept to grasp if you are new to programming is the working directory. As its name suggests, it is the directory (think folder) where you store files to be analysed and can save objects you created in R.


Our first example used the paste function to combine two character vectors (first and last names), 
we can also use this function to move around folders from within R. This because especially useful 
if you want your script to write output to be written in particular folders, e.g. by gender or 
ethnicity.

When you first open R the default working directory is normally the folder where the R library is saved. You can find out by typing:

In [7]:
getwd()

Normally the files you want to analyse are not the default folder. This is not just good practice organising your data but also you may find the data is on a restricted location, e.g. a network drive, and cannot move the data for security reasons. You can set the working directory at any time during your R session, simply tell R where you want the working directory to be:

In [9]:
setwd("C:/Users/ucbtkdi/temp")

In [10]:
getwd()


## *Data Storage Types*


R stores data in a form that it thinks is most appropriate, there are several types it can select from:  


* logical - TRUE/FALSE  
* numeric
    + integer - 1,2,3,4,5  
    + single/double precision - 32 and 64 bit precision  
    + complex - 1 + 0i
* character - "hello"
* symbol - the name of an object


## *Getting Data into R*


You will normally get data into R from external files or locations, for example,


In [None]:
data1 <- read.table("test.csv", head = T, sep = ",")

The function "read.table" is one of several related functions in R that allows you to read in data. Others are read.csv and read.delim for comma and tab seperated files. All accept a predefined list of options that you can specify, to understand more about the function you can use R's in-built help system:

In [15]:
help("read.table")

One of the options with read.table is to select the value that seperates columns in your dataset. If you are reading in comma seperated value (csv) files, use the read.csv function as it save you some typing.

If you are given data in a workbook (.xls/x), you will need to call on function from a package (called "xlsx").

In [None]:
library("xlsx")
data.xlsx <- read.xlsx(file = "test_excelfile.xlsx", sheetIndex = 1)

To read in large files efficiently, another specialised function must be
called this time from the "data.table" package.    

In [None]:
data <- data.table::fread("big_file.csv")


## *Data Structures*


R normally reads data into structures called data.frames. A data.frame is a very useful object in that it allows you to mix vectors of different types (numeric, character, etc.). It is an organised list of vectors, each column of your dataset in stored as an individual vector. The data.frame is displayed as a 2 dimensional array with rows and columns and can be indexed:

In [None]:
head(data) # look at the first 6 rows
data.xlsx[1:6,]

A related structure is a matrix, again a 2 dimensional array that can be indexed, but all columns must be of the same type. These are particularly useful when storing the output of a statistical analysis. 

In [18]:
the.matrix <- matrix(c(1:9), nrow = 3, ncol = 3) # put numbers 1-9 into a 3 by 3 matrix
the.matrix
now.dataframe <- as.data.frame(the.matrix)

0,1,2
1,4,7
2,5,8
3,6,9


Can always convert a matrix into a data.frame, to go the other way, the columns of the data.frame must be of the same type.


A list is an ordered collection of components, these components can be anything, e.g. numbers, strings, data.frames, other lists(!).

In [19]:
new.list <- list(Forename = first.name, 
                 Surname = last.name, 
                 'Date of Birth' = c("1950-01-30", "1960-05-20", "1955-09-10"), 
                 PracID = c(100,NA,450))
new.list

Lists are a good way of storing many objects without clutering your workspace, for example, say you had 1,000 csv files in a folder and wanted to read them all into R:

In [20]:
# for(i in 1:1e3){
# write.csv(data.frame(patid = sample(1e3, 10), height = sample(150:200, 10), weight = sample(60:120,10)), file = paste0("file_", i, ".csv"))
# }

In [None]:
list.myfilenames <- list.files(getwd(), pattern = "\\.csv$") # get the names of the files you want
list.myfiles <- lapply(setNames(list.myfilenames, make.names(gsub("*.csv$", "", list.myfilenames))), read.csv) # read them in
lapply(list.myfiles[1:10], head) # check a couple


## *Getting Help*


In [21]:
?read.csv
help("read.csv")

This will do one of several things depending on the operating system and internet connection status; it will open your browser on the GUI version, open a new window within your session if you're not connected to the internet, or if you're working on a server it will print the help file to the command line. The help files usually give an example of the function for you to see how it should work. 

To see an example of the function in action if it exists, you can use 'example':

In [22]:
example(summary)


summry> summary(attenu, digits = 4) #-> summary.data.frame(...), default precision
     event            mag           station         dist       
 Min.   : 1.00   Min.   :5.000   117    :  5   Min.   :  0.50  
 1st Qu.: 9.00   1st Qu.:5.300   1028   :  4   1st Qu.: 11.32  
 Median :18.00   Median :6.100   113    :  4   Median : 23.40  
 Mean   :14.74   Mean   :6.084   112    :  3   Mean   : 45.60  
 3rd Qu.:20.00   3rd Qu.:6.600   135    :  3   3rd Qu.: 47.55  
 Max.   :23.00   Max.   :7.700   (Other):147   Max.   :370.00  
                                 NA's   : 16                   
     accel        
 Min.   :0.00300  
 1st Qu.:0.04425  
 Median :0.11300  
 Mean   :0.15422  
 3rd Qu.:0.21925  
 Max.   :0.81000  
                  

summry> summary(attenu $ station, maxsum = 20) #-> summary.factor(...)
    117    1028     113     112     135     475    1030    1083    1093    1095 
      5       4       4       3       3       3       2       2       2       2 
    111     116   


## *Libraries*


Like we've done above to read in and out files, functions from packages must first be loaded into our R session. Thankfully, you don't have to install every package each time you install R, to see what packages you already have type:

In [1]:
installed.packages()

Unnamed: 0,Package,LibPath,Version,Priority,Depends,Imports,LinkingTo,Suggests,Enhances,License,License_is_FOSS,License_restricts_use,OS_type,MD5sum,NeedsCompilation,Built
colorspace,colorspace,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,1.3-2,,"R (>= 2.13.0), methods","graphics, grDevices",,"datasets, stats, utils, KernSmooth, MASS, kernlab, mvtnorm, vcd, dichromat, tcltk, shiny, shinyjs",,BSD_3_clause + file LICENSE,,,,,yes,3.2.3
crayon,crayon,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,1.3.4,,,"grDevices, methods, utils",,"mockery, rstudioapi, testthat, withr",,MIT + file LICENSE,,,,,no,3.2.3
curl,curl,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,3.0,,R (>= 3.0.0),,,"testthat (>= 1.0.0), knitr, jsonlite, rmarkdown, magrittr, httpuv, webutils",,MIT + file LICENSE,,,,,yes,3.2.3
devtools,devtools,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,1.13.3,,R (>= 3.0.2),"httr (>= 0.4), utils, tools, methods, memoise (>= 1.0.0), whisker, digest, rstudioapi (>= 0.2.0), jsonlite, stats, git2r (>= 0.11.0), withr",,"curl (>= 0.9), crayon, testthat (>= 1.0.2), BiocInstaller, Rcpp (>= 0.10.0), MASS, rmarkdown, knitr, hunspell (>= 2.0), lintr (>= 0.2.1), bitops, roxygen2 (>= 5.0.0), evaluate, rversions, covr, gmailr (> 0.7.0)",,GPL (>= 2),,,,,no,3.2.3
dichromat,dichromat,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,2.0-0,,"R (>= 2.10), stats",,,,,GPL-2,,,,,,3.2.3
digest,digest,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,0.6.12,,R (>= 2.4.1),,,"knitr, rmarkdown",,GPL (>= 2),,,,,yes,3.2.3
evaluate,evaluate,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,0.10.1,,R (>= 3.0.2),"methods, stringr (>= 0.6.2)",,"testthat, lattice, ggplot2",,MIT + file LICENSE,,,,,no,3.2.3
gdata,gdata,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,2.18.0,,R (>= 2.3.0),"gtools, stats, methods, utils",,RUnit,,GPL-2,,,,,no,3.2.3
ggplot2,ggplot2,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,2.2.1,,R (>= 3.1),"digest, grid, gtable (>= 0.1.1), MASS, plyr (>= 1.7.1), reshape2, scales (>= 0.4.1), stats, tibble, lazyeval",,"covr, ggplot2movies, hexbin, Hmisc, lattice, mapproj, maps, maptools, mgcv, multcomp, nlme, testthat (>= 0.11.0), quantreg, knitr, rpart, rmarkdown, svglite",sp,GPL-2 | file LICENSE,,,,,no,3.2.3
git2r,git2r,/home/kenan/R/x86_64-pc-linux-gnu-library/3.2,0.19.0,,"R (>= 3.0.2), methods","graphics, utils",,getPass,,GPL-2,,,,,yes,3.2.3


If you need a package that is not installed on your system, you can install it:

In [None]:
install.packages("ggplot2")

Similarly, to remove a package:

In [None]:
remove.packages("ggplot2")

R has many built in functions, you can see them all in one place:

In [7]:
 builtins()