The unix shell
is an extremely powerful environment that features many
extremely handy tools, that do simple things, and that can be piped (|
)
together.
wc
,grep
,cut
tr
,sed
,awk
shell
can also be used for scripting.
- All exploit regular expressions. See ItDT book (later).
grep
: find matching linessed
: stream-editor. Incredibly handy for one-liners:
http://sed.sourceforge.net/sed1line.txt
sed 's/foo/bar/g' # replaces ALL instances in line # print section of file between two regular expressions sed -n '/Iowa/,/Montana/p' # case sensitive
- awk: flexible pattern matching/ processing of text files.
http://www.pement.org/awk/awk1line.txt
# print the sums of the fields of every line awk '{s=0; for (i=1; i<=NF; i++) s=s+$i; print s}'\footnotesize \footnotesize
- diff shows the differences between version1 and version 2.
diff nextsteps/version1.dat nextsteps/version2.dat
- patch: new file = old file + diff
- patches are efficient ways of sending updates. Useful for syncing and version control.
diff version1.dat version2.dat > p patch < p diff version1.dat version2.dat
- Most unix tools (used to be) limited by length of lines. Perl removed those restrictions, combining features of awk, sh and C.
- ‘duct tape’ programming language.
- Useful in computational biology. See http://www.bioperl.org
- Excellent Ensembl API, http://www.ensembl.org/info/data/api.html
- G. Valiente. Combinatorial Pattern Matching Algorithms in Computational Biology using Perl and \R. Taylor & Francis/CRC Press (2009).
- Verdict: yucky, but probably [essential | good to know].
- Bidirectional \R/Perl interfaces http://www.omegahat.org/RSPerl/
grep
,sub
,gsub
,strsplit
,nchar
,substr
, …- also \Rpackage{stringr} package
and for sequence data storing and manipulation
- \Rpackage{Biostrings} package
- Modern programming language; less compact than perl:
\footnotesize
while (<>) { | import sys
print if /perl/i; | for line in sys.stdin.readlines():
} | if line.lower().find("perl") > -1:
# | print line,
http://www.sabren.net/articles/againstperl.php3
\normalsize
- Clean syntax
- Properly object-oriented.
- Not as much support in computational biology (yet). See http://www.biopython.org
- Verdict: More general programming language than \R; lacking (perhaps?) in core numerics and graphics – see NumPy and RPy(2).
- Bidirectional \R/Python interface http://www.omegahat.org/RSPython/
- Open-source project from MIT, “high-level, high-performance dynamic programming language for technical computing” http://julialang.org
- Syntax similar to Matlab.
- C like performance. Proper Macros, influenced by lisp. What’s not to like?
- No version 1.0 yet. Still rapidly changing.
- No consensus yet on default plotting engine (but Plots.jl is semi-unifying interface).
- Small (compared to CRAN) list of packages, but growing. e.g. BioJulia is quite small https://github.com/BioJulia/Bio.jl currently.
- BUT my hope is Julia will soon (1–2 years) be seen as more attractive than Matlab.
- Low-level programming language
- Very fast, but takes a long time to write code.
- You have to worry about memory allocation yourself.
- All variables have predefined type.
- Critical for numerical-intensive work. (FORTRAN less-popular.)
- \R has built-in \texttt{C} interfaces
- Better know how to program in \texttt{C}.
- Documentation is not always easy to follow: R-Ext, R Internals as well as \R and other package’s code.
.C
Arguments and return values must be \textit{primitive} (vectors of doubles or integers).Call
Accepts \R data structures as arguments and return values (\texttt{SEXP} and friends) (no type checking is done though).- Memory management: memory allocated for \R objects is garbage collected. Thus \R objects in \texttt{C} code must be explicitely \texttt{PROTECT}ed to avoid being \texttt{gc()}ed, and subsequently \texttt{UNPROTECT}ed.
\tiny
- Create a shared library:
R CMD SHLIB gccount.c
- Load the shared object: \Rfunction{dyn.load(“gccount.so”)}
- Create an \R function that uses it: \Rfunction{gccount <- function(inseq) .Call(“gccount”,inseq)}
- Use the \texttt{C} code: \Rfunction{gccount(“GACAGCATCA”)}
s <- "GACTACGA"
gccount
gccount(s)
table(strsplit(s, ""))
system.time(replicate(10000, gccount(s)))
system.time(replicate(10000, table(strsplit(s, ""))))
- \Rpackage{Rcpp} is a great package for writing both
C
andC++
code: - It comes with loads of documentation and examples.
- No need to worry about garbage collection.
- All basic \R types are implemented as
C++
classes. - Easy to interface
C++
classes (viamodules
) - With package \Rpackage{inline} code can be easily compiled in \R.
\small
library(Rcpp)
library(inline)
cppCode <- '
Rcpp::NumericVector cx(x);
Rcpp::NumericVector ret(1);
ret[0] = cx[0] * cx[0];
return(ret);
'
squareOne <- cxxfunction(signature(x="numeric"),
plugin="Rcpp", body=cppCode)
squareOne(10)
See following files: counting.cpp and counting.R
\tiny
\footnotesize
- http://adv-r.had.co.nz/Rcpp.html (check whole book)
- Hadley Wickham has now generated a whole universe of alternative packages for tidying things up.
- Over time, things in R have got quite messy.
- https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.
- Was called the “Hadleyverse”, now called “Tidyverse”
- e.g. devtools (essential), ggplot2, readr (fast), readxl.
- https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
- How do you keep two directories in synchrony, e.g. your home directory on laptop and desktop?
sftp
,ssh
,rsync
Unison
gets Stephen’s vote since 2003 – http://www.damtp.cam.ac.uk/internal/computing/unison/- Modern services like Dropbox are useful and build upon these unix tools.
{{{FIGURE(nextsteps/unison-conflict.png,width=8cm)}}}
- How to keep backup copies over time?
- Just copy files, e.g. mycode.jan1.R, mycode.jan2.R, …
- Leads to many large copies, with no trace of what you did over time.
- more principled way is to use version control: every time you make significant changes, you commit a new version with a succint log file saying what you changed.
- RCS: going since 1982… old and simple but stable. Typically single-user.
- More modern approaches: cvs, svn, git, …
- Github, google code, bitbucket, …
- R-forge: svn and build system
Got 15 minutes and want to learn Git? http://try.github.com
- Computational Biology requires access to large data files.
- Reading them all into memory is difficult, when files are very large (> 1 Gb).
- Some approaches:
- Compress files.
- Selectively use scan or connections.
- Use a database.
- This produces typically x2 compression:
Rscript -e 'write(rnorm(99999), file="largefile.dat")' ls -lh largefile.dat gzip largefile.dat ls -lh largefile.dat.gz gunzip largefile.dat
- \R can read in compressed files natively.
x <- scan('largefile.dat.gz')
- Other compression options also recognised: xz, bzip2
- scan() is very flexible; e.g. read just 2nd column:
\footnotesize
scan(file = "", what = double(0), nmax = -1, n = -1, sep = "",
quote = if(identical(sep, "\n")) "" else "'\"", dec = ".",
skip = 0, nlines = 0, na.strings = "NA",
flush = FALSE, fill = FALSE, strip.white = FALSE,
quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE,
comment.char = "", allowEscapes = FALSE,
fileEncoding = "", encoding = "unknown")
x <- scan(file, what=list(NULL,"",NULL), skip=2, sep='\t')
\normalsize
- connections allow you to maintain state between accesses to a file.
\footnotesize
- Relational database: data stored in tables, very similar in nature to \R’s data.frames.
- Databases allow for multiple-accesses, locks for restricted changes, very scalable.
- Many databases available: Oracle, Postgres, Access, MySQL.
- SQL – Structured Query Language: language to interrogate databses.
- Most databases run on remote server; SQLite is embedded into your program.
- Embedding the database simplifies setup of server, but means your databases are not shared in the same way that others are. (You have to share the .sql files.)
- Incredibly small (1/4 Mb) and useful. Widely used (e.g. mac, iOS, Firefox, Android). Not as fast as e.g. Oracle.
- You compile your SQLite within your program.
- All handled with you by \R, care of RSQLite package. (e.g. Bioconductor uses it for data files.)
- package DBI interfaces to all database platforms.
library(RSQLite)
m = dbDriver("SQLite")
## Create a new database from an R data frame.
con = dbConnect(m, dbname = "arrest.db")
data(USArrests)
dbWriteTable(con, "USArrests", USArrests, overwrite=TRUE)
dbListTables(con)
## Later, query the database.
rs = dbSendQuery(con, "select * from USArrests")
d1 = fetch(rs, n=5) ## get first five
print(d1)
d1 = fetch(rs, n=-1)
dbDisconnect(con)
sqldf
Performs SQL selects on \R data frames.- supports SQLite backend database (by default), the H2 java db and PostgreSQL and MySQL.
- avoid read.csv entirely http://code.google.com/p/sqldf/
“See ?read.csv.sql in sqldf. It uses RSQLite and SQLite to read the file into an sqlite database (which it sets up for you) completely bypassing \R and from there grabs it into \R removing the database it created at the end.” (G. Grothendieck, r-help mailing list).
- Good book:
^((HT|X)M|SQ)L|R$
Introduction to Data Technologies (Paul Murrell).
- ff package stores objects on disk, but looks like they are in memory.
- “back to the future”: S used to store objects in disk.
- Sorting a single column of 81e6 entries. Time-taken in seconds.
Oct 2010 results from. http://tolstoy.newcastle.edu.au/R/packages/10/0697.html
ruinteger | rinteger | rusingle | rsingle | rudouble | rdouble | rfactor | rchar | |
---|---|---|---|---|---|---|---|---|
ram | 5.58 | 3.23 | NA | NA | NA | NA | 0.49 | NA |
ff | 10.70 | 8.54 | 51.35 | 28.98 | 70.20 | 44.13 | 7.91 | NA |
R | OOM | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
SAS | 61.45 | 44.94 | NA | NA | 63.14 | 46.56 | NA | OOD |
(ram=in-memory, optimized for speed, not ram; ff=on disk).
\note{see text in nextsteps/ff-oct2010.txt}
- The
bigmemory
package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. - netCDF data files:
ncdf
andRNetCDF
packages. - hdf5 format:
rhdf5
package XML
package to parse xml dataRhadoop
package for very large files.- https://rpubs.com/msundar/large_data_analysis
- Applicable when repeating \textit{independent} computations a certain number of times; results are combined after parallel executions are done.
- A cluster of nodes: generate multiple workers listening to the master; these workers are new processes that can run on the current machine or a similar one with an identical R installation. Should work on all \R platforms (as in package \Rpackage{snow}).
- The \R process is \textit{forked} to create new \R processes by taking a complete copy of the masters process, including workspace (pioneered by package \Rpackage{multicore}). Does not work on Windows.
- Grid computing.
- Package \Rpackage{parallel}, first included in \R 2.14.0 builds on CRAN packages \Rpackage{multicore} and \Rpackage{snow}.
mclapply(X, FUN, ...) (adapted from multicore). parLapply(cl, X, FUN, ...) (adapted from snow ).
- Package \Rpackage{foreach}, introducing a new looping construct supporting parallel execution. Natural choice to parallelise a \texttt{for} loop.
library(doMC) library(foreach) registerDoMC(2) foreach(i = ll) %dopar% f(i) foreach(i = ll) %do% f(i) ## serial version library(plyr) llply(ll, f, .parallel=TRUE)
- Find information about managing and chunking big data:
- High performance computing CRAN task view
- http://cran.r-project.org/web/views/HighPerformanceComputing.html
- value is the default in \R
- reference using S4 ReferenceClasses (OO)
- can emulate pass by ref using an
environment
e <- new.env()
e$x <- 1
f <- function(myenv) myenv$x <- 2
f(e)
e$x
m <- matrix(rnorm(1e6), ncol=100)
Rprof("rprof")
res <- apply(m,1,mean,trim=.3)
Rprof(NULL)
summaryRprof("rprof")
m <- matrix(rnorm(1e6), ncol=100)
f1 <- function(x, t = 0.3) {
xx <- 0
for (i in 1:nrow(x)) {
xx <- c(xx, sum(m[i, ]))
}
mean(xx, trim = t)
}
f2 <- function(x, t = 0.3) mean(rowSums(x), trim = t)
library(rbenchmark)
benchmark(f1(m), f2(m),
columns=c("test", "replications",
"elapsed", "relative"),
order = "relative", replications = 10)
Describe simple models of populations dynamics of species competing for some common resource. When two species are not interacting, their population evolve according to the logistic equations and the rate of reproduction is proportional to both the existing population and the amount of available resources
\begin{align*}
\deriv{x}{t} &= r1 x ~ (1 - \frac{x}{k1} )
\deriv{y}{t} &= r2 y ~ (1 - \frac{y}{k2} )
\end{align*}
where the constant $ri$ defines the growth rate and $ki$ is the carrying capacity of the environment.
When competing for the same resource, the animals have a negative influence on their competitors growth.
\begin{align*}
\deriv{x}{t} &= r1 x ~ (1 - \frac{x}{k1} ) - axy
\deriv{y}{t} &= r2 y ~ (1 - \frac{y}{k2} ) - bxy
\end{align*}
Here is an example with $r1 = 3$, $k1 = 3$,
\begin{align*}
\deriv{r}{t} &= r( 3 - r - 2s)
\deriv{s}{t} &= s( 2 - r - s)
\end{align*}
i.e. use numerical integration, with
library(deSolve)
Sheep <- function(t, y, parms) {
r=y[1]; s=y[2]
drdt = r * (3 - r - (2*s))
dsdt = s * (2 - r - s)
list(c(drdt, dsdt))
}
x0 <- c(1, 1.2)
times <- seq(0, 30, by=0.2)
parms <- 0
out <- rk4(x0, times, Sheep, parms)
head(out)
deSolve
package- phase planes and nullclines (
DMBpplane.r
from DMB site, modified from Daniel Kaplan) integrate()
– quadratureD()
– symbolic differentiationoptimize()
(1d) andoptim()
(n-d)- Steven Strogatz. Nonlinear dynamics and chaos.
- NR: William Press et al. Numerical Recipes in C/C++
- More slides about DE and phase plane – \url{de.pdf}
- Looking for packages
- CRAN Task Views http://cran.r-project.org/web/views/
- Bioconductor biocViews http://bioconductor.org/packages/release/BiocViews.html
- Reproducibility is crucial
- Have several tools at hand
- editor, programming languages, shell, …
- Practice to keep learning
- Have fun! ☺