Skip to content

Commit

Permalink
README.ms, Slides and Case Study major changes
Browse files Browse the repository at this point in the history
  • Loading branch information
onertipaday committed Dec 2, 2012
1 parent 6ee6427 commit 8509afc
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 27 deletions.
70 changes: 50 additions & 20 deletions CaseStudy/CaseStudy.Rnw
Expand Up @@ -42,8 +42,7 @@ library("xtable") # tables
\textbf{Findings:} We decided to replicate the Class Prediction Analysis depicted in the manuscript. The authors taking advantage of the nearest shrunken centroid method (PAM) \cite{pamr} wanted to identify the gene expressions patterns (\emph{signatures}) of the contrasts \emph{adenoma vs normal biopsy samples} and \emph{colorectal cancer (CRC) vs normal biopsy samples}.
On the basis of the information contained in the manuscript's Material and Methods, we were not capable to reproduce the \emph{exact} original results.

\textbf{Conclusions:} This document combines the Literate Programming paradigm and the power of R and Bioconductor in providing an example of
Reproducible Research in the analysis of high-throughput biological data.
\textbf{Conclusions:} This document combines the Literate Programming paradigm and the power of R and Bioconductor in providing an example of Reproducible Research in the analysis of high-throughput biological data.

\end{abstract}

Expand All @@ -63,9 +62,11 @@ We focused our attention on the classification tasks and, following the original
\section{Loading microarray data into Bioconductor}
The data used in this example can be downloaded from the ArrayExpress database and imported into R using the \cc{ArrayExpress} package by typing:

<<eval=TRUE, hide=FALSE, cache=TRUE, message=FALSE>>=
<<eval=TRUE, cache=TRUE, message=FALSE>>=
library("ArrayExpress")
EGEOD4183.affybatch = ArrayExpress(accession = "E-GEOD-4183", save=TRUE)
fns <- list.celfiles(path="../../DATA/",full.names=TRUE)
EGEOD4183.affybatch <- read.affyBatch(fns)
@

\noindent A brief description of the \cc{EGEOD4183.affybatch} object can be obtained by using the
Expand All @@ -86,6 +87,11 @@ EGEOD4183.affybatch = ArrayExpress(accession = "E-GEOD-4183", save=TRUE)
\noindent If the Affymetrix microarray data sets have been downloaded into a single directory,
then the \cc{.cel} files can be loaded into R using the \cc{ReadAffy} command.

<<eval=FALSE, message=FALSE>>=
library("affy")
fns <- list.celfiles(path=".",full.names=TRUE)
EGEOD4183.affybatch <- read.affybatch(fns)
@
%------------------------------------------------------------
\section{Pre-processing}

Expand All @@ -107,7 +113,7 @@ Following the Material and Methods of the original paper we used the GeneChip RM

<<eval=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
library("gcrma")
eset.rma = gcrma(EGEOD4183.affybatch)
eset.rma = rma(EGEOD4183.affybatch)
@

\noindent The aim of the normalisation is to make the distribution of probe intensities for each array in a set of arrays the same.
Expand All @@ -129,9 +135,9 @@ Gene Expressions patterns (\emph{signatures}) of the contrast between adenoma an
shrunken centroid classification algorithm (PAM).

Using the \cc{pamr} package between adenoma and healthy biopsy samples, 20 classifiers were identified
(sensitivity:100\%, specificity: 100\%). See Table~\ref{avsn} for the identified classifiers.
(sensitivity:100\%, specificity: 100\%). See Table~\ref{avsn} and Fig~\ref{fig:plotcen01} for the identified classifiers.

<<eval=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
<<eval=TRUE, hide=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
x.avsn <- exprs(eset.rma)[ , c(grep("adenoma", pData(eset.rma)$"Description"),grep("healthy", pData(eset.rma)$"Description"))]
mydata.avsn <- list( x=x.avsn,
y=factor(c(rep("adenoma", length(grep("adenoma", pData(eset.rma)$"Description"))), rep("normal", length(grep("healthy", pData(eset.rma)$"Description"))))),
Expand All @@ -142,27 +148,42 @@ set.seed(123) # for reproducibility
mytrain.avsn <- pamr.train(mydata.avsn)
mycv.avsn <- pamr.cv(mytrain.avsn, mydata.avsn)
pamr.confusion(mycv.avsn,threshold=6.8)
pdf("pamr_centroids_adenoma_vs_normal.pdf", width=15)
pamr.plotcen(mytrain.avsn, mydata.avsn, threshold=6.8)
dev.off()
avsn.signature <- data.frame(pamr.listgenes(mytrain.avsn, mydata.avsn, threshold=6.8, genenames=T))
write.table(avsn.signature, file="adenomavsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
library('hgu133plus2.db') # chip annotations
mp.entrez <- mappedkeys(hgu133plus2ENTREZID)
mp.entrez.lst <- as.list(hgu133plus2ENTREZID[mp.entrez])
mp.symbol <- mappedkeys(hgu133plus2SYMBOL)
mp.symbol.lst <- as.list(hgu133plus2SYMBOL[mp.symbol])
avsn.df <- data.frame(Affymetrix_id= as.character(avsn.signature$id),
ENTREZID=unlist(mp.entrez.lst)[as.character(avsn.signature$id)],
SYMBOL=unlist(mp.symbol.lst)[as.character(avsn.signature$id)], avsn.signature[,3:4] )
write.table(avsn.df, file="adenomavsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
@

\begin{figure}[hp]
\centering
<<plotcen01, eval=TRUE, cache=TRUE>>=
pamr.plotcen(mytrain.avsn, mydata.avsn, threshold=6.8)
@
\begin{table}[h]
\caption{Centroids of the \Sexpr{dim(avsn.df)[[1]]} features in the contras: adenoma vs normal}
\label{fig:plotcen01}
\end{figure}

\begin{table}[h!]
\caption{Classificatory genes identified by pam - adenoma vs normal}\label{avsn}
<<tab1, message=FALSE, results='asis', echo=FALSE>>=
library("xtable")
tab1 <- xtable(avsn.signature)
tab1 <- xtable(avsn.df)
print(tab1, floating=FALSE, include.rownames=FALSE)
@
\end{table}
%------------------------------------------------------------
\clearpage
%------------------------------------------------------------
Normal and CRC biopsy samples could be distinguished using 15 discriminatory genes (sensitivity:86\%, specificity: 100\%).
See Table~\ref{crcvsn} for the identified classifiers.
See Table~\ref{crcvsn} and Fig~\ref{fig:plotcen02} for the identified classifiers.

<<eval=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
<<eval=TRUE, cache=TRUE, results=FALSE, include=TRUE, message=FALSE>>=
x.crcvsn <- exprs(eset.rma)[ , c(grep("colorectal cancer", pData(eset.rma)$"Description"),grep("healthy", pData(eset.rma)$"Description"))]
mydata.crcvsn <- list( x=x.crcvsn,
y=factor(c(rep("CRC", length(grep("colorectal cancer", pData(eset.rma)$"Description"))), rep("normal", length(grep("healthy", pData(eset.rma)$"Description"))))),
Expand All @@ -172,18 +193,27 @@ set.seed(123)
mytrain.crcvsn <- pamr.train(mydata.crcvsn)
mycv.crcvsn <- pamr.cv(mytrain.crcvsn, mydata.crcvsn)
pamr.confusion(mycv.crcvsn,threshold=4.45)
pdf("pamr_centroids_CRC_vs_normal.pdf", width=15)
pamr.plotcen(mytrain.crcvsn, mydata.crcvsn, threshold=4.45)
dev.off()
crcvsn.signature <- data.frame(pamr.listgenes(mytrain.crcvsn, mydata.crcvsn, threshold=4.45, genenames=T))
write.table(crcvsn.signature, file="crcvsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
crcvsn.df <- data.frame(Affymetrix_id= as.character(crcvsn.signature$id),
ENTREZID=unlist(mp.entrez.lst)[as.character(crcvsn.signature$id)],
SYMBOL=unlist(mp.symbol.lst)[as.character(crcvsn.signature$id)], crcvsn.signature[,3:4] )
write.table(crcvsn.df, file="crcvsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
@

\begin{figure}[hp]
\centering
<<plotcen02, eval=TRUE, cache=TRUE>>=
pamr.plotcen(mytrain.crcvsn, mydata.crcvsn, threshold=4.45)
@
\caption{Centroids of the \Sexpr{dim(crcvsn.df)[[1]]} features in the contras: normal vs colorectal cancer}
\label{fig:plotcen02}
\end{figure}

\begin{table}[h!]
\caption{Classificatory genes identified by pam - normal vs colorectal cancer}\label{crcvsn}
<<tab2, message=FALSE, results='asis', echo=FALSE>>=
library("xtable")
tab2 <- xtable(crcvsn.signature)
tab2 <- xtable(crcvsn.df)
print(tab2, floating=FALSE, include.rownames=FALSE)
@
\end{table}
Expand All @@ -194,7 +224,7 @@ print(tab2, floating=FALSE, include.rownames=FALSE)

Following the incomplete information depicted in the paper we were capable of reproducing
only part of the original results.
We think that the main reasons for this outcome are:
We think that the main reasons of this outcome are:
\begin{itemize}
\item No source code for the analyses was included
\item The versions of the different packages were not specified
Expand Down
6 changes: 3 additions & 3 deletions README.md
Expand Up @@ -9,10 +9,10 @@ Abstract of talk
===
Reproducible Research in High-Throughput Biology: A Case Study

The talk will present an introduction to Reproducible Research in the Analysis of High-Throughput data for Biology. It will depict a Case Study produced completely in R taking advantage of the knitr package for the Literate Programming and various Bioconductor packages for the proper analysis.
The talk presents an introduction to Reproducible Research in the Analysis of High-Throughput data for Biology. It depicts a Case Study produced completely in R taking advantage of the [knitr](http://yihui.name/knitr/) package for the Literate Programming and various [Bioconductor](bioconductor.org) packages for the proper analysis.

Paolo Sonego has spent the last four years as a bioinformatician analyzing '-omics' data for a company in Trieste, Italy.
He has been using R for the last 9 years and occasionally blogs about the power of R at [onertipaday.blogspot.com](onertipaday.blogspot.com)
Paolo Sonego has spent the last four years as a bioinformatician analyzing _omics_ data for a company in Trieste, Italy.
He has been using R for the last 9 years and occasionally blogs about the power of R at [onertipaday.blogspot.com](onertipaday.blogspot.com)

**License**
This work is licensed under the Creative Commons Attribution-NonCommercial-Share Alike 3.0 License.
Expand Down
8 changes: 4 additions & 4 deletions Slides/Slides.Rmd
Expand Up @@ -231,8 +231,6 @@ Versioning and Version Control Systems

Subversion exists to be universally recognized and adopted as an open-source, centralized version control system characterized by its reliability as a safe haven for valuable data; the simplicity of its model and usage; and its ability to support the needs of a wide variety of users and projects, from individuals to large-scale enterprise operations.

![](images/r-forge.png)

svn "Hello World"
===
svnadmin create SVNrep
Expand Down Expand Up @@ -275,7 +273,7 @@ R and Bioconductor
source("wordcloud.r")
```

Bioconductor
Bioconductor [^4]
===
* Bioconductor is an open source project to provide tools for the analysis and comprehension of high-throughput genomic data, based primarily on the R programming language.
* The broad goals of the Bioconductor project are:
Expand All @@ -285,6 +283,8 @@ Bioconductor
* To further scientific understanding by producing high-quality documentation and reproducible research.
* To train researchers on computational and statistical methods for the analysis of genomic data.

[^3]: [Bioconductor web site](bioconductor.org)

Why Bioconductor is the way to follow for Reproducible Research
===
* Same programming environment with tools operating and cooperating between them in coherent ways
Expand All @@ -306,7 +306,7 @@ Summary
* Use a Version Control System for recording your source code changes
* Create pipelines producing documents including both the analysis and the working code (_Literate Programming_)
* Use Bioconductor and include the version of your packages and the session info in the document
* One last thing:
* One more thing:
* The Bioconductor team has developed an Amazon Machine Image ([AMI](http://www.bioconductor.org/help/bioconductor-cloud-ami/)) that is optimized for running Bioconductor in the Amazon Elastic Compute Cloud (or EC2) for sequencing tasks.
* storing data, code, packages in a frozen Virtual Machine allows the complete reproducibility of your analysis and make easy to share the results

Expand Down

0 comments on commit 8509afc

Please sign in to comment.