README.ms, Slides and Case Study major changes

onertipaday · Dec 2, 2012 · 8509afc · 8509afc
1 parent 6ee6427
commit 8509afc
Show file tree

Hide file tree

Showing 3 changed files with 57 additions and 27 deletions.
diff --git a/CaseStudy/CaseStudy.Rnw b/CaseStudy/CaseStudy.Rnw
@@ -42,8 +42,7 @@ library("xtable") # tables
 \textbf{Findings:} We decided to replicate the Class Prediction Analysis depicted in the manuscript. The authors taking advantage of the nearest shrunken centroid method (PAM) \cite{pamr} wanted to identify the gene expressions patterns (\emph{signatures}) of the contrasts \emph{adenoma vs normal biopsy samples} and \emph{colorectal cancer (CRC) vs normal biopsy samples}.
 On the basis of the information contained in the manuscript's Material and Methods, we were not capable to reproduce the \emph{exact} original results.
 
-\textbf{Conclusions:} This document combines the Literate Programming paradigm and the power of R and Bioconductor in providing an example of
-Reproducible Research in the analysis of high-throughput biological data.
+\textbf{Conclusions:} This document combines the Literate Programming paradigm and the power of R and Bioconductor in providing an example of Reproducible Research in the analysis of high-throughput biological data.
 
 \end{abstract}
 
@@ -63,9 +62,11 @@ We focused our attention on the classification tasks and, following the original
 \section{Loading microarray data into Bioconductor}
 The data used in this example can be downloaded from the ArrayExpress database and imported into R using the \cc{ArrayExpress} package by typing:
 
-<<eval=TRUE, hide=FALSE, cache=TRUE, message=FALSE>>=
+<<eval=TRUE, cache=TRUE, message=FALSE>>=
 library("ArrayExpress")
 EGEOD4183.affybatch = ArrayExpress(accession = "E-GEOD-4183", save=TRUE)
+fns <- list.celfiles(path="../../DATA/",full.names=TRUE)
+EGEOD4183.affybatch <- read.affyBatch(fns)
 @
 
 \noindent A brief description of the \cc{EGEOD4183.affybatch} object can be obtained by using the 
@@ -86,6 +87,11 @@ EGEOD4183.affybatch = ArrayExpress(accession = "E-GEOD-4183", save=TRUE)
 \noindent If the Affymetrix microarray data sets have been downloaded into a single directory, 
 then the \cc{.cel} files can be loaded into R using the \cc{ReadAffy} command.
 
+<<eval=FALSE, message=FALSE>>=
+library("affy")
+fns <- list.celfiles(path=".",full.names=TRUE)
+EGEOD4183.affybatch <- read.affybatch(fns)
+@
 %------------------------------------------------------------
 \section{Pre-processing}
 
@@ -107,7 +113,7 @@ Following the Material and Methods of the original paper we used the GeneChip RM
 
 <<eval=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
 library("gcrma")
-eset.rma = gcrma(EGEOD4183.affybatch)
+eset.rma = rma(EGEOD4183.affybatch)
 @
 
 \noindent The aim of the normalisation is to make the distribution of probe intensities for each array in a set of arrays the same. 
@@ -129,9 +135,9 @@ Gene Expressions patterns (\emph{signatures}) of the contrast between adenoma an
 shrunken centroid classification algorithm (PAM). 
 
 Using the \cc{pamr} package between adenoma and healthy biopsy samples, 20 classifiers were identified
-(sensitivity:100\%, specificity: 100\%). See Table~\ref{avsn} for the identified classifiers.
+(sensitivity:100\%, specificity: 100\%). See Table~\ref{avsn} and Fig~\ref{fig:plotcen01} for the identified classifiers.
 
-<<eval=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
+<<eval=TRUE, hide=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
 x.avsn <-  exprs(eset.rma)[ , c(grep("adenoma", pData(eset.rma)$"Description"),grep("healthy", pData(eset.rma)$"Description"))]
 mydata.avsn <- list( x=x.avsn, 
 	y=factor(c(rep("adenoma", length(grep("adenoma", pData(eset.rma)$"Description"))), rep("normal", length(grep("healthy", pData(eset.rma)$"Description"))))), 
@@ -142,27 +148,42 @@ set.seed(123) # for reproducibility
 mytrain.avsn <- pamr.train(mydata.avsn)
 mycv.avsn <- pamr.cv(mytrain.avsn, mydata.avsn)
 pamr.confusion(mycv.avsn,threshold=6.8)
-pdf("pamr_centroids_adenoma_vs_normal.pdf", width=15)
-pamr.plotcen(mytrain.avsn, mydata.avsn, threshold=6.8)
-dev.off()
 avsn.signature <- data.frame(pamr.listgenes(mytrain.avsn, mydata.avsn, threshold=6.8, genenames=T))
-write.table(avsn.signature, file="adenomavsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
+library('hgu133plus2.db') # chip annotations
+mp.entrez <- mappedkeys(hgu133plus2ENTREZID)
+mp.entrez.lst <- as.list(hgu133plus2ENTREZID[mp.entrez])
+mp.symbol <- mappedkeys(hgu133plus2SYMBOL)
+mp.symbol.lst <- as.list(hgu133plus2SYMBOL[mp.symbol])
+avsn.df <- data.frame(Affymetrix_id= as.character(avsn.signature$id), 
+ENTREZID=unlist(mp.entrez.lst)[as.character(avsn.signature$id)], 
+SYMBOL=unlist(mp.symbol.lst)[as.character(avsn.signature$id)], avsn.signature[,3:4] )
+write.table(avsn.df, file="adenomavsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
+@
+
+\begin{figure}[hp]
+\centering
+<<plotcen01, eval=TRUE, cache=TRUE>>=
+pamr.plotcen(mytrain.avsn, mydata.avsn, threshold=6.8)
 @
-\begin{table}[h]
+\caption{Centroids of the \Sexpr{dim(avsn.df)[[1]]} features in the contras: adenoma vs normal}
+\label{fig:plotcen01}
+\end{figure}
+
+\begin{table}[h!]
   \caption{Classificatory genes identified by pam - adenoma vs normal}\label{avsn}
 <<tab1, message=FALSE, results='asis', echo=FALSE>>= 
 library("xtable")
-tab1 <- xtable(avsn.signature)
+tab1 <- xtable(avsn.df)
 print(tab1, floating=FALSE, include.rownames=FALSE)
 @
 \end{table}
 %------------------------------------------------------------
 \clearpage
 %------------------------------------------------------------
 Normal and CRC biopsy samples could be distinguished using 15 discriminatory genes (sensitivity:86\%, specificity: 100\%).
-See Table~\ref{crcvsn} for the identified classifiers.
+See Table~\ref{crcvsn} and Fig~\ref{fig:plotcen02} for the identified classifiers.
 
-<<eval=TRUE, cache=TRUE, results=FALSE, message=FALSE>>=
+<<eval=TRUE, cache=TRUE, results=FALSE, include=TRUE, message=FALSE>>=
 x.crcvsn <- exprs(eset.rma)[ , c(grep("colorectal cancer", pData(eset.rma)$"Description"),grep("healthy", pData(eset.rma)$"Description"))]
 mydata.crcvsn <- list( x=x.crcvsn, 
 	y=factor(c(rep("CRC", length(grep("colorectal cancer", pData(eset.rma)$"Description"))), rep("normal", length(grep("healthy", pData(eset.rma)$"Description"))))), 
@@ -172,18 +193,27 @@ set.seed(123)
 mytrain.crcvsn <- pamr.train(mydata.crcvsn)
 mycv.crcvsn <- pamr.cv(mytrain.crcvsn, mydata.crcvsn)
 pamr.confusion(mycv.crcvsn,threshold=4.45)
-pdf("pamr_centroids_CRC_vs_normal.pdf", width=15)
-pamr.plotcen(mytrain.crcvsn, mydata.crcvsn, threshold=4.45)
-dev.off()
 crcvsn.signature <- data.frame(pamr.listgenes(mytrain.crcvsn, mydata.crcvsn, threshold=4.45, genenames=T))
-write.table(crcvsn.signature, file="crcvsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
+crcvsn.df <- data.frame(Affymetrix_id= as.character(crcvsn.signature$id), 
+ENTREZID=unlist(mp.entrez.lst)[as.character(crcvsn.signature$id)], 
+SYMBOL=unlist(mp.symbol.lst)[as.character(crcvsn.signature$id)], crcvsn.signature[,3:4] )
+write.table(crcvsn.df, file="crcvsnormal_signature_pam.txt",sep="\t", quote=F, row.names=F)
+@
+
+\begin{figure}[hp]
+\centering
+<<plotcen02, eval=TRUE, cache=TRUE>>=
+pamr.plotcen(mytrain.crcvsn, mydata.crcvsn, threshold=4.45)
 @
+\caption{Centroids of the \Sexpr{dim(crcvsn.df)[[1]]} features in the contras: normal vs colorectal cancer}
+\label{fig:plotcen02}
+\end{figure}
 
 \begin{table}[h!]
   \caption{Classificatory genes identified by pam - normal vs colorectal cancer}\label{crcvsn}
 <<tab2, message=FALSE, results='asis', echo=FALSE>>= 
 library("xtable")
-tab2 <- xtable(crcvsn.signature)
+tab2 <- xtable(crcvsn.df)
 print(tab2, floating=FALSE, include.rownames=FALSE)
 @
 \end{table}
@@ -194,7 +224,7 @@ print(tab2, floating=FALSE, include.rownames=FALSE)
 
 Following the incomplete information depicted in the paper we were capable of reproducing
 only part of the original results.
-We think that the main reasons for this outcome are:
+We think that the main reasons of this outcome are:
 \begin{itemize}
 \item No source code for the analyses was included
 \item The versions of the different packages were not specified

diff --git a/README.md b/README.md
@@ -9,10 +9,10 @@ Abstract of talk
 ===
 Reproducible Research in High-Throughput Biology: A Case Study
 
-The talk will present an introduction to Reproducible Research in the Analysis of High-Throughput data for Biology. It will depict a Case Study produced completely in R taking advantage of the knitr package for the Literate Programming and various Bioconductor packages for the proper analysis.
+The talk presents an introduction to Reproducible Research in the Analysis of High-Throughput data for Biology. It depicts a Case Study produced completely in R taking advantage of the [knitr](http://yihui.name/knitr/) package for the Literate Programming and various [Bioconductor](bioconductor.org) packages for the proper analysis.
 
-Paolo Sonego has spent the last four years as a bioinformatician analyzing '-omics' data for a company in Trieste, Italy.
-He has been using R for the last 9 years and occasionally  blogs about the power of R at [onertipaday.blogspot.com](onertipaday.blogspot.com)
+Paolo Sonego has spent the last four years as a bioinformatician analyzing _omics_ data for a company in Trieste, Italy.
+He has been using R for the last 9 years and occasionally blogs about the power of R at [onertipaday.blogspot.com](onertipaday.blogspot.com)
 
 **License**   
 This work is licensed under the Creative Commons Attribution-NonCommercial-Share Alike 3.0 License.

diff --git a/Slides/Slides.Rmd b/Slides/Slides.Rmd
@@ -231,8 +231,6 @@ Versioning and Version Control Systems
 
 Subversion exists to be universally recognized and adopted as an open-source, centralized version control system characterized by its reliability as a safe haven for valuable data; the simplicity of its model and usage; and its ability to support the needs of a wide variety of users and projects, from individuals to large-scale enterprise operations.
 
-![](images/r-forge.png)
-
 svn "Hello World"
 ===
     svnadmin create SVNrep
@@ -275,7 +273,7 @@ R and Bioconductor
 source("wordcloud.r")
 ```
 
-Bioconductor
+Bioconductor [^4]
 ===
 * Bioconductor is an open source project to provide tools for the analysis and comprehension of high-throughput genomic data, based primarily on the R programming language.
 * The broad goals of the Bioconductor project are:
@@ -285,6 +283,8 @@ Bioconductor
     * To further scientific understanding by producing high-quality documentation and reproducible research.
     * To train researchers on computational and statistical methods for the analysis of genomic data.
 
+[^3]: [Bioconductor web site](bioconductor.org)
+
 Why Bioconductor is the way to follow for Reproducible Research
 ===
 * Same programming environment with tools operating and cooperating between them in coherent ways
@@ -306,7 +306,7 @@ Summary
 * Use a Version Control System for recording your source code changes
 * Create pipelines producing documents including both the analysis and the working code (_Literate Programming_)
 * Use Bioconductor and include the version of your packages and the session info in the document
-* One last thing:
+* One more thing:
     * The Bioconductor team has developed an Amazon Machine Image ([AMI](http://www.bioconductor.org/help/bioconductor-cloud-ami/)) that is optimized for running Bioconductor in the Amazon Elastic Compute Cloud (or EC2) for sequencing tasks.
     * storing data, code, packages in a frozen Virtual Machine allows the complete reproducibility of your analysis and make easy to share the results