Skip to content

Commit

Permalink
initial version of NYC/DR talk
Browse files Browse the repository at this point in the history
  • Loading branch information
topepo committed Apr 2, 2015
1 parent 9c4e9a5 commit d3b6bfd
Show file tree
Hide file tree
Showing 8 changed files with 2,065 additions and 0 deletions.
385 changes: 385 additions & 0 deletions release_process/talk/caret.Rnw
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
\documentclass[12 pt]{beamer}

\usepackage{beamerthemesplit}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\mode<presentation> {
\usetheme{Boadilla}
\usecolortheme{whale}
\setbeamercovered{transparent}
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Custom definitions

\definecolor{darkgreen}{rgb}{0,0.3,0}
\definecolor{darkred}{rgb}{0.6,0.0,0}

\newcommand{\mxnum}[1]{\texttt{\hlnum{#1}}}%
\newcommand{\mxkwc}[1]{\texttt{\hlkwc{#1}}}%
\newcommand{\mxkwd}[1]{\texttt{\hlkwd{#1}}}%

\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
\renewcommand{\pkg}[1]{{\color{darkgreen}\textsf{#1}}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%

<<setup, results='hide', echo=FALSE,message=FALSE, warning=FALSE>>=
library(knitr)
library(caret)
opts_chunk$set(comment=NA, digits = 3, size = 'scriptsize',
prompt = TRUE, background = 'white',
message=FALSE, warning=FALSE)
hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
if (is.character(x)) highr::hi_latex(x) else hook_inline(x)
})
options(width = 80)
knit_theme$set("bclear")
theme_set(theme_bw())
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{Nobody Knows What It's Like To Be the Bad Man}
\author{Max Kuhn, Ph.D}
\institute{Pfizer Global R$\&$D \linebreak Groton, CT\linebreak max.kuhn@pfizer.com }
\date{}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

\begin{frame}[plain]
\maketitle
\end{frame}

\title{caret}
\author{Max Kuhn}
\institute{Pfizer}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}
\frametitle{Outline}

\begin{itemize}
\item What is the \pkg{caret} package?
\item What makes it different
\item Version control
\item The CRAN release process
\item Testing
\item Documentation
\end{itemize}

\end{frame}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Model Function Consistency}

Since there are many modeling packages in R written by different people,
there are some inconsistencies in how models are specified and
predictions are created.

\vspace{.15in}

For example, many models have only one method of specifying the model
(e.g. formula method only)


\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Generating Class Probabilities Using Different Packages}

\begin{footnotesize}
\begin{center}
\begin{tabular}{rcl}
{\bf Function} && {\bf {\tt predict} Function Syntax} \\
\hline
\href{http://cran.r-project.org/web/packages/MASS/index.html}{\pkg{MASS}}\texttt{:::lda} && {\tt \Sexpr{'predict(obj)'}} (no options needed)\\
\pkg{stats}\texttt{:::glm} && {\tt \Sexpr{'predict(obj, type = "response")'}} \\
\href{http://cran.r-project.org/web/packages/gbm/index.html}{\pkg{gbm}}\texttt{:::gbm} && {\tt \Sexpr{'predict(obj, type = "response", n.trees)'}} \\
\href{http://cran.r-project.org/web/packages/mda/index.html}{\pkg{mda}}\texttt{:::mda} && {\tt \Sexpr{'predict(obj, type = "posterior")'}} \\
\href{http://cran.r-project.org/web/packages/rpart/index.html}{\pkg{rpart}}\texttt{:::rpart} && {\tt \Sexpr{'predict(obj, type = "prob")'}} \\
\href{http://cran.r-project.org/web/packages/RWeka/index.html}{\pkg{RWeka}}\texttt{:::Weka} && {\tt \Sexpr{'predict(obj, type = "probability")'}} \\
\href{http://cran.r-project.org/web/packages/caTools/index.html}{\pkg{caTools}}\texttt{:::LogitBoost} && {\tt \Sexpr{'predict(obj, type = "raw", nIter)'}} \\
\hline \\
\\
\end{tabular}
\end{center}
\end{footnotesize}
\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{The \pkg{caret} Package}

The \href{http://cran.r-project.org/web/packages/caret/index.html}{\pkg{caret}} package was developed to:
\begin{itemize}
\item create a unified interface for modeling and prediction
(interfaces to \Sexpr{length(table(modelLookup()$model))} models)
\item streamline model tuning using resampling
\item provide a variety of ``helper'' functions and classes for day--to--day model building tasks
\item increase computational efficiency using parallel processing
\end{itemize}

\vspace{.07in}

First commits within Pfizer: 6/2005, First version on CRAN: 10/2007

\vspace{.06in}

Website: \href{http://topepo.github.io/caret/}{http://topepo.github.io/caret/}

\vspace{.06in}

JSS Paper: \href{http://www.jstatsoft.org/v28/i05/paper}{http://www.jstatsoft.org/v28/i05/paper}

\vspace{.06in}

Model List: \href{http://topepo.github.io/caret/bytag.html}{http://topepo.github.io/caret/bytag.html}

\vspace{.06in}

Many computing sections in APM

\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Package Dependencies}

<<mod_pkgs,echo=FALSE>>=
mods <- getModelInfo()
mod_pkgs <- unique(unlist(lapply(mods, function(x) x$library)))
mod_pkgs <- mod_pkgs[mod_pkgs != "caret"]
@

One thing that makes \pkg{caret} different from most other packages is that is uses code from an abnormally large number (> 80) of other packages.

\vspace{.15in}

Briefly, these were in the \texttt{Depends} field of the \textt{DESCRIPTION} file which cause all of them to be loaded with \pkg{caret}.

\vspace{.15in}

For many years, they were moved to \texttt{Suggests}, which solved that issue,


\vspace{.15in}

However, their formal dependency in the \textt{DESCRIPTION} file required CRAN to install hundreds of other packages to check \pkg{caret}. They were not pleased.

\end{frame}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{The Basic Release Process}

\begin{enumerate}
\item create a few dynamic man pages
\item use {\color{darkred} \texttt{R CMD check --as-cran}} to ensure passing CRAN tests and {\color{darkred} unit tests}
\item update all packages (and R)
\item run {\color{darkred} regression tests} and evaluate results
\item send to CRAN
\item repeat
\item repeat
\item install passed \pkg{caret} version
\item generate {\color{darkred} HTML documentation} and sync github io branch
\item profit!
\end{enumerate}

\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Required ``Optimizations''}


For example, there is one check that produces a large number of false positive warnings. For example:

<<subset, eval = FALSE>>=
bwplot.diff.resamples <- function (x, data, metric = x$metric, ...) {
## some code
plotData <- subset(plotData, Metric %in% metric)
## more code
}
@

will trigger a wanring that ``{\tt \small bwplot.diff.resamples: no visible binding for global variable 'Metric'}''.

\vspace{.15in}

The ``solution'' is to have a file that is sourced first in the package (e.g. \texttt{aaa.R}) with the line

<<subset_fix, eval = FALSE>>=
Metric <- NULL
@

\end{frame}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Severity of Problems}

It's hard to tell which warnings should be ignored and which should not. There is also the issue of inconsistencies related to who is ``on duty'' when you submit your package.

\vspace{.1in}

For example, I recently updated the \pkg{desirability} package and received this warning:

\vspace{.1in}



\end{frame}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Package Dependencies}

This problem was somewhat alleviated at the end of 2013 when {\em custom methods} were introduced into the package.

\vspace{.15in}

Although this functionality had already existed in the package for some time, it was refactored to be more user freindly.

\vspace{.15in}

In the process, much of the modeling code was moved out of \pkg{caret}'s R files and into R objects, eliminating the formal dependencies.

\vspace{.15in}

Right now, the {\em total} number of dependencies is much smaller (2 \texttt{Depends}, 7 \texttt{Imports}, and 25 \texttt{Suggests}).

\vspace{.15in}

This still affects testing though (described later)

\end{frame}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Regression Testing}

Prior to CRAN release (or whenever required), a comprehensive set of regression tests are conducted.

\vspace{.1in}

All modeling packages are updated to their current CRAN versions.

\vspace{.1in}

For each model accessed by \mxkwd{train}, \mxkwd{rfe}, and/or \mxkwd{sbf}, a set of test cases are computed with the production version of \pkg{caret} and the devle version.

\vspace{.1in}

First, test cases are evaluated to make sure that nothing has been broken by updated versions of the consistuent packages.

\vspace{.1in}

Diffs of the model results are computed to assess any differences in \pkg{caret} versions.

\vspace{.1in}

This process takes approximately 10hrs to complete using \texttt{make -j 12} on a Mac Pro.

\end{frame}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Documentation}

\pkg{caret} originally contained four package vignettes with in--depth descriptions of functionality with examples.

\vspace{.15in}

Although this functionality had already existed in the package for some time, it was refactored to be more user freindly.

\vspace{.15in}

However, this added time to \texttt{R CMD check} and was a general pain for CRAN.

\vspace{.15in}

Efforts to make the vingettes more computationally efficient (e.g. reducing the number of examples, resamples, etc.) diminished the effectiveness of the documentation.


\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}[fragile]
\frametitle{Documentation}

The documentation was moved out of the package and to the github IO page.

\vspace{.15in}


These pages are built using \pkg{knitr} whenever a new version is sent to CRAN. Some advantages are:

\begin{itemize}
\item longer and more relavant examples are availible
\item update scheduile is under my control
\item dynamic documentation (e.g. D3 network graphs, JS tables)
\item better formatting
\end{itemize}

\vspace{.1in}

It currently takes about 4hr to create these (using parallel processing when possible).

\end{frame}







\end{document}













Loading

0 comments on commit d3b6bfd

Please sign in to comment.