paper/paper.tex

% can be 9pt
\documentclass{sigplanconf}
\usepackage{url}
\usepackage{tikz}
\usepackage{graphicx}
\usepackage{framed}
\usepackage{color}
\usepackage{url}
\usepackage{alltt}
\usepackage{tabularx}
\usepackage{subfigure}
\usepackage{fancyhdr}
\usepackage{paralist}
\usepackage{multirow}
\usepackage{pgf}

\usepackage{listings}
\lstset{ %
  language=Haskell,                 % the language of the code
 %%basicstyle= \scriptsize,           % the size of the fonts that are used
  %% for the code
  %%firstnumber=0, % index to start numbering at
  %%numbers=left,                      % where to put the line-numbers
  %% numberstyle=\tiny\color{gray},  % the style that is used for the
  %% line-numbers
  %% stepnumber=5,                   % the step between two line-numbers. If it's
                                  % 1, each line
  %%                              % will be numbered
  %% numbersep=5pt,                  % how far the line-numbers are from the code
  %% xleftmargin=0ex,
  %% backgroundcolor=\color{white},      % choose the background color. You must
  %% add \usepackage{color}
  %% showspaces=false,               % show spaces adding particular underscores
  %% showstringspaces=false,         % underline spaces within strings
  %% showtabs=false,                 % show tabs within strings adding
  %% particular underscores
  %% frame=single,                   % adds a frame around the code
  %% rulecolor=\color{black},        % if not set, the frame-color may be
  %% changed on line-breaks within not-black text (e.g. commens (green here))
  %% tabsize=2,                      % sets default tabsize to 2 spaces
  %% captionpos=b,                   % sets the caption-position to bottom
  %% breaklines=true,                % sets automatic line breaking
  %% breakatwhitespace=false,        % sets if automatic breaks should only
  %% happen at whitespace
  %% title=\lstname,                   % show the filename of files included
  %% with \lstinputlisting;
  %%                                 % also try caption instead of title
  %% keywordstyle=\color{blue},          % keyword style
  %% commentstyle=\color{dkgreen},       % comment style
  %% stringstyle=\color{mauve},          % string literal style
  escapeinside={@*}{*@},            % if you want to add a comment within your
                                    % code
  %% morekeywords={*,...}           % if you want to add more keywords to
  %% the set
}


%% \usepackage{tikz}
%% \usetikzlibrary{arrows,automata}

%% \usepackage{amsmath}
%% \usepackage{amsthm}

%% \newtheoremstyle{lst}
%%   {\topsep}   % ABOVESPACE
%%   {\topsep}   % BELOWSPACE
%%   {\itshape}  % BODYFONT
%%   {0pt}       % INDENT (empty value is the same as 0pt)
%%   {} % HEADFONT
%%   {}         % HEADPUNCT
%%   {5pt plus 1pt minus 1pt} % HEADSPACE
%%   {(\thmnumber{#2})}          % CUSTOM-HEAD-SPEC
%% \theoremstyle{lst}
%% \newtheorem{codenum}{lst}

\usepackage{float}
\floatstyle{boxed}
\restylefloat{figure}

%% \include{math}

\newenvironment{code}{\begin{alltt}\footnotesize}{\end{alltt}}
%% \newenvironment{cols}{\begin{tabular}{m{3.6cm}|m{3.6cm}}{Haskell} &

\newcommand{\ttp}[1]{\texttt{#1}}

\usepackage{ifthen}
\newboolean{submission}  %set to true for the submission version
\setboolean{submission}{false}
%\setboolean{submission}{true}
\ifthenelse
{\boolean{submission}}
{ \newcommand{\todo}[1]{ } } %hide todo
{ \newcommand{\todo}[1]{ {\color{blue}$<<$#1$>>$}
 }}
%\usepackage{fancyhdr}

\newcommand{\sub}[1]{\(\sb{#1}\)}

\newcommand{\undef}[1]{{\bf #1}}

\newcommand{\sci}[2]{\ensuremath{#1 \times 10^{#2}}}
\begin{document}

%% \conferenceinfo{ICFP'12,} {September 9--15, 2012, Copenhagen, Denmark.}
%% \CopyrightYear{2012}
%% \copyrightdata{978-1-4503-1054-3/12/09}

%\titlebanner{banner above paper title}        % These are ignored unless
%\preprintfooter{}   % 'preprint' option specified.

\title{SmartCheck: Automatic and Efficient Counterexample Reduction and Generalization}
\subtitle{UNPUBLISHED DRAFT}

\authorinfo{Lee Pike}
           {Galois, Inc.}
           {leepike@galois.com}
\maketitle

\begin{abstract}
QuickCheck, developed by Claessen and Hughes, is a powerful library for
automatic test-case generation.  Because QuickCheck performs random testing,
some of the counterexamples discovered are very large.  QuickCheck provides an
interface for the user to write shrink functions to attempt to reduce the size
of counterexamples.  Hand-written implementations of shrink can be complex,
inefficient, and consist of significant boilerplate code.  Furthermore,
shrinking is only one aspect in debugging: counterexample generalization is the
process of extrapolating from individual counterexamples to a class of
counterexamples, often requiring a flash of insight from the programmer.  To
improve counterexample reduction and generalization, we introduce
\emph{SmartCheck}.  SmartCheck is a debugging tool that reduces algebraic data
using generic search heuristics to efficiently find smaller counterexamples.  In
addition to shrinking, SmartCheck also automatically generalizes counterexamples
to formulas representing classes of counterexamples.  SmartCheck has been implemented for Haskell and is freely available.
\end{abstract}

%% \category{D.2.4}{Software/Program Verification}{Reliability}

%% \terms
%% Languages, Verification

%% \keywords
%% embedded domain-specific language, compiler, verification

\section{Introduction}\label{sec:intro}
The QuickCheck testing framework was a revolutionary step-forward in
type-directed random testing~\cite{qc,monadic}.  Originally designed for Haskell,
QuickCheck has been ported to other languages and is a now a widely-used testing
tool.  Because QuickCheck generates random values for testing, counterexamples
it finds may be substantially larger than a minimal counterexample.  In their
original QuickCheck paper~\cite{qc}, the authors report the following user
experience by Andy Gill:
%
\begin{quote}
Sometimes the counterexamples found are very large and it is difficult to go
back to the property and understand why it is a counterexample.
\end{quote}

That is, automated testing helps find a counterexample, but it may be a very
large counterexample.  We are left facing the \emph{shrinking problem}: assume
there exist a set of inputs $i_0, \ i_1, \ldots, \i_n$, such that for program
$P$ (modeled as a function mapping inputs to output) and property $\phi$,
$\phi(P(i_0, \ i_1, \ldots, \i_n))$ is false.  The shrinking problem is to find
the smallest set of inputs $i'_0, \ i'_1, \ldots, \i'_n$, according to some
measure, such that $\phi(P(i\_0, \ i\_1, \ldots, \i\_n))$ is also false.

In general, the problem too expensive to solve, since it requires showing that
for a set of failing inputs, there exist no strictly smaller inputs that also
cause the property to fail.  Such a demonstration requires exhaustive testing or
proof.  Solutions to the shrinking problem are therefore approximated to
efficiently discover sufficiently small inputs causing the input to fail.
Usually, the shrunk inputs are derived algorithmically (but perhaps
nondeterministically) from a known larger set of inputs resulting in failure,
and the inputs may not be minimal.

To address the shrinking problem, QuickCheck defines a type class
\ttp{Arbitrary} that presents a method \ttp{arbitrary} for generating random
values of a given type.  Gill added another method to the type class:
%
\begin{code}
smaller :: a -> [a]
\end{code}
%
\noindent
The purpose of \ttp{smaller} is to generate strictly smaller values, according
to some measure, from a given counterexample.  These new values are then tested
to attempt to find a smaller counterexample.  Today, \ttp{smaller} is called
\ttp{shrink}.

In industrial uses, shrinking is essential.  In describing commercial
applications of QuickCheck, Hughes has noted that ``without it [shrinking],
randomly generated failing cases would often be so large as to be almost
useless.''~\cite{qcjh}.  Hughes~\emph{et al.} also give an extended example in
which shrinking is essential in debugging telecom software~\cite{telecom}.

Defining an efficient and effective shrink method requires a good understanding
of how shrinking in QuickCheck works and the semantics of the property and
program being evaluated.  Bad definitions can be so slow or so ineffective at
shrinking that they are unusable.

In addition, shrinking is one side of the coin when it comes to making
counterexamples more understandable: the other side is extrapolation from
individual counterexamples to a class of counterexamples.  This leap of
abstraction is often implicitly made by the programmer in determining the reason
why counterexamples fail the property.  For example, the following is a
relatively small counterexample returned when QuickChecking a property in (a
bug-injected version of) Xmonad, a popular X11 window manager written in
Haskell~\cite{xmonad}:
%
\medskip%
\begin{sloppypar}
%\footnotesize
\noindent%
\ttp{StackSet \{current = Screen \{workspace = Workspace \{tag = NonNegative
      \{getNonNegative = 0\}, layout = -1, stack = Just (Stack \{focus = 'S', up =
        "", down = ""\})\}, screen = 1, screenDetail = 1\}, visible = [Screen
    \{workspace = Workspace \{tag = NonNegative \{getNonNegative = 2\}, layout = -1,
        stack = Nothing\}, screen = 2, screenDetail = -1\},Screen \{workspace =
      Workspace \{tag = NonNegative \{getNonNegative = 3\}, layout = -1, stack =
        Nothing\}, screen = 0, screenDetail = -1\}], hidden = [Workspace \{tag =
      NonNegative \{getNonNegative = 1\}, layout = -1, stack = Just (Stack \{focus
        = `\char`\\NUL', up = "", down = ""\})\},Workspace \{tag = NonNegative
      \{getNonNegative = 4\}, layout = -1, stack = Just (Stack \{focus = 'I', up =
        "", down = ""\})\}], floating = fromList []\} }
\end{sloppypar}
\medskip%
\noindent
(This counterexample uses Haskell's default \ttp{Show} instances, which uses
record syntax.)  Programmers may be familiar with having to debug
a ``wall of text'' as shown above.  What if instead a
formula like the following were returned, stating that for any well-typed
values $x_0$, $x_1$, $x_2$, and $x_3$, tested, a counterexample is found?

%
\begin{samepage}
\begin{code}
forall values x0 x1 x2 x3:
  StackSet
    (Screen (Workspace x0 (-1) (Just x1)) 1 1)
    x2 x3 (fromList [])
\end{code}
\end{samepage}
%
\noindent
The formula quantifies away all the irrelevant portions of the data structure
with respect to the property, so that the user can focus on the heart of the
problem in a class of counterexamples.

\paragraph{SmartCheck}
Motivated by the problems of reducing and generalizing large counterexamples, we
developed \emph{SmartCheck}.  SmartCheck takes a counterexample produced by some
oracle and generically minimizes and generalizes the counterexample.

As shown in the motivating example in Section~\ref{sec:example}, SmartCheck
can shrink more than an order-of-magnitude faster than even the most efficient hand-written
shrink implementations for some properties and programs.  As well, the time
required to shrink scales linearly with the size of the counterexample---other
common shrink implementations often scale in polynomial time or worse.
(Shrinking time is important in real-world uses; see
Section~\ref{sec:experiments} for examples.)

After presenting some preliminary definitions in
Section~\ref{sec:preliminaries}, in Section~\ref{sec:shrinking}, we describe the
generic shrinking algorithm implemented.  The algorithm is particularly novel in
that it generates new random sub-values during the reduction phase (see related
work in Section~\ref{sec:related}).


%%  than the tuple or list-truncation approaches
%% described earlier, and the size of resulting counterexamples are two-five times
%% smaller (more detailed performance numbers are given in
%% Section~\ref{sec:performance}).  Furthermore, other shrink approaches tend to
%% have ``long tails,'' meaning that for some original counterexamples, that are
%% not shrunk much or take a very long time to shrink.
%% Motivated by these limitations of QuickCheck and SmallCheck, we developed
%% \emph{SmartCheck}.

SmartCheck implements three novel approaches to automatically generalize
counterexamples, which are described in Section~\ref{sec:generalization}.  The
first algorithm universally quantifies sub-values that always fail in tests.
The second algorithm existentially quantifies sub-values for types in which
every possible variant fails the property.  For example, finding counterexamples
(\ttp{Left 2}) and (\ttp{Right True}) for the type
%
\begin{code}
\ttp{Either Int Bool}
\end{code}
%
\noindent
means there exists a counterexample regardless of the variant chosen.
Existential generalization is useful for large sum types, as found in
abstract syntax tree (AST) definitions, for example.

The third algorithm automatically strengthens properties by omitting ``similar''
counterexamples to the ones previously observed.  The algorithm is motivated by
noting that there are often multiple ways in which a property may fail; for
example, a property stating that pretty-printing an AST and then parsing it
results in the original AST may fail due to multiple bugs.  During testing, it
is useful to discover counterexamples arising from each bug in one go.  In
practice, the problem is solved by discovering a counterexample \ttp{cex},
abstracting it, and then adding a new precondition to the property that
informally says ``omit counterexamples of form \ttp{cex}.''  Adding
preconditions manually is laborious and may cause the programmer to make
premature fixes to the program, if she believes she has isolated the error
before she actually does.

We describe our implementation based on generic programming in
Section~\ref{sec:implementation}; the implementation is open-source.  In
Section~\ref{sec:experiments}, we discuss some of our experiences with using
SmartCheck, including checking properties from XMonad and a natural language
processing library.

\section{A Motivating Example}\label{sec:example}

\begin{figure}[ht]
\begin{code}
type I   = [Int16]
data T   = T I I I I I

toList :: T -> [[Int16]]
toList (T i0 i1 i2 i3 i4) =
  (map . map) fromIntegral [i0, i1, i2, i3, i4]

pre :: T -> Bool
pre t = all ((>) 256 . sum) (toList t)

post :: T -> Bool
post t = (sum . concat) (toList t) < 5 * 256

prop :: T -> Property
prop t = pre t ==> post t
\end{code}
  \caption{Example program and property.}
  \label{fig:initial}
\end{figure}

In this section, we motivate in more detail the challenges in shrinking
counterexamples by comparing manual approaches using QuickCheck to SmartCheck.
(We focus on shrinking rather than generalization here since counterexample
generalization is unique to SmartCheck.)  We will show how a small data type and
simple property can result in large counterexamples without any shrinking.  Then
we show the difficulty in designing an efficient shrink implementation manually,
which is required when using QuickCheck; we will walk through a couple of poor
designs before arriving at a ``canonical'' manual solution.  However, even the
canonical solution produces significantly larger counterexamples and takes
significantly longer to produce them than SmartCheck, even though SmartCheck's
shrinking strategy is completely automatic and generic.

Consider the example in Figure~\ref{fig:initial}.\footnote{All examples and
  algorithms in this paper are presented in Haskell 2010~\cite{haskell2010} plus
  the language extensions of existential type quantification and scoped type
  variables.}  Data type \ttp{T} is a product type containing five lists of
signed 8-bit integers.  For our purposes, it is not important what \ttp{T}
models, but for concreteness, suppose it models the values in five linked lists.

Now suppose we are modeling some program that serializes values of type \ttp{T}.
The input to the program satisfies the invariant \ttp{pre}, that the sum of
values in each list of \ttp{Int16}s is less than or equal to 256.  Assuming this,
we want to show \ttp{post} holds, that the sum of all the values from \ttp{T} is
less than $5 * 256$, where five is the number of fields in \ttp{T}.  At first
glance, the property seems reasonable.  But we have forgotten about underflow;
for example, the value
%
\begin{code}
T [-20000] [-20000] [] [] []
\end{code}
%
\noindent
satisfies \ttp{pre} but fails \ttp{post} (the \ttp{==>} operator in the figure
is implication from the QuickCheck library).

Despite the simplicity of the example, a typical counterexample returned by
QuickCheck can be large.  With standard settings and no shrinking, the median
counterexample discovered contains just under 70 \ttp{Int16} values, and
counterexamples can contain over 100 values.  Thus, it pays to define
\ttp{shrink}!

We might first naively try to shrink counterexamples for a data type like
\ttp{T} by taking the cross-product of shrunk values over the arguments to the
constructor \ttp{T}.  This can be expressed using Haskell's list-comprehension
notation:
%
\begin{code}
shrink (T w i0 i1 i2 i3) =
  [ T a b c d e | a <- shrink w,  b <- shrink i0
                , c <- shrink i1, d <- shrink i2
                , e <- shrink i3 ]
\end{code}
%
\noindent
While the above definition appears reasonable on first glance, there are two
problems with it.  First, the result of (\ttp{shrink t}) is null if any list
contained in \ttp{t} is null, in which case \ttp{t} will not be shrunk.  More
troublesome is that with QuickCheck's default \ttp{Arbitrary} instances, the
length of potential counterexamples returned by \ttp{shrink} can easily exceed
$10^{10}$, which is an intractable number of tests for QuickCheck to analyze.
The reason for the blowup is that shrinking a list \ttp{[a]} produces a list
\ttp{[[a]]}, a list of lists; each list is a new shrinking of the original list.
Then, we take the cross-product of the generated lists.  For the example above,
with some counterexamples, the shrinking stage may appear to execute for hours,
consuming ever more memory, without returning a result.

The first way we will control the complexity is focusing on structurally
shrinking the data type rather than shrinking the base types.  We prevent
QuickCheck from using its default shrink instance for \ttp{Int16} by defining a
newtype:
%
\begin{code}
newtype T = T { getInt :: Int16 }
\end{code}
%
\noindent
Without shrinking the \ttp{Int16}s, the performance is dramatically improved.

Next, a programmer might try to control the complexity by truncating lists using
truncation where
%
\begin{code}
take n ls
\end{code}
%
returns the first $n$ elements \ttp{ls}.  The trade-off is quicker shrinking
with a lower probability of finding a smaller counterexample.  For example, we
might redefine shrink as follows:
%
\begin{samepage}
\begin{code}
shrink (T w i0 i1 i2 i3) =
  [ T a b c d e | a <- tk w
                , b <- tk i0, c <- tk i1
                , d <- tk i2, e <- tk i3 ]
  where tk x = take 10 (shrink x)
\end{code}
\end{samepage}
%
\noindent
Call this version ``truncated shrink''.  Truncation controls the blowup of the
input-space, but the downside is that potentially smaller counterexamples may be
omitted.

Someone who understands the semantics of QuickCheck's \ttp{shrink}
implementation defines a \ttp{shrink} instance as follows:\footnote{The
  following approach was suggested to the author by John Hughes.}
%
\begin{code}
shrink (T w i0 i1 i2 i3) = map go xs
  where xs = shrink (w, i0, i1, i2, i3)
        go (w', i0', i1', i2', i3')
           = T w' i0' i1' i2' i3'
\end{code}
%
\noindent
This ``tuple shrink'' definition does not suffer the same shortcomings: it
shrinks a value even if it contains an empty list, and the combinatorial blowup
of shrink candidates is avoided, since a pair \ttp{(a,b)} is shrunk by
attempting to shrink \ttp{a} while holding \ttp{b} constant, then attempting to
shrink \ttp{b} while holding \ttp{a} constant (and is generalized from pairs to
arbitrary tuples).  Thus, each argument to \ttp{T} is independently shrunk.
Still, using this definition, from some initial counterexamples, shrinking may
result in a final counterexample containing just over 100 \ttp{Int16} values.

SmallCheck is another testing framework for Haskell for which shrinking is
irrelevant: SmallCheck is guaranteed to return a smallest counterexample, if
one exists~\cite{sc}.  SmallCheck does this by enumerating all possible inputs,
ordered from smallest to largest, up to some user-defined bound.  While
SmallCheck is effective for testing many programs and properties (in accordance
with the \emph{small scope hypothesis}~\cite{jackson}), counterexamples to even
relatively simple properties may be infeasible to discover due to input-space
explosion.

%% \begin{figure}
%%   \begin{code}
%% instance Serial Int16 where
%%   series d = drawnFrom [(-d')..d']
%%     where d' = fromIntegral d

%% instance Serial A where series = cons1 A
%% instance Serial B where series = cons4 B
%%   \end{code}
%%   \caption{SmallCheck instances for the data types in
%%     Figure~\ref{fig:initial}.}
%%   \label{fig:smallshrink}
%% \end{figure}

For a product type, SmallCheck must check values produced by taking the
cross-product of the type's fields at a given depth.  Unfortunately, using
default instances, SmallCheck fails long before it reaches a depth required to
discover a counterexample.  (A related library named Feat combines some aspects
of SmallCheck and QuickCheck~\cite{feat}; we discuss it in
Section~\ref{sec:related}).


%% is unable to verify the property from Figure~\ref{fig:initial}, for much the
%% same reason that \ttp{shrink} is inefficient

%% For the property \ttp{prop} defined in Figure~\ref{fig:initial} (suitably
%% redefined to use SmallCheck's types).

%% to depth 16 to find the first counterexample.  Unfortunately, after a couple
%% hours of testing, Lazy SmallCheck is still checking values at depth six, and
%% the number of tests scales with respect to the depth.\footnote{All tests in
%% this paper are performed on a four-core Pentium~i7 running at 2.7GHz with 8GB
%% RAM on Fedora~16.  This test, and others unless noted, are performed using
%% interpreted Haskell under GHC 7.4.2.\todo{redo tests?}}

%% For example, for type \ttp{Int}, the depth $d$ is defined as the integers in
%% the list enumerating from $-d$ to $d$ (i.e., \ttp{[(-d)..d]}).
%% Figure~\ref{fig:smallshrink} contains functions for generating Lazy
%% SmallCheck (a variant of SmallCheck) tests by defining instances for the
%% \ttp{Serial} class.

%% Finally, even if we could overcome the problems described above with QuickCheck
%% and SmallCheck, we are left with two others.
%


%% Second, during testing, it is sometimes helpful to cover the input space
%%     by discovering different counterexamples to \ttp{prop}.  In the example in
%%     Figure~\ref{fig:initial}, one counterexample is fairly representative of
%%     others, but this is not true in general (e.g., consider properties about
%%     abstract syntax trees).  Ideally, the testing framework would automatically
%%     add a predicate characterizing a counterexample as the precondition to
%%     \ttp{prop} that strengthens the property.  That is, for a specific
%%     counterexample \ttp{ex}, we have a predicate \ttp{predEx} that holds if its
%%     argument is another counterexample that is similar to (i.e., some
%%     generalization of) \ttp{ex}.  Then we test against a new property
%% %
%% \begin{code}
%% prop' a = not (cex a) ==> prop a
%% \end{code}
%
%% \noindent
%% where \ttp{==>} is QuickCheck's implication operator that discards tests that
%% violate the precondition.
%\end{itemize}

\begin{figure}[ht]
\includegraphics[scale=0.6]{../regression/PaperExample1/out/data}
%% \includegraphics[scale=0.6]{../regression/PaperExample1/out/time-big}
%% \includegraphics[scale=0.6]{../regression/PaperExample1/out/time-small}
  \caption{Results for 1000 tests of the program in Figure~\ref{fig:initial}.}
  \label{fig:graphs}
\end{figure}

\begin{table}[ht]
\footnotesize
  \begin{center}
    \begin{tabular}{|r||c|c|c|c|}
\hline
           & QC none     & QC trunc     & QC tuple    & SmartCheck\\
\hline \hline
Mean       & 81, 0.013s  & 40, 0.231s   & 13, 0.228s  & 6, 0.022s\\
\hline
Std. dev.  & 21, 0.001   & 21, 1.742    & 7,  0.962   & 4, 0.001s\\
% Median     & 0.009s, 67  & 0.17s, 33  & 1.81s, 13  & 0.10s, 6\\
%% \hline
%% MAD        & 0.02s, 13  & 0.09s, 8   & 0.62s, 3   & 0.02s, 2\\
\hline
95\%       & 116, 0.025s  & 84, 0.283s  & 26, 0.501s  & 13, 0.039s\\
\hline
    \end{tabular}
  \end{center}
  \caption{Summarizing data for the graph in Figure~\ref{fig:graphs}. Entries
    contain execution time (in seconds) and counterexample sizes (counting
    \ttp{Int16} values).}
  \label{table:results}
\end{table}

To motivate the benefits of SmartCheck reduction algorithm, consider
Table~\ref{table:results} and the corresponding graphs in
Figure~\ref{fig:graphs}, giving the results for using QuickCheck and SmartCheck
on the program shown in Figure~\ref{fig:initial} (we omit SmallCheck, since as
noted, it is not even feasible for the example with default
instances).\footnote{All results reported in the paper are with a
  GHC-7.6.3-compiled program, using \ttp{-O2}, on a 2.8 GHz 4 core i7 with 16GB
  memory under OSX v. 10.9.2.}  We graph the final counterexample size and
execution time.

The experiments show three approaches to shrinking with QuickCheck: no shrinking
(QC none), to provide a baseline, truncated shrinking (QC trunc), as described
above, in which each list is truncated to 10 elements and \ttp{Int16} values are
not shrunk, and tuple shrinking (QC tuple), also described above, and
\ttp{Int16} values are also not shrunk.

With each shrinking strategy, we generate 1000 runs in which a random
counterexample is discovered (by QuickCheck), and then shrunk.  We plot the
number of runs resulting in a shrunk counterexample of a given size.

In Table~\ref{table:results}, we summarize the graphed values and provide
execution time in seconds as well.  For both execution time and final value
size, we summarize the mean, median, standard deviation, and the results at the
95th percentile.  (While we provide the standard deviations, note that the plots
are not necessarily Gaussian.)

The execution time includes the time to
discover an initial counterexample as well as time to shrink it.  

SmartCheck produces smaller counterexamples than the other approaches, with a
very small tail.


More dramatic, however, are the differences in execution time.  The execution
times for SmartCheck include the initial counterexample generation time required
by QuickCheck (QC none); thus the time required by SmartCheck to shrink a
counterexample is on average just two hundredths of a second (0.10 - 0.08).  As
can be seen in the graphs, SmartCheck's execution time scales linearly (like
QuickCheck without shrinking).  Truncation and tuple shrinking produce very long
tails; the longest execution time recorded for truncation shrinking and tuple
shrinking are 88 seconds and 110 seconds, respectively, although these outliers
can vary wildly.  SmartCheck is largely insensitive to the size of the original
counterexample provided.

%% In regards to generalization, after shrinking the counterexample, SmartCheck
%% will also generalize it.  Note that for any \ttp{Word8} value in the first
%% argument of \ttp{T}, the counterexample still holds.  For example, SmartCheck
%% might return a counterexample of the form
%% %
%% \begin{code}
%% forall x . T x [-20000] [-20000] [] []
%% \end{code}

Finally, these performance results for SmartCheck are fairly generalizable to
other types and properties, since the algorithm is not dependent on the specific
program or property being checked.  However, the results are particularly
pronounced for counterexamples that are particularly ``deep'' or ``broad'' in
terms of their tree structure.  %% For programs and properties for which small
%% counterexamples are easily found, QuickCheck's shrinking is perfectly
%% acceptable.  Furthermore, user-defined shrink methods may vary significantly
%% based on the property and counterexample type, so it is not always clear which
%% implementation to compare against.
We now describe in detail SmartCheck's algorithm for generic shrinking.


%% \begin{itemize}

%% \item an efficient counterexample reduction strategy for algebraic
%%   data types (Section~\ref{sec:shrinking});

%\item an approach to counterexample generalization, to present the user with a
%%   formula describing a set of counterexamples to a property
%%   (Section~\ref{sec:generalization});

%% \item an approach to generically strengthen a property with a precondition that
%%   characterizes an already-discovered counterexample for the purpose of finding
%%   new, dissimilar counterexamples (Section~\ref{sec:generalization});

%\item

%% \item More generally, this paper is the first to explore the idea of
%%   efficient, generic counterexample reduction and generalization for
%%   functional program testing.
% \end{itemize}


%% \noindent Moreover, the inability to quickly uncover understandable
%% counterexamples is a real-world problem with QuickCheck (or other functional
%% languages).  It works great when your tests pass,

%% \paragraph{Outline.}
%% The paper precedes as follows.  We define a type class and some definitions in
%% Section~\ref{sec:preliminaries} used in the remainder of the paper.  In
%% Section~\ref{sec:shrinking}, we describe SmartCheck's reduction algorithm for
%% efficiently shrinking large counterexamples.  In
%% Section~\ref{sec:generalization}, we turn our attention to the problem of
%% counterexample generalization.  Here, we present two algorithms for generalizing
%% counterexamples and constructing formulas for describing them.  We also discuss
%% a method for property strengthening to discover new counterexamples.  We
%% describe implementation-specific details in Section~\ref{sec:implementation},
%% and the results of some experiments in using our tool are presented in
%% Section~\ref{sec:experiments}.  Related work is given in
%% Section~\ref{sec:related}.  Finally, concluding remarks and future work are
%% presented in Section~\ref{sec:conclusions}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preliminaries}\label{sec:preliminaries}

SmartCheck focuses on algebraic data types represented as ``sums of products''
including recursive and mutually recursive types.  We sometimes refer to the
elements of a sum type as a \emph{variant} that is \emph{tagged} by a
constructor.  For example, the type
%
\begin{code}
Maybe a = Just a | Nothing
\end{code}
%
\noindent
contains two variants, tagged by the constructors \ttp{Just} and \ttp{Nothing},
respectively.

To present the algorithms in the following sections, we provide some definitions
in the form of methods of a \ttp{SubTypes} type class.  (SmartCheck requires
instances to be defined for data being analyzed; in
Section~\ref{sec:implementation}, we describe how instances for the type class
are derived automatically.)
%
\begin{code}
type Idx    = Int
type Size   = Int
data SubVal = forall a. SubTypes a => SubVal a

class Arbitrary a => SubTypes a where
  size    :: a -> Size
  index   :: a -> Idx -> Maybe SubVal
  replace :: a -> Idx -> SubVal -> a
  constr  :: a -> String
  constrs :: a -> [String]
  opaque  :: a -> Bool
  subVals :: a -> Tree SubVal
\end{code}
%
\noindent
The \ttp{SubTypes} type class requires QuickCheck's \ttp{Arbitrary} as a
super-class.  \ttp{SubTypes} has the following methods:
\begin{itemize}
\item \ttp{size} returns the size of a value---intuitively, the number of
  constructors contained within it,
\item \ttp{index} returns a \emph{sub-value} at a breadth-first index in a
  value,
\item \ttp{replace} replaces a sub-value at a particular focus (returning the
  original value if the index is out-of-bounds),
\item \ttp{constr} returns a string representation of the constructor tagging
  the value,
\item \ttp{constrs} returns the list of all possible constructors from the
  value's type,
\item \ttp{opaque} is false when the type of the value is an ``interesting
  type''; informally, this is a type other than a primitive type like \ttp{Int}
  or \ttp{Char}.  See Section~\ref{sec:base} for a full discussion.
\item \ttp{subVals} returns a tree of all non opaque-type sub-values.  A tree is a
  value with the type
%
  \begin{code}
data Tree a = Node \{ rootLabel :: a
                   , subForest :: [Tree a] \}
  \end{code}
%
\end{itemize}

\noindent
To illustrate typical evaluations of the methods, consider a binary tree type:
%
\begin{code}
data T = L | B T T
\end{code}
%
\noindent
and the value \ttp{tree}, labeled with indexes in a breadth-first order:
%
\begin{code}
tree = B\sub{0} (B\sub{1} L\sub{3}
             (B\sub{4} L\sub{6} L\sub{8}))
          (B\sub{2} L\sub{5} L\sub{7})
\end{code}
%
\noindent
Here are example applications of \ttp{SubTypes} methods; in the following, we
show the indexes with respect to the value \ttp{tree}:
%
\begin{code}
size tree = 9

index tree 4  = (Just . SubVal) (B\sub{4} L\sub{6} L\sub{8})
index tree 12 = Nothing

replace tree 2 (SubVal L) =
  B\sub{0} (B\sub{1} L\sub{3}
        (B\sub{4} L\sub{6} L\sub{8}))
     L

constr  tree = ["B"]
constrs tree = ["B", "L"]
constrs L    = ["B", "L"]

opaque (3 :: Int) = True
opaque tree       = False
opaque L          = False

subVals (B (B L\sub{0} L\sub{1}) L\sub{2}) =
  Node (B L\sub{0} L\sub{1}) [Node L\sub{0} [], Node L\sub{1} []]
       (Node L\sub{1} [])
\end{code}
%
\noindent
The \ttp{SubVal} type is an existential data type, used as a generic container
for sub-values from a counterexample.  We will sometimes refer to the unwrapped
value returned by \ttp{index a i} as the \emph{$i$th sub-value of $a$}, so for
example, (\ttp{B\sub{4} L\sub{6} L\sub{8}}) is the 4th sub-value of \ttp{tree}.
An invariant of \ttp{index} is that for any value \ttp{a}, and for the smallest
$i \geq 0$ such that
%
\begin{code}
index a i == Nothing
\end{code}
%
\noindent
then for all $0 \leq j < i$,
%
\begin{code}
index a j /= Nothing
\end{code}
\noindent
We use this invariant as a termination case in recursive algorithms over the
sub-values of a value.  (Rather than indexes into a data-structure, an
alternative representation is to use a zipper data structure~\cite{zipper} to
traverse data.  We have chosen explicit indexes to write simple tail-recursive
algorithms that can easily be transcribed to imperative languages.)

In our implementation, \ttp{constr} and \ttp{constrs} depends on GHC
Generics~\cite{generics}, which we describe in Section~\ref{sec:implementation}.
For simplicity, we omit here Generics-specific super-class constraints on the
\ttp{SubTypes} class.  Moreover, our presentation simplifies the implementation
(Section~\ref{sec:implementation}) somewhat to improve the presentation.

%% Finally, \ttp{opaque} is true for our tree type but false for a
%% structureless type, such as \ttp{Int}.

% ------------------------------------------------------------------------------

\section{Shrinking Data}\label{sec:shrinking}
In this section, we describe how to efficiently and generically shrink algebraic
data values.  Recall the basic idea behind the \ttp{shrink}
method of the \ttp{Arbitrary} class: generate a list of values, each of which is
smaller than the current counterexample.  Each of the new values generated
may not bear any relationship to the original counterexample other than being
smaller.  SmartCheck pursues an approach that searches for smaller but
structurally similar counterexamples, as we make precise below.

We describe the algorithm in Section~\ref{sec:reduct} and then describe
algorithmic details in Section~\ref{sec:details}.  Some optimizations to the
reduction algorithm are described in Section~\ref{sec:optimizations}.

%% Consider the problem of finding a small counterexample as a graph search
%% problem.  The set of vertexes are all possible well-typed values.  Let $<$ be
%% a partial ordering on values relating them by some size measure.  Let there
%% be an edge from vertex $v$ to $v'$ if $v < v'$ and there exists no $v''$ such
%% that $v < v'' < v'$.  The goal is to traverse the graph, starting from a
%% counterexample, to find a vertex from the set of smallest counterexamples in
%% the graph.

%% The approach taken by typical \ttp{shrink} implementations is the most naive
%% graph search algorithm: it simply generates a random set of potential vertexes
%% that are strictly smaller according to the given measure to check the property
%% against.  However, it does not exploit that the original counterexample
%% \ttp{cex} is already in a neighborhood of other likely counterexamples.  So
%% rather than choose random vertexes, it may be more efficient to test other
%% values that are closely related to \ttp{cex}.  By restricting the search to
%% \ttp{cex}'s neighborhood, a smaller counterexample might be missed, and only a
%% local minimum is returned.  However, with a large set of values, finding a
%% global minimum by random sampling vertexes in the graph is possible but too
%% inefficient.

\subsection{Reduction Algorithm Overview}\label{sec:reduct}
The algorithm we present for efficiently searching for new counterexamples is an
instance of greedy breadth-first search over a tree structure that represents a
value.  At each node, during the traversal, we generate arbitrary
structurally smaller sub-values and build a new value from that, leaving
the remainder of the tree unchanged.  Intuitively, by a \emph{structurally
  smaller value}, we mean one with fewer constructors.  We continue until we
reach a fixed-point.


\begin{figure}[ht]
  \lstset{caption={}
%         ,xleftmargin=5ex,firstnumber=0,numbers=left
         }
  \begin{code}
getSize :: SubVal -> Size
getSize (SubVal a) = size a

newVals :: Size -> Int -> SubVal -> IO [SubVal]
newVals sz tries (SubVal a) =
  replicateM tries s where
  s = liftM SubVal (\undef{sizedArbitrary} sz a)

reduce :: SubTypes a
  => ScArgs -> (a -> Property) -> a -> IO a
reduce args prop cex = reduce' 1
  where
  reduce' idx
    | Just v <- index cex idx
    = do vs <- newVals (getSize v)
                 (scMaxReduce args) v
         maybe (reduce' (idx+1)) (reduce args prop)
               (test cex idx vs prop)
    | otherwise = return cex

test :: SubTypes a => a -> Idx -> [SubVal]
     -> (a -> Property) -> Maybe a
test cex idx vs prop = go vs
  where
  go []      = Nothing
  go (v:vs') =
    let cex' = replace cex idx v in
    if \undef{pass} prop cex' then go vs'
      else Just cex'
  \end{code}
  \caption{Counterexample reduction algorithm.\label{fig:reduction}}
\end{figure}

Figure~\ref{fig:reduction} shows the reduction algorithm.  In this algorithm and
subsequent algorithms in the paper, functions in {\bf bold font} are left
undefined but their implementation is described in the text.  The function
\ttp{reduce} takes flags to customize the algorithm's behavior, a counterexample
\ttp{cex}, and the property \ttp{prop}.  The reduction begins at the first
proper sub-value of \ttp{cex}; call it \ttp{v}.  When the index \ttp{idx}
becomes out-of-bounds and returns \ttp{Nothing}, the algorithm terminates.
Otherwise, a list of new random values are generated.
%
\begin{code}
sizedArbitrary :: SubTypes a => Size -> a -> IO a
\end{code}
%
\noindent
generates a new value \ttp{v'} having the same type as \ttp{v} and that is
strictly smaller (with respect to the \ttp{size} method) than \ttp{v}.  Just
like QuickCheck's \ttp{arbitrary} method, \ttp{sizedArbitrary} generates
successively larger counterexamples when generating new values with which to
replace a sub-value.

The flag \ttp{scMaxReduce} is the maximum number of tries to discover a new
counterexample by replacing \ttp{v} in \ttp{cex} and testing it.  The result of
\ttp{pass prop cex'} for
%
\begin{code}
pass :: (a -> Property) -> a -> Bool
\end{code}
%
\noindent
holds if \ttp{cex'} satisfies the property \ttp{prop}.  The property may be a
conditional, in which case the value must pass the precondition as well as the
consequent for \ttp{pass} to return \ttp{True}.  If no failure is found, we move
to the next sub-value of \ttp{cex} and continue.  However, if a new smaller
counterexample \ttp{cex'} is found, we start a new breadth-first traversal of
\ttp{cex'}, attempting to shrink it further.

The algorithm is guaranteed to terminate: informally, the measure for the
function is that either the index increases or the size of the counterexample
being evaluated decreases.  The algorithm's complexity is
$\mathcal{O}(n^2)$, where $n$ is the number of constructors in the
counterexample, assuming that generating new sub-values and testing them is done
in constant time.

\subsection{Reduction Algorithm Details}\label{sec:details}
Having described the reduction algorithm, there are two important details about
its design we describe below.

\subsubsection{Variant Counterexample Hypothesis}
A motivation for the design of the reduction algorithm is something we call the
\emph{variant counterexample hypothesis}: in the search space of possible values
from a given type \ttp{T}, if a known counterexample \ttp{cex} is a variant
\ttp{v} of \ttp{T}, then it is most probable that other counterexamples are also
from variant \ttp{v}.  As an example supporting the hypothesis, consider a
property about unintended variable capture over a language's parse tree
represented by a sum type with constructors for module imports, function
definitions, and global-variable assignments, respectively.  A function
definition counterexample can only be reduced to smaller function definition
counterexamples, the only construct in which variable capture is possible.

Recall that the algorithm begins at the \emph{first} sub-value of the
counterexample rather than the 0th sub-value.  The 0th sub-value of \ttp{cex} is
\ttp{cex} itself.  No invariant of the algorithm would be violated by beginning
with the 0th sub-value, and in particular, the algorithm would still terminate.
However, we begin with a strict sub-value because of the hypothesis.

The hypothesis may be incorrect for some properties, in which case SmartCheck
may potentially fail to discover a smaller counterexample.  However, in
Sections~\ref{sec:existential} and~\ref{sec:precondition}, we describe an
approaches to generalize counterexamples based on discovering new counterexample
variants.  These generalization techniques are executed in an (optional)
generalization phase, run after the reduction phase, in which this hypothesis is
implemented.

%% Another way to motivate the hypothesis is that searching the space of
%% possible counterexamples by generating \emph{any} well-typed value that is
%% strictly smaller than the original counterexample is something we have seen
%% before: it is precisely the naive approach to implementing \ttp{shrink}!

%% \subsubsection{Conditional Properties}
%% A QuickCheck property may be a conditional of the form
%% %
%% \begin{code}
%% prop a = pre a ==> prop' a
%% \end{code}
%% %
%% \noindent
%% such that \ttp{prop} is true if the precondition (\ttp{pre a}) is false or
%% (\ttp{prop' a}) holds.  In practice, conditionals are used to restrict the set
%% of values tested; values that fail the precondition are discarded.  Thus, the
%% call to \ttp{pass} in the reduction algorithm only considers a potential
%% counterexample \ttp{cex} to have failed if it passes the property's precondition
%% and fails the consequent.

\subsubsection{Opaque Types}\label{sec:base}

SmartCheck focuses on efficiently shrinking and generalizing large data
structures.  It is not intended as a general replacement for QuickCheck's
\ttp{shrink} method.  Consequently, SmartCheck ignores ``primitive'' types
without value constructors, such as \ttp{Char}, \ttp{Int}, and \ttp{Word16}.
Our experience is that for the kinds of properties with counterexamples that
contain massive data structures, shrinking primitive types does not
significantly help in understanding them.  Furthermore, by ignoring these types
by fiat, shrinking time is dependent only on the size of a data structure as
measured by the number of constructors.

We generalize the idea of ignoring primitive types by introducing the concept of
\emph{opaque types}.  If the reduction algorithm encounters a opaque type, it is
ignored.  Opaque types include the primitive types mentioned above, but in the
user can declare any substructure in a data type to be a opaque type by
providing custom instances.  Doing so effectively treats values from that type
as ``black boxes'', making SmartCheck more efficient if the user knows that some
portion of the structure cannot be shrunk or is irrelevant to the property.

Opaque types can be conditional.  For example, the user may want lists to be
shrunk in general, unless the element of the list are opaque themselves.  Such a
definition is possible.

Opaque types are defined by providing the method for \ttp{opaque} in the
\ttp{SubTypes} type class.

\todo{reference into with QC, no int shrinking}


%% The number of constructors in an algebraic data value primarily determines how
%% large it is.  In some cases, the size of fields that come from primitive
%% types, like \ttp{Int} or \ttp{Char}, have less of a role in understanding a
%% counterexample, so in general, we want to prevent our reduction algorithm from
%% spending time shrinking values from these types.


%% By default, this includes ``primitive'' types (e.g., \ttp{Int}, \ttp{Word},
%% \ttp{Float}, \ttp{Bool}, and \ttp{Char}).  We also treat recursive container
%% types like lists specially: lists of opaque types are considered to be opaque
%% types themselves.  This prevents the algorithm from analyzing, for example,
%% strings.  A list of non-opaque-types, however, is shrunk by the algorithm.

%% Conversely, if the user wishes to shrink types that are treated as opaque by
%% default, she can sometimes do so before handing the counterexample to
%% SmartCheck, say by using QuickCheck's shrink instances to shrink base types
%% while leaving the remaining structure for SmartCheck to shrink.  In the future,
%% depending on user experiences, default instances for types can be removed so
%% that SmartCheck's reduction algorithm does not ignore values from those types.

\subsection{Reduction Algorithm Optimizations}\label{sec:optimizations}
The reduction algorithm description above omits some details and optimizations
we describe here.

\subsubsection{Sub-value Counterexample Hypothesis}\label{sec:subval}
Sometimes, a counterexample fails a property due to a sub-value nested deep
inside the counterexample.  The rest of the value is irrelevant.  We call this
the \emph{sub-value counterexample hypothesis.}  Thus, one way to efficiently
search the space of potential counterexamples is to test a counterexample's
(well-typed) sub-values.

For example, consider a simple calculator language containing constants,
addition, and division, together with an evaluator that checks if the divisor is
0 and returning \ttp{Nothing} in that case:
%
\begin{code}
data Exp = C Int
         | Add Exp Exp
         | Div Exp Exp

eval :: Exp -> Maybe Int
eval (C i) = Just i
eval (Add e0 e1) =
  liftM2 (+) (eval e0) (eval e1)
eval (Div e0 e1) =
  let e = eval e1 in
  if e == Just 0 then Nothing
    else liftM2 div (eval e0) e
\end{code}
%
\noindent
Now consider the property \ttp{prop\_div}, claiming that if \ttp{divSubTerms}
holds on an expression, then the evaluator returns \ttp{Just} a value:
%
\begin{code}
divSubTerms :: Exp -> Bool
divSubTerms (C _)         = True
odivSubTerms (Div _ (C 0)) = False
divSubTerms (Add e0 e1)   =  divSubTerms e0
                          && divSubTerms e1
divSubTerms (Div e0 e1)   =  divSubTerms e0
                          && divSubTerms e1

prop_div e = divSubTerms e ==> eval e /= Nothing
\end{code}
%
Testing \ttp{prop\_div}, we might have a counterexample like the following:
%
\medskip%
\begin{sloppypar}
\small
\noindent%
\ttp{Add (Div (C 5) (C (-12))) (Add (Add (C 2) (C 4)) (Add (C 7) (Div (Add (C 7)
  (C 3)) (Add (C (-5)) (C 5)))))}
\end{sloppypar}
\medskip%
%
\noindent
The cause is that \ttp{divSubTerms} fails to check whether the
divisor evaluates to zero.  In the counterexample, the culprit is a buried
sub-value:
%
\begin{code}
Div (Add (C 7) (C 3)) (Add (C (-5)) (C 5))
\end{code}
%
\noindent
Thus, when attempting to shrink an \ttp{Exp} value, it pays to test whether a
sub-value itself fails the property.

%% and consider an interpreter for the language that returns \ttp{Nothing} if
%% division-by-zero is encountered and (\ttp{Just x}), where \ttp{x} is the result
%% of the computation, otherwise:
%% %
%% \begin{code}
%% eval :: Exp -> Maybe Int
%% eval (C i) = Just i
%% eval (Add e0 e1) =
%%   liftM2 (+) (eval e0) (eval e1)
%% eval (Div e0 e1) =
%%   let e = eval e1 in
%%   if e == Just 0 then Nothing
%%     else liftM2 div (eval e0) e
%% \end{code}
%% %
%% \noindent
%% Suppose we restrict the set of programs with the following predicate:
%% %
%% \begin{code}
%% divSubTerms :: Exp -> Bool
%% divSubTerms (C _)         = True
%% divSubTerms (Div _ (C 0)) = False
%% divSubTerms (Add e0 e1)   =    divSubTerms e0
%%                             && divSubTerms e1
%% divSubTerms (Div e0 e1)   =    divSubTerms e0
%%                             && divSubTerms e1
%% \end{code}
%% %
%% and then test the property
%% %
%% \begin{code}
%% prop_div e = divSubTerms e ==> eval e /= Nothing
%% \end{code}
%% %
%% \noindent
%% The property \ttp{prop\_div} turns out to be false; for example,
%% %
%% \begin{code}
%% Div (C 1) (Div (C 0) (C 1))
%% \end{code}
%% %
%% \noindent
%% is a counterexample.  In the predicate case
%% %
%% \begin{code}
%% divSubTerms (Div _ (C 0)) = False
%% \end{code}
%% %
%% \noindent
%% we forgot to recur on the divisor.

%% %% \begin{figure}[ht!]
%% %%   \caption{Typical counterexample from QuickCheck with no user-defined
%% %%     \ttp{shrink} function.}
%% %%   \label{fig:div-cex}
%% %% \end{figure}

%% A typical counterexample returned by QuickCheck, with a \ttp{shrink} function
%% implemented, is as follows:%
%% %
%% \medskip


%% \begin{sloppypar}
%% %\makebox[\textwidth]{
%% \noindent%
%% \ttp{Div (Add (Div (Div (C (-12)) (C 22)) (Add (C 10) (C (-5)))) (Div (Add (C
%%   30) (C 12)) (C (-18)))) (Div (Div (C (-20)) (C 32)) (Div (C 10) (C (-8))))}
%% \end{sloppypar}
%% \medskip%

%% \noindent
%% It turns out that the culprit is the sub-value
%% %
%% \begin{code}
%% Div (Div (C (-20)) (C 32)) (Div (C 10) (C (-8)))
%% \end{code}
%% %
%% \noindent
%% Thus, while analyzing the counterexample above

\begin{figure}
\begin{code}
reduceOpt :: forall a . SubTypes a
  => ScArgs -> (a -> Property) -> a -> IO a
reduceOpt args prop cex = reduce' 1
  where
  reduce' idx
    | Just v <- index cex idx
    = maybe (test' v idx) (reduceOpt args prop)
            (testHole v)
    | otherwise = return cex

  test' v idx = do
    vs <- newVals (getSize v) (scMaxReduce args) v
    maybe (reduce' (idx+1)) (reduceOpt args prop)
          (test cex idx vs prop)

  testHole (SubVal a) = do
    a' <- \undef{cast} a :: Maybe a
    if \undef{pass} prop a' then Nothing else Just a'
\end{code}
  \caption{Reduction algorithm with the sub-value counterexample optimization.}
  \label{fig:reduce0}
\end{figure}

Generalizing the scenario, during the reduction algorithm's breadth-first
search through a counterexample \ttp{cex}'s sub-values, we may happen upon a
sub-value \ttp{cex'} that has the same type as \ttp{cex} and fails the property
(while passing any preconditions).  In this case, we can return \ttp{cex'}
directly, and rerun the reduction algorithm on \ttp{cex'}.  In
Figure~\ref{fig:reduce0}, we show an updated reduction algorithm, \ttp{reduceOpt},
that implements this optimization.  The function \ttp{testHole} tests the
current sub-value and if it fails the property, the function tests the sub-value
directly.


\subsubsection{Bounding Counterexample Exploration}

SmartCheck's implementation contains flags to allow the user to customize its
behavior.  Three flags that are relevant to the reduction algorithm are the
following:
%
\begin{code}
scMaxReduce :: Int
scMaxSize   :: Int
scMaxDepth  :: Maybe Int
\end{code}
%
\noindent
The \ttp{scMaxReduce} flag controls the number of values generated by the
reduction algorithm for each sub-value analyzed.  \ttp{scMaxSize} controls the
maximum size of values generated to replace sub-values by the reduction
algorithm.  Thus, new sub-values must be strictly smaller than the minimum of
\ttp{scMaxSize} and the size of the sub-value being replaced.  Finally,
\ttp{scMaxDepth} determines the maximum depth in the counterexample the
reduction algorithm should analyze.  A value of \ttp{Nothing} means that the
counterexample should be exhaustively reduced, as the algorithm is presented in
Figure~\ref{fig:reduction}.  For example, the depth of a list is determined by
its length, and the depth of the binary tree defined in Section~\ref{sec:reduct}
is determined by the function \ttp{depth}:
%
\begin{code}
depth L         = 0
depth (B t0 t1) = 1 + max (depth t0) (depth t1)
\end{code}
%
\noindent
Of the flags, \ttp{scMaxDepth} is the most important for controlling
efficiency, particularly for large product types with significant ``fan out''.
The number of sub-values of a product type value can grow exponentially with
respect to the depth.  Furthermore, note that as the reduction algorithm
descends further, there is less chance to reduce the size of the value overall,
since smaller and smaller sub-values are replaced.


\section{Counterexample Generalization}\label{sec:generalization}

Small counterexamples make debugging easier, but they are just half the battle.
To go from a specific counterexample to the required fix in a program, the
programmer must have a flash of insight in which she generalizes the
counterexample to a set of counterexamples for which the program and property
fails.  The generalization step is an important yet under-appreciated step in
the debugging process.  A characterizing formula reduces the noise in favor of
the signal by abstracting away portions of large counterexample that are
irrelevant to why it violates the property.

The characterization of counterexamples that most helps the programmer should
strike a middle ground.  A single counterexample is too specific.  On the other
hand, the property itself is a formula that over-approximates the failing
inputs.  In this section, we describe two kinds of formula that fall between
these two extremes that we call universal and existential sub-value
generalization, respectively.  We then describe a third approach to
generalization to automatically strengthen a property's precondition to obtain
new counterexamples.

\subsection{Universal Sub-Value Generalization}\label{sec:universal}
Consider again the calculator language from Section~\ref{sec:subval}.  The
property \ttp{prop\_div} is violated for any numerator, so we might generalize a
counterexample like
%
\begin{code}
Div (Add (C 7) (C 3)) (Add (C (-5)) (C 5))
\end{code}
%
\noindent
by the formula
%
\begin{code}
forall x . Div x (Add (C (-5)) (C 5))
\end{code}
%
\noindent
since any dividend results in divide-by-zero for the given divisor.  Not only do
the generalizations assist the programmer's insight, but they reduce the sheer
size of the counterexample.  We call the kind of formula just shown
\emph{universal sub-value generalization} and it is implemented in SmartCheck.

An extrapolation algorithm performs universal sub-value generalization.  The
basic idea is as follows: for a counterexample \ttp{cex} and a property
\ttp{prop}, a breadth-first search over the sub-values of the \ttp{cex} is
performed.  For each sub-value, the algorithm generates new sub-values and
replaces them in \ttp{cex} to create a list of new potential counterexamples.
If no new value satisfies the property, then we extrapolate, claiming that for
\emph{any} new value replacing that sub-value in \ttp{cex}, the property will
fail.

\begin{figure}
  \begin{code}
subTrees :: SubTypes a => a -> Idx -> [Idx] -> Bool
subTrees cex idx = any (\undef{subTree} cex idx)

extrapolate :: SubTypes a
  => ScArgs -> a -> (a -> Property) -> IO [Idx]
extrapolate args cex prop = extrapolate' 1 []
  where
  extrapolate' idx idxs
    | subTrees cex idx idxs
    = extrapolate' (idx+1) idxs
    | Just v <- index cex idx = mkNewVals v
    | otherwise = return idxs
    where
    mkNewVals v = do
      vs <- newVals (scMaxSize args)
                    (scMaxForall args) v
      extrapolate' (idx+1)
        (if allFail args cex idx vs prop
           then idx:idxs else idxs)

allFail :: SubTypes a => ScArgs -> a -> Idx
  -> [SubVal] -> (a -> Property) -> Bool
allFail args cex idx vs prop =
  length res >= scMinForall args && and res
  where
  res  = mapMaybe go vs
  go   = \undef{fail} prop . replace cex idx
  \end{code}
  \caption{Universal sub-value generation algorithm.}
  \label{fig:universal}
\end{figure}

The extrapolation algorithm is shown in Figure~\ref{fig:universal}; let sketch
its specification.  The algorithm is similar to the reduction algorithm in
Figure~\ref{fig:reduction} (and in the implementation, the algorithms are
generalized and combined).  The function \ttp{extrapolate} returns a list of
indexes to be generalized in the original counterexample.  In the recursive
function \ttp{extrapolate'}, there is a function guard with a call
%
\begin{code}
subTree cex idx0 idx1
\end{code}
%
\noindent
where \ttp{subTree} has the type
%
\begin{code}
subTree :: SubTypes a => a -> Idx -> Idx -> Bool
\end{code}
%
\noindent
The value
%
\begin{code}
subTree cex idx0 idx1
\end{code}
%
is true if in \ttp{cex}, the value at index \ttp{idx0} is a child of index
\ttp{idx1} in a tree representation of \ttp{cex} (i.e., \ttp{subVals cex}).  The
\ttp{subTrees} guard prevents the algorithm from trying to generalize sub-values
that are abstracted away already since their parents have been generalized.
New sub-values are generated by \ttp{newVals}, shown in
Figure~\ref{fig:reduction}.

The function \ttp{allFail} takes a counterexample \ttp{cex}, an index into
\ttp{cex}, a list of new sub-values, and a property.  It returns true if no new
values satisfy the property.  The function
%
\begin{code}
fail :: (a -> Property) -> a -> Maybe Bool
\end{code}
%
\noindent
is roughly the dual of \ttp{pass} in the reduction algorithm: (\ttp{fail
  prop cex}) returns (\ttp{Just True}) if \ttp{cex} passes \ttp{prop}'s
precondition but fails the property; (\ttp{Just False}) if \ttp{cex}
non-trivially satisfies \ttp{prop}, and \ttp{Nothing} if \ttp{cex} fails
\ttp{prop}'s precondition.

Like in the reduction algorithm user-specified flags bound the behavior of the
algorithm.  We bound the size of values to generate by the flag \ttp{scMaxSize},
which is independent of the size of the particular sub-value.  The flag
\ttp{scMaxForall} is the analogue of the \ttp{scMaxReduce} flag, determining the
number of values generated in trying to generalize a value.  The flag
\ttp{scMinForall} is the minimum number of \ttp{Just False} results required
from \ttp{fail} to extrapolate from failed tests to a universal claim.  So, for
example, if \ttp{scMaxForall} is set to 30 and \ttp{scMinForall} is set to 20,
we generate 30 new values, 20 of which must pass the precondition but fail the
property to claim the counterexample can be generalized.

The algorithm's complexity is $\mathcal{O}(n)$, where $n$ is the number of
constructors in the counterexample.  Again, we assume that the cost for each
generating random values and testing them at each index is constant.

\paragraph{Soundness}
The extrapolation algorithm is unsound in two ways.  First, it extrapolates from
a set of counterexamples to a universal claim, similar to QuickSpec or
Daikon~\cite{qs,daikon}.  By tuning the parameters, the risk of an unsound
generalization is reduced by requiring more or larger values to fail the
property.

In some cases, a formula may be returned that is overly general.  For example,
consider the counterexample in which both arguments of the outermost \ttp{Add}
constructor contain values causing the failure:
%
\begin{code}
Add (Div (C 1) (Add (C (-2)) (C 2)))
    (Div (C 0) (Add (C (-1)) (C 1)))
\end{code}
%
\noindent
Since no matter what random value the first field of \ttp{Add} is replaced with,
the property fails by the second field and vice versa for the second.  The
universal generalization algorithm might return the formula
\noindent
%
\begin{code}
forall values x0 x1 . Add x0 x1
\end{code}
%
\noindent
An over-generalized formula is only a problem when generalizing without
minimizing the counterexample first, as pointed out to the author by John
Hughes, so in practice with SmartCheck, this form of unsoundness does not arise.
However, if one were to generalize without shrinking first, the reader should
read a universally quantified formula as shorthand for quantifying each variable
independently and taking the conjunction of formulas.  For example, instead of
%
\begin{code}
forall values x0 x1 . Add x0 x1
\end{code}
%
\noindent
one should read
%
%
\begin{code}
      forall values x0 .
        Add x0 (Div (C 0) (C (-1)) (C 1))
  and forall values x1 .
        Add (Div (C 1) (C (-2)) (C 2)) x1
\end{code}


\subsection{Existential Sub-Value Generalization}\label{sec:existential}

Sum types denote choice in a data type.  Sometimes, a property over a sum type
fails because there is a bug for some of the variants but not others.  For
example, recall again the calculator language from Section~\ref{sec:subval}.  The
no-division-by-zero property fails only for values that contain a variant tagged
with the \ttp{Div} constructor.  Recall again the generalized counterexample
from Section~\ref{sec:generalization}:
%
\begin{code}
forall x . Div x (Add (C (-5)) (C 5))
\end{code}
%
\noindent
Because the divisor does not generalize, we know there is something special
about it that causes failure.  But we might wonder if there is something special
about variants tagged by the \ttp{Add} constructor, or might we finding failing
sub-values with the other variants.

We therefore introduce another kind of generalization we call \emph{existential
  sub-value generalization}.  In this generalization, if there is a
counterexample containing every possible variant as a sub-value, then we
abstract it.  For example, suppose that \ttp{divSubTerms} had no equation
%
\begin{code}
divSubTerms (Div _ (C 0)) = False
\end{code}
%
\noindent
Then the following would all counterexamples:
%
\begin{code}
Div (Add (C 3) (C 2)) (C 0)
Div (C 7) (Add (C (-5)) (C 5))
Div (C (-2)) (Div (C 0) (C 1))
\end{code}
%
\noindent
Because there are only three variants in the type, we have found counterexamples
built from each of them in the divisor.  We therefore can claim the following
formula holds:
%
\begin{code}
forall values x .
  forall constructors c .
    there exist arguments \(\stackrel{\rightarrow}{y}\) .
      such that Div x (c\(\stackrel{\rightarrow}{y}\))
\end{code}
%
\noindent
We therefore present an existential sub-value generalization algorithm that
performs constructor generalization.  Like with the other algorithms, this
algorithm also performs a breadth-first search over a counterexample.

\begin{figure}
  \begin{code}
subConstr :: SubVal -> String
subConstr (SubVal a) = constr a

subConstrs :: SubVal -> [String]
subConstrs (SubVal a) = constrs a

sumTest :: SubTypes a => ScArgs -> a
  -> (a -> Property) -> [Idx] -> IO [Idx]
sumTest args cex prop exIdxs = sumTest' 1 []
  where
  sumTest' idx idxs
    | subTrees cex idx (exIdxs ++ idxs)
    = sumTest' (idx+1) idxs
    | Just v <- index cex idx = fromSumTest v
    | otherwise = return idxs
    where
    fromSumTest v = do
      vs <- newVals (scMaxSize args)
              (scMaxExists args) v
      sumTest' (idx+1)
        (if constrFail cex idx vs prop
           (subConstr v) (subConstrs v)
           then idx:idxs else idxs)

constrFail :: SubTypes a => a -> Idx -> [SubVal]
  -> (a -> Property) -> String -> [String] -> Bool
constrFail cex idx vs prop con allCons =
  constrFail' [con] vs
  where
  constrFail' cons vs'
    | length cons == length allCons = True
    | null vs'                      = False
    | go v == Just True
    = constrFail' (c:cons) (tail vs')
    | otherwise
    = constrFail' cons (tail vs')
    where
    v  = head vs'
    c  = subConstr v
    go = fail prop' . replace cex idx
    prop' a = c `notElem` cons ==> prop a
  \end{code}
  \caption{Universal sub-value generation algorithm.}
  \label{fig:constrs}
\end{figure}

We show the algorithm in Figure~\ref{fig:constrs}.  The function \ttp{sumTest}
takes a set of flags, a counterexample, a property, and a list of indexes that
have already been generalized---perhaps by the extrapolation algorithm in
Figure~\ref{fig:universal}.  The list of course may be empty if no sub-values
have been previously extrapolated.  In a call to \ttp{subTrees}, discussed in
Section~\ref{sec:universal}, the guard prevents constructor generalization if
the current index is a sub-value of a previously generalized value.  Otherwise,
a list of well-typed new values are generated by a call to \ttp{newVals}, as
shown in Figure~\ref{fig:reduction}.  In the arguments to \ttp{newVals}, we
bound the size of values generated with \ttp{scMaxSize} as before, and bound the
number of values generated with the flag \ttp{scMaxExists}.  Because values are
randomly generated, for ``wide'' sum-types (i.e., with a large number of
constructors), \ttp{scMaxExists} should be large enough to ensure with high
probability that each variant is generated.

The function \ttp{constrFail} returns true if we replace the sub-value at index
\ttp{idx} in counterexample \ttp{cex} with every possible variant given the type
and construct a counterexample to the property.  There are four guards to the
recursive function \ttp{constrFail'}: the first guard holds if the list of
constructors tagging variants in which a counterexample is found is equal in
size to the list of all possible constructors for the type.  The second guard
tests whether the set of test values is null; if so (and if the first guard
fails), then we have exhausted test values before finding all possible failing
variants.  Third, for a specific sub-value \ttp{v}, we test whether it fails the
property.  If so, we add its constructor to the list of constructors.
Otherwise, we simply recurse.  Note in the definition of \ttp{prop'}, we add an
additional precondition that the current constructor is not an element of
constructors already seen.  Thus, (\ttp{go v}) returns
%
\begin{code}
\ttp{Just True}
\end{code}
%
\noindent
if
%
\begin{code}
\ttp{replace cex idx v}
\end{code}
%
passes this precondition (and any other preconditions of \ttp{prop}), but fails
the property.

Unlike universal sub-value generalization, existential sub-value generalization
is sound.  The existential claim is only that for each variant, there exists at
least one counterexample, so inductive reasoning is not required.

This algorithm's complexity is also $\mathcal{O}(n)$, where $n$ is the number of
constructors in the counterexample.

\subsection{Automated Precondition Strengthening}\label{sec:precondition}
The universal and existential generalization algorithms generalize a
counterexample, but in the ``neighborhood'' of the original counterexample.  In
particular, all generalizations are from the same variant as the original
counterexample.  To help the programmer in the generalization step, we would
also like a way to test the property again, ensuring we get counterexamples (if
they exist) outside of the neighborhood of the original one.

\begin{figure}[ht!]
  \begin{center}
    \includegraphics[scale=0.49]{Figs/cex-gen}
   \end{center}
  \caption{Counterexample generalization.}
  \label{fig:cex-gen}
\end{figure}

Figure~\ref{fig:cex-gen} illustrates a property of the form (\ttp{pre ==>
  post}).  Points are specific counterexamples that satisfy the precondition but
fail the post-condition, and the enclosing oval represents the generalization of
counterexamples resulting from either universal or existential generalization.
Our goal is to find additional counterexamples in the shaded region.  As new
counterexamples are discovered in the shaded region (and generalized), the
counterexample space becomes covered until no more classes of counterexamples
exist or it becomes too difficult for the testing framework to discover them.

\begin{figure}
  \begin{code}
matchesShapes :: SubTypes a
  => a -> [(a,[Idx])] -> Bool
matchesShapes d = any (matchesShape d)

matchesShape :: SubTypes a
  => a -> (a, [Idx]) -> Bool
matchesShape a (b, idxs)
  | constr a /= constr b = False
  | Just a' <- aRepl
  = let x = subForest (subVals a') in
    let y = subForest (subVals b)  in
    all foldEqConstrs (zip x y)
  | otherwise = False
  where
  updateA idx d =
    fmap (replace d idx) (index b idx)
  aRepl = foldl go (Just a) idxs where
    go ma idx | Just x <- ma = updateA idx x
              | otherwise    = Nothing
  foldEqConstrs ( Node (SubVal l0) sts0
                , Node (SubVal l1) sts1 )
    | constr l0 == constr l1 =
      all foldEqConstrs (zip sts0 sts1)
    | otherwise              = False
  \end{code}
  \caption{Shape matching algorithm.}
  \label{fig:matches}
\end{figure}

Figure~\ref{fig:matches} shows the shape-matching algorithm used for precondition
strengthening.  The function takes a candidate counterexample and a list of
previous counterexamples, together with the indexes at which they have been
generalized.  The basic idea of the algorithm is to determine whether two values
have the same ``shape''.  Intuitively, we consider two values to have the same
shape if in their respective tree representations, their constructors at each
node in the tree match, ignoring all opaque types (Section~\ref{sec:base}).
That is, for values \ttp{a} and \ttp{b},
%
\begin{code}
subVals a == subVals b
\end{code}
%
\noindent
Furthermore, indexes that have been universally generalized (and all their
children) match any value, since a universally generalized index indicates that
counterexamples have been found for any possible constructor.  In our
implementation, we also consider existentially generalized indexes to match any
value.  Doing so is a design choice that is more aggressive about covering the
space of counterexamples at the risk of omitting some.

To understand the shape matching algorithm, consider a few values of type
\ttp{Exp}, defined in Section~\ref{sec:subval}:
%
\begin{code}
e0 = Div (C 1) (C 2)
e1 = Div (C 1) (C 3)
e2 = Div (Add (C 1) (C 2)) (C 7)
e3 = Div (Div (C 8) (C 2)) (C 7)
\end{code}
%
\noindent
Then the following hold:
%
\begin{code}
matchesShape e0 (e1, [])  == True
matchesShape e1 (e2, [])  == False
matchesShape e1 (e2, [0]) == True
matchesShape e3 (e2, [0]) == True
\end{code}
%
The first equation holds because we ignore opaque types, so the arguments to
\ttp{C} are considered equal.  The second equation fails because at index 0,
\ttp{e1} contains the constructor \ttp{C} and \ttp{e2} contains the constructor
\ttp{Add}.  However, index 0 has been generalized, then we ignore the sub-value
at index 0, and they match (the third equation).  The same reasoning holds for
the fourth equation.


\section{Implementation and Usage}\label{sec:implementation}

The implementation of SmartCheck is written in Haskell and is designed to test
Haskell programs.  The source code is licensed BSD3 and is freely
available.\footnote{\url{https://github.com/leepike/SmartCheck.git}}

SmartCheck generically operates over arbitrary algebraic data and so uses a
generics library to encode generic traversals.  Specifically, SmartCheck uses
``GHC Generics'', a recently-developed generics library for
Haskell~\cite{generics}.  The generics library allows algebraic data types to
automatically derive the type class \ttp{SubTypes} presented in
Section~\ref{sec:preliminaries}.  One limitation with GHC Generics is that it
does not (currently) support generalized algebraic data types~\cite{gadts}.

The data type to be tested must derive the \ttp{Typeable}, and \ttp{Generic}
type classes.  Deriving \ttp{Typeable} and \ttp{Generic} require using the GHC
language extensions \ttp{DeriveDataTypeable} and \ttp{DeriveGeneric},
respectively.  \ttp{Typeable} is used for dynamic typing in defining the
\ttp{replace} method of \ttp{SubTypes} since it is unknown at compile-time
whether the value to be replaced and replacing value have the same types.
However, through SmartCheck's interface, run-time failures due to
type-mismatches will not occur.  Additionally, like with QuickCheck, deriving
\ttp{Show} is required to print counterexamples discovered.

Then, the user simply declares, for a data type \ttp{D},
%
\begin{code}
instance Subtypes D
\end{code}
%
Default instances are provided for common Prelude types and some additional
ones, including all types for which QuickCheck provides instances.

SmartCheck does not implement a counterexample discovery algorithm or an
\ttp{arbitrary} method.  Any initial counterexample can be passed in or
generated, and the algorithm depends on QuickCheck's \ttp{arbitrary} to generate new
well-typed values.

The kinds of programs SmartCheck is specialized for are ones that operate over a
large data structure together with smaller inputs.  Therefore, properties
provided to SmartCheck are expected to be of the form
%
\begin{code}
Testable prop => a -> prop
\end{code}
%
where \ttp{a} is the type of the value for SmartCheck to analyze, and \ttp{prop}
is a testable property, as defined by QuickCheck; morally, these are functions
(or degenerately, values) that evaluate to a Boolean value.

If QuickCheck is used to discover a counterexample, all arguments except the
first are shrunk, if their types have \ttp{shrink} methods defined for them.
The first argument is returned to SmartCheck to be shrunk or generalized
according to the algorithms described earlier.

%% \todo{is still subject to QC's maxsize constraints}
%% During all stages of SmartCheck's execution, the
%% remaining arguments are held constant.  For example, for a property of type
%% %
%% \begin{code}
%% prop :: Foo -> Int -> Bool
%% \end{code}
%% %
%% \noindent
%% and a pair of arguments for which \ttp{prop} returns \ttp{False}
%% %
%% \begin{code}
%% (SomeFoo ...) 3
%% \end{code}
%% %
%% \noindent
%% returned by QuickCheck, SmartCheck reduces and generalizes \ttp{SomeFoo} with
%% respect to the property
%% %
%% \begin{code}
%% prop' :: Foo -> Bool
%% prop' foo = prop foo 3
%% \end{code}

A read-eval-print loop is presented to the user, allowing her to iterate shrink
and generalize counterexamples, and then generate new counterexamples after
strengthening the property's precondition as described in
Figure~\ref{sec:precondition}.

SmartCheck is executed using
%
\begin{code}
> smartCheck args prop
\end{code}
%
\noindent
where \ttp{args} (the arguments) are passed in, and \ttp{prop} is the property
being tested.

The interface types and functions for SmartCheck with analogous behavior to
QuickCheck's are prefixed with an \ttp{sc} to avoid name space collisions with
QuickCheck.  Others are specialized for SmartCheck; e.g., enabling or disabling
universal or existential extrapolated, number of extrapolation rounds, and
limits on the depth and size of the values to generate.

Counterexamples can be optionally shown in a tree format by setting the
\ttp{format} field of the arguments to be equal to \ttp{PrintTree}.  for
example, the tree format shows a counterexample like
%
\begin{code}
Div (C 1) (Add (C 0) (C 2))
\end{code}
%
\noindent
as
%
%\scriptsize
\begin{samepage}
\begin{code}
Div
|
+- C 1
|
`- Add
   |
   +- C 0
   |
   `- C 2
\end{code}
\end{samepage}
%
\noindent
We find that for very large and nested data structures, a tree representation
aids in visually parsing the value.

\section{Experiments}\label{sec:experiments}
We describe two experiments using SmartCheck, including an XMonad property and a
property about a natural language processing library.

%% \subsection{Lambda Calculus}
%% Included in QuickCheck's test suite is an untyped lambda calculus module.  One
%% of the properties, \ttp{prop\_EvalProducesValue}, claims that a normal form
%% results from repeatedly applying the $\beta$-reduction rule to a lambda term.
%% But the property is just the negation of the theorem that the untyped lambda
%% calculus is not (weakly) normalizing, so we should expect failure.  In testing,
%% though, we do not know how many $\beta$-reduction steps might be required for a
%% given term, so a finite limit on reduction is introduced, either through a
%% timeout or after a number of steps.  If we limit $\beta$-reduction to 10,000
%% steps, a counterexample is returned by QuickCheck:
%% %
%% \medskip%
%% \begin{sloppypar}
%% \scriptsize
%% \noindent%
%% \ttp{Closed (App (App (App (Lam (MkVar "l") (App (Lam (MkVar "u") (App (Var (MkVar "l")) (Var (MkVar "u")))) (App (Lam (MkVar "u") (Var (MkVar "l"))) (Var (MkVar "l"))))) (Lam (MkVar "s") (App (App (App (Var (MkVar "s")) (Var (MkVar "s"))) (App (Var (MkVar "s")) (Var (MkVar "s")))) (App (Lam (MkVar "a") (Var (MkVar "a"))) (Lam (MkVar "w") (Var (MkVar "s"))))))) (App (Lam (MkVar "y") (App (Var (MkVar "y")) (Var (MkVar "y")))) (Lam (MkVar "q") (App (App (Lam (MkVar "a") (Var (MkVar "q"))) (App (Var (MkVar "q")) (Var (MkVar "q")))) (App (Var (MkVar "q")) (App (Con (MkCon "H")) (Var (MkVar "q")))))))) (App (App (App (App (Lam (MkVar "s") (Var (MkVar "s"))) (App (Lam (MkVar "b") (Con (MkCon "J"))) (Lam (MkVar "v") (Var (MkVar "v"))))) (Lam (MkVar "s") (Lam (MkVar "z") (Var (MkVar "z"))))) (App (Con (MkCon "Z")) (App (App (Lam (MkVar "l") (Con (MkCon "V"))) (Lam (MkVar "e") (Var (MkVar "e")))) (App (Lam (MkVar "u") (Var (MkVar "u"))) (Lam (MkVar "x") (Var (MkVar "x"))))))) (Lam (MkVar "a") (App (App (Lam (MkVar "i") (App (Var (MkVar "i")) (Var (MkVar "i")))) (Var (MkVar "a"))) (App (Lam (MkVar "t") (Lam (MkVar "u") (Var (MkVar "a")))) (App (App (Con (MkCon "P")) (Con (MkCon "M"))) (Var (MkVar "a"))))))))}
%% \end{sloppypar}
%% \medskip%
%% \noindent
%% SmartCheck reduces the counterexample automatically to a more manageable size:
%% %
%% \medskip%
%% \begin{sloppypar}
%% \scriptsize
%% \noindent%
%% \ttp{Closed (App (App (App (Lam (MkVar "l") (App (Var (MkVar "l")) (App (Lam (MkVar "u") (Var (MkVar "l"))) (Var (MkVar "l"))))) (Lam (MkVar "s") (App (App (Con (MkCon "D")) (App (Var (MkVar "s")) (Var (MkVar "s")))) (Var (MkVar "c"))))) (Con (MkCon "D"))) (Var (MkVar "b")))}
%% \end{sloppypar}
%% \medskip%
%% \noindent
%% and then gne:eralizes it to the following formula:
%% %
%% \begin{code}
%% forall x0 x1 x2: Closed (App (App x0 x1) x2)
%% \end{code}
%% %


\subsection{XMonad}
Recall from the introduction the XMonad example.  The XMonad window manager is a
large software project with many contributors, so naturally, a QuickCheck
test harness is included to help ensure new commits do not introduce bugs.  At
the heart of XMonad is a \ttp{StackSet} data type that encodes the relationship
between windows, work spaces, and which window has the focus.  XMonad contains
properties to ensure the correct manipulation of \ttp{StackSet}s.  Due to having
one large data-structure that is essential to the entire program, XMonad is a
perfect candidate for SmartCheck.

XMonad passes all of its QuickCheck tests, but let us see what might happen to a
new contributor if things go awry.  Suppose a developer defines a deletion
function to delete a window, if it exists.  There is a deletion function that
exists in XMonad, which is quite complex, given the amount of state that is managed
by \ttp{StackSet}.  However, one function used in deletion is to filter the
stack of windows associated with each workspace defined:
%
\begin{code}
removeFromWorkspace ws =
  ws \{ stack = stack ws >>= filter (/= w) \}
\end{code}
%
\noindent
Now, suppose the programmer makes a simple typo and instead writes
%
\begin{code}
removeFromWorkspace ws =
  ws \{ stack = stack ws >>= filter (== w) \}
\end{code}
%
\noindent
When testing the property \ttp{prop\_delete}, which says that deleting the
focused window of the current stack removes it from the \ttp{StackSet} \ttp{x}.
%
\begin{code}
prop_delete x =
    case peek x of
        Nothing -> True
        Just i  -> not (member i (delete i x))
\end{code}
%
\noindent
QuickCheck returns the large value shown in the introduction.  That value is a
relatively small counterexample, but even the smallest \ttp{StackSet} values are
somewhat visually overwhelming due to the number of fields within it.  Recall
the value returned by SmartCheck after generalization:
%
\begin{code}
forall values x0 x1 x2 x3:
  StackSet
    (Screen (Workspace x0 (-1) (Just x1)) 1 1)
    x2 x3 (fromList [])
\end{code}
%
\noindent
Let us examine what was generalized.  In our test run, we chose to treat data
maps as opaque, so the forth element of \ttp{StackSet} is not generalized, but
is simply the empty map, so looks uninteresting.  The second and third fields of
\ttp{StackSet} are generalized, but the first one is not.  So there is something
particular about it.  So the culprit is one of the small constants (1 and -1) or
having a \ttp{Just} value rather than a \ttp{Nothing}: it turns out that what
matters is having a \ttp{Just} value, which is the stack field that deletion
works on!


%% n insertion
%% function that inserts a new focus into the \ttp{StackSet}, except she forgets to
%% place the value inside.  The correct insertion function from XMonad is
%% %
%% \begin{code}
%% insertUp a s = if member a s then s else insert
%%   where
%%   insert = modify (Just $ Stack a [] [])
%%                   (\textbackslash(Stack t l r) ->
%%                       Just $ Stack a l (t:r)) s
%% \end{code}
%% %
%% \noindent
%% So suppose the developer forgets to cons \ttp{t} onto \ttp{r}.  Fortunately,
%% there are QuickCheck tests that depend on \ttp{insertUp}.  One such test is
%% \ttp{prop\_shift\_reversible}, which tests if the \ttp{shift} function, which
%% checks a property about shifting focus back and forth between windows.

%% With the incorrect version of \ttp{insertUp}, an example returned by QuickCheck
%% would typically include a \ttp{StackSet} value as follows:

%% \medskip%
%% \begin{sloppypar}
%% \scriptsize
%% \noindent%
%% \ttp{StackSet \{current = Screen \{workspace = Workspace \{tag = NonNegative
%%   \{getNonNegative = 5\}, layout = -2, stack = Just (Stack \{focus = 'I', up =
%%   ", down = "\})\}, screen = 5, screenDetail = 0\}, visible = [Screen
%%     \{workspace = Workspace \{tag = NonNegative \{getNonNegative = 0\}, layout =
%%     -2, stack = Just (Stack \{focus = `\char`\\161', up = ", down =
%%     "\})\}, screen = 1, screenDetail = -2\},Screen \{workspace = Workspace
%%     \{tag = NonNegative \{getNonNegative = 2\}, layout = -2, stack = Just (Stack
%%     \{focus = '3', up = ", down = "\})\}, screen = 2, screenDetail =
%%     -1\},Screen \{workspace = Workspace \{tag = NonNegative \{getNonNegative =
%%     3\}, layout = -2, stack = Just (Stack \{focus = 'T', up = ", down =
%%     "\})\}, screen = 3, screenDetail ) = 2\},Screen \{workspace = Workspace
%%     \{tag = NonNegative \{getNonNegative = 4\}, layout = -2, stack = Nothing\},
%%     screen = 4, screenDetail = -1\},Screen \{workspace = Workspace \{tag =
%%     NonNegative \{getNonNegative = 6\}, layout = -2, stack = Just (Stack \{focus
%%     = 'm', up = ", down = "\})\}, screen = 6, screenDetail = 2\},Screen
%%     \{workspace = Workspace \{tag = NonNegative \{getNonNegative = 7\}, layout =
%%     -2, stack = Nothing\}, screen = 0, screenDetail = 0\}], hidden = [Workspace
%%     \{tag = NonNegative \{getNonNegative = 1\}, layout = -2, stack = Just (Stack
%%     \{focus = `\char`\\DC3', up = ", down = "\})\},Workspace \{tag = NonNegative \{getNonNegative = 8\}, layout = -2, stack = Nothing\},Workspace \{tag = NonNegative \{getNonNegative = 9\}, layout = -2, stack = Just (Stack \{focus = `\char`\\ACK', up = ", down = "\})\}], floating = fromList []\} }
%% \end{sloppypar}
%% \medskip%
%% \noindent
%% This is a wall of text, which provides little useful information about what went
%% wrong.  SmartCheck, however, takes the value and returns the following:
%% %
%% \begin{code}
%% Extrapolated value:
%% forall values x0 x1 x2:
%%   StackSet x0 x1 x2 (fromList [])
%% \end{code}
%% %
%% \noindent
%% (The last field of \ttp{StackSet} is a map, and we treat it as a base type.)
%% The programmer immediately sees that failing instances were found when
%% substituting arbitrary well-typed values into the first three fields of
%% \ttp{StackSet}.  Therefore, there is likely nothing specific in the
%% counterexample causing the failures.

%% the first three fields in the record are
%% irrelevant.  The upshot is that any \ttp{StackSet} fails because insertion is
%% ill-defined.  Such an insight, while obvious in hind-sight, can take hours to
%% arrive at as the programmer manually tries to write her own shrink functions or
%% manually deconstruct a large record.

\subsection{Natural Language Processing} \label{sec:nlp}
In 2012, a question was posted on the programming message board
Stack Overflow asking about how to shrink large nested data
types.\footnote{\url{http://stackoverflow.com/questions/8788542/how-do-i-get-good-small-shrinks-out-of-quickcheck}}
The poster writes:
\begin{quote}
$\ldots$ I tend to get an incomprehensible page full of output. $\ldots$
  Implementing the shrink function for each of my types seems to help a little,
  but not as much as I'd like.  $\ldots$ If I try to tune my shrink
  implementations, I also find that QC starts taking a very long time.
\end{quote}

\todo{how long unanswered?}

\noindent
The question relates to the Geni natural language processing (NLP) package
implemented in Haskell~\cite{kow}.  Specifically, counterexamples to a property
attempting to show that a macro expansion function is its own inverse are
enormous, requiring 200-300 80-character lines to print.

Using SmartCheck, we are able to reduce counterexamples to around 25
80-character lines of output.  Most of the savings in the counterexample size
were due to universal generalization, like in the XMonad case: entire record
fields are abstracted away.  From that, we (syntactically) shrunk the
counterexample by-hand further by naming common sub-expressions.

We were able to send a substantially reduced and generalized counterexample to
the message poster, making the cause of the bug more obvious.  The author responded:
\begin{quote}
$\ldots$While your improved shrinking may not have gone `all'
the way to the bottom, it got me a \emph{huge} chunk of the way there!
\end{quote}

\noindent
Recall that through the entire process, we never had to learn how GenI works,
what the property meant, or how to write a custom shrink method!

``Retro-fitting'' SmartCheck to GenI, which is a large software system, required
some boilerplate work.  Fortunately, because the program already uses
QuickCheck, \ttp{arbitrary} instances were already defined, but we had to hunt
down the data types used to write \ttp{instance Subtypes D}, for each data type
\ttp{D} (indeed, the job could have largely been done with Sed and Awk).

%-------------------------------------------------------------------------------

\subsection{Benchmarks}

No set of testing benchmarks exists over which to compare different test-case
generation and minimization approaches.  Therefore, we have collected a small
number of benchmarks, in addition to the more involved case-studies described
earlier in this section.

The benchmarks presented, in addition to the motivating example presented in
Section~\ref{sec:example}, are

\begin{itemize}
  \item  \emph{Reverse}, with the false property
\begin{code}
prop_rev :: [a] -> Bool
prop_rev ls = ls == reverse ls
\end{code}
\noindent
(the example appears in the original QuickCheck documentation);

  \item \emph{Div0}, a division-by-zero property for a simple calculator
    language (introduced in Section~\ref{sec:subval});

  \item \emph{Heap}, an example from the QuickCheck test suite, in which an
    incorrect ``to sorted list'' function is checked.

  \item \emph{Parser}, a parser/pretty-printer for a toy imperative language
    containing a parser bug that switches the arguments of an binary operator.
\end{itemize}

\noindent
All benchmarks can be found at
\url{https://github.com/leepike/SmartCheck/tree/master/regression}.  Benchmarks
are against
QuickCheck~2.7.3.\footnote{\url{http://hackage.haskell.org/package/QuickCheck-2.7.3}}
The benchmarks only compare the size of shrunk values and the time required to
generate them using standard arguments to QuickCheck and SmartCheck.  We do not
benchmark the generalization algorithms, since there is no analog in QuickCheck
or other testing frameworks.

\todo{include std dev.}
\todo{fill in benchmarks!}

\begin{table}[ht]
\footnotesize
  \begin{center}
    \begin{tabular}{|r|r||c|c|}
\hline
\multicolumn{2}{|c||}{} & QuickCheck & SmartCheck \\
\hline \hline


\multirow{4}{*}{Reverse}
  & Mean       & 2, 0.002s  & 2, \sci{4.8}{4}s  \\
  \cline{2-4}
  & Std. dev.  & 0, 0.002   & 0, \sci{8.1}{4}  \\
  \cline{2-4}
  & 95\%       & 2, 0.007s  & 2, \sci{5.8}{4}s  \\
  \hline

\multirow{4}{*}{Div0}
  & Mean       & 10, 0.001s  & 5, 0.001s  \\
  \cline{2-4}
  & Std. dev.  & 6, 0.002    & 0, 0.001   \\
  \cline{2-4}
  & 95\%       & 23, 0.002s  & 5, 0.002s \\
  \hline

\multirow{4}{*}{Heap}
  & Mean       & 15, 0.016s  & 7, 0.007s  \\
  \cline{2-4}
  & Std. dev.  & 1, 0.025    & 2, 0.003   \\
  \cline{2-4}
  & 95\%       & 19, 0.045s  & 12, 0.015s  \\
  \hline

\multirow{4}{*}{Parser}
  & Mean       & 191, 0.013s  & 16, 0.241s  \\
  \cline{2-4}
  & Std. dev.  & 195, 0.018   & 75, 0.800   \\
  \cline{2-4}
  & 95\%       & 577, 0.044s  & 14, 0.521s  \\
  \hline

    \end{tabular}

  \end{center}
  \caption{Summarizing data for the graphs in Figure~\ref{fig:graphs}. Entries
    contain execution time (in seconds) and counterexample sizes (counting
    constructors).}
  \label{table:results}
\end{table}

\todo{note that 95 is smaller than average for heap}

The \emph{Reverse} benchmark essentially provides a lower-bound on the benefit
of using SmartCheck to discover small counterexamples, since a list of length
two falsifies the property (SmartCheck is still useful in generalizing the
counterexample, however!).  Perhaps surprisingly, SmartCheck still slightly
outperforms QuickCheck.  The other two benchmarks are slightly larger examples,
where SmartCheck's benefits are more pronounced.  But again, we emphasize the
most pronounced benefit comes in much larger examples, like described in
Section~\ref{sec:nlp}.


%-------------------------------------------------------------------------------
\section{Related Work}\label{sec:related}

Other researchers have also investigated counterexample shrinking.  Zeller
and Hildebrandt describe an application of greedy search to shrink
counterexamples they call ``delta-debugging'' (DD)~\cite{dd}.  The authors apply
their work to shrinking HTML inputs to crash Mozilla and shrinking C programs to
trigger a bug in GCC.  Subsequent generalizations are reported by Misherghi and
Su in which they perform greedy search on tree-structured data; they call their
approach hierarchical delta-debugging (HDD)~\cite{hdd}.

HDD is most similar to SmartCheck's reduction algorithm, with an important
difference: HDD (and DD) is deterministic, so the algorithm only succeeds in
reducing the counterexample only if a new counterexample can be constructed from
the original one.  Our approach combines the speed of delta debugging with the
power of QuickCheck to \emph{randomly} discover structurally smaller
counterexamples; in short, this is the first treatment we know of that
integrates counterexample generation with counterexample reduction.
Furthermore, neither paper explores the idea of counterexample generalization.

Within the functional programming community, one of the few treatments of
generic shrinking is as a motivation for generic programming in Haskell's
``Scrap your boilerplate'' generic programming library~\cite{syb}.  There, the
motivation was not to design new approaches to counterexample reduction, but
simply to derive instances for the \ttp{shrink} method.

Besides QuickCheck and SmallCheck, another testing framework related to
SmartCheck is the recent Haskell library Feat~\cite{feat}.  Feat provides
automated enumerations of algebraic data types in Haskell, allowing for fast
access to very large indexes.  For example, from the enumeration of
(\ttp{[Bool]})
%
\begin{code}
[[],[False],[True],[False,False],[False,True] ...
\end{code}
%
\noindent
Accessing the $10^{1000}$th element takes under $0.1$ seconds in interpreted
Haskell.  Feat combines some advantages of SmallCheck and QuickCheck, since the
user can choose to exhaustively test an enumeration up to some depth, like with
SmallCheck, or she can create a uniform distribution of test cases up to some
depth.

Feat finds small counterexamples effectively, but a limitation compared to
QuickCheck is that value generation is applicative rather than monadic.
Consequently, defining custom generators (i.e., instances for \ttp{arbitrary})
is constrained.  On the other hand, generators can be derived
automatically with Feat.

We have benchmarked Feat as well on the example in the introduction
(Figure~\ref{fig:initial}).  It performs quite competitively to SmartCheck
in terms the size of counterexamples discovered: the mean and median number of
\ttp{N} constructors in the counterexamples returned by Feat are both seven. \todo{check}
However, Feat performs less well with respect to how long counterexample
discovery takes: the mean time taken by Feat is 5.60 seconds, with a long tail:
five percent of runs take over 14 seconds (the mean time taken by
SmartCheck is 0.12 seconds, and 99\% of runs take less than 0.30 seconds).

Finally, SmartCheck bears some similarity to QuickSpec, a testing-based library
that infers equational properties about programs~\cite{qs} insofar as they both
attempt to generalize counterexamples based on specific inputs.  QuickSpec
attempts to infer equational properties of programs through random testing.
Similarly, Daikon infers assertions for C, C++, Java, and Perl by observing
relationships between variables in executions of a program~\cite{daikon}.
SmartCheck does not attempt to infer properties like these tools do, so it is in
one sense less general.  However, SmartCheck's three kinds of counterexample
generalization are novel.

%% \todo{avoid complexities \url{http://stackoverflow.com/questions/14006005/idiomatic-way-to-shrink-a-record-in-quickcheck}%
%% %
%% * Also check out \url{http://www.whyprogramsfail.com/}%
%% %
%% \url{http://www.haskell.org/haskellwiki/Research_papers/Testing_and_correctness}%}
%% \todo{shrink paper rehger pointed to ``Automatic isolation of compiler errors''
%%   from 1994}

%% \todo{Not relevant really: GenCheck
%%   \url{https://github.com/JacquesCarette/GenCheck}%
%% \url{http://blog.regehr.org/archives/697} --- check out C-reduce (Regehr)
%% \todo{\url{http://www.cs.utah.edu/~regehr/papers/pldi12-preprint.pdf}}
%% }

%-------------------------------------------------------------------------------
\section{Conclusions and Future Work}\label{sec:conclusions}

We have presented new approaches for generically shrinking and generalizing
counterexamples over algebraic data.  SmartCheck automates the laborious
task of shrinking, and extrapolating from counterexamples, and in our
experience, performs better and faster than hand-written shrink functions.

We envision a number of potential extensions and improvements to SmartCheck.
First, we have considered only the simplest kind of data, algebraic data types.
As noted in Section~\ref{sec:implementation}, SmartCheck does not work with
GADTs currently, due to limitations with GHC Generics.  It would be interesting
to see if the approaches described here could be extended to function types as
well---we are particularly motivated by Claessen's recent work in shrinking and
showing functions~\cite{claessen}.

Lazy SmallCheck can test partially-defined inputs by detecting the evaluation of
undefined values~\cite{sc}.  This capability is useful in shrinking, too.  For
example, the universal sub-value generalization algorithm
(Section~\ref{sec:universal}) could be extended to shortcut testing and
generalize a sub-value if it is not evaluated in testing the property.  Not only
does this shortcut the generalization phase, but it gives a \emph{proof} that the
sub-value can be generalized.

SmartCheck displays (generalized) counterexamples in a form similar to default
\ttp{Show} instances or in a tree form, which can be helpful to parse the
components of the value.  Better approaches for showing large data types are
needed.  In particular, an interactive web-based viewer with hyperlinks to close
or expand sub-values would be particularly useful.

Another aspect of displaying large counterexamples that we have not explored is
to exploit sharing.  Constructs might be repeated that can be abstracted out.
For example, instead of a counterexample like
%
\begin{code}
Add (Div (C 1) (Add (C (-2)) (C 2)))
    (Div (C 1) (Add (C (-1)) (C 1)))
\end{code}
%
\noindent
we might instead return
%
\begin{code}
Add (div (-2) 2) (div (-1) 1)
  where div x y = Div (C 1) (Add (C x) (C y))
\end{code}
%
\noindent
Discovering and exploiting sharing automatically is future work.

%% We believe there are interesting connections between counterexample
%% discovery and probabilistic reasoning.  Often, when a bug in a program is fixed,
%% the fix is not correct, or it introduces new bugs.  Such knowledge might be
%% useful in guiding the distribution of test cases or shrinking.

%% Regarding our implementation, we have not profiled the performance of our
%% implementation of SmartCheck, so additional performance gains might be possible.
%% In particular, we perform significant computations over tree-structured data,
%% and a generic zipper~\cite{zipper}, which we do not currently implement, could
%% improve performance.

Debugging is a difficult task.  Functional programming has been at the forefront
of testing research, with tools like QuickCheck and SmallCheck.  We were
motivated to build a tool like SmartCheck just because of how effective
QuickCheck is at discovering counterexamples automatically---there would be no
such problem of having very large counterexamples if inputs were written by
hand.  We hope SmartCheck and the ideas in this paper continue the tradition of
highly-automated testing and debugging in the functional programming community,
and beyond!

\todo{mention memoization (generalized tries)}

\todo{tail-recursive algorithm is fast and easy to port to imperative languages}

\section*{Acknowledgments}
I thank Andy Gill, Joe Hurd, John Launchbury, Simon Winwood, the anonymous
reviewers of ICFP'13 (from which an early draft of this paper was rejected), and
especially John Hughes for reading an earlier draft of this paper.  Their deep
insights and bug catches dramatically improved both the form and content.

\bibliographystyle{abbrvnat}
\bibliography{paper}

\balancecolumns


\end{document}