Skip to content
J-E Dazard, PhD edited this page Aug 31, 2021 · 227 revisions

Welcome to the Wiki webpage of PRIMsrc! This Wiki webpage introduces what "Bump Hunting" is about and describes its current implementation in the software PRIMsrc. It lays out the roadmap of the project, shows the current development status, provides support (for usage and development), documents the software with manual and examples and illustrates usage with publications and case studies.

The Wiki page is currently under initial construction. If you are interested in improving PRIMsrc right now, please start editing it as you see fit (If you don't see an [Edit] button on the top right, you need to Sign in/up to Github).

The Wiki page can be edited in any format supported by GitHub Markup including Markdown (default). See the "Edit mode" drop-down menu in the editor. You can use the Wiki toolbar to create and include content on a page. To add a new page simply reference it within brackets, e.g.: [SamplePage]. Use the Wiki GitHub Guide for help. Thanks for your contribution! ;-)

Description of Bump Hunting

The general problem in "Bump Hunting" (BH) is to identify, characterize and predict hidden structures in the data that are informative and significant. In practice, "Bump Hunting" refers to the task of mapping out a local region of the multi-dimensional input space where a target function of interest, usually unknown, assumes smaller or larger values than its average over the entire space. In general, the region could be any smooth shape (e.g. a convex hull) possibly disjoint. The input space to perform the "Bump Hunting" search may be any low or high-dimensional space where inputs may be any variables such as attributes, features, predictors, etc. The target function may be any function of interest (see assumptions below for details).

"Bump Hunting" applies to mathematical / statistical problems such as:

  • Mode(s) Hunting
  • Local/Global Extremum(a) Finding
  • Subgroup(s) Identification
  • Outlier(s) Detection


The multivariate inputs are assumed to be real-valued, discrete or continuous variables and the target function is assumed to be a univariate real-valued bounded function, that can take on discrete, continuous or censored time-to event values.

It is intended to handle low and high-dimensional multivariate datasets, including the paradigm where the number of covariates (p) exceeds or dominates that of samples (n): p > n or p >> n.

Formal Goal

Consider a supervised setting, where the unknown target function of interest is a response vector of a multivariate vector of input explanatory variables (attributes/features/predictors).

Denote by Rp a p-dimensional Euclidean space and f(x) the target function of input explanatory variables x=(x1, x2, … , xp) of support S \subseteq Rp.

Consider a set of observations of (x,f(x)). Bump Hunting focuses on the “bump support(s)” of the target function f(x). Its goal is to:

  • Characterize a region R \subseteq S in the input space over which the target function f(x) is expected to have a local and/or global maximum (minimum). This is broader than finding the modal (anti-modal) regions, because f(x) can be e.g. a regression function, that is, not necessarily a probability distribution function. The region R could be any smooth shape (e.g. a convex hull), possibly disjointed (not necessarily contiguous).
  • Determine which of the input variables (x1, x2, … , xp) are important/relevant in the above task.
  • Identify corresponding sub-groups of samples falling into this region R

Examples below in the one-dimensional case of a regression function and two-dimensional case of a probability distribution function: Goal Left: Schematic representation of a target function f(x) that is a regression function (line) in p = 1. The solution region R (shown in red) corresponds to an interval where f(x) assumes larger average values (\bar fR) than over the entire support (\bar fS). Right: Example where the target function is a joint density probability function pdf(x1, x2) (surface) simulated from a mixture of bivariate normal distributions in p = 2. In higher-dimensional spaces, it is not uncommon to find a solution region R that can be complex, possibly disjointed.


We (see authors) chose to achieve the above goals simultaneously by a non-parametric supervised approach. Our PRIMsrc implements a unified treatment of the "Bump Hunting" task in Survival, Regression and Classification (SRC) settings in high-dimensional space. It uses a generic rule-induction algorithm by recursive peelings derived from the Patient Rule Induction Method (PRIM), initially introduced by Fisher & Friedman in 1999 [[(see publications) | 5) Publications]].

PRIMsrc estimates the target region R of interest by generating decision rules delineating a region (possibly disjointed) in the multi-dimensional input space where the target function is locally larger (or smaller) than its average over the entire space. In PRIMsrc, the approximation of the region R is shaped as low-dimensional hyper-rectangle(s), parallel to the coordinate axes. Each of these is called a "box" and is defined by conjunctions ('AND'-rules) involving a usually restricted set of inputs, such as:

Rule of box (sub-region) 1: t11- < x1 < t11+ t21- < x2 < t21+ ... tp1- < xp < tp1+

Rule of box (sub-region) 2: t12- < x1 < t12+ t22- < x2 < t22+ ... tp2- < xp < tp2+


Please note that the current version is a development release, that only implements the case of a survival outcome. At this point, this version of PRIMsrc is also restricted to a directed peeling search of the first box covered by the recursive coverage (outer) loop of our PRSP or PRGSP algorithm (see details below). New features will be added as soon as available.

In a direct application, "Bump Hunting" (BH) can identify subgroups of observations for which their outcome is as extreme as possible. Similarly to this traditional goal of subgroup finding, PRIMsrc also implements the goal of mapping out a region (possibly disjointed) of the input space where the outcome \emph{difference} between existing (fixed) groups of observations is as extreme as possible. We refer to the later goal as "Group Bump Hunting" (GBH).

In the case of a time-to event outcome, possibly censored (as in survival or risk analysis), "Survival Bump Hunting" (SBH) is done by our Patient Recursive Survival Peeling (PRSP) algorithm. See Dazard and Rao (2014, 2015, 2016, 2021a) for details, as well as Dazard et al. (2021c) for an application in Patient Survival Subtyping. Alternatively, "Group Survival Bump Hunting" (GSBH) is done by using a derivation of PRSP with specifc peeling and cross-validation criterion, called Patient Recursive Group Survival Peeling (PRGSP). See Dazard and Rao (2021b) for details, as well as Rao et al. (2021d) for an application in Survival Disparity Subtyping.

Strengths and Weaknesses

  • Minimal assumptions: PRIMsrc can handle continuous and categorical variables, along with missing values and high-dimensional situations.
  • Supervised: PRIMsrc can target desired responses including discrete (Classification), continuous (Regression) or a censored time-to event (Survival).
  • Patient: PRIMsrc is less greedy than other rule-induction methods such as partitioning methods (e.g. CART).
  • Interpretable: PRIMsrc generates simple conjunctive rules.


  • If too patient, the number of rules defining boxes may have many conditions making them less interpretable.
  • Final disjunction (unions) of rules covering wider regions can be more tedious to interpret.
  • If a single rule is used, it could discover a local optimum and get stuck in a sub-optimal region.

Why use PRIMsrc?

The fact that the method (i) makes minimal assumptions about the date, (ii) gives easily interpretable rules with estimated variance and (iii) can target for any desired responses (being supervised for Survival, Regression and Classification (SRC) settings), makes it highly attractive to the user.

Unlike classical regression, classification and clustering problems, "Bump Hunting" is interested in:

  • Understanding and characterizing newly identified sub-groups of samples and homogeneous sub-populations
  • Discovering and describing sub-groups of samples and sub-populations with extreme behaviors
  • Identifying and predicting future sub-groups of samples and sub-populations with extreme behaviors
  • Customizing and/or targeting sub-groups of samples and sub-populations with extreme behaviors

Multiple applications exist in an increasing range of problems spanning from medical, engineering, marketing, business analysis and materials research:

  • subgroup finding
  • disparity subtyping
  • personalized medicine (improved accuracy of diagnostication and/or prognostication)
  • alternative drug/treatment indication (re-purposing)
  • economical medicine (hot spotting)
  • system reliability analysis in engineering
  • duration analysis/modeling in economics
  • event history analysis in sociology
  • financial securities return
  • risk management in insurance and finance



Project funded in part by the National Institute of Health - National Cancer Institute, Grant: R01-CA160593 awarded to J.Sunil Rao/J-E. Dazard (co-PIs). This work was also made possible thanks to the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. Thanks also to Wiki contributors of the ggplo2 package by Hadley Wickham.