Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Welcome to the Wiki webpage of PRIMsrc! This Wiki webpage introduces what "Bump Hunting" is about and describes its current
implementation in the software
PRIMsrc. It lays out the roadmap of the project, shows the
current development status, provides support (for usage and development), documents the software with manual and examples and
illustrates usage with publications and case studies.
The Wiki page is currently under initial construction. If you are interested in improving PRIMsrc right now, please start editing it as you see fit (If you don't see an [Edit] button on the top right, you need to Sign in/up to Github).
The Wiki page can be edited in any format supported by GitHub Markup including Markdown (default). See the "Edit mode" drop-down menu in the editor. You can use the Wiki toolbar to create and include content on a page. To add a new page simply reference it within brackets, e.g.: [SamplePage]. Use the Wiki GitHub Guide for help.
Thanks for your contribution! ;-)
Description of Bump Hunting
"Bump Hunting" refers to the procedure of mapping out local regions of the input space where a target function of interest, usually unknown, assumes larger (or smaller) values than its average over the entire space. The input space to perform the "Bump Hunting" search may be any low or high-dimensional space where inputs may be any variables such as attributes, features, predictors, etc ... The target function may be any function of interest (see assumptions below for details).
The problem of "Bump Hunting" covers mathematical / statistical tasks such as:
- Mode(s) Hunting
- Local/Global Extremum(a) Finding
- Subgroup(s) Identification
- Outlier(s) Detection
Consider a supervised setting, where the unknown target function of interest is a response vector of a multivariate vector of input explanatory variables (attributes/features/predictors).
Denote by Rp a p-dimensional Euclidean space and f(x) the target function of input explanatory variables x=(x1, x2, … , xp) of support S \subseteq Rp.
Consider a set of observations of (x,f(x)). Bump Hunting focuses on the “bump support(s)” of the target function f(x). Its goal is to:
- Characterize a region R \subseteq S in the input space over which the target function f(x) is expected to have a local and/or global maximum (minimum). This is broader than finding the modal (anti-modal) regions, because f(x) can be e.g. a regression function, that is, not necessarily a probability distribution function. The region R could be any smooth shape (e.g. a convex hull), possibly disjointed (not necessarily contiguous).
- Determine which of the input variables (x1, x2, … , xp) are important/relevant in the above task.
- Identify corresponding sub-groups of samples falling into this region R
Examples below in the one-dimensional case of a regression function and two-dimensional case of a probability distribution function: Left: Schematic representation of a target function f(x) that is a regression function (line) in p = 1. The solution region R (shown in red) corresponds to an interval where f(x) assumes larger average values (\bar fR) than over the entire support (\bar fS). Right: Example where the target function is a joint density probability function pdf(x1, x2) (surface) simulated from a mixture of bivariate normal distributions in p = 2. In higher-dimensional spaces, it is not uncommon to find a solution region R that can be complex, possibly disjointed.
We (see authors) chose to achieve the above goals simultaneously by a non-parametric supervised approach. Our
PRIMsrc implements a unified treatment of the "Bump Hunting" task in Survival, Regression
and Classification (SRC) settings in high-dimensional space. It uses a generic rule-induction algorithm by recursive peelings
derived from the Patient Rule Induction Method (PRIM), initially introduced by Fisher & Friedman in 1999 [[(see publications) | 5)
PRIMsrc estimates the target region R of interest by generating decision rules delineating a region (possibly disjointed) in the multi-dimensional input space where the target function is locally larger (or smaller) than its average over the entire space. In PRIMsrc, the approximation of the region R is shaped as low-dimensional hyper-rectangle(s), parallel to the coordinate axes. Each of these is called a "box" and is defined by conjunctions ('AND'-rules) involving a usually restricted set of inputs, such as:
Rule of box (sub-region) 1: t11- < x1 < t11+ t21- < x2 < t21+ ... tp1- < xp < tp1+
Rule of box (sub-region) 2: t12- < x1 < t12+ t22- < x2 < t22+ ... tp2- < xp < tp2+
The multivariate inputs are assumed to be real-valued, discrete or continuous variables and the target function is assumed to be a univariate real-valued bounded function, that can take on discrete, continuous or censored time-to event values.
It is intended to handle low and high-dimensional multivariate datasets, including the paradigm where the number of covariates (p) exceeds or dominates that of samples (n): p > n or p >> n.
Strengths and Weaknesses
- Minimal assumptions: PRIMsrc can handle continuous and categorical variables, along with missing values and high-dimensional situations.
- Supervised: PRIMsrc can target desired responses including discrete (Classification), continuous (Regression) or a censored time-to event (Survival).
- Patient: PRIMsrc is less greedy than other rule-induction methods such as partitioning methods (e.g. CART).
- Interpretable: PRIMsrc generates simple conjunctive rules.
- If too patient, the number of rules defining boxes may have many conditions making them less interpretable.
- Final disjunction (unions) of rules covering wider regions can be more tedious to interpret.
- If a single rule is used, it could discover a local optimum and get stuck in a sub-optimal region.
Why Use PRIMsrc?
The fact that the method (i) makes minimal assumptions about the date, (ii) gives easily interpretable rules with estimated variance and (iii) can target for any desired responses (being supervised for Survival, Regression and Classification (SRC) settings), makes it highly attractive to the user.
Unlike classical regression, classification and clustering problems, "Bump Hunting" is interested in:
- Understanding and characterizing newly identified sub-groups of samples and homogeneous sub-populations
- Discovering and describing sub-groups of samples and sub-populations with extreme behaviors
- Identifying and predicting future sub-groups of samples and sub-populations with extreme behaviors
- Customizing and/or targeting sub-groups of samples and sub-populations with extreme behaviors
Multiple applications exist in an increasing range of problems spanning from medical, engineering, marketing, business analysis and materials research:
- personalized medicine (improved accuracy of diagnostication and/or prognostication)
- economical medicine (hot spotting)
- alternative drug/treatment indication (re-purposing)
- system reliability analysis in engineering
- duration analysis/modeling in economics
- event history analysis in sociology
- financial securities return
- risk management in insurance and finance
- 1) Authors
- 2) Roadmap
- 3) Documentation and Manual
- 4) Examples
- 5) Publications
- 6) Case Studies
- 7) FAQ
- 8) Improving Documentation
- 9) Developing Code
Project funded in part by the National Institute of Health - National Cancer Institute, Grant:
awarded to J.Sunil Rao/J-E. Dazard (co-PIs). This work was also made possible thanks to the High Performance Computing
Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. Thanks also to Wiki contributors of the
ggplo2 package by Hadley Wickham.