# MARS (Multivariate Adaptive Regression Splines)

## Model Specification

Using the terminology of splines, we first define the candidate **basis functions**:

\begin{align}
C = \{ (X_p-t)_+, (t-X_p)_+\}_{t\in\{ x_1, \dots, x_{N_p}\}, \;\;p=1,\dots, P}.
\end{align}

The $x$'s are the values of variable $X_p$ in the training sample. Note that in the above, each basis function is considered a multivariate function, though it only depends on $X_p$.

MARS then expands a linear additive structure using the above set of hinge (or sometime including linear) functions and constant functions, so-called the **forward pass**:
\begin{align}
f(X) = \beta_0+\sum_{m=1}^M \beta_mh_m(X),
\end{align}
where each $h_m(X)$ is a function in $C$, or a product of two or more such functions. Given a choice for the $h_m$, the coefficients $\beta_m$ are estimated by minimizing the residual sum-of-squares. So what remains to be explained in the forward pass is how $h_m$ is determined in the forward pass. In each iteration, we consider as a new basis function pair all products of a function $h_m$ in the model set $M$ with one of the *reflected pairs* in $C$. We add to the model $M$ the term of the form
\begin{align}
\hat{β}_{M+1}h_{l}(X)\cdot(X_j − t)_+ +\hat{β}_{M+2}h_l(X)\cdot(t − X_j)_+,\;\;h_l \in M,
\end{align}
that produces the largest decrease in training error: again $\hat{β}_{M+1}$ and $\hat{β}_{M+2}$ are estimated by least square. The forward pass continue in these iterations, until the maximum number of terms allowed is reached.

After the forward pass concludes, MARS then engages in a **backward trim**, where the terms whose removal causes the smallest increase in residual squared error is deleted from the model at each iteration in the backward trim. Instead of doing a CV to estimate the error, one typically rely on the following generalized cross-validation statistic to approximate CV at each iteration:
\begin{align}
GCV(\lambda)=\frac{\sum_{n=1}^N(y_n-\hat{f}_{\lambda}(x_n))^2}{(1-M(\lambda)/N)^2},
\end{align}
where $\lambda$ indicates the stage of the model. The value $M(λ)$ is the effective number of parameters in the model: this accounts both for the number of terms in the models, plus the number of parameters used in selecting the optimal positions of the knots. Some mathematical and simulation results suggest that one should pay a price of three parameters for selecting a knot in a piecewise linear regression. Thus if there are $r$ linearly independent basis functions in the model, and $K$ knots were selected in the forward process, the formula is $M(\lambda) = r+cK$, where $c = 3$. (When the model is restricted to be additive, a penalty of $c = 2$ is used). Using this, we choose the model along the backward sequence that minimizes $GCV(\lambda)$.

The forward modeling strategy in MARS is hierarchical, in the sense that multi-way products are built up from products involving terms already in the model. For example, a four-way product can only be added to the model if one of its three-way components is already in the model. The philosophy here is that a high-order interaction will likely only exist if some of its lowerorder “footprints” exist as well. This need not be true, but is a reasonable working assumption and avoids the search over an exponentially growing space of alternatives.

There is one restriction put on the formation of model terms: each input can appear at most once in a product. This prevents the formation of higher-order powers of an input, which increase or decrease too sharply near the boundaries of the feature space. Such powers can be approximated in a more stable way with piecewise linear functions.

### Variants and Generalizations

Stone et al. (1997) developed a hybrid of MARS called PolyMARS specifically designed to handle classification problems. It uses the multiple logistic framework. It grows the model in a forward stagewise fashion like MARS, but at each stage uses a quadratic approximation to the multinomial log-likelihood to search for the next basis-function pair. So basically it is just replacing the log probs in softmax. But then this is the tricky part they need to solve. Once found, the enlarged model is fit by maximum likelihood, and the process is repeated.

## Theoretical Properties

### Advantages 

- Being a simple additive model, MARS produce regression-type coefficients, and hence is straight-forward for feature importance.
- The linear additive nature of the MARS model also lends it ease in model interpretation.
- By considering hinge functions and interaction terms, MARS can handle both linear and non-linear strctures at the same time. 
- The regression surface of MARS is built up parsimoniously, using nonzero components locally - only when they are needed. This is important, since one should 'spend' parameters carefully in high dimensions, as they can run out quickly.
- MARS can handle “mixed” predictors—quantitative and qualitative—in a natural way, much like CART does. MARS considers all possible binary.

### Disadvantages

### Relation to Other Models

- MARS can be viewed as a generalization of stepwise linear regression or a modification of the CART method to improve the latter’s performance in the regression setting. MARS is better in handling additive structures than CART.
- The way that MARS adaptively chooses the new basis function can be compared to what is done in [boosting](boosting.ipynb) as well.

## Empirical Performance

### Advantages 

- MARS is said to perform well in high-dimensional problems.

### Disadvantages

## Implementation Details and Practical Tricks

**Setting upper limit in interaction terms**

A useful option in the MARS procedure is to set an upper limit on the order of interaction. For example, one can set a limit of two, allowing pairwise products of piecewise linear functions, but not three- or higherway products. This can aid in the interpretation of the final model. An upper limit of one results in an additive model.

**`py-earth`**

`sklearn` does not support MARS yet, but it is in the process of subsuming an open-source library called `pyearth`, whose interface is very similar.

In [None]:
import numpy
from pyearth import Earth
from matplotlib import pyplot

#Create some fake data
numpy.random.seed(0)
m = 1000
n = 10
X = 80*numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1*numpy.random.normal(size=m)

#Fit an Earth model
model = Earth(max_terms=None, max_degree=None, penalty=None, allow_linear=None, enable_pruning=True, feature_importance_type=None, verbose=0)
model.fit(X,y)

#Print the model
print(model.trace())
print(model.summary())

**Some commonly used inputs**:

- **`max_terms`**: The maximum number of terms generated by the forward pass. All memory is allocated at the beginning of the forward pass, so setting max_terms to a very high number on a system with insufficient memory may cause a `MemoryError` at the start of the forward pass.

- **`max_degree`**: The maximum degree of terms generated by the forward pass.

- **`penalty`**: A smoothing parameter used to calculate GCV and GRSQ. Used during the pruning pass and to determine whether to add a hinge or linear basis function during the forward pass. Put simply, it is $c$ in $M(\lambda)$ above.

- **`allow_linear`**: If `True`, the forward pass will check the GCV of each new pair of terms and, if it’s not an improvement on a single term with no knot (called a linear term, although it may actually be a product of a linear term with some other parent term), then only that single, knotless term will be used. If `False`, that behavior is disabled and all terms will have knots except those with variables specified by the linvars argument (see the fit method).

- **`enable_pruning`** : bool, optional(default=True) If False, the pruning pass will be skipped.

- **`feature_importance_type`**: string or list of strings, optional (default=`None`) Specify which kind of feature importance criteria to compute. Currently three criteria are supported : `‘gcv’`, `‘rss’` and `‘nb_subsets’`. By default (when it is None), no feature importance is computed. Feature importance is a measure of the effect of the features on the outputs. For each feature, the values go from 0 to 1 and sum up to 1. A high value means the feature have in average (over the population) a large effect on the outputs. 

- **`verbose`** : int, optional(default=0) If `verbose >= 1`, print out progress information during fitting. If `verbose >= 2`, also print out information on numerical difficulties if encountered during fitting. If `verbose >= 3`, print even more information that is probably only useful to the developers of `py-earth`.



## Use Cases

MARS, as its original form, is usually used for regression setting. But by setting classes labeled as $0$ and $1$, it can be used in the classification setting as well.

## Results Interpretation, Metrics and Visualization

## References 

- ESL, Section 9.4
- [py-earth Documentation](https://contrib.scikit-learn.org/py-earth/content.html#a-simple-earth-example)

### Further Reading

## Misc.