# Friedman Function

To train and benchmark each MLP, we use a Friedman function, specifically the one proposed in this [1979 paper on nonparametric regression](https://www.slac.stanford.edu/pubs/slacpubs/2250/slac-pub-2336.pdf).
The inputs $\mathbf{x} \in [0, 1]^6$ lie within a unit hypercube, and the response noise $\epsilon \sim \mathcal{N}(0, 1)$ follows a standard normal distribution.
Note that $x_6$ serves solely as input noise, as its coefficient is zero.
The response is defined as

$$
y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + 0 x_6 + \epsilon,
$$

and is later scaled by a factor of $0.2$, resulting in approximately unit variance.
This helps with convergence and metric interpretability.

In [1]:
import numpy as np
import pandas as pd

In [2]:
def generate_friedman(samples: int, path: str) -> None:
    X = np.random.rand(6, samples)
    y = (
        10 * np.sin(np.pi * X[0] * X[1])
        + 20 * (X[2] - 0.5) ** 2
        + 10 * X[3]
        + 5 * X[4]
        + np.random.randn(samples)
    ).reshape(-1, 1)
    y *= 0.2

    print(f"{path.split('.')[0]} response variance: {y.var(ddof=1):.6f}")

    data = np.concatenate([X.T, y], axis=-1)
    columns = [f"x_{i + 1}" for i in range(6)] + ["y"]
    pd.DataFrame(data, columns=columns).to_csv(path, index=False)

In [3]:
np.random.seed(42)
generate_friedman(samples=8_000, path="train.csv")
generate_friedman(samples=2_000, path="test.csv")

train response variance: 1.017834
test response variance: 0.925737
