# Fitting

There are several ways (e.g. using [least square fit](https://en.wikipedia.org/wiki/Least_squares) or the [Likelihood function](https://en.wikipedia.org/wiki/Likelihood_function)) to fit a model (a simple line, a polynomial of degree n or machine learning model) to your data. In general, we call this step **fitting** or **regression** and it is always a kind of minimization task. In python there are several modules for that purpose (see also [SciPy optimize](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.least_squares.html) or [Numpy polyfit](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html)).

One key feature is to select the correct stack of parameters. On the one hand the correct degree of freedom of a polynomial on the other hand the model parameter of a machine learning task.

In [None]:
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt

import scipy.optimize as opt

plt.rcParams['figure.figsize'] = (16.0, 8.0)
plt.style.use('ggplot')

## Methods

### Example: Least square fit

A [least square fit](https://en.wikipedia.org/wiki/Least_squares) is a standard approach to derive a reasonable fit especially in technical use cases when an obvious functional relation between the input and output parameter is known. The most trivial case is a fit of a line to linearly dependent sample.

The idea is simple: Find the set of parameters of a given function so the sum of the squared **residuals** is minimized.

We use some generated data point. In general they follow a linear relationship of $y = 1 + 3x$ but we will add some random noise.

In [None]:
# Model or function we want to fit
def linear(x, intercept, slope):
    return intercept + slope * x

In [None]:
n_sample = 50
err = 1

# Generate example data
x = np.linspace(-2, 2, n_sample)
y = 1 + 3 * x
yerr = np.random.normal(0, err, size=n_sample)
y = y + yerr

# Linear fit
params, cov = opt.curve_fit(linear, x, y, absolute_sigma=err)

# Show results
print(f'intercept={params[0]:.2f}, slope={params[1]:.2f}')
plt.errorbar(x, y, err, marker='+', linewidth=0, elinewidth=0, color='dodgerblue', label='Data points');
plt.plot(x, linear(x, *params), label='Linear fit')
plt.legend();

#### Task

Try out different values for `err`. What happens? What happens when you do not provide the uncertainty of each data point to the fit (remove `absolute_sigma=err`)?

## Overfitting & Underfitting (bias–variance dilemma)

In principal, regardless of the fitting method, the choice of the model will have the highest impact. There are two extremes when chosing a model or function:

- **Underfitting** (bias)
- **Overfitting** (variance)

### Underfitting

Underfitting is the situation, when having not enough parameter to describe the behavior of our data points. Our model has not enough degrees of freedom to represent the data. In the example we have data coming from a quadratic function while we use a linear function as fit model.

In [None]:
n_sample = 50
err = 1

# Generate example data (quadratic)
x = np.linspace(-2, 2, n_sample)
y = 2 * x**2
yerr = np.random.normal(0, err, size=n_sample)
y = y + yerr

# Linear fit
params, cov = opt.curve_fit(linear, x, y, absolute_sigma=err)
print(f'intercept={params[0]:.2f}, slope={params[1]:.2f}')

# Show results
plt.errorbar(x, y, err, marker='+', linewidth=0, elinewidth=0, color='dodgerblue', label='Data points');
plt.plot(x, linear(x, *params), label='Linear fit')
plt.legend();

### Correct fitting

Changing to a quadratic model will get us a valid fit.

In [None]:
def quadratic(x, a, b):
    return a + b * x**2

In [None]:
# Quadratic fit to the data
params, cov = opt.curve_fit(quadratic, x, y, absolute_sigma=err)

# Show results
print(f'a={params[0]:.2f}, b={params[1]:.2f}')
plt.errorbar(x, y, err, marker='+', linewidth=0, elinewidth=0, color='dodgerblue', label='Data points');
plt.plot(x, quadratic(x, *params), label='Quadratic fit')
plt.legend();

### Overfitting

The opposite problem is called overfitting: The model describes the data points "too well", e.g. it describes random noise instead of simplifying and describing the general relationships in the data. Let's have fewer data points following a linear relationship. We use a linear function and a polynomial function (degree of six) for a regression.

In [None]:
def poly6(x, a, b, c, d, e, f, g):
    return a + b * x + c * x**2 + d * x**3 + e * x**4 + f * x**5 + g * x**6

In [None]:
# Generate data
n_sample = 7
err = 3

# Generation of data
x = np.linspace(-2, 2, n_sample)
y = 1 + 3 * x  # Linear function!
yerr = np.random.normal(0, err, size=n_sample)
y = y + yerr

In [None]:
# Plot data
plt.errorbar(x, y, err, marker='+', linewidth=0, elinewidth=0, color='dodgerblue', label='Data points');

# Linear fit
if False:
    params1, cov1 = opt.curve_fit(linear, x, y, absolute_sigma=err)
    plt.plot(x, linear(x, *params1), label='Linear fit')

# Polynomial fit
if True:
    params2, cov2 = opt.curve_fit(poly6, x, y, absolute_sigma=err)
    x_res = np.linspace(-2.2, 2.15, n_sample * 100)
    plt.plot(x_res, poly6(x_res, *params2), label='Polynomial fit')

# Add new data points
if False:
    x_new = np.array([-2.5, -1,-0.5,0,1])
    y_new = 1 + 3 * x_new
    yerr_new = np.random.normal(0, err, size=5)
    y_new = y_new + yerr_new
    plt.scatter(x_new, y_new, marker='*', s=150, color='orange', label='New data');

plt.legend();

The polynomial can describe each data point exactly but does not get the overall linear dependency of our original linear relationship. Especially at the edges of the spectrum it even diverges. If we generate new data points belonging to the linear relationship they will not be represented very well. 

---

_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_