# First project: automated fitting (30 minutes)

In this project, you will write an algorithm in Python to optimize a linear fit.

Form groups of 2 or 3 and work together!

<br><br><br>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<br><br><br>

You'll start with the penguin body mass versus flipper length data:

In [None]:
penguins = pd.read_csv("data/penguins.csv")
penguins[["flipper_length_mm", "body_mass_g"]]

In [None]:
measurements = penguins[["flipper_length_mm", "body_mass_g"]].dropna().values

<br><br><br>

You'll use a linear model to predict `body_mass` as a function of `flipper_length`:

$$\mbox{\tt body\_mass} = a \times \mbox{\tt flipper\_length} + b$$

In [None]:
def body_mass(flipper_length, a, b):
    return a * flipper_length + b

<br><br><br>

And you'll use the absolute value of differences between measured and predicted `body_mass`, summed over all measurements, as the **badness of fit** criteria:

In [None]:
def badness_of_fit(a, b, measurements):
    badness = 0

    for measured_length, measured_mass in measurements:
        badness += (body_mass(measured_length, a, b) - measured_mass)**2
    
    return badness

<br><br><br>

Your algorithm doesn't have to find the best fit all at once.

Given a bad model with a particular `a` and `b`, it just has to find a _better_ `a` and `b`, and then do that over and over until the model is good.

Your code replaces the `print("???")`.

Call the `badness_of_fit` function as many times as you need to, varying `a` and `b` however you want.

In [None]:
def better_fit(i, a, b, measurements):
    
    print("???")

    return a, b

<br><br><br>

**Suggestion:** Try tuning it by hand before writing an algorithm. The algorithm formalizes what you would do intuitively.

<br><br><br>

**To test:** Run the cell below to initialize `a` and `b` to some bad values:

In [None]:
i = 0   # iteration number
a = 30
b = -3000

And run this cell over and over (control-enter) to see if your iterative algorithm is improving the fit.

In [None]:
a, b = better_fit(i, a, b, measurements)
i += 1

fig, ax = plt.subplots()

ax.scatter(measurements[:, 0], measurements[:, 1], marker=".")

x = np.linspace(165, 240, 10)
y = body_mass(x, a, b)
ax.plot(x, y, color="orange")

badness = badness_of_fit(a, b, measurements)

ax.legend([], [], title=f"i = {i}\na = {a:.2f}\nb = {b:.0f}\nbadness = {badness:.2e}", loc="upper left")

None

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>

**How good is good enough?**

<img src="img/04-my-best-fit.svg" width="600"><img src="img/04-best-fit.svg" width="600">

On the left is the best I managed to do with a simple algorithm. It's good enough for this project.

On the right is a state-of-the-art best fit.

<br><br><br>

**If it was too easy:**

Extra credit: What if you fit to a quadratic function instead of a linear function?

```python
def body_mass(flipper_length, a, b, c):
    return a * flipper_length + b + c * flipper_length**2
```

Note that you have to propagate this new argument, `c`, into all of the functions above. "Restart Kernel and Run All Cells" under the "Kernel" menu can be helpful.

<br><br><br>

**If everyone is done early:**

Let me know! If we finish early, we'll have more time to do machine learning.