# Prelab 06: A review of residuals and fitting

In [None]:
%reset -f
%matplotlib widget
import data_entry2
import numpy as np
import matplotlib.pyplot as plt
import fit_plot

This prelab focuses on revisting and reinforcing concepts from Lab 05. We start with a review of residual plots.

## Part 1: Review of residuals

Answer the questions below on your own before reading the answers.

**Your turn #1a:** Given experimental data and a model for that data, what is a residual?

**Your turn #1b:** What are the properties of a residual plot that inform us that the model is a good fit to the data?

##### **Answer for #1a:**

A residual is defined as the difference between experimental data and a model, i.e., $r_i = y_i - f(x_i)$.

##### **Answer for #1b:**

A model that is a good fit to the data will have a residual plot with:<font>
- No obvious trend
- A roughly equal scatter of points across the $x$-axis ($x = 0$)
- And if the uncertainties are well characterized, we will also see
    + Roughly ~68% of error bars crossing the $x$-axis
    + Roughly all (~95%) of doubled error bars crossing the $x$-axis

## Part 2: Using residuals to diagnose fits

Now that we have reminded ourselves of what to look for in a residuals plot, let's use it to diagnose some fits. Below, we have three data sets to which we have tried to fit the model of a straight line with intercept; $y = mx + b$. We provide an image of an initial fit with some issues and ask you to diagnose what needs to be changed to improve the quality of the fit.

In [None]:
# Run me to load our three data sets
# Make sure to hit the "Generate Vectors" button!

de1 = data_entry2.sheet("lab06_prelab_data.csv")

### Your turn #2 (Dataset 1)
Consider the data and fit shown in the image below. First, answer the questions 2a and 2b and then, for question 2c, update the fitting parameters to get a better fit. 

![dataset 1](https://i.ibb.co/r7ShPLk/data1.png)

#### **Your turn #2a:** 
Which feature(s) of the **residual** plot above indicates a problem with the fit of the model to the data?

##### **Answer #2a**

We see a trend in the residuals where all of the residuals are negative, lying below the residuals = 0 line.

#### **Your turn #2b:** 
What should be changed about the model to improve the fit?

##### **Answer #2b**

The $y$-intercept of the model is too large, so we should decrease it.

#### **Your turn #2c:** 
Run the `fit_plot.line()` interactive fitting tool below and update the fitting parameters to `slope = 1` and `intercept = 1.1` using the text entry boxes in the widget. This will reproduce the image shown above. Then, use the interactive fitting widget to update the fit to show a good fit in the scatter plot and residuals. **Write your best $y$-intercept in the cell below.**

In [None]:
# Use the text boxes to manually update the values to
# slope = 1, intercept = 1.1

fit_plot.line("dataset 1", xVec, y1Vec, dy1Vec)

##### **Answer #2c:**

Changing the intercept from 1.1 to 1 in the code improves the model in the desired fashion.

### Your turn #3 (Dataset 2)

Answer the questions below related to the data and fit shown below and then for part c, update the fitting parameters to get a better fit. 

![dataset 2](https://i.ibb.co/5sLwRZh/data2.png)

#### **Your turn #3a:** 
Which feature(s) of the residual plot above indicates a problem with the fit of the model to the data?

##### **Answer #3a**

We see an (upwards) linear trend in the residuals plot.

#### **Your turn #3b:** 
What should be changed about the model to improve the fit?

##### **Answer #3b**

The slope of the model is too small, so we should increase it.

#### **Your turn #3c:** 
Run the `fit_plot.line()` interactive fitting tool below and update the fitting parameters to `slope = 1.98` and `intercept = 3` using the text entry boxes in the widget. This will reproduce the image shown above. Use the interactive fitting widget to update the fit to show a good fit in the scatter plot and residuals. **Write your best slope in the cell below.**

In [None]:
# Use the text boxes to manually update the values to
# slope = 1.98, intercept = 3

fit_plot.line("dataset 2", xVec, y2Vec, dy2Vec)

##### **Answer #3c**

Changing the slope from 1.98 to 2 in the code improves the model in the desired fashion.

### Your turn #4 (dataset 3)

Answer the following questions related to the data and fit shown in the image below.

**Note:** For this dataset, the issue with the model is a bit harder to fix compared to the other two examples. Think carefully about if our current linear model is the right one for these data.

![dataset 3](https://i.ibb.co/r0n2b5T/data3.png)

#### **Your turn #4a:** 
Which feature(s) of the **residual** plot above indicates a problem with the fit of the model to the data?

##### **Answer #4a**

We notice from the residuals plot that there is a concave-up trend (that is potentially parabolic or some other non-linear function).

#### **Your turn #4b:** 
What should be changed about the model to improve the fit?

##### **Answer #4b**

This non-linear trend tells us that the current linear model is not a good fit for the data.  This means our current model does not fully describe the physics of the phenomenom we are observing, and we likely need to change it to a new one.

#### **Your turn #4c (optional):** 
Play around with the interactive fitting tool below to convince yourself that no matter which compination of slope and intercept you try, a linear model will not result in a good fit.

In [None]:
# Use the text boxes to manually update the values to
# slope = 1, intercept = 3

fit_plot.line("dataset 3", xVec, y3Vec, dy3Vec)

## Part 3: Beyond linear models

Frequently, data that we expected to be linear will be more complicated than that. Often this happens when the assumption of linearity in our model has overlooked other important effects. As such, it is good to know how to modify our modelling and plotting code to include non-linear functions. 

An example of this are **quadratic models**, which are represented by a quadratic function, usually written as (standard form):

$$y = f(x) =ax^2 +bx + c,$$ 

Note that $b$ here is **not** the same parameter as for a linear model.

**Warning:** This is just one example of a non-linear model, and there are obviously many more functions that could fit non-linear data such as higher-order polynomials, exponentials, sinusoidals, etc. **In general, you should never fit your data with an arbitrary non-linear function just because it makes your residuals look better.**  You need to have some reason, grounded in a certain physics model, to use a certain function in your fits. For example, consider the fact that you can always find a polynomial that can go through every single point of a dataset (this is called "polynomial interpolation").  You would then have all residuals perfectly equal to zero, but would gain absolutely no understanding of the physics that can actually model your data.


## Part 4: Practice with plots

Recall the process of fitting in the lab:

1. Use the `fit_plot.line()` function as an interactive fitting tool to find your best fit parameters.
2. Make a final properly labelled plot of your data and best fit along with the residuals.

Let's get some practice with that second step, since plotting with Python is a new skill for many of you. In this part of the pre-lab, we adapt the plotting code from Part 5 in Prelab 05 and use the third dataset from this prelab (that you just considered in Your Turn #4) to get some practice with plotting. The code is a bit nicer than what you had previously, since we we will introduce the Matplotlib function `plt.subplot()` which enables you to divide one figure into several subplots (one for the data, and another for residuals).

#### Plotting your best fit from a linear model

Read the comments carefully and then run the cell below.  **You might want to save this code and use it for future labs when creating your final plots after you have found your best fit from a linear model.**

In [None]:
########################################################################################
# PLOTTING A LINEAR MODEL
########################################################################################

########################################################################################
# Model - Step 1: find the range of x values from the experimental data.
########################################################################################

x_data = xVec
y_data = y3Vec
dx_data = dxVec
dy_data = dy3Vec
x_min = np.min(x_data)  # find the smallest x value
x_max = np.max(x_data)  # find the largest x value

########################################################################################
# Model - Step 2: generate an array of model x values between x_min and x_max
# for which we want to plot the model y values.
########################################################################################

x_model = np.linspace(start=x_min, stop=x_max, num=100)  # return 100 evenly spaced values

########################################################################################
# Step 3: calculate the model y values at each of the model x values.
# Choose best-fitting values for the linear model.
########################################################################################

m = 1.07
b = 2.94
y_model = m * x_model + b

########################################################################################
# Step 4: plot the model on the graph of the experimental data in the first subplot
########################################################################################

# Create a new figure for our two plots
plt.figure()

# Divide that figure into a 2x1 grid, and choose the first position to plot in first
plt.subplot(2, 1, 1)

# Plot the experimental data
data_label = "dataset 3"
graph_title = "Scatter plot of the experimental data with a linear model"
x_label = "$x$ (s)"
y_label = "$y$ (m)"
plt.errorbar(x=x_data, y=y_data, yerr=dy_data, fmt='bo', markersize=3, label=data_label)  # plot experimental data
plt.title(graph_title)
plt.xlabel(x_label)
plt.ylabel(y_label)

#Plot the model on top of the data
model_label = "model ($y = mx + b$)"
plt.plot(x_model, y_model, "r-", label=model_label)  # plot model data
plt.legend(loc='upper left')   # add a legend (you can change the location as needed)

########################################################################################
# Residuals – Step 1: calculate the model predictions y_prediction for each of
# the measured x_data values.
########################################################################################

y_prediction = m * x_data + b

########################################################################################
# Residuals – step 2: calculate the residuals.
########################################################################################

residuals = y_data - y_prediction

########################################################################################
# Residuals – step 3: plot the residuals against the measured x_data values 
# in the second subplot
########################################################################################

# From 2x1 grid in figure, now choose the second position to plot residuals in
plt.subplot(2, 1, 2)

# Plot the residuals
residual_graph_title = "Residual plot"
residual_y_label = "residual = data - model (m)"
plt.errorbar(x=x_data, y=residuals, yerr=dy_data,fmt='bo', markersize=3, label=data_label)
plt.title(residual_graph_title)
plt.xlabel(x_label)  # reuse the x-label from the scatter plot
plt.ylabel(residual_y_label)

########################################################################################
# Residuals – step 4: add a horizontal line at r=0 to the plot.
########################################################################################

#plot a horizontal line from x_min to x_max where r=0
plt.hlines(y=0, xmin=x_min, xmax=x_max, color='k', label="$r = 0$")

# Add a legend (you can change the location as needed)
plt.legend(loc='lower left', fontsize='small')

# adjust the padding between and around subplots so there's no overlapping
plt.tight_layout() 
plt.show()

💡 Notice that we used a new Matplotlib function: `plt.subplot(n, m, index)`. This function divides the figure into a `n` x `m` grid in which you can create subplots. The argument `index` specifies which of the subplots you want to plot in.  For example, `plt.subplot(1, 2, 2)` means that the next plot(s) will be displayed in the second row of the 1x2 grid in our figure.

💡 Simple LaTex equations are supported by Matplotlib: the string needs to be placed inside a pair of dollar signs `$`. Note that Latex symbols containing a backslash, will need to be written with two backslashes. For example: `$\alpha$` (LaTex in Markdown cell) --> `"$\\alpha$"` (Matplotlib).


#### Plotting your best fit from a non-linear model

Notice that, in the previous plot, we reproduced the clear concave up trend in the residuals that we considered in Your Turn #4.  This highlights the need for a non-linear model! Let's assume that we have good reason from your physics class to believe that the data is described by a quadratic model (as described above). Through trial and error, your estimate that the best fit parameters are: $a = 0.01$, $b = 1$ and $c = 3$.

**Your turn #5:**

Copy the code in the previous cell below and update it so that it instead plots a quadratic model with the best fit parameters listed above. Try this on your own first before looking at the answer. Make sure to update labels and titles.

**Note that there are lots of functions that can account for a "curvy" shape in your residuals!** Here, we are using a quadratic function just for the sake of an example and to illustrate how you could modify your code to account for a non-linear function. Do not make the assumption that any vaguely curvy shape in your data is necessarily quadratic! Remember that your model is not just some arbitrary function, it must be grounded in the physics of the phenomenom you are observing.

In [None]:
########################################################################################
# PLOTTING A QUADRATIC MODEL
########################################################################################


# insert your code here

#### Answer:

In [None]:
########################################################################################
# PLOTTING A QUADRATIC MODEL
########################################################################################

########################################################################################
# Model - Step 1: find the range of x values from the experimental data.
########################################################################################

x_data = xVec
y_data = y3Vec
dx_data = dxVec
dy_data = dy3Vec
x_min = np.min(x_data)  # find the smallest x value
x_max = np.max(x_data)  # find the largest x value

########################################################################################
# Model - Step 2: generate an array of model x values between x_min and x_max
# for which we want to plot the model y values.
########################################################################################

x_model = np.linspace(start=x_min, stop=x_max, num=100)  # return 100 evenly spaced values

########################################################################################
# Step 3: calculate the model y values at each of the model x values.
# Choose best-fitting values for the quadratic model.
########################################################################################

a = 0.01
b = 1
c = 3
y_model = a * x_model**2 + b * x_model + c

########################################################################################
# Step 4: plot the model on the graph of the experimental data in the first subplot
########################################################################################

# Create a new figure for our two plots
plt.figure()

# Divide that figure into a 2x1 grid, and choose the first position to plot in first
plt.subplot(2, 1, 1)

# Plot the experimental data
data_label = "dataset 3"
graph_title = "Scatter plot of the experimental data with a quadratic model"
x_label = "$x$ (s)"
y_label = "$y$ (m)"
plt.errorbar(x=x_data, y=y_data, yerr=dy_data, fmt='bo', markersize=3, label=data_label)  # plot experimental data
plt.title(graph_title)
plt.xlabel(x_label)
plt.ylabel(y_label)

#Plot the model on top of the data
model_label = "model ($y = ax^2 + bx + c$)"
plt.plot(x_model, y_model, "r-", label=model_label)  # plot model data
plt.legend(loc='upper left')   # add a legend (you can change the location as needed)

########################################################################################
# Residuals – Step 1: calculate the model predictions y_prediction for each of
# the measured x_data values.
########################################################################################

y_prediction = a * x_data**2 + b * x_data + c

########################################################################################
# Residuals – step 2: calculate the residuals.
########################################################################################

residuals = y_data - y_prediction

########################################################################################
# Residuals – step 3: plot the residuals against the measured x_data values 
# in the second subplot
########################################################################################

# From 2x1 grid in figure, now choose the second position to plot residuals in
plt.subplot(2, 1, 2)

# Plot the residuals
residual_graph_title = "Residual plot"
residual_y_label = "residual = data - model (m)"
plt.errorbar(x=x_data, y=residuals, yerr=dy_data,fmt='bo', markersize=3, label=data_label)
plt.title(residual_graph_title)
plt.xlabel(x_label)  # reuse the x-label from the scatter plot
plt.ylabel(residual_y_label)

########################################################################################
# Residuals – step 4: add a horizontal line at r=0 to the plot.
########################################################################################

#plot a horizontal line from x_min to x_max where r=0
plt.hlines(y=0, xmin=x_min, xmax=x_max, color='k', label="$r = 0$")

# Add a legend (you can change the location as needed)
plt.legend(loc='lower left', fontsize='small')

# adjust the padding between and around subplots so there's no overlapping
plt.tight_layout() 
plt.show()

## Your turn #5: Preparing your Lab 06 notebook
In this final set of tasks, you will prepare your Lab 06 notebook for data collection and analysis.

1. Open the Lab 06 Instructions on Canvas and take a few minutes to read through them so that you have a sense of how you will be spending your time during the lab.
2. In Part B, you will notice that we will be reanalyzing the data from Lab 05. Add some code to your Part B notebook to read your Lab 05 data into the Lab 06 notebook and launch the `fit_plot.line()` interactive fitting widget. Update the fit parameters to match your best fit from Lab 05. We will be learning a new tool to help us improve our fits even further, and we will want to use your previous best fit as a starting point.
3. Also in Part B (Step 3), you will use your Matplotlib graphing skills to create nice, well-labeled plots, as you did in Part 3 of this prelab. Copy and paste the code you need to create these plots. 

You should now be ready for this lab.

# Submit

Steps for submission:

1. Click: Run => Run_All_Cells
2. Read through the notebook to ensure all the cells executed correctly and without error.
3. File => Save_and_Export_Notebook_As->HTML
4. Inspect your downloaded html document
5. Upload the HTML document to the lab submission assignment on Canvas.

In [None]:
# The following function will display tables based on the data currently
# stored in your data_entry2 spreadsheets. Please do not modify this cell.
display_sheets()