## Week 5 - Practice Quiz: Linear regression

<br/>**1.**
In the previous video you saw how to fit a line $y = mx+c$ to linear data. In this quiz you will practice identifying data that is appropriate for linear regression, and initialize some fits yourself.

Which of the following figures looks like it contains sensible data for a linear fit?

**![picture alt](https://i.ibb.co/8zpq9Dh/34-H0-TABm-Eeioigp-Ac-EC6-QA-4a1fa4cd20b65752f54fd71d7301792c-Lin-Or-Not-1aa.png)**

<br/>**2.**
Which of the following figures looks like it contains sensible data for a linear fit?

**![picture alt](https://i.ibb.co/VqgR4X6/34-H0-TABm-Eeioigp-Ac-EC6-QA-4a1fa4cd20b65752f54fd71d7301792c-Lin-Or-Not-1aa.png)**

<br/>**3.**
Now that we've identified candidates for linear regression we can do some linear fitting ourselves. The code block below plots some predefined data points and a linear regression with the values $[m,c]$, where $m$ is the gradient and $c$ is the y-intercept. It also gives the $\chi^2$ value discussed in the previous video, which is a measure of how good the fit is.

Play with the values of $m$ and $c$ to get a sense for how different linear fits affect $\chi^2$, then try to find the best possible fit to the data.

In [None]:
# See what m and c do to the fit
m = 1.2 ; c = 0.1
p = [m,c]
line(p)

The minimum $\chi^2$ value is $0.03819$ to 4 significant figures. Try to find a fit with $\chi^2 \leq 0.04$ and then input these values into the following code block:

In [2]:
# Replace m and c with values that minimise χ^2.
p = [-0.27, 0.79]

<br/>**4.**
Fitting by eye is not that easy even for small sets of data. Let's make some linear fits using the maths discussed in the previous video.

The following is a figure with 5 data points, labeled with their $(x,y)$ coordinates:

![picture alt](https://i.ibb.co/9NN0ThV/34-H0-TABm-Eeioigp-Ac-EC6-QA-4a1fa4cd20b65752f54fd71d7301792c-Lin-Or-Not-1aa.png)

Let's fit a linear regression to this small sample by hand. Recall that we can use $\chi^2$ to measure how good our fit is, defined by

$\chi^2=\Sigma (y_i - mx_i - c)^2,$

and that we can find the minimum of $\chi^2$ by differentiating it and setting it to zero. This leads us to the equations for $m$ and $c$,

$m=\frac{\Sigma (x_i - \bar{x})y_i}{\Sigma (x_i - \bar{x})^2},\quad c=\bar{y} - m\bar{x}$

which minimize $\chi^2$.

Use these equations to calculate the $m$ and $c$ which minimize $\chi^2$ for the 5 data points given above and select the correct values below:

**$m=2, c=-0.7$**

<br/>**5.**
As you have seen it can be quite a lot of effort to fit even 5 data points when doing the maths by hand. Often it's necessary to work with much larger data sets, so let's consider a new example with 50 data points. Instead of doing it by hand we'll implement a function to do the maths for us.

Run the following code block first to see the data without any kind of linear fit. The function _linfit_ is being defined inside the code block. Your task is to edit the definition so that _linfit_ takes the array of x data, _xdat_, and the array of y data, _ydat_, and returns the correct $m$ and $c$ to create a linear fit which minimizes $\chi^2$.

The calculation for $\bar{x}$, _xbar_, and $\bar{y}$, _ybar_, is already given. As you can see _numpy_ has been imported as _np_.

In [None]:
# Here the function is defined
def linfit(xdat,ydat):
  # Here xbar and ybar are calculated
  xbar = np.sum(xdat)/len(xdat)
  ybar = np.sum(ydat)/len(ydat)

  # Insert calculation of m and c here. If nothing is here the data will be plotted with no linear fit

  # Return your values as [m, c]
  return [m, c]

# Produce the plot - don't put this in the next code block
line()

Use the above code block to test your code. When you are confident that you have correctly defined the function, put it into the next codeblock and run it, being careful not to include _line()_ in your answer.

In [None]:
# Here the function is defined
def linfit(xdat,ydat):
  # Here xbar and ybar are calculated
  xbar = np.sum(xdat)/len(xdat)
  ybar = np.sum(ydat)/len(ydat)

  # Insert calculation of m and c below
  m = np.sum((xdat - xbar) * ydat) / np.sum((xdat - xbar)**2)
  c = ybar - m * xbar

  # Return your values as [m, c]
  return [m, c]
  
# Don't include line() in this answer box

<br/>**6.**
While it is informative to write the code ourselves, as in the previous question, in practice functions which do various types of regression are implemented in lots of programming languages. There are several of these in python.

One such example is the _scipy.stats.linregress()_ method, which takes arrays of x data and y data in exactly the same way as the _linfit()_ function you defined in the previous question. As an output it gives the slope $m$ and intercept $c$ as well as a few useful statistical measures like the standard error.

In the following code block, the x data is again stored in the _xdat_ array, and the y data in the _ydat_ array. Call the method _stats.linregress()_ with the data arguments, and then pass the output to _line()_ to plot the regression.

In [None]:
from scipy import stats

# Use the stats.linregress() method to evaluate regression
regression = 

line(regression)

Hopefully it is clear that _linregress()_ does everything _linfit()_ did and more, without having to write it yourself!

Once you're happy that you've implemented things correctly in the above code block, repeat the same in the following code block without the last line to complete the question.

In [None]:
from scipy import stats

# Use the stats.linregress() method to evaluate regression
regression = stats.linregress(xdat, ydat)

# Don't use line(regression) in this code box