# Linear regression notation exercise

These exercises will test your understanding of vectors and vector notation, in
the context of linear regression. Please refer to the textbook page [Linear
Regression Notation](LINK_HERE) if you get stuck at any point. Learning the
notation can be challenging, so please do not worry if you need to check the
textbook page multiple times, this is completely normal!

*Note*: for the auto-marking on this page to work correctly it is very important that you store you answers using the variable names that you are instructed to use...

In [None]:
# please run this cell
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from jupyprint import jupyprint, arraytex

# load some hints
hint_paths = ["hint_Q3.txt", "hint_Q7.txt"] 
hints_dict = {}
for hint_path in hint_paths:
    with open(hint_path, "r") as file:
        hints_dict[hint_path] = file.read()
        
# load some names
name_paths = ["names_1.txt", "names_2.txt"] 
names_dict = {}
for name_path in name_paths:
    with open(name_path, "r") as file:
        names_dict[name_path] = file.read()
names_1 = [h.strip().strip("'") for h in  names_dict['names_1.txt'].strip("[]").split(",")]
names_2 = [h.strip().strip("'") for h in  names_dict['names_2.txt'].strip("[]").split(",")]

# load some variables for testing
string_1 = '6%c&m!'.replace('%', 'n').replace('&', 'o').replace('!', 'e').replace('6', 'i')
string_2 = '%re&ti!6'.replace('%', 'p').replace('&', 's').replace('!', 'g').replace('6', 'e')
temp = pd.read_csv("Duncan_Occupational_Prestige.csv")
tester = temp[string_1]
tester_2 = temp[string_2]
tester_3 = tester[temp['type'] == '%1'.replace('1', 'c').replace('%', 'b')]
tester_4 = tester_2[temp['type'] == '%1'.replace('1', 'c').replace('%', 'b')]
import scipy.stats as sps
tni, epo = sps.linregress(tester, tester_2).intercept, sps.linregress(tester, tester_2).slope
tni_2, epo_2 = sps.linregress(tester_3, tester_4).intercept, sps.linregress(tester_3, tester_4).slope
del sps

# this imports the machinery for marking answers to questions
from client.api.notebook import Notebook
ok = Notebook('linear_regression_notation.ok')

## Question 1

Here is a vector that contains the test scores of 8 students:

$$
\vec{x} = \begin{bmatrix}
           14 \\
           13 \\
           17 \\
           11 \\
           14 \\
           18 \\
           17 \\
           15 \\
         \end{bmatrix}
$$

Please recreate this vector in the code cell below, as a numpy array called `x`:

In [None]:
#- make your vector/array here
x = ...
# Show the result
x

In [None]:
# run this cell to check your answer
_ = ok.grade('q1')

## Question 2

Please set the variable `my_Q2_answer` in the cell below to the number
corresponding to the correct definition of "vector" (e.g. if you think the
answer is statement 1, `set my_Q2_answer` to equal 1):

`1` - a numpy array

`2` - a sequence of values

`3` - a python list

`4` - a single number within a sequence of numbers

`5` - it is another term for "x-axis"

In [None]:
#- set `my_Q2_answer` to your answer
my_Q2_answer = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q2')

## Question 3

The python variable `x` contains the test scores of 8 students.

The maximum score on the test was 20. We might wish to express each student's score as a percentage of the total score.

Using the variable `x` which you defined earlier (which contains the test
scores), please convert the values in the `x` vector to be percentages of the
maximum possible score, and store your answer in a variable called
`x_as_percentage`:

*Note*: you can uncomment the second line in the cell below, to show an
additional hint about what you need to do here...

In [None]:
# Uncomment the line below to show a hint for this question,
# if you get stuck on how to calculate the percentages.
# jupyprint(hints_dict["hint_Q3.txt"])

# put your answer below
x_as_percentage = ...
# show the values in your x_as_percentage variable
x_as_percentage

In [None]:
# run this cell to check your answer
_ = ok.grade('q3')

## Question 4

Let's assume $\vec{x}$ is our predictor vector in a linear regression model. We are assessing how well the scores on the first test - (i.e. the scores stored in the variable `x`) - predict the students' scores on a second test.

The student's scores on the second test are shown below, in the vector $\vec{y}$:

$ \vec{y} = \begin{bmatrix}{} 
40 \\
49 \\
59 \\
38 \\
43 \\
56 \\
52 \\
45 \\
\end{bmatrix}$

Please recreate this vector as a numpy array called `y`:

In [None]:
#- your answer here
y = ...
# Show the result
y

In [None]:
# run this cell to check your answer
_ = ok.grade('q4')

## Question 5

We now have our `x` vector and our `y` vector. These vectors contain 8 students' scores on two different tests. The first test was a maths test and the second was a geography test.

We'd like to see how well the scores in the `x` vector predict the scores in `y` vector. Does scoring highly on the maths test mean that a student is likely to also score highly on the geography test?

Let's do some graphical inspection of the data first. Use the `plt.scatter`
function to create a scatter plot with `x` (maths test score) on the x-axis and
`y` (geography test score) on the y-axis.

In [None]:
#- your scatterplot here
...

From looking at the graph, please set the variable `my_Q5_answer` in the cell
below to the number corresponding to statement you think is most true (e.g. if
you think the answer is statement 1, set `my_Q5_answer` to equal 1):

`1` - there is no linear trend, the points do NOT seem to be approximately
summarized by a straight line

`2` - a scatterplot is not an appropriate type of graph for this type of data

`3` - the association between `x` and `y` looks like it would be better described by a curve, rather than a straight line

`4` - there is a linear trend, the points DO seem to be approximately
summarized by a straight line

`5` - it looks like the higher the `x` score, the lower the `y` score is likely to be

In [None]:
#- set `my_Q5_answer` to your answer
my_Q5_answer = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q5')

## Question 6

Using any python method you like to perform the linear regression, perform a linear regression using `x` as your predictor variable and `y` as your outcome variable.

Store the intercept from your regression in a variable called `c`.

Store the slope from your regression in a variable called `b`.

In [None]:
#- Do the linear regression to find slope b and intercept
#- c in whatever way you like.
...
...
#- store your intercept value here
c = ...
#- store your slope value here
b = ...
# show the values of the intercept and slope
jupyprint(f"Intercept = {c}")
jupyprint(f"Slope = {b}")

In [None]:
# run this cell to check your answer
_ = ok.grade('q6')

## Question 6b

Please set the variable `my_Q6B_answer` in the cell below to the number
corresponding to statement you think is most true <b> for the regression model
we have just fit </b> (e.g. if you think the answer is statement 1, set
`my_Q6B_answer` to equal 1):

`1` - the slope tells us expected change in `x` (maths test) score if we compared two observational units which have the same `y` (geography test) score

`2` - the slope is a constant value, added to all of the `x` (maths test) scores

`3` - the slope tells us expected change in `y` (geography test) scores if we compared two observational units which have the same `x` (maths test) score. In this case, if they differed by 1 point in `x` score, they are likely to differ by over 2 points in `y` score

`4` - the slope tells us expected change in `y` (geography test) scores if we compared two observational units which have the same `x` (maths test) score.. In this case, if they differed by 1 point in `x` score, they are likely to differ by over 7 points in `y` score

`5` - the slope tells us that the association between `x` and `y` is random

In [None]:
#- set `my_Q6B_answer` to your answer
my_Q6B_answer = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q6b')

## Question 7

Now we have the slope ($b$) and the intercept ($c$) from our linear regression, please generate the vector of fitted values, using $\vec{x}$, $b$ and $c$.

Store your answer in a variable called `y_hat`:

*Note*: you can uncomment the second line of code in the cell below to get a hint to help you here.

In [None]:
# uncomment the line below to show a hint for this question
# jupyprint(hints_dict["hint_Q7.txt"])

# your answer here
y_hat = ...
# show the values in your y_hat vector
y_hat

In [None]:
# run this cell to check your answer
_ = ok.grade('q7')

## Question 8

Using the `y_hat` vector which you just calculated, please calculate the error vector in the cell below. Store your answer
as a variable called `errors`.

In [None]:
#- your answer here
errors = ...
# show the error vector
errors

In [None]:
# run this cell to check your answer
_ = ok.grade('q8')

## Question 9

Please set the variable `my_Q9_answer` in the cell below to the number
corresponding to statement you think is most true (e.g. if you think the answer
is statement 1, set `my_Q9_answer` to equal 1):

`1` - the `errors` vector contains the fitted values, plus the errors

`2` - the `errors` vector contains the intercept, plus the `y` values

`3` - the `errors` vector contains more values than the `y` vector and the `x` vector

`4` - the `errors` vector contains less values than the `y` vector and the `x` vector

`5` - the `errors` vector contains the distance between each `y` value and its corresponding fitted (`y_hat`) value

In [None]:
#- set `my_Q9_answer` to your answer
my_Q9_answer = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q9')

## Question 10

Here is the general mathematical form of a linear regression model with 8 observations (like the model we have fitted during this exercise):

$\begin{bmatrix}{} \text{$y_{1}$} \\ \text{$y_{2}$} \\ \text{$y_{3}$} \\ \text{$y_{4}$} \\ \text{$y_{5}$} \\ \text{$y_{6}$} \\ \text{$y_{7}$} \\ \text{$y_{8}$} \end{bmatrix} = b * \begin{bmatrix}{} \text{$x_{1}$} \\ \text{$x_{2}$} \\ \text{$x_{3}$} \\ \text{$x_{4}$} \\ \text{$x_{5}$} \\ \text{$x_{6}$} \\ \text{$x_{7}$} \\ \text{$x_{8}$}  \end{bmatrix} + c +  \begin{bmatrix}{} \text{$\varepsilon_{1}$} \\ \text{$\varepsilon_{2}$} \\ \text{$\varepsilon_{3}$} \\ \text{$\varepsilon_{4}$} \\ \text{$\varepsilon_{5}$} \\ \text{$\varepsilon_{6}$} \\ \text{$\varepsilon_{7}$} \\ \text{$\varepsilon_{8}$}\end{bmatrix} $

Using this formula, write a python function in the cell below called `recreate_y_values()` which takes `x`, `b`, `c`, and `errors` as its input arguments, and returns the `y` values as its output.

<b> DO NOT USE THE VARIABLE `y` AT ANY POINT IN THE FUNCTION.</b> It is possible to calculate the original $\vec{y}$ values from `x`, `b`, `c`, and `errors`.

*Note*: this is tricky, but straightforward once you realise what you have to do. Please check the textbook page if you get stuck.

In [None]:
# create your function here
def recreate_y_values(x, b, c, errors):
    ...
    return ...
# test out your function here
recreate_y_values(x, b, c, errors)

In [None]:
# run this cell to check your answer
_ = ok.grade('q10')

## Question 11

We'll now return to the Duncan Occupational Prestige dataset which we saw on the textbook page.

Run the cell below to import the data. For a reminder, here is the description of the dataset/variables:

Duncan (1961) combined information from the 1950 U.S. Census with data collected by the
National Opinion Research Centre (NORC). The Census data contained information
about different occupations, such as the percentage of people working in that occupation
who earned over a certain amount per year. The NORC data was from a survey which asked
participants to rate how prestigious they considered each occupation.

Here are descriptions of the variables in the dataset, which covers 45 occupations (adapted from [here](https://rdrr.io/cran/carData/man/Duncan.html)):

`name` - the name of the occupation, from the 1950 US Census

`type`- type of occupation, with the following categories ``prof``,
professional and managerial; ``wc``, white-collar; ``bc``, blue-collar. (E.g. how the
occupation was classified in the 1950 US Census)

`income` - percentage of census respondents within the occupation who
earned 3,500 dollars or more per year (about 36,000 US dollars in 2017)

`education` - percentage of census respondents within the occupation who were high school
graduates 

`prestige` - percentage of respondents in the NORC survey who rated the occupation
as “good” or better in prestige

In [None]:
#- import the Duncan_Occupational_Prestige.csv file as as data frame.
duncan_df = ...
# Show the result
duncan_df

In the cell below, plot `prestige` as a function of `income`:

In [None]:
#- your scatterplot here
...
...

After inspecting your plot, please set the variable `my_Q11_answer` in the cell
below to the number corresponding to statement you think is most true (e.g. if
you think the answer is statement 1, set `my_Q11_answer` to equal 1):

`1` - lower `income` scores appear to predict higher `prestige` scores

`2` - `income` and `prestige` look to be randomly associated

`3` - higher `income` scores appear to predict higher `prestige` scores

`4` - all of the above are true

`5` - none of the above are true

In [None]:
#- set `my_Q11_answer` to your answer
my_Q11_answer = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q11')

## Question 12

On the textbook page, we mentioned that there are a lot of different terms for the variables in regression models.

Let's assume you're working with a data scientist who has a habit of using the terminology from multiple disciplines.

They've asked you to fit a linear regression model to the Duncan data. Please run the cell below, to generate the instruction from your colleague regarding which variables you should include in your model:

*Note*: the terminology used is correct, but the specific terms are random (out of the available terms), this is in order to check that you are familiar with whatever terms you might hear!

In [None]:
# generate your colleague's question
jupyprint(f"Please fit a linear regression model, use `income` as your {np.random.choice(names_1)}, and `prestige` as your {np.random.choice(names_2)}.")

Please follow your colleague's instruction in the cell below, using the data in the `duncan_df` dataframe.

The form of the model should be as follows (as there are 45 observations in `duncan_df`):

$$
\small
\begin{bmatrix}
y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \\ y_{5} \\ y_{6} \\ y_{7} \\ y_{8} \\ y_{9} \\ y_{10} \\ y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{15} \\ y_{16} \\ y_{17} \\ y_{18} \\ y_{19} \\ y_{20} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \\ y_{25} \\ y_{26} \\ y_{27} \\ y_{28} \\ y_{29} \\ y_{30} \\ y_{31} \\ y_{32} \\ y_{33} \\ y_{34} \\ y_{35} \\ y_{36} \\ y_{37} \\ y_{38} \\ y_{39} \\ y_{40} \\ y_{41} \\ y_{42} \\ y_{43} \\ y_{44} \\ y_{45} \\
\end{bmatrix}
= b \cdot
\begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \\ x_{5} \\ x_{6} \\ x_{7} \\ x_{8} \\ x_{9} \\ x_{10} \\ x_{11} \\ x_{12} \\ x_{13} \\ x_{14} \\ x_{15} \\ x_{16} \\ x_{17} \\ x_{18} \\ x_{19} \\ x_{20} \\ x_{21} \\ x_{22} \\ x_{23} \\ x_{24} \\ x_{25} \\ x_{26} \\ x_{27} \\ x_{28} \\ x_{29} \\ x_{30} \\ x_{31} \\ x_{32} \\ x_{33} \\ x_{34} \\ x_{35} \\ x_{36} \\ x_{37} \\ x_{38} \\ x_{39} \\ x_{40} \\ x_{41} \\ x_{42} \\ x_{43} \\ x_{44} \\ x_{45} \\
\end{bmatrix}
+ c + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \varepsilon_{3} \\ \varepsilon_{4} \\ \varepsilon_{5} \\ \varepsilon_{6} \\ \varepsilon_{7} \\ \varepsilon_{8} \\ \varepsilon_{9} \\ \varepsilon_{10} \\ \varepsilon_{11} \\ \varepsilon_{12} \\ \varepsilon_{13} \\ \varepsilon_{14} \\ \varepsilon_{15} \\ \varepsilon_{16} \\ \varepsilon_{17} \\ \varepsilon_{18} \\ \varepsilon_{19} \\ \varepsilon_{20} \\ \varepsilon_{21} \\ \varepsilon_{22} \\ \varepsilon_{23} \\ \varepsilon_{24} \\ \varepsilon_{25} \\ \varepsilon_{26} \\ \varepsilon_{27} \\ \varepsilon_{28} \\ \varepsilon_{29} \\ \varepsilon_{30} \\ \varepsilon_{31} \\ \varepsilon_{32} \\ \varepsilon_{33} \\ \varepsilon_{34} \\ \varepsilon_{35} \\ \varepsilon_{36} \\ \varepsilon_{37} \\ \varepsilon_{38} \\ \varepsilon_{39} \\ \varepsilon_{40} \\ \varepsilon_{41} \\ \varepsilon_{42} \\ \varepsilon_{43} \\ \varepsilon_{44} \\ \varepsilon_{45} \\ \end{bmatrix}
$$

After performing the regression, you should have variables stored with the following names: `y_for_Q12`, `x_for_Q12`, `b_for_Q12`, `c_for_Q12`, `errors_for_Q12`, corresponding to the equation in the cell above, and to your colleague's question (so $b$ should be stored as a variable called `b_for_Q12`, where $b$ and `b_for_Q12` are the slope from a linear regression using the variables that your colleague asked you to use, and so on):

In [None]:
#- define your y vector here
y_for_Q12 = ...
#- define your x vector here
x_for_Q12 = ...
#- perform your linear regression here
duncan_mod_1 = ...
#- store your "b" variable here
b_for_Q12 = ...
#- store your "c" variable here
c_for_Q12 = ...
#- store your "errors" variable here
errors_for_Q12 = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q12')

## Question 13

The cell below contains the data for a hypothetical occupations which are not present in the Duncan data, and for which the `prestige` scores are missing:

In [None]:
# some other occupations
other_occupations = pd.DataFrame({'names': ["Procrastination Technician", "Snackologist", "Meme Historian",
                                            "Nap Explorer"],
                                  'income': [18, 25, 23, 17]})

# show the dataframe
other_occupations

Using the slope and intercept which you calculated in the last question. Please calculate what we would predict the `pestige` scores of these occupations to be, based on their `income` scores. Save the predictions in a variable called `other_occupations_predicted_prestige`.

*Hint:* you can do this using the same method by which you would generate the *fitted values* for observations that are in the Duncan dataset.

In [None]:
#- generate your predicted values here
other_occupations_predicted_prestige = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q13')

## Question 14

Your colleague would like to fit a linear regression model to just the blue collar jobs in the Duncan dataset (these are the jobs for which `'type' == 'bc'`), to see if the parameter estimates are similar to those for a regression using the same predictor variable and outcome variable.

First, create a dataframe called `blue_collar` which contains just the blue collar jobs:

In [None]:
#- get just the blue collar jobs in their own dataframe
blue_collar = ...
# Show the result
blue_collar

Run the cell below to show the linear regression model which you should fit. 

The equation shows the actual values within each vector - you should work out from the values which columns of the `blue_collar` dataframe you need to use as each vector in your model:

In [None]:
# run this cell to show the model you should fit
jupyprint(f"${arraytex(np.atleast_2d(tester_3.values).T)} = "
          f"b * {arraytex(np.atleast_2d(tester_4.values).T)} + "
          f"c + $" + "$ \\begin{bmatrix} \\varepsilon_{1} \\\\ \\varepsilon_{2} \\\\ \\varepsilon_{3} \\\\ \\varepsilon_{4} \\\\ \\varepsilon_{5} \\\\ \\varepsilon_{6} \\\\ \\varepsilon_{7} \\\\ \\varepsilon_{8} \\\\ \\varepsilon_{9} \\\\ \\varepsilon_{10} \\\\ \\varepsilon_{11} \\\\ \\varepsilon_{12} \\\\ \\varepsilon_{13} \\\\ \\varepsilon_{14} \\\\ \\varepsilon_{15} \\\\ \\varepsilon_{16} \\\\ \\varepsilon_{17} \\\\ \\varepsilon_{18} \\\\ \\varepsilon_{19} \\\\ \\varepsilon_{20} \\\\ \\varepsilon_{21} \\\\ \\end{bmatrix}$")

Please peform the linear regression that your colleague has requested. You should use variables called `y_for_Q14`, `x_for_Q14`, `b_for_Q14`, `c_for_Q14`, `errors_for_Q14`, corresponding to the equation in the cell above, and to your colleague's question (so $b$ should be stored as a variable called `b_for_Q14`, where $b$ and `b_for_Q14` are the slope from a linear regression using the variables that your colleague asked you to use, and so on):

In [None]:
#- define your y vector here
y_for_Q14 = ...
#- define your x vector here
x_for_Q14 = ...
#- perform your linear regression here
duncan_mod_2 = ...
#- store your "b" variable here
b_for_Q14 = ...
#- store your "c" variable here
c_for_Q14 = ...
#- store your "errors" variable here
errors_for_Q14 = ...

In [None]:
# run this cell to check your answer
_ = ok.grade('q14')

## Question 15

Your variable `b_for_Q12` contains the slope ($b$) from a linear regression pedicting `prestige` from `income` using the full Duncan dataset.

Your variable `b_for_Q14` contains the slope ($b$) from a linear regression pedicting `prestige` from `income` using only the data from the blue collar (`bc`) jobs within the Duncan dataset.

Run the cells below to view each slope.

In [None]:
# slope from a linear regression pedicting `prestige` from `income` using
# the full Duncan dataset
b_for_Q12

In [None]:
# from a linear regression pedicting `prestige` from `income` using only
# the data from the blue collar (`bc`) jobs
b_for_Q14

In the cell below, write a short paragraph stating what you think the meaning of the difference between these slopes is.

<i> Double click this text to write you answer here. </i>

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]