# Session 1: Linear Regression (Question 5 and 6)

**Credits:** adapted from a Jupyter Notebook by Prof. Norman.

This worksheet breaks down the steps necessary to complete the linear regression questions (**Q5 and Q6**) of the problem set.

For the moment, you can **ignore Q1-4 and Q7**, which address k-NN classification. We will address these questions on Tuesday and Thursday.

## Add the necessary import statements

Libraries you need include: 
* numpy
* sklearn: datasets, linear_model modules 
* matplotlib

In [None]:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt

## Load the breast cancer dataset 

Hint: this step was covered in the pre-activity.

In [None]:
# load the breast cancer dataset here

## Write a general function to do a scatter plot 

Your function should accept two **arrays** of x and y values as input, as well as some optional strings for labelling the axes and title. With these inputs, the function should display a properly-labelled scatter plot.

(You may find your plotting code from Week 3 helpful here!)

We will test this function shortly using data from the breast cancer dataset.

In [None]:
def display_scatter(x, y, xlabel='x', ylabel='y',title_name ='default'): 
    pass

## Plot the data from of the dataset's features

Using your display_scatter function, plot the **feature 0** data against the **feature 3** data from the breast cancer dataset. Remember to pass some appropriate labels for the data. The pre-class activity will remind you how to extract descriptions of these features.

In [None]:
x_index = 0
y_index = 3

# extract 1D arrays from the breast cancer dataset

# extract the corresponding feature descriptions

# call your function to display the scatter plot

**Reflection:** Look at the scatter plot and ask yourself whether the features have a linear relationship. If so, we can go ahead and do a linear regression.

PS: we do not require a submission or check-off for your scatter plot. It's optional, but still worth it for the warm and fuzzy feeling I'm certain it will bring you.

## Reminder: 1D vs. 2D numpy arrays

### (1) the <font color = 'blue'>.shape </font> method gives you the dimensions of the array. The output tells you: 
<ul>
<li> 1D array: <font color = 'blue'> (number of elements, ) </font>
<li> 2D array: <font color = 'blue'> (number of rows, number of columns) </font>
</ul>

### (2) indexing a 2D array using an <font color = 'blue'>integer</font>  gives you a 1D array
Suppose <font color = 'blue'>a</font> is a numpy array.
<ul>
<li> Row 3 as a 1D array: <font color = 'blue'>a[3,:] </font>
<li> Column 3 as a 1D array: <font color = 'blue'>a[:,3] </font>
</ul>

### (3) indexing a 2D array using a <font color = 'blue'>list</font>  gives you a 2D array

Suppose <font color = 'blue'>a</font> is a numpy array.
<ul>
<li> Row 3 as a 2D array  <font color = 'blue'>a[[3],:] </font>
<li> Column 3 as a 2D array <font color = 'blue'>a[:,[3]] </font>
</ul>

### A <font color = 'red'> 1D array </font> is like a list - there is no concept of rows or columns, just elements. 

## (Question 5) Import statements necessary for linear regression

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error, r2_score 

## Complete the linear regression function

Now, you should write a function to perform a linear regression. While we will testing it using the breast cancer dataset, it should be written in a **general** way, in that it can take other datasets as inputs as well as other feature indices. The arguments of the function should be:

* a bunchobject, as obtained after loading a dataset
* integers representing the columns of the two features to be considered
* the proportion (size) of the data to be reserved for testing the regression
* a seed to remove the randomness (in testing) and ensure the results can be repeated

In [None]:
def linear_regression(bunchobject, x_index, y_index, size, seed):
    
    # extract the relevant data from bunchobject as 2D arrays (i.e. column vectors)
    
    # split the data using the 'train_test_split' function
    
    # perform the linear regression on the training data
    
    # test the linear regression by predicting y values for the x values that were reserved for testing
    
    # create a 'results' dictionary containing the coefficients, intercept, MSE, and R2 score
    
    return x_train, y_train, x_test, y_pred, results
    

## Write a function to visualize your results

Optionally, create a function to plot both: (1) your x and y training data (in black); and (2) your x test data and y predicted data (in blue). Visually inspect the plot to assess how effective your linear regression was.

In [None]:
def plot_linear_regression (x1, y1, x2, y2, x_label= '' , y_label= '' ): 
    pass

## Call your linear_regression and plot_linear_regression functions below

1. Extract the data 
2. Conduct the linear regression (call your 'linear_regression' function)
3. Visualize the results (call your 'plot_linear_regression' function)

Some suggested test code is available in the problem sheet PDF.

## (Question 6) Now we investigate if Polynomial Regression will be better

You will have seen that a linear regression model was not quite sufficient for the relationship between features 0 and 3 of the breast cancer dataset. In this part, we will use **polynomial regression** to learn a model of the form $y = a_0 + a_1x + a_2x^2 + a_3x^3 + \dots$ . You will be extending your 'linear_regression' function to one that also accounts for such higher orders.

Use the space below to complete **Q6** of the problem sheet. You will need to import PolynomialFeatures and use it to generate $x^2$, $x^3$, $x^4$, ... features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures 

def multiple_linear_regression(bunchobject, x_index, y_index, order, size, seed):
    pass 