## Problem 3: Simple Linear Regression

In this question, you will implement simple linear regression from scratch. The dataset you will work with is called the Boston data set. You can find more information about the data set here: https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html 

You will use the pandas library to load the csv file into a dataframe: 



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# read the csv file and load into a pandas dataframe 
# make sure Boston.csv is in the same file path as this notebook
boston = pd.read_csv('Boston.csv')

In [None]:
# read the above link to learn more about what each of the columns indicate 
boston.head()

Simple linear regression builds a linear relationship between an input variable $X$ and an output variable $Y$. We can define this linear relationship as follows: 

$$Y = \beta_0 + \beta_1X$$

#### Objective: find the linear relationship between the proportion of non-retail business acres per town (indus) and the full-value property-tax rate per 10,000 dollars (tax)

So our equation will look like:

$$TAX = \beta_0 + \beta_1INDUS$$

Here, the coefficient $\beta_0$ is the intercept, and $\beta_1$ is the scale factor or slope. How do we determine the values of these coefficients? 

There are several different methods to do so, but we will focus on the Ordinary Least Squares (OLS) method. This method minimizes the sum of the squares of the differences between the observed dependent variable and those predicted by the linear function. 

Recall that a residual is the difference between any data point and the line of regression. When we develop a regression model, we want the sum of the residuals squared to be minimized, indicating that the model is a close fit to the data. 

$$RSS = \sum_{i =1}^{n} (y_i - f(x_i))^2$$
$$= \sum_{i =1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$$

This is the objective function we minimize to find $\beta_0$ and $\beta_1$. 

In [None]:
# set X to 'indus' and y to 'tax'
X = boston['indus']
y = boston['tax']

First, visualize the data by plotting X and y using matplotlib. Be sure to include a title and axis labels. 

In [None]:
# TODO: display plot 

# TODO: labels and title


TODO: What do you notice about the relationship between the variables? 

A:

Next, find the coefficients. The values for $\beta_0$ and $\beta_1$ are given by the following equations, where $n$ is the total number of values. This derivation was done in class. 


$$\beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$


$$\beta_0 = \bar{y} - \beta_1\bar{x}$$

In [None]:
# TODO: implement function 
def get_coeffs(X, y):
    '''
    Params:
        X: the X vector
        y: the y vector
    Returns:
        a tuple (b0, b1)
    '''
    raise NotImplementedError

In [None]:
# run cell to call function and display the regression line
# the values are rounded for display convenience 
b0, b1 = get_coeffs(X, y)
print("Regression line: TAX = " + str(round(b0)) + " + " + str(round(b1)) +"*INDUS")

Plot the regression line overlayed on the real y-values. 

In [None]:
# TODO: plot y-values 


# TODO: plot regression line


# TODO: labels and title

The line appears to fit the data, but first, let us find the RSS to evaluate this model. The RSS is used to measure the amount of variance in the data set that is not explained by the regression model. Recall that
$$RSS = \sum_{i =1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$$

In [None]:
# TODO: implement function
def get_RSS(b0, b1, X, y):
    '''
    Params: 
        b0: beta 0
        b1: beta 1
        X: X vector
        y: y vector
    Returns:
        residual sum of squares (RSS) 
    '''
    raise NotImplementedError

In [None]:
# run this cell to print RSS
print("RSS:", get_RSS(b0, b1, X, y))

We can also evaluate the model through the Root Mean Squared Error (RMSE) and the Coefficient of Determination ($R^2$ score). 
- The RMSE is similar to the RSS, but provides a value with more interpretable units -- in our case, tax rate per 10,000 dollars.  
- The $R^2$ value represents the proportion of the variance for the dependent variable that is explained by the independent variable. 

Use the following equations to find the RMSE and $R^2$ score:

$$ RMSE = \sqrt(\sum_{i=1}^{n} \frac{1}{n} (\hat{y_i} - y_i)^2 )$$

$$ R^2 = 1 - \frac{SS_r}{SS_t} $$ where

$$SS_t = \sum_{i = 1}^{n} (y_i - \bar{y})^2$$

and

$$SS_r = \sum_{i=1}^{n} (y_i - \hat{y_i})^2$$



In [None]:
# TODO: implement function
def get_RMSE(b0, b1, X, y):
    '''
    Params: 
        b0: beta 0
        b1: beta 1
        X: X vectore
        y: y vector
    Returns:
        rmse 
    '''
    raise NotImplementedError

In [None]:
# run cell to print RMSE
print("RMSE: ", get_RMSE(b0, b1, X, y))

In [None]:
# TODO: implement function
def get_R2(b0, b1, X, y):
    '''
    Params: 
        b0: beta 0
        b1: beta 1
        X: X vector
        y: y vector
    Returns:
        r2 score
    '''
    raise NotImplementedError

In [None]:
# run cell to print RMSE
print("R2: ", get_R2(b0, b1, X, y))

TODO: Analyze what the above $R^2$ score indicates about the model. 

A: 

Now, we will compare the above results with the results from using scikit-learn, a machine learning library in Python. Read the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to learn how to use this library. Return the $R^2$ score and RMSE. 

In [None]:
# TODO: scikit learn function
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def linear_regression_SKL(X, y):
    '''
    Params:
        X: X vector
        y: y vector
    Returns:
        rmse and r2 as a tuple
    '''
    raise NotImplementedError

In [None]:
# run this cell to print results from SKL LR
linear_regression_SKL(X, y)

TODO: Analyze the results and compare the RMSE and $R^2$ to the previous method.

A: