In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

# Simple Linear Regression

In this notebook we again analyse the data from the file `cars.csv`.  However, this time we try to determine the independent variable that is best for explaining the fuel consumption.  To this end, we will encapsulate the relevant code into functions so that it is easier to reuse.

We need to read our data from a <tt>csv</tt> file.  The module `csv` offers a number of functions for reading and writing a <tt>csv</tt> file.

In [None]:
import csv

Let us read the data.

In [None]:
with open('cars.csv') as handle:
    reader = csv.DictReader(handle, delimiter=',')
    Data   = [] # engine displacement
    for row in reader:
        Data.append(row)

In [None]:
Data[:5]

In [None]:
import numpy as np

The function `simple_linear_regression` takes two arguments:
* `X` is a numpy vector of $n$ independent variables.
* `Y` is a numpy vector of dependent variables of the same length as `X`.
The function computes the linear model
$$ Y_i = \vartheta_0 + \vartheta_1 \cdot X_i $$
in a way that the *residual sum of squares* 
$$ \texttt{RSS} = \sum\limits_{i=0}^{n-1} (\theta_0 + \theta_1 \cdot X_i - Y_i)^2 $$
is minimized.

It returns the *coefficient of determination* 
$$ R^2 = 1 - \frac{\texttt{RSS}}{\texttt{TSS}}  $$
where $\texttt{TSS}$ is the *total sum of squares*, which is defined as follows:
$$ \texttt{TSS} = \sum\limits_{i=0}^{n-1} (\bar{Y} - Y_i)^2 $$

In [None]:
def simple_linear_regression(X, Y):
    xMean = np.mean(X)
    yMean = np.mean(Y)
    ϑ1    = np.sum((X - xMean) * (Y - yMean)) / np.sum((X - xMean) ** 2)
    ϑ0    = yMean - ϑ1 * xMean
    TSS   = np.sum((Y - yMean) ** 2)
    RSS   = np.sum((ϑ1 * X + ϑ0 - Y) ** 2)
    R2    = 1 - RSS/TSS
    return R2

The function `coefficient_of_determination` takes to arguments:
* `name` is the name of an attribute that should serve as the independent variable.
* `Data` is an array of dictionary containing the values of various variables.

The function prints the *coefficient of determination* if the given attribute is used to predict the fuel consumption.

In [None]:
def coefficient_of_determination(name, Data):
    X  = np.array([float(line[name])    for line in Data])
    Y  = np.array([1/float(line['mpg']) for line in Data])
    R2 = simple_linear_regression(X, Y)
    print(f'coefficient of determination of fuel consumption w.r.t. {name:12s}: {round(R2, 2)}')

In [None]:
DependentVars = ['cyl', 'displacement', 'hp', 'weight', 'acc', 'year']

In [None]:
for name in DependentVars:
    coefficient_of_determination(name, Data)