In [None]:
%%HTML
<style>
.container { width:100% } 
</style>

# General Linear Regression

In this case study we investigate how much of the fuel consumption of a car can be explained by its 

- number of cylinders,
- engine displacement,
- power,
- weight,
- acceleration (given as the number of seconds until the car reaches 60 miles per hour), and
- the year the car has been introduced into the market.

This data is given in the <tt>CSV</tt> file `cars.csv`.  In this file, the engine displacement is given in cubic inches and the weight is given in pounds.  The fuel consumption is specified as *miles per galon*.

The module `csv` offers a number of functions for reading and writing <tt>csv</tt> files.

In [None]:
import csv

In [None]:
with open('cars.csv') as input_file:
        reader     = csv.reader(input_file, delimiter=',')
        line_count = 0
        mpg        = []
        Causes     = []
        for row in reader:
            if line_count != 0:  
                mpg   .append(float(row[0]))  
                Causes.append([float(x) for x in row[1:6+1]] + [1.0]) 
            line_count += 1

The number of data pairs of the form $\langle \textbf{x}, y \rangle$ that we have read is stored in the variable `m`.

In [None]:
m = len(mpg)
m

We have to transform `Causes`, which is a the list of list, into a `NumPy` matrix.

In [None]:
import numpy as np

In [None]:
Causes

In [None]:
X = np.array(Causes)
X

Note that every row in this matrix corresponds to the data of one car.

Since *miles per gallon* is the inverse of the fuel consumption, the vector `Y` is defined as the reciprocal of the variable `mpg`.

In [None]:
y = np.array([1 / mpg[i] for i in range(m)])

The weight vector `w` is specified via the **normal equation**:
$$ (X^\top \cdot X) \cdot \textbf{w} = X^\top \cdot \textbf{y} $$ 
This linear equation can be solved for `w` using the method `np.linalg.solve`.

In [None]:
%%time
w = np.linalg.solve(X.T @ X, X.T @ y)
print(w)

The *residual sum of squares* is given by the following sum:
$$ \texttt{RSS} = \sum\limits_{i=1}^m \bigl(\textbf{x}^{(i)} \cdot \textbf{w} - y_i\bigr)^2 $$
Here $\textbf{x}^{(i)}$ is the $i$-th row of the matrix $X$, while $y_i$ is the $i$-th component of the vector $\textbf{y}$.
The expression $\textbf{x}^{(i)} \cdot \textbf{w}$ is the predicted value of the linear model, while $y_i$ is the actual value.

In [None]:
RSS = np.sum((X @ w - y) ** 2)

We compute the average fuel consumption according to the formula:
$$ \bar{\mathbf{y}} = \frac{1}{m} \cdot \sum\limits_{i=1}^m y_i $$ 

In [None]:
yMean = np.sum(y) / m
yMean

We  compute the *total sum of squares* `TSS`according to the following formula:
$$ \mathtt{TSS} := \sum\limits_{i=1}^m \bigl(y_i - \bar{\mathbf{y}}\bigr)^2 $$

In [None]:
TSS = np.sum((y - yMean) ** 2)
TSS

Now $R^2$ is calculated via the formula:
$$ R^2 = 1 - \frac{\mathtt{RSS}}{\mathtt{TSS}}$$

In [None]:
R2 = 1 - RSS/TSS
R2

It looks like we can explain about $88\%$ of the fuel consumption by the data given in our `CSV` file.