In [1]:
%%HTML
<style>
.container { width:100% } 
</style>

# General Linear Regression

In this case study we investigate how much of the <em style="color:blue">fuel consumption</em> of a car can be explained by the 

- number of cylinders,
- engine displacement,
- power,
- weight,
- acceleration, and
- the year the car has been introduced into the market.

This data is given in the <tt>CSV</tt> file `cars.csv`.  In this file, the engine displacement is given in *cubic inches* and the weight is given in *pounds*.  The fuel consumption is specified as *miles per galon* and the acceleration is given as the number of seconds until the car reaches 60 miles per hour.

The module `csv` offers a number of functions for reading and writing <tt>csv</tt> files.

In [2]:
import csv

Below we read the file `cars.csv` and store the fuel consumption in the list `mpg`, while the number of cylinders, the engine displacement,
the power, the weight, the acceleration, and the building year are stored as the list of lists `Features`. Note also that we have added the constant feature $1$ to every list in `Features`.

In [3]:
with open('cars.csv') as input_file:
    reader     = csv.reader(input_file, delimiter=',')
    line_count = 0
    mpg        = []
    Features   = []
    for row in reader:
        print(row)
        if line_count != 0:  
            mpg     .append(float(row[0]))  
            Features.append([float(x) for x in row[1:6+1]] + [1.0]) 
        line_count += 1

['mpg', 'cyl', 'displacement', 'hp', 'weight', 'acc', 'year', 'name']
['18.0', '8', '307.0', '130.0', '3504.0', '12.0', '70', 'chevrolet chevelle malibu']
['15.0', '8', '350.0', '165.0', '3693.0', '11.5', '70', 'buick skylark 320']
['18.0', '8', '318.0', '150.0', '3436.0', '11.0', '70', 'plymouth satellite']
['16.0', '8', '304.0', '150.0', '3433.0', '12.0', '70', 'amc rebel sst']
['17.0', '8', '302.0', '140.0', '3449.0', '10.5', '70', 'ford torino']
['15.0', '8', '429.0', '198.0', '4341.0', '10.0', '70', 'ford galaxie 500']
['14.0', '8', '454.0', '220.0', '4354.0', '9.0', '70', 'chevrolet impala']
['14.0', '8', '440.0', '215.0', '4312.0', '8.5', '70', 'plymouth fury iii']
['14.0', '8', '455.0', '225.0', '4425.0', '10.0', '70', 'pontiac catalina']
['15.0', '8', '390.0', '190.0', '3850.0', '8.5', '70', 'amc ambassador dpl']
['15.0', '8', '383.0', '170.0', '3563.0', '10.0', '70', 'dodge challenger se']
['14.0', '8', '340.0', '160.0', '3609.0', '8.0', '70', "plymouth 'cuda 340"]
['15.0', '

The number of data pairs of the form $\langle \textbf{x}, y \rangle$ that we have read is stored in the variable `m`.

In [4]:
m = len(mpg)
m

392

For efficiency reasons we transform the variable `Features`, which is a the list of list, into a `NumPy` matrix.

In [5]:
import numpy as np

In [6]:
Features

[[8.0, 307.0, 130.0, 3504.0, 12.0, 70.0, 1.0],
 [8.0, 350.0, 165.0, 3693.0, 11.5, 70.0, 1.0],
 [8.0, 318.0, 150.0, 3436.0, 11.0, 70.0, 1.0],
 [8.0, 304.0, 150.0, 3433.0, 12.0, 70.0, 1.0],
 [8.0, 302.0, 140.0, 3449.0, 10.5, 70.0, 1.0],
 [8.0, 429.0, 198.0, 4341.0, 10.0, 70.0, 1.0],
 [8.0, 454.0, 220.0, 4354.0, 9.0, 70.0, 1.0],
 [8.0, 440.0, 215.0, 4312.0, 8.5, 70.0, 1.0],
 [8.0, 455.0, 225.0, 4425.0, 10.0, 70.0, 1.0],
 [8.0, 390.0, 190.0, 3850.0, 8.5, 70.0, 1.0],
 [8.0, 383.0, 170.0, 3563.0, 10.0, 70.0, 1.0],
 [8.0, 340.0, 160.0, 3609.0, 8.0, 70.0, 1.0],
 [8.0, 400.0, 150.0, 3761.0, 9.5, 70.0, 1.0],
 [8.0, 455.0, 225.0, 3086.0, 10.0, 70.0, 1.0],
 [4.0, 113.0, 95.0, 2372.0, 15.0, 70.0, 1.0],
 [6.0, 198.0, 95.0, 2833.0, 15.5, 70.0, 1.0],
 [6.0, 199.0, 97.0, 2774.0, 15.5, 70.0, 1.0],
 [6.0, 200.0, 85.0, 2587.0, 16.0, 70.0, 1.0],
 [4.0, 97.0, 88.0, 2130.0, 14.5, 70.0, 1.0],
 [4.0, 97.0, 46.0, 1835.0, 20.5, 70.0, 1.0],
 [4.0, 110.0, 87.0, 2672.0, 17.5, 70.0, 1.0],
 [4.0, 107.0, 90.0, 2430.0,

In [7]:
X = np.array(Features)
X

array([[  8. , 307. , 130. , ...,  12. ,  70. ,   1. ],
       [  8. , 350. , 165. , ...,  11.5,  70. ,   1. ],
       [  8. , 318. , 150. , ...,  11. ,  70. ,   1. ],
       ...,
       [  4. , 135. ,  84. , ...,  11.6,  82. ,   1. ],
       [  4. , 120. ,  79. , ...,  18.6,  82. ,   1. ],
       [  4. , 119. ,  82. , ...,  19.4,  82. ,   1. ]])

Note that every row in this matrix corresponds to the data of one car.

Since <em style="color:blue">miles per gallon</em> is the inverse of the <em style="color:blue">fuel consumption</em>, the vector `Y` is defined as the reciprocal of the variable `mpg`.

In [8]:
y = np.array([1 / mpg[i] for i in range(m)])

The weight vector `w` is specified via the <em style="color:blue">normal equation</em>:
$$ (X^\top \cdot X) \cdot \textbf{w} = X^\top \cdot \textbf{y} $$ 
This linear equation can be solved for `w` using the method `np.linalg.solve`.  Note how the <em style="color:blue">transpose</em> of the matrix `X` can be computed by writing `X.T`.

In [9]:
%%time
w = np.linalg.solve(X.T @ X, X.T @ y)
print(w)

[ 1.39205897e-03 -1.70283082e-05  1.13756363e-04  1.10909594e-05
  3.38800847e-04 -1.26596723e-03  8.95296095e-02]
CPU times: user 4.12 ms, sys: 2 ms, total: 6.12 ms
Wall time: 4.89 ms


The <em style="color:blue">residual sum of squares</em> is given by the following sum:
$$ \texttt{RSS} = \sum\limits_{i=1}^m \Bigl(\bigl(\textbf{x}^{(i)}\bigr)^\top \cdot \textbf{w} - y_i\Bigr)^2 $$
Here $\textbf{x}^{(i)}$ is the $i$-th row of the matrix $X$, while $y_i$ is the $i$-th component of the vector $\textbf{y}$.
The expression $\bigl(\textbf{x}^{(i)}\bigr)^\top \cdot \textbf{w}$ is the predicted value of the linear model, while $y_i$ is the actual value.
As the feature Matrix $X$ is defined as
$$ X = \left( \begin{array}{c}
              \bigl(\textbf{x}^{(1)}\bigr)^\top \\
              \vdots \\
              \bigl(\textbf{x}^{(m)}\bigr)^\top
              \end{array}
       \right)
$$
we can compute `RSS` as follows:

In [10]:
RSS = np.sum((X @ w - y) ** 2)

We compute the <em style="color:blue">average fuel consumption</em> $\bar{\mathbf{y}}$ according to the formula:
$$ \bar{\mathbf{y}} = \frac{1}{m} \cdot \sum\limits_{i=1}^m y_i $$ 

In [11]:
yMean = np.mean(y)
yMean

0.04782242789602457

We  compute the <em style="color:blue">total sum of squares</em> `TSS`according to the following formula:
$$ \mathtt{TSS} := \sum\limits_{i=1}^m \bigl(y_i - \bar{\mathbf{y}}\bigr)^2 $$

In [12]:
TSS = np.sum((y - yMean) ** 2)
TSS

0.10825652135293372

Now the <em style="color:blue">proportion of explained variance</em> $R^2$ is calculated via the formula:
$$ R^2 = 1 - \frac{\mathtt{RSS}}{\mathtt{TSS}}$$

In [13]:
R2 = 1 - RSS/TSS
R2

0.8838956283269112

We can explain about $88\%$ of the fuel consumption by the data given in our `CSV` file.  Given that our data does not contain any parameters describing the air resistance this result seems reasonable. 