In [1]:
from IPython.display import HTML
HTML(open("../style.css", "r").read())

In [2]:
%load_ext nb_mypy

Version 1.0.6


# Simple Linear Regression

In this notebook we again analyse the data from the file `cars.csv`.  However, this time we try to determine the independent variable that is best for explaining the fuel consumption.  To this end, we will encapsulate the relevant code into functions so that it is easier to reuse.

We need to read our data from a <tt>csv</tt> file.  The module `csv` offers a number of functions for reading and writing a <tt>csv</tt> file.

In [3]:
import csv

Let us read the data.

In [4]:
def read_data(file: str) -> list[dict[str, str]]:
    with open(file) as handle:
        reader = csv.DictReader(handle, delimiter=',')
        Data   = [] # engine displacement
        for row in reader:
            Data.append(row)
    return Data

In [5]:
Data = read_data('cars.csv')

In [6]:
Data[:5]

[{'mpg': '18.0',
  'cyl': '8',
  'displacement': '307.0',
  'hp': '130.0',
  'weight': '3504.0',
  'acc': '12.0',
  'year': '70',
  'name': 'chevrolet chevelle malibu'},
 {'mpg': '15.0',
  'cyl': '8',
  'displacement': '350.0',
  'hp': '165.0',
  'weight': '3693.0',
  'acc': '11.5',
  'year': '70',
  'name': 'buick skylark 320'},
 {'mpg': '18.0',
  'cyl': '8',
  'displacement': '318.0',
  'hp': '150.0',
  'weight': '3436.0',
  'acc': '11.0',
  'year': '70',
  'name': 'plymouth satellite'},
 {'mpg': '16.0',
  'cyl': '8',
  'displacement': '304.0',
  'hp': '150.0',
  'weight': '3433.0',
  'acc': '12.0',
  'year': '70',
  'name': 'amc rebel sst'},
 {'mpg': '17.0',
  'cyl': '8',
  'displacement': '302.0',
  'hp': '140.0',
  'weight': '3449.0',
  'acc': '10.5',
  'year': '70',
  'name': 'ford torino'}]

In [7]:
import numpy as np
from numpy.typing import NDArray

The function `simple_linear_regression` takes two arguments:
* `X` is a numpy vector of $n$ independent variables.
* `Y` is a numpy vector of dependent variables of the same length as `X`.
The function computes the linear model
$$ Y_i = \vartheta_0 + \vartheta_1 \cdot X_i $$
in a way that the *residual sum of squares* 
$$ \texttt{RSS} = \sum\limits_{i=0}^{n-1} (\theta_0 + \theta_1 \cdot X_i - Y_i)^2 $$
is minimized.

It returns the *coefficient of determination* 
$$ R^2 = 1 - \frac{\texttt{RSS}}{\texttt{TSS}}  $$
where $\texttt{TSS}$ is the *total sum of squares*, which is defined as follows:
$$ \texttt{TSS} = \sum\limits_{i=0}^{n-1} (\bar{Y} - Y_i)^2 $$

In [8]:
def simple_linear_regression(X: NDArray[np.float64], Y: NDArray[np.float64]) -> float:
    xMean = np.mean(X)
    yMean = np.mean(Y)
    ϑ1    = np.sum((X - xMean) * (Y - yMean)) / np.sum((X - xMean) ** 2)
    ϑ0    = yMean - ϑ1 * xMean
    TSS   = np.sum((Y - yMean) ** 2)
    RSS   = np.sum((ϑ1 * X + ϑ0 - Y) ** 2)
    R2    = 1 - RSS/TSS
    return R2

The function `coefficient_of_determination` takes to arguments:
* `name` is the name of an attribute that should serve as the independent variable.
* `Data` is an array of dictionary containing the values of various variables.

The function prints the *coefficient of determination* if the given attribute is used to predict the fuel consumption.

In [9]:
def coefficient_of_determination(name: str, Data: list[dict[str, str]]) -> float:
    X  = np.array([float(line[name])    for line in Data])
    Y  = np.array([1/float(line['mpg']) for line in Data])
    R2 = simple_linear_regression(X, Y)
    return R2

In [10]:
DependentVars = ['cyl', 'displacement', 'hp', 'weight', 'acc', 'year']

In [11]:
def main() -> tuple[float, str]:
    bestVal, bestName = 0.0, ''
    for name in DependentVars:
        R2 = coefficient_of_determination(name, Data)
        print(f'coefficient of determination of fuel consumption w.r.t. {name:12s}: {round(R2, 2)}')
        if R2 > bestVal:
            bestVal, bestName = R2, name
    return bestVal, bestName

In [12]:
main()

coefficient of determination of fuel consumption w.r.t. cyl         : 0.7
coefficient of determination of fuel consumption w.r.t. displacement: 0.75
coefficient of determination of fuel consumption w.r.t. hp          : 0.73
coefficient of determination of fuel consumption w.r.t. weight      : 0.78
coefficient of determination of fuel consumption w.r.t. acc         : 0.21
coefficient of determination of fuel consumption w.r.t. year        : 0.31


(np.float64(0.7833240828863838), 'weight')