In [1]:
from IPython.display import HTML
HTML(open("../style.css", "r").read())

# Linear Regression with SciKit-Learn

We import the module `pandas`.  This module implements so called [data frames](https://www.geeksforgeeks.org/pandas/python-pandas-dataframe/) and is more convenient than the module `csv` when reading a <tt>csv</tt> file. 

In [2]:
import pandas as pd

The data we want to read is contained in the <tt>csv</tt> file `'cars.csv'`.  

In [3]:
cars = pd.read_csv('cars.csv')
cars

Unnamed: 0,mpg,cyl,displacement,hp,weight,acc,year,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino
...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86.0,2790.0,15.6,82,ford mustang gl
388,44.0,4,97.0,52.0,2130.0,24.6,82,vw pickup
389,32.0,4,135.0,84.0,2295.0,11.6,82,dodge rampage
390,28.0,4,120.0,79.0,2625.0,18.6,82,ford ranger


We want to convert the columns containing `mpg` into one **NumPy** array, while the remaining numerical attributes should be collected into a 
<em style="color:blue">feature matrix</em>.  

In [4]:
import numpy as np

In [5]:
X = np.array(cars[['cyl', 'displacement', 'hp', 'weight', 'acc', 'year']])
Y = np.array(cars['mpg'])

Let us inspect the first five rows of the matrix `X`.

In [6]:
X[:5]

array([[   8. ,  307. ,  130. , 3504. ,   12. ,   70. ],
       [   8. ,  350. ,  165. , 3693. ,   11.5,   70. ],
       [   8. ,  318. ,  150. , 3436. ,   11. ,   70. ],
       [   8. ,  304. ,  150. , 3433. ,   12. ,   70. ],
       [   8. ,  302. ,  140. , 3449. ,   10.5,   70. ]])

Since *miles per gallon* is in a reciprocal relation to the *fuel consumption*, we convert `Y` to its inverse.

In [7]:
Y = 1 / Y

We import the `linear_model` from **SciKit-Learn**:

In [8]:
import sklearn.linear_model as lm

We create a *linear model* for linear regression.

In [9]:
M = lm.LinearRegression()

We train this model using the data that we have read.

In [10]:
%%time
M.fit(X, Y)

CPU times: user 2.08 ms, sys: 932 μs, total: 3.01 ms
Wall time: 4.12 ms


0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


The model `M` represents a linear relationship between the dependent variable $1/\texttt{mpg}$ and the independent variables $\texttt{cyl}$, $\texttt{displacement}$, $\texttt{hp}$, $\texttt{weight}$, $\texttt{acc}$, and $\texttt{year}$ of the form
$$\displaystyle \frac{1}{\texttt{mpg}}
                = \vartheta_0 + \vartheta_1 \cdot \texttt{cyl} 
                + \vartheta_2 \cdot \texttt{displacement} 
               + \vartheta_3 \cdot \texttt{hp}
                + \vartheta_4 \cdot \texttt{weight}
                + \vartheta_5 \cdot \texttt{acc}
                + \vartheta_6 \cdot \texttt{year}  $$
We proceed to extract the coefficients $\vartheta_i$ for $i\in\{1,\cdots,6\}$.

In [None]:
ϑ0 = M.intercept_
ϑ0

In [None]:
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6 = M.coef_
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5, ϑ6

Let us check how much of the variance is explained by our model.  This is done using the method `score`.

In [None]:
R2 = M.score(X, Y)
R2

The linear model explains $88\%$ of the variation of the fuel efficiency.  In order to derive a better model, we would need the *reference area* of the car, the *drag coefficient*, and the type of fuel.