- Status: draft
- Title: Linear Regression in Scikit-learn
- Slug: python-sklearn-linear
- Date: 2020-05-23 22:22:34
- Category: Computer Science
- Tags: programming, Python, Scikit-learn, sklearn, AI, machine learning, data science
- Author: Ben Du
- Modified: 2020-05-23 22:22:34


This notebook illustrate how to do a simple linear regression using sk-learn.

In [61]:
import numpy as np
import pandas as pd
import hvplot.pandas
from sklearn.linear_model import LinearRegression

Load the Iris dataset
and convert the response variable to numeric (so that a linear regression can be performed).

In [2]:
iris = pd.read_csv("../../home/media/data/iris.csv")
iris['y'] = iris.species.apply(
    lambda spe: {
        "Iris-virginica": 3,
        "Iris-setosa": 1,
        "Iris-versicolor": 2
    }[spe]
)
iris.head()

Unnamed: 0,id,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm,species,y
0,1,5.1,3.5,1.4,0.2,Iris-setosa,1
1,2,4.9,3.0,1.4,0.2,Iris-setosa,1
2,3,4.7,3.2,1.3,0.2,Iris-setosa,1
3,4,4.6,3.1,1.5,0.2,Iris-setosa,1
4,5,5.0,3.6,1.4,0.2,Iris-setosa,1


In [3]:
x = iris[["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm"]]
y = iris.y

In [52]:
reg1 = LinearRegression().fit(x, y)

In [53]:
reg1.coef_

array([-0.10974146, -0.04424045,  0.22700138,  0.60989412])

In [54]:
reg1.intercept_

1.1920839948281388

In [55]:
reg1.rank_

4

In [56]:
reg1.score(x, y)

0.9304223675331597

In [62]:
reg1.predict(x)

array([0.91734173, 0.96141024, 0.95181031, 1.01260878, 0.92389183,
       1.05680235, 1.03762592, 0.95544006, 1.02070502, 0.91869693,
       0.89827134, 1.00008849, 0.91139498, 0.89816529, 0.7730022 ,
       0.95635941, 0.9660018 , 0.97833114, 0.96731454, 0.98775914,
       0.95694375, 1.0531726 , 0.87698786, 1.17725847, 1.0681889 ,
       0.99583637, 1.10011902, 0.92906772, 0.91079163, 1.01991072,
       1.01336062, 1.0335223 , 0.84153404, 0.84247683, 0.91869693,
       0.89618773, 0.850745  , 0.91869693, 0.99358084, 0.94446591,
       0.96660515, 1.07456442, 0.98473275, 1.2176738 , 1.13954911,
       1.0333738 , 0.94946987, 0.98548459, 0.90924548, 0.93716396,
       2.20308259, 2.2845166 , 2.32487047, 2.1876208 , 2.31393877,
       2.25705298, 2.39745639, 1.90717243, 2.17656176, 2.24113634,
       1.95929474, 2.28013501, 1.95420588, 2.31512204, 2.05930184,
       2.17232866, 2.38115786, 1.97673409, 2.35070534, 2.02311961,
       2.59045598, 2.0996557 , 2.41725961, 2.19756726, 2.13040

In [60]:
1 - sum((reg1.predict(x) - y)**2) / sum((y - np.mean(y))**2)

0.9304223675331595

In [67]:
1 - sum((np.round(reg1.predict(x)) - y)**2) / sum((y - np.mean(y))**2)

0.96

## Train - Test

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)

In [11]:
x_train

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm
85,6.0,3.4,4.5,1.6
30,4.8,3.1,1.6,0.2
101,5.8,2.7,5.1,1.9
94,5.6,2.7,4.2,1.3
64,5.6,2.9,3.6,1.3
...,...,...,...,...
9,4.9,3.1,1.5,0.1
103,6.3,2.9,5.6,1.8
67,5.8,2.7,4.1,1.0
117,7.7,3.8,6.7,2.2


In [12]:
y_train

85     2
30     1
101    3
94     2
64     2
      ..
9      1
103    3
67     2
117    3
47     1
Name: y, Length: 90, dtype: int64

In [13]:
reg = LinearRegression().fit(x_train, y_train)

In [14]:
reg.score(x_train, y_train)

0.9518623736306308

In [15]:
reg.score(x_test, y_test)

0.8886093167492697

## Cross Validation

In [40]:
from sklearn.model_selection import cross_val_score

lreg = LinearRegression()
scores = cross_val_score(lreg, x, y, cv=5)

In [39]:
scores

array([0.        , 0.85215955, 0.        , 0.76225759, 0.        ])

In [33]:
scores.mean()

0.3228834275997373

## References

https://scikit-learn.org/stable/modules/cross_validation.html