# Linear Regression

In this module, you will learn how to import and use linear regression in scikit-learn. We will also learn how to evaluate how well the model performs by dividing into training sets and test sets. Lastly, we will use mean square error to get some idea of how well we did.

<b>Functions and attributes in this lecture: </b>
- `sklearn.model_selection` - Submodule for functions used in deciding modules
 - `train_test_split` - Divides the data into training set and test set.
- `sklearn.linear_model` - Submodule for linear models
  - `LinearRegression` - The linear regression model.
  - `.fit()` - Training the model on the data.
  - `.predict()` - Predicting on new data using the model.
- `sklearn.metrics` - Submodule for metrics used to evaluate models
  - `mean_squared_error` - Taking the mean square error.

In [1]:
# Importing Pandas and NumPy
import pandas as pd
import numpy as np

## Importing the Dataset

Let us now import the diabetes dataset and see what it represents!

In [1]:
# Import the diabetes loader
from sklearn.datasets import load_diabetes

# Load in the diabetes dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)

# Get the description of the dataset
print(load_diabetes()["DESCR"])

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature va

## Doing Linear Regression

We will now train a linear regression machine learning model!

In [2]:
# Import the linear regression model
from sklearn.linear_model import LinearRegression

In [3]:
# Select only some of the features
X = X[["age", "bmi"]]

In [4]:
# Creating the model
lin_reg = LinearRegression()

In [5]:
# Training the model
lin_reg.fit(X, y)

LinearRegression()

In [6]:
# Creating new datapoints
new_data = [[0, 0], [0.5, 0.3]]

In [7]:
# Predicted values
lin_reg.predict(new_data)

array([152.13348416, 496.0852863 ])

## Is our Model any Good?

So far, so good. But how do we know if our machine learning model is any good? Let's find out!

In [8]:
# Importing train-test-split
from sklearn.model_selection import train_test_split

In [9]:
# Splitting into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [10]:
# Seeing the shape of the training set
X_train.shape

(331, 2)

In [11]:
# Seeing the shape of the test set
X_test.shape

(111, 2)

In [12]:
# Creating a new linear model
lin_reg_mse = LinearRegression()

In [13]:
# Training the model on the training set
lin_reg_mse.fit(X_train, y_train)

LinearRegression()

In [14]:
# Importing the mean square error
from sklearn.metrics import mean_squared_error

In [15]:
# Predicting on the test set
y_pred = lin_reg_mse.predict(X_test)

In [16]:
# Displaying the predictions
y_pred

array([153.26129968, 201.5519655 , 158.17040414, 216.41789908,
       134.82853965, 130.17151246, 312.08539485, 191.36822223,
        51.84017368, 157.04863799, 113.97588244,  97.95043803,
        79.97768556, 168.64396936,  97.12489612, 151.8748747 ,
       226.19196102, 245.96699906, 173.88704274, 215.08820263,
       177.78783812, 119.12448261,  98.66252286, 183.11920547,
       184.61268523, 172.08475718, 213.24177009, 140.88457339,
        72.88817962, 134.38111365, 244.31591524, 100.14342106,
       165.20318154, 150.3939765 , 205.18810206, 213.39297184,
       121.17884544, 120.28399344, 133.22160282, 100.24429658,
        85.57371173, 134.24249346, 137.37447547, 138.06139718,
       115.93577206, 115.94835362, 102.31124097, 109.11092496,
        68.41391961, 143.77705969, 208.20662699,  70.66363117,
       172.85357055, 116.57854679, 126.54795747, 209.17078909,
       104.59251796, 216.43048064, 130.56220992,  94.1315343 ,
       177.07575329, 184.41733649, 193.80070097, 111.92

In [17]:
# Finding the error
mse = mean_squared_error(y_pred, y_test)
mse

3782.3793166602563

In [18]:
# Taking the square root
np.sqrt(mse)

61.50105134597502