# Linear Regression

In this notebook, we will learn how to apply Linear regression for predicting the heating load requirements (Y1) of buildings as a function of building parameters (Xs).

The attached dataset is taken from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Energy+efficiency).

To run this code, you will need the following python packages:
* numpy
* pandas
* openpyxl
* scikit-learn

In [1]:
import numpy as np
import pandas as pd

In [2]:
!pip install openpyxl



In [3]:
# First, we load the dataset using pandas
df = pd.read_excel("Energy_Efficiency.xlsx", engine = 'openpyxl')
# Remove any unnamed columns (might occur due to difference in pandas readers)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
# Remove any row with NaNs
df = df.dropna(how='all')
# Drop Y2 (as we only consider Y1 for regression)
df = df.drop('Y2', axis=1)

In [4]:
# next, we will split the dataframe into a training and testing splits with a 70% / 30% ratio
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.3, random_state=42) # Random is fixed for reproducability

In [5]:
# Now lets display a few rows from the training data
df_train

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1
334,0.62,808.5,367.5,220.50,3.5,4,0.25,1,15.08
139,0.64,784.0,343.0,220.50,3.5,5,0.10,2,15.19
485,0.90,563.5,318.5,122.50,7.0,3,0.25,5,32.31
547,0.79,637.0,343.0,147.00,7.0,5,0.40,1,41.67
18,0.79,637.0,343.0,147.00,7.0,4,0.00,0,29.63
...,...,...,...,...,...,...,...,...,...
71,0.76,661.5,416.5,122.50,7.0,5,0.10,1,32.21
106,0.86,588.0,294.0,147.00,7.0,4,0.10,2,26.33
270,0.71,710.5,269.5,220.50,3.5,4,0.10,5,10.67
435,0.98,514.5,294.0,110.25,7.0,5,0.25,4,28.62


In [6]:
# Then lets view some statistics
df_train.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1
count,537.0,537.0,537.0,537.0,537.0,537.0,537.0,537.0,537.0
mean,0.760354,674.867784,318.636872,178.115456,5.201117,3.500931,0.23594,2.854749,22.050503
std,0.10479,87.758133,43.619254,44.839207,1.750948,1.106502,0.134118,1.544532,10.088187
min,0.62,514.5,245.0,110.25,3.5,2.0,0.0,0.0,6.01
25%,0.66,612.5,294.0,147.0,3.5,3.0,0.1,2.0,12.96
50%,0.74,686.0,318.5,220.5,3.5,3.0,0.25,3.0,17.23
75%,0.82,759.5,343.0,220.5,7.0,4.0,0.4,4.0,31.28
max,0.98,808.5,416.5,220.5,7.0,5.0,0.4,5.0,43.1


In [7]:
# Now we will extract the models input and targets from both the training and testing dataframes
def extract_Xy(df):
    df_numpy = df.to_numpy()
    return df_numpy[:, :-1], df_numpy[:, -1]

X_train, y_train = extract_Xy(df_train)
X_test, y_test = extract_Xy(df_test)

## Linear Regression via Scikit-Learn

In [8]:
# Then we test the linear regression using Scikit-learn's implementation
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X_train, y_train)

In [9]:
# Using scikit-learn's MSE function, we can compute the training and testing error for our model
from sklearn.metrics import mean_squared_error

y_train_predict = model.predict(X_train)
training_error = mean_squared_error(y_train, y_train_predict)
print(f"Training Error: {training_error} (RMS: {training_error**0.5})")
y_test_predict = model.predict(X_test)
testing_error = mean_squared_error(y_test, y_test_predict)
print(f"Testing Error: {testing_error} (RMS: {testing_error**0.5})")

#Note: We also display the Root Mean Square error (RMS) since it is more intuitive to compare with the dataset statistics (diplayed using df_train.describe())


Training Error: 8.400443932494394 (RMS: 2.8983519338573074)
Testing Error: 8.852752599680795 (RMS: 2.9753575582912375)


In [10]:
%%timeit
LinearRegression().fit(X_train, y_train)
# Here we are measuring the training time to compare with our implementation below

603 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Linear Regression from Scratch

In [11]:
def our_mean_square_error(true, predicted):
    #TODO: implement this function to match Scikit-learn's mean_square_error
    #Note: both true & predicted will be float numpy arrays
    # pass
    mse = np.mean((true-predicted)**2)
    return mse


In [12]:
print(f"{our_mean_square_error( np.array([  1, 0]), np.array([1,   0]) ) = }") # Should be 0
print(f"{our_mean_square_error( np.array([  0, 1]), np.array([1,   0]) ) = }") # Should be 1
print(f"{our_mean_square_error( np.array([0.5, 0]), np.array([1, 0.5]) ) = }") # Should be 0.25

our_mean_square_error( np.array([  1, 0]), np.array([1,   0]) ) = 0.0
our_mean_square_error( np.array([  0, 1]), np.array([1,   0]) ) = 1.0
our_mean_square_error( np.array([0.5, 0]), np.array([1, 0.5]) ) = 0.25


In [13]:
class OurLinearRegression:
    def _prepare_inputs(self, X):
        # Here, we add a new input with value 1 to each example. It will be multipled by the bias
        ones = np.ones((X.shape[0], 1), dtype=X.dtype)
        return np.concatenate((ones, X), axis=1)

    def fit(self, X, y):
        X = self._prepare_inputs(X) # First, we prepare the inputs
        #TODO: compute and store the model weights into self.w
        # Note: you can use numpy function and do not use "numpy.linalg.lstsq" or "numpy.linalg.pinv"
        # To compute a square matrix's inverse, you can use "numpy.linalg.inv".
        # A more stable option to compute "numpy.linalg.inv(A) @ b" is using "numpy.linalg.solve(A, b)" 
        self.w = np.linalg.solve(X.T @ X, X.T @ y)
        # Return self to match the behavior of Scikit-Learn's LinearRegression fit()
        return self
    
    def predict(self, X):
        X = self._prepare_inputs(X) # First, we prepare the inputs
        #TODO: Compute and return the predictions given X
        return X @ self.w

In [14]:
# Now, you can train your model
our_model = OurLinearRegression().fit(X_train, y_train)

In [15]:
# Using your MSE function, you can compute the training and testing error for our model
y_train_predict = our_model.predict(X_train)
training_error = our_mean_square_error(y_train, y_train_predict)
print(f"Training Error: {training_error} (RMS: {training_error**0.5})")
y_test_predict = our_model.predict(X_test)
testing_error = our_mean_square_error(y_test, y_test_predict)
print(f"Testing Error: {testing_error} (RMS: {testing_error**0.5})")

Training Error: 8.396083444368179 (RMS: 2.897599600422422)
Testing Error: 8.839515947385179 (RMS: 2.9731323460931196)


In [16]:
%%timeit
OurLinearRegression().fit(X_train, y_train)
# Now, you can compare the time of our implementation with Scikit-Learn's. What is your conclusion?

55.3 µs ± 666 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [17]:
#TODO: Write your conclusion about your implementation's performance and training time 
'''
*performance 

  -my implementation
      Training Error: 8.396083444368179 (RMS: 2.897599600422422)
      Testing Error: 8.839515947385179 (RMS: 2.9731323460931196)
  -scikit-learn
      Training Error: 8.400443932494394 (RMS: 2.8983519338573074)
      Testing Error: 8.852752599680795 (RMS: 2.9753575582912375)


  my implementation performance is almost the same but a little better than  scikit-learn performance 

*training time 

  -my implementation
      55.3 µs ± 666 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
  -scikit-learn
      603 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

  my implementation  training time  is faster than  scikit-learn 

  


'''

'\n*performance \n\n  -my implementation\n      Training Error: 8.396083444368179 (RMS: 2.897599600422422)\n      Testing Error: 8.839515947385179 (RMS: 2.9731323460931196)\n  -scikit-learn\n      Training Error: 8.400443932494394 (RMS: 2.8983519338573074)\n      Testing Error: 8.852752599680795 (RMS: 2.9753575582912375)\n\n\n  my implementation performance is almost the same as  scikit-learn performance  \n\n*training time \n\n  -my implementation\n      23.2 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n  -scikit-learn\n      589 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n\n  my implementation  training time  is faster than  scikit-learn \n\n  \n\n\n'