## Linear Regression using Scikit-Learn

This is an open-source, commercially usable machine learning toolki: [scikit-learn](https://scikit-learn.org/stable/index.html). This toolkit contains implementations of many of the algorithms seen in other notebooks of this project

In [44]:
import os
import numpy as np
import matplotlib as plt
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

The following is a helper function to load the house data features and prices dataset

In [45]:
def load_house_dataset():
    print(os.getcwd)
    '''
    Loads both features and targer values from the raw dataset, as specified in the path below.
    Example of data in the file:

    [feature_x1, feature_x2, feature_x4, feature_x5, target_y]
    1.244000000000000000e+03,3.000000000000000000e+00,1.000000000000000000e+00,6.400000000000000000e+01,3.000000000000000000e+02
    1.947000000000000000e+03,3.000000000000000000e+00,2.000000000000000000e+00,1.700000000000000000e+01,5.098000000000000114e+02

        Returns: X matrix and y vector, numpy structures.

    '''
    data = np.loadtxt("../data/raw/houses_features_prices.txt", delimiter=',', skiprows=1)
    """
    This line selects all rows (indicated by the : before the first comma) and all columns except the last one (indicated by :-1). 
    The :-1 means "from the beginning to the second last element" (or "from the beginning to the last element minus one"). 
    
    This creates a new numpy array X containing all the features (i.e., all columns except the last one).
    """
    X = data[:, :4]
    """
    This line selects all rows (indicated by the : before the comma) and only the last column (indicated by -1). 
    In numpy, -1 refers to the last index. This creates a new numpy array y containing only the target variable (i.e., the last column).
    """
    y = data[:, 4]
    return X, y 

# Gradient Descent in Scikit-Learn
Scikit-learn has a gradient descent regression model [sklearn.linear_model.SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#examples-using-sklearn-linear-model-sgdregressor).  Like previous implementations of gradient descent in this project, the model performs best with normalized inputs. [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) will perform z-score normalization . In the library it is referred to as 'standard score'.

Below code loads the dataset

In [46]:
X_train, y_train = load_house_dataset()
X_features = ['size(sqft)','bedrooms','floors','age']
print(f"First row loaded features: {X_train[0]} - Target: {y_train[0]}")

<built-in function getcwd>
First row loaded features: [1.244e+03 3.000e+00 1.000e+00 6.400e+01] - Target: 300.0


Next step is to normalize the training data, that is, scale it to allow for Gradient Descent to run much faster

In [47]:
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_train)
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X_train,axis=0)}")   
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")

Peak to Peak range by column in Raw        X:[2.406e+03 4.000e+00 1.000e+00 9.500e+01]
Peak to Peak range by column in Normalized X:[5.8452591  6.13529646 2.05626214 3.68533012]


Then we create and fit the regression model

In [48]:
sgdr = SGDRegressor(max_iter=1000)
sgdr.fit(X_norm, y_train)
print(sgdr)
print(f"number of iterations completed: {sgdr.n_iter_}, number of weight updates: {sgdr.t_}")

SGDRegressor()
number of iterations completed: 117, number of weight updates: 11584.0


We can inspect the obtained parameters now, which should be associated with the normalised data

In [49]:
b_norm = sgdr.intercept_
w_norm = sgdr.coef_
print(f"model parameters:                   w: {w_norm}, b:{b_norm}")

model parameters:                   w: [110.05889707 -21.01156267 -32.42551666 -38.06763388], b:[363.15290664]


### Making Predictions

Now that the library algorithm provided the parameters and bias, we can make predictions with them, as follows:

In [50]:
y_pred_sgd = sgdr.predict(X_norm)
# The prediction can also be done 'manually', that is, using the b and w obtained before and numpy dot product
y_pred = np.dot(X_norm, w_norm) + b_norm
print(f"prediction using np.dot() and sgdr.predict match: {(y_pred == y_pred_sgd).all()}")
print(f"Prediction on training set:\n{y_pred[:4]}" )
print(f"Target values \n{y_train[:4]}")

prediction using np.dot() and sgdr.predict match: True
Prediction on training set:
[295.17713373 485.87932857 389.6014757  492.04143246]
Target values 
[300.  509.8 394.  540. ]
