This notebook demonstrates the implementation and application of a multivariate linear regression model for the "housing" dataset. You can download the dataset from http://lib.stat.cmu.edu/datasets/.

**Comments:** "*The file cadata.txt contains all the the variables. Specifically, it contains median house value, median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude in that order. The dependent variable is ln(median house value).*"

In [None]:
import numpy

# load data from text file (note that the initial comments have been deleted from the original file)
data = numpy.loadtxt("cadata.txt")

# the first column corresponds to the target variables; the remaining ones are the features


In [None]:
# Set a random seed for reproducibility and shuffle the data
numpy.random.seed(1)
numpy.random.shuffle(data)

# Split the shuffled data into training and testing sets
split_index = int(0.8 * len(data))  # 80% for training, 20% for testing
train_data = data[:split_index]
test_data = data[split_index:]

# Separate features and target variables
# the first column corresponds to the target variables; the remaining ones are the features
X_train, y_train = train_data[:, 1:], train_data[:, 0]
X_test, y_test = test_data[:, 1:], test_data[:, 0]

We first load and instantiate the "model" object. Afterwards, we call the "fit" method to fit our model (i.e., to compute the weights).

In [None]:
from linreg import LinearRegression

In [None]:
model = LinearRegression(lam=0.1, penalize_constant=False)
model.fit(X_train, y_train)

Given the fitted model, we can now obtain predictions for new data points, i.e. our test set.

In [None]:
preds = model.predict(X_test)

Finally, we have a look at the quality of our model by computing the RMSE and by generating a plot "predictions" vs. "true labels".

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter

from sklearn.metrics import mean_squared_error

# compute RMSE
print("RMSE: {}".format(numpy.sqrt(mean_squared_error(y_test, preds))))

# visualize predictions vs. true labels
fig = plt.figure(figsize=(8,8))
plt.scatter(preds, y_test, color="blue", alpha=0.5)
plt.xticks(rotation=45)
plt.gca().xaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.plot([-100000,600000], [-100000, 600000], 'k--')
plt.xlabel("Predictions")
plt.ylabel("True Labels")
plt.xlim([-100000,600000])
plt.ylim([-100000,600000])
plt.title("Evaluation of Regression Model")
plt.show()