## Question 3

In [None]:
pip install -r requirements.txt

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score

In [13]:
## Load and clean data
df_train = pd.read_csv("train.csv")
df_train = df_train.drop(columns=["zipcode"])

columns = df_train.drop(columns=["price"]).columns.tolist()
df_train[columns] = (df_train[columns] - df_train[columns].mean()) / df_train[columns].std()
df_train["price"] = df_train["price"] / 1000

df_test = pd.read_csv("test.csv")
df_test = df_test.drop(columns=["zipcode", "id", "date"])

columns = df_test.drop(columns=["price"]).columns.tolist()
df_test[columns] = (df_test[columns] - df_test[columns].mean()) / df_test[columns].std()
df_test["price"] = df_test["price"] / 1000

# print(df_train.describe().round(2))
# print(df_test.describe().round(2))

## Question 3.1

In [14]:
## Define linear regression used closed form

class LinearRegression:

    def __init__(self):
        self.beta = None

    def fit(self, X, y):
        ones = np.ones((X.shape[0], 1))
        X_b = np.hstack((ones, X))
        
        self.beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        
    def predict(self, X):
        ones = np.ones((X.shape[0], 1))
        X_b = np.hstack((ones, X))
        return X_b @ self.beta

In [15]:
## Train the model

model = LinearRegression()

X_train = df_train.drop(columns=["price"])
y_train = df_train["price"]

model.fit(X_train, y_train)

In [16]:
## Evaluate model

y_train_pred = model.predict(X_train)

train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

X_test = df_test.drop(columns=["price"])
y_test = df_test["price"]

print("Training MSE:", train_mse)
print("Training R^2:", train_r2)

y_test_pred = model.predict(X_test)

test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("Testing MSE:", test_mse)
print("Testing R^2:", test_r2)

Training MSE: 31694.527700403145
Training R^2: 0.7247237650377227
Testing MSE: 58380.59120259212
Testing R^2: 0.6498431009314046


## Question 3.2

Comparing the closed form model to the scikit-learn model, the MSE and R^2 for training data are extremely similar, with the package model having a slightly smaller MSE and R^2, with 31415.747 and 0.727 repsectively, while the closed form model had 31694.527 and 0.724. This means that the package model was slightly better at predicting the training data than the closed form model. When observing the test data, we see the opposite. The closed form model had a slightly higher MSE and R^2, with 58380.591 and 0.649 respectively, while the package model had 59887.872 and 0.640. This gap between testing data metics was larger than the gap between training data metrics. This means that the gap between training and testing with the closed form model is smaller than the gap of the package model, meaning that the closed form model might have slightly less overfitting than the package model