### Dataset

The dataset for this assignment is available [here](https://drive.google.com/drive/folders/1FV7qofHD_olqygrIjASLQ7k7LTyVSEGG?usp=sharing).  The prediction task is to predict the price of a house (column price) given the other features. Please ignore the columns `id` and `date`, as well as the categorical column `zipcode`. File `kc_house_data.csv` includes all the records in the dataset. The training file `train.csv` and testing file `test.csv` include each 1000 records extracted from the dataset. Please apply the following transformations to the data before using it for this homework:

- Scale the data so that each feature has mean 0 and standard deviation 1.
- Divide the price by 1000 for all rows in the dataset. This will reduce the value of MSE.

#########################################################################################################################

**PROBLEM 2**

#########################################################################################################################

In [10]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

train_df = pd.read_csv("/Users/haderie/Downloads/housing/train.csv")

test_df = pd.read_csv("/Users/haderie/Downloads/housing/test.csv")



In this problem, you will use an existing package of your choice for training and testing a linear regression model for the house prediction
dataset.

1. Use an existing package to train a multiple linear regression model on the training set using all the features (except the ones excluded
above). 

Report the coefficients of the linear regression models and the following metrics on the training data: (1) MSE metric; (2)
$R^2$ metric.

In [11]:
# "id", "date" not in train_df
train_df = train_df.drop(columns=["zipcode"])


X_train = train_df.drop(columns=["price"]) # features
y_train = train_df["price"] / 1000 # target

# standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

coefficients = pd.Series(
    model.coef_,
    index=X_train.columns
)

print(coefficients, "\n")

y_train_pred = model.predict(X_train_scaled)

train_mse_skl = mean_squared_error(y_train, y_train_pred)
train_r2_skl = r2_score(y_train, y_train_pred)

print("Training MSE:", train_mse_skl)
print("Training R^2:", train_r2_skl)


Unnamed: 0        8.456024
bedrooms        -12.807339
bathrooms        18.456913
sqft_living      57.161582
sqft_lot         11.127338
floors            8.151038
waterfront       64.230911
view             47.610288
condition        12.647609
grade            92.511076
sqft_above       48.439051
sqft_basement    27.688812
yr_built        -68.043173
yr_renovated     17.341926
lat              78.129852
long             -1.437669
sqft_living15    45.479128
sqft_lot15      -12.906560
dtype: float64 

Training MSE: 31415.747916100867
Training R^2: 0.7271450489303788


2. Evaluate the model on the testing set. Report the MSE and $R^2$ metrics on the testing set.

In [None]:
# exclude columns
test_df = test_df.drop(columns=["id", "date", "zipcode"])

X_test = test_df.drop(columns=["price"]) # featurs
y_test = test_df["price"] / 1000  # target

X_test_scaled = scaler.transform(X_test)
y_test_pred = model.predict(X_test_scaled)

test_mse_skl = mean_squared_error(y_test, y_test_pred)
test_r2_skl = r2_score(y_test, y_test_pred)

print("Testing MSE:", test_mse_skl)
print("Testing R^2:", test_r2_skl)


Testing MSE: 58834.67397821398
Testing R^2: 0.6471195893437873


3. Interpret the results in your own words. Which features contribute mostly to the linear regression model? Is the model fitting the data well? 
How large is the model error? How do the training and testing MSE relate?

The features that contribute mostly to the linear regression model are grade, latitude, waterfront status, and living area. 
These variables have the largest standardized coefficients, indicating they contribute most strongly to house price prediction. 

The model fits the data reasonably well, with an $R^2$ of 0.73 on training and 0.65 on testing, indicating good but imperfect explanatory power. 

The model is off by about $242,000 on average (RMSE ≈ √58,835 ≈ 242).
The testing MSE (58,834) is higher than the training MSE (31,416), suggesting mild overfitting, but the gap is expected and the model still generalizes adequately to unseen data



#########################################################################################################################

**PROBLEM 3**

#########################################################################################################################


In this problem, you will implement your own linear regression model, using the closed-form solution we derived in class. You will also
compare your model with the one trained with the package in Problem 2 on the same house price prediction dataset.

- Implement the closed-from solution for multiple linear regression using matrix operations and train a model on the training set. Write a function to predict the response on a new testing point.


In [13]:
# original features 
X_train_cf = X_train_scaled

# add column of 1s
X_train_cf = np.hstack([np.ones((X_train_cf.shape[0], 1)), X_train_cf])

# pseudoinverse
theta = np.linalg.pinv(X_train_cf) @ y_train


def predict(X, theta):
    X = np.hstack((np.ones((X.shape[0], 1)), X))
    return X @ theta

y_test_pred_cf = predict(X_test_scaled, theta)


- Compare the models given by your implementation with those trained in Problem 2 by the Python packages. Report the MSE and $R^2$
metrics for the models you implemented on both training and testing sets and compare these metrics to the ones given by the package
implementation from Problem 2. Discuss if the results of your implementation are similar to those of the package.

In [14]:
# prediction
y_train_pred_cf = predict(X_train_scaled, theta)
y_test_pred_cf = predict(X_test_scaled, theta)

# metrics
train_mse_cf = mean_squared_error(y_train, y_train_pred_cf)
train_r2_cf  = r2_score(y_train, y_train_pred_cf)

test_mse_cf = mean_squared_error(y_test, y_test_pred_cf)
test_r2_cf  = r2_score(y_test, y_test_pred_cf)


print("Training Metrics")
print("Sklearn   - MSE:", train_mse_skl, "R^2:", train_r2_skl)
print("ClosedForm - MSE:", train_mse_cf,  "R^2:", train_r2_cf)

print("\nTesting Metrics")
print("Sklearn   - MSE:", test_mse_skl, "R^2:", test_r2_skl)
print("ClosedForm - MSE:", test_mse_cf,  "R^2:", test_r2_cf)


Training Metrics
Sklearn   - MSE: 31415.747916100867 R^2: 0.7271450489303788
ClosedForm - MSE: 31415.747916100874 R^2: 0.7271450489303787

Testing Metrics
Sklearn   - MSE: 58834.67397821398 R^2: 0.6471195893437873
ClosedForm - MSE: 58834.673978213905 R^2: 0.6471195893437878


The closed-form implementation produces identical results to the sklearn implementation on both the training and testing sets. The MSE and R² values match up to very small numerical differences caused by floating-point precision. 
This confirms that the manual implementation correctly computes the closed-form solution for linear regression and both approaches produce the same fitted model and predictions.