In [13]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler


train_df = pd.read_csv("/Users/haderie/Downloads/housing/train.csv")
train_df = train_df.drop(columns=[ "zipcode"])

X_train = train_df.drop(columns=["price"]) # features
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
y_train = train_df["price"] / 1000 # target


test_df = pd.read_csv("/Users/haderie/Downloads/housing/test.csv")
test_df = test_df.drop(columns=["id", "date", "zipcode"])

X_test = test_df.drop(columns=["price"]) # featurs

y_test = test_df["price"] / 1000  # target
X_test_scaled = scaler.transform(X_test)



In this problem, you will implement your own gradient descent algorithm and apply it to linear regression on the same house prediction dataset.

1. Write code for gradient descent for training linear regression using the algorithm from class.

In [14]:

def gradient_descent(X, y, alpha, num_iters):
    """
    Gradient descent for linear regression
    X: (N, d) matrix with intercept
    y: (N,) target values
    alpha: learning rate
    num_iters: number of iterations
    """
    N, d = X.shape # num of data points, num of features
    theta = np.zeros(d)

    for _ in range(num_iters):
        gradient = (2 / N) * X.T @ (X @ theta - y)
        theta = theta - alpha * gradient

    return theta

2. Vary the value of the learning rate (at least 3 different values $\alpha \in \{0.01,0.1,0.5\}$) and report the value of the model parameter $\theta$ after different number of iterations (10, 50, and 100). Include in a table the MSE and $R^2$ metrics on the training and testing set for the different number of iterations and different learning rates. You can choose more values of the learning rates to observe how the
behavior of the algorithm changes.

In [15]:
# add intercepts
X_train_scaled = np.hstack([np.ones((X_train_scaled.shape[0], 1)), X_train_scaled])
X_test_scaled = np.hstack([np.ones((X_test_scaled.shape[0], 1)), X_test_scaled])


results = []

learning_rates = [0.01, 0.1, 0.5]
iterations_list = [10, 50, 100]

for alpha in learning_rates: # for each learning rate, go through diff num of iterations
    for num_iters in iterations_list:
        theta = gradient_descent(X_train_scaled, y_train, alpha, num_iters)

        y_train_pred = X_train_scaled @ theta
        y_test_pred = X_test_scaled @ theta

        results.append({
            "alpha": alpha,
            "iterations": num_iters,
            "Train MSE": mean_squared_error(y_train, y_train_pred),
            "Train R^2": r2_score(y_train, y_train_pred),
            "Test MSE": mean_squared_error(y_test, y_test_pred),
            "Test R^2": r2_score(y_test, y_test_pred)
        })

results_df = pd.DataFrame(results)
print(results_df)


   alpha  iterations      Train MSE      Train R^2       Test MSE  \
0   0.01          10   2.357311e+05  -1.047393e+00   2.828668e+05   
1   0.01          50   6.969578e+04   3.946717e-01   9.432003e+04   
2   0.01         100   3.676495e+04   6.806857e-01   6.127904e+04   
3   0.10          10   3.504793e+04   6.955985e-01   6.000379e+04   
4   0.10          50   3.142706e+04   7.270468e-01   5.889054e+04   
5   0.10         100   3.141602e+04   7.271427e-01   5.883993e+04   
6   0.50          10   1.464434e+17  -1.271904e+12   1.632452e+17   
7   0.50          50   1.293867e+67  -1.123761e+62   1.442316e+67   
8   0.50         100  3.504812e+129 -3.044031e+124  3.906928e+129   

        Test R^2  
0  -6.965872e-01  
1   4.342845e-01  
2   6.324587e-01  
3   6.401074e-01  
4   6.467845e-01  
5   6.470881e-01  
6  -9.791172e+11  
7  -8.650767e+61  
8 -2.343309e+124  


3. Write some observations about the behavior of the algorithm: How do the metrics change with different learning rates; How many
iterations are needed; Does the algorithm converge to the optimal solution, etc.

The results show that the learning rate significantly affects the convergence behavior of gradient descent. 

When α = 0.01, the algorithm converges slowly: performance is poor at 10 iterations but steadily improves by 100 iterations, indicating gradual movement toward the optimal solution.

When α = 0.1, the algorithm converges much faster, achieving strong performance within 10–50 iterations, with little improvement afterward, suggesting it has reached the minimum. 

However, when α = 0.5, the algorithm diverges, as shown by extremely large MSE values and highly negative (R^2), meaning the learning rate is too large and causes overshooting. 

Overall, α = 0.1 provides the best balance and demonstrates that an appropriate learning rate allows gradient descent to efficiently converge to the optimal solution.