 # Linear Regression with SGD (Scikit-learn)

 ## Introduction

 This notebook uses scikit-learn's SGD implementation for Linear Regression on the student dataset.

 ## Data Loading 

 Load  the student data.

In [None]:
import os
import sys

import numpy as np

# Set project root directory and add it to the system path
project_root = os.path.abspath(os.path.join(os.getcwd(), "..", "..", ".."))
sys.path.append(project_root)


from src.scratch.utils.viz_utils import plot_scatter_for_regression


X_train = np.load("../../../data/processed/student_X_train.npy")
X_test = np.load("../../../data/processed/student_X_test.npy")
y_train = np.load("../../../data/processed/student_y_train.npy")
y_test = np.load("../../../data/processed/student_y_test.npy")

print("Training features shape:", X_train.shape)
print("Test features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Test target shape:", y_test.shape)

 ## Exploratory Data Analysis

 Plot feature 1 vs. target.

In [None]:
plot_scatter_for_regression(X_train, y_train, feature_index=0, title="Feature 1 vs Target", filename="feature1_vs_target_sgd_sk.png")


 ## Model Initialization

 Initialize scikit-learn's SGD model.

In [None]:
from src.sklearn_impl.linear_regression_sk import LinearRegressionSK

model = LinearRegressionSK(
    method="sgd",
    learning_rate=0.00001,
    n_iterations=1000000,
    verbose=True,
    early_stopping=True,
    n_iter_no_change=200,
)

 ## Training

 Train and time the model.

In [None]:
import time

start_time = time.time()
model.fit(X_train, y_train)
training_time = time.time() - start_time
print(f"Training Time: {training_time:.4f} seconds")


 ## Evaluation

 Calculate MSE and R².

In [None]:
from src.scratch.utils.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")
print(f"R² Score: {r2:.4f}")


 ## Visualizations

 Visualize results.

In [None]:
from src.scratch.utils.viz_utils import plot_learning_curve, plot_actual_vs_predicted, plot_residuals_vs_predicted, plot_qq_residuals, plot_residual_histogram

plot_learning_curve(model.get_loss_history(), title="Learning Curve (SGD SK)", filename="learning_curve_sgd_sk.png")
plot_actual_vs_predicted(y_test, y_pred, title="Actual vs Predicted (SGD SK)", filename="actual_vs_predicted_sgd_sk.png")
plot_residuals_vs_predicted(y_test, y_pred, title="Residuals vs Predicted (SGD SK)", filename="residuals_vs_predicted_sgd_sk.png")
plot_qq_residuals(y_test, y_pred, title="Q-Q Plot of Residuals (SGD SK)", filename="qq_residuals_sgd_sk.png")
plot_residual_histogram(y_test, y_pred, title="Residual Histogram (SGD SK)", filename="residual_histogram_sgd_sk.png")


 ## Conclusion

 The scikit-learn SGD model achieved an MSE of {mse:.4f} and R² of {r2:.4f}. Visualizations compare its performance to the "from scratch" version.