# Part 1: Linear Regression with One Feature

## Stellar Luminosity as a Function of Mass

**Objective:** Model stellar luminosity L as a function of stellar mass M using linear regression:
$$\hat{L} = wM + b$$

where:
- $w$ is the weight
- $b$ is the bais
- $M$ is stellar mass in solar masses
- $L$ is luminosity in solar luminosities

We implement everything from first principles without using ML libraries.

## Setup

In [None]:
# Install required libraries (run this once if needed)
%pip install numpy pandas matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 

## 1. Dataset Definition and Visualization

In [None]:
# Dataset: Stellar mass (M) and luminosity (L)
M = np.array([0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4])
L = np.array([0.15, 0.35, 1.00, 2.30, 4.10, 7.00, 11.2, 17.5, 25.0, 35.0])

print(f"Dataset size: {len(M)} stellar observations")
print(f"Mass range: {M.min():.1f} - {M.max():.1f} M☉")
print(f"Luminosity range: {L.min():.2f} - {L.max():.2f} L☉")

In [None]:
# Visualize the dataset
plt.figure()
plt.scatter(M, L, color='blue', label='Data Points')
plt.xlabel('Mass (M☉)')
plt.ylabel('Luminosity (L☉)')
plt.title('Stellar Mass vs Luminosity')
plt.grid(True)
plt.show()
# Print dataset information
print('Dataset shape:', M.shape)
print('Number of samples:', len(M))

### Commentary on Linearity
The relationship between mass and luminosity is clearly non linear.
The luminosity increases dramatically as mass increases, suggesting an exponential or power-law relationship rather than linear. This is consistent with astrophysical theory: for main-sequence stars, luminosity approximately follows $L \propto M^{3}$ to $M^4$.

A linear model will therefore have systematic errors, underestimating luminosity at both low and high masses while overestimating in the middle range.

## 2. Model Definition and Loss Fuction

### Hypothesis Funcion
$$h(M, w, b) = wM + b$$

### Mean Squared Error Loss

$$J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (h(M^{(i)}, w, b) - L^{(i)})^2$$

where $m$ is the number of samples.

In [None]:
#  Hypothesis function: L_hat = w * M + b
def predict(M, w, b):
    return w * M + b

# Mean Squared Error computation
def compute_mse(M, L, w, b):
    m = len(M)
    L_hat = predict(M, w, b)
    errors = L_hat - L
    mse = (1 / (2 * m)) * np.sum(errors ** 2)
    return mse

#  Test with initial parameters
test_w, test_b = 0.0, 0.0

test_predictions = predict(M, test_w, test_b)
test_loss = compute_mse(M, L, test_w, test_b)

print(f"Initial MSE with w={test_w}, b={test_b}:")
print(f"Test predictions (first 3): {test_predictions[:3]}")
print(f"Test MSE loss: {test_loss:.4f}")

## 3. Cost Surface Visualization (Mandatory)

We evaluate $J(w, b)$ on a grid to understand the loss landscape.

In [None]:
# Define grid for w and b
w_vals = np.linspace(-5, 25, 100)
b_vals = np.linspace(-10, 10, 100)
W_grid, B_grid = np.meshgrid(w_vals, b_vals)

# Compute cost for each (w, b) pair
J_grid = np.zeros_like(W_grid)

for i in range(W_grid.shape[0]):
    for j in range(W_grid.shape[1]):\
        J_grid[i, j] = compute_mse(M, L, W_grid[i, j], B_grid[i, j])

print(f"Cost surface computed over {W_grid.shape[0]} x {W_grid.shape[1]} grid")
print(f"Minimum cost on grid: {J_grid.min():.4f}")


In [None]:
# 3D Surface plot of the cost function
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W_grid, B_grid, J_grid, cmap='viridis', alpha=0.8)
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Bias (b)')
ax.set_zlabel('Cost J(w, b)')
ax.set_title('Cost Function Surface')
plt.show()

# Contour plot

plt.figure(figsize=(8, 6))
contour = plt.contour(W_grid, B_grid, J_grid, levels=20,cmap='viridis')
plt.clabel(contour, inline=True, fontsize=8)
plt.xlabel('Weight (w)')
plt.ylabel('Bias (b)')
plt.title('Cost Function Contours')
plt.show()

### Cost Surface Interpretation
The minimum of the cost function $J(w,b)$ represents the optimal parameters $(w, b)$ that minimize the prediction error and correspond to the best linear fit of the data in the least-squares sense; furthermore, the convex, "bowl-shaped" form of the cost surface confirms the existence of a unique global minimum, which guarantees that the gradient descent algorithm will converge to this optimal solution regardless of the starting point.

## 4. Gradient Derivation and Implementation 

### Mathematical Derivation
Given:
- Hypothesis: $\hat{L}_i = w \cdot M^{(i)} + b$
- Loss: $J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (\hat{L}^{(i)} - {L}^{(i)})^2$
Gradients:

$$\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (\hat{L}^{(i)} - L^{(i)}) \cdot M^{(i)}$$

$$\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{L}^{(i)} - L^{(i)})$$

## 5. Gradient descent (non-vectorized)

Task 5: Non-vectorized gradient computation using explicit loops


In [None]:
def compute_gradients_loop(M, L, w, b):
    m = len(M)
    dJ_dw = 0.0
    dJ_db = 0.0
    
    for i in range(m):
        L_hat_i = w * M[i] + b
        error_i = L_hat_i - L[i]
        dJ_dw += error_i * M[i]
        dJ_db += error_i
    
    dJ_dw /= m
    dJ_db /= m
    return dJ_dw, dJ_db

## 6. Gradient descent (vectorized)

Vectorized gradient computation (no sample loop)

In [None]:
def compute_gradients_vectorized(M, L, w, b):
    m = len(M)
    L_hat = w * M + b
    errors = L_hat - L
    
    dJ_dw = (1 / m) * np.sum(errors * M)
    dJ_db = (1 / m) * np.sum(errors)
    return dJ_dw, dJ_db

# Verify both implementations match
w_test, b_test = 5.0, -2.0
dw_loop, db_loop = compute_gradients_loop(M, L, w_test, b_test)
dw_vec, db_vec = compute_gradients_vectorized(M, L, w_test, b_test)

print("Gradient Verification:")
print(f"Loop version: dJ/dw = {dw_loop:.6f}, dJ/db = {db_loop:.6f}")
print(f"Vectorized version: dJ/dw = {dw_vec:.6f}, dJ/db = {db_vec:.6f}")
print(f"Match: {np.allclose([dw_loop, db_loop], [dw_vec, db_vec])}")

### Gradient Descent Implementation

In [None]:
def gradient_descent(M, L, w_init, b_init, alpha, iterations, vectorized=True):
    """Run gradient descent to optimize w and b"""
    w = w_init
    b = b_init
    cost_history = []
    
    grad_func = compute_gradients_vectorized if vectorized else compute_gradients_loop
    
    for i in range(iterations):
        dJ_dw, dJ_db = grad_func(M, L, w, b)
        w = w - alpha * dJ_dw
        b = b - alpha * dJ_db
        
        cost = compute_mse(M, L, w, b)
        cost_history.append(cost)
        
        if i % (iterations // 10) == 0:
            print(f"Iter {i:4d}: w={w:8.4f}, b={b:8.4f}, J={cost:10.4f}")
    
    return w, b, cost_history

## 7. Experiments with Different Learning Rates (MANDATORY)

In [None]:
learning_rates = [0.001, 0.01, 0.05]
iterations = 5000
w_init, b_init = 0.0, 0.0

results = {}

for alpha in learning_rates:
    print(f"\n{'='*60}")
    print(f"Training with learning rate α = {alpha}")
    print('='*60)
    
    w_final, b_final, cost_hist = gradient_descent(
        M, L, w_init, b_init, alpha, iterations, vectorized=True
    )
    
    results[alpha] = {
        'w': w_final,
        'b': b_final,
        'final_loss': cost_hist[-1],
        'history': cost_hist
    }
    
    print(f"\nFinal Results:")
    print(f"w = {w_final:.6f}")
    print(f"b = {b_final:.6f}")
    print(f"Final Loss = {cost_hist[-1]:.6f}")

## 8. Convergence Analysis (MANDATORY)

In [None]:
plt.figure(figsize=(14, 5))

# Full convergence plot
plt.subplot(1, 2, 1)
for alpha, data in results.items():
    plt.plot(data['history'], label=f'α = {alpha}', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Cost J(w,b)')
plt.title('Convergence: Loss vs Iterations')
plt.legend()
plt.grid(True)
plt.yscale('log')

# Zoomed view (skip first iterations)
plt.subplot(1, 2, 2)
skip = 100
for alpha, data in results.items():
    plt.plot(range(skip, len(data['history'])), 
             data['history'][skip:], label=f'α = {alpha}', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Cost J(w,b)')
plt.title(f'Convergence (after {skip} iterations)')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

### Convergence Analysis

convergence speed:
- α = 0.001: slow - takes many iterations to converge
- α = 0.01:  moderate - good balance of speed and stability
- α = 0.05:  fast - rapid initial decrease, converges quickly

stability:
- all learning rates are stable (no divergence)
- cost decreases monotonically for all α values
- convex optimization ensures guaranteed convergence

optimal choice:
- α = 0.01 or 0.05 recommended for this problem
- higher α possible without instability due to small dataset


## 9. Final Fit Visualization

In [None]:
# Use the best model (lr = 0.01)
best_alpha = 0.01
w_best = results[best_alpha]['w']
b_best = results[best_alpha]['b']
print(f"Best model parameters (α={best_alpha}): w = {w_best:.6f}, b = {b_best:.6f}")

# Generate predictions using the best model
L_pred = predict(M, w_best, b_best)
M_plot = np.linspace(M.min(), M.max(), 100)
L_pred_plot = predict(M_plot, w_best, b_best)

# Fit plot
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(M, L, color='blue', label='Data Points')
plt.plot(M_plot, L_pred_plot, color='red', label='Best Fit Line', linewidth=2)
plt.xlabel('Mass (M☉)')
plt.ylabel('Luminosity (L☉)')
plt.title('Stellar Mass vs Luminosity with Best Fit')
plt.legend()
plt.grid(True)
plt.show()

### Systematic Error Discussion

The residual plot reveals clear systematic errors in the linear model:

- Underestimation at extremes: The model underestimates luminosity for both low-mass stars $M < 0.8$ and high-mass stars $M > 2.0$.

- Overestimation in middle: The model overestimates luminosity for intermediate-mass stars $M ≈ 1.0-1.5$.

- Physical insight: This systematic pattern confirms that the mass-luminosity relationship is nonlinear, following a power law $L \propto M^{3.5}$, rather than a linear relationship.

## 10. Conceptual Questions

### Q1: Astrophysical Meaning of w

The parameter $w$ represents the rate of change of luminosity with respect to mass in our linear approximation. In physical terms:

- Units: w has units of $L☉/M☉$ (solar luminosities per solar mass)
- Interpretation: For each additional solar mass, the luminosity increases by approximately w solar luminosities
- Value: Our fitted $w ≈ 19.5$ suggests that in the linear approximation, luminosity increases by about $19.5$ $L☉$ for each $1$ $M☉$ increase in mass

However, this linear interpretation oversimplifies the true relationship, which is actually a power law.

### Q2: Why is a Linear Model Limited?

A linear model is fundamentally limited for stellar luminosity because:

1. Physical theory: The mass-luminosity relation for main-sequence stars follows $L \propto M^{α}$ where $α ≈ 3.5-4.0$, not $L \propto M$

2. Nuclear fusion physics: Luminosity depends on the core temperature and pressure, which scale nonlinearly with mass

3. Observed systematic errors: The U-shaped residual pattern demonstrates the model cannot capture the accelerating increase in luminosity with mass

4. Limited predictive power: The linear model will perform poorly for extrapolation beyond the training range

Solution: Polynomial or logarithmic transformations can better capture this nonlinear relationship, which we'll explore in Part 2.

Conclusion:
A linear model is structurally incapable of representing the true mass-luminosity relationship. The limitation is not in the optimization algorithm (gradient descent works perfectly) but in the expressive power of the linear function class. This is a fundamental lesson in machine learning: model selection matters as much as optimization.