---
title: "Week 01 — Introduction to Financial Modelling & ML Basics"
week: 1
author: "Praveen Kumar"
date: 2025-10-07
duration: "2-3 hours"
prerequisites: ["Basic Python", "High school algebra"]
tags: ["intro","linear-regression","financial-modeling"]
version: v1.0
instructor_only: true
---

# Week 01 — Introduction to Financial Modelling & ML Basics

## INSTRUCTOR NOTEBOOK: Linear Regression for Stock Return Prediction

This notebook contains the complete solutions for all exercises and additional teaching materials.

**⚠️ INSTRUCTOR ONLY - Remove solution cells before distributing to students**

In [None]:
# Parameters
SEED = 42
SAMPLE_MODE = True  # Set to True for quick runs, False for full analysis
DATA_PATH = "data/synthetic/"
DATASET = "stock_prices.csv"

print(f"Configuration:")
print(f"SEED: {SEED}")
print(f"SAMPLE_MODE: {SAMPLE_MODE}")
print(f"DATA_PATH: {DATA_PATH}")
print(f"DATASET: {DATASET}")

## Teaching Notes for Instructors

### Expected Results and Common Student Mistakes:

1. **Linear Regression Model Performance**:
   - Expected R² on test set: 0.01 to 0.05 (stock returns are inherently noisy)
   - MSE typically around 0.0003-0.001 for daily returns
   - Students often expect higher R² values - explain that financial data is noisy

2. **Common Student Mistakes**:
   - Using random splits instead of chronological splits (data leakage)
   - Expecting high R² values like in other domains
   - Not standardizing features for linear models
   - Misinterpreting negative R² as model failure
   - Forgetting to set random seed for reproducibility

3. **Grading Points**:
   - Correct chronological split: 20 points
   - Proper feature engineering (lags): 25 points  
   - Model implementation and training: 25 points
   - Evaluation metrics calculation: 15 points
   - Visualizations and interpretation: 15 points

### Sample Expected Outputs:
- Linear Regression Test R²: ~0.02-0.05
- Ridge Regression typically shows similar or slightly better R²
- Most influential lag is usually Lag_1 (momentum effect)
- Volatility feature may improve R² by 0.01-0.02

## Exercise Solutions (INSTRUCTOR ONLY)

### Exercise 1 Solution: Ridge Regression Implementation

In [None]:
# INSTRUCTOR ONLY - Exercise 1 Solution: Ridge Regression
from sklearn.linear_model import Ridge

# Create Ridge regression pipeline
ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', Ridge(alpha=1.0, random_state=SEED))
])

# Train Ridge model
ridge_pipeline.fit(X_train, y_train)

# Make predictions
y_train_pred_ridge = ridge_pipeline.predict(X_train)
y_test_pred_ridge = ridge_pipeline.predict(X_test)

# Evaluate Ridge model
ridge_train_metrics = evaluate_model(y_train, y_train_pred_ridge, "Ridge Training")
ridge_test_metrics = evaluate_model(y_test, y_test_pred_ridge, "Ridge Test")

# Compare Linear vs Ridge
print(f"\n{'='*60}")
print("LINEAR REGRESSION vs RIDGE REGRESSION COMPARISON")
print(f"{'='*60}")
print(f"{'Metric':<15} {'Linear (Test)':<15} {'Ridge (Test)':<15} {'Improvement':<15}")
print(f"{'-'*60}")

for metric in ['MSE', 'R2']:
    linear_val = test_metrics[metric]
    ridge_val = ridge_test_metrics[metric]
    improvement = ridge_val - linear_val if metric == 'R2' else linear_val - ridge_val
    print(f"{metric:<15} {linear_val:<15.6f} {ridge_val:<15.6f} {improvement:<15.6f}")

# Compare coefficients
ridge_coefficients = ridge_pipeline.named_steps['regressor'].coef_
linear_coefficients = coefficients

print(f"\nCoefficient Comparison:")
coef_comparison = pd.DataFrame({
    'Feature': feature_columns,
    'Linear_Coef': linear_coefficients,
    'Ridge_Coef': ridge_coefficients,
    'Difference': np.abs(linear_coefficients - ridge_coefficients)
})
print(coef_comparison)

print(f"\nRidge regularization effect:")
print(f"- Average absolute coefficient reduction: {np.mean(coef_comparison['Difference']):.6f}")
print(f"- Ridge coefficients are {'more' if np.mean(np.abs(ridge_coefficients)) < np.mean(np.abs(linear_coefficients)) else 'less'} regularized")

### Exercise 2 Solution: Rolling Volatility Feature

In [None]:
# INSTRUCTOR ONLY - Exercise 2 Solution: Rolling Volatility Feature

def create_enhanced_features(data, price_column='Adj Close'):
    """Create features including volatility."""
    df = data.copy()
    df['Return'] = df[price_column].pct_change()
    
    # Original lag features
    for i in range(1, 6):
        df[f'Lag_{i}'] = df['Return'].shift(i)
    
    # Add rolling volatility feature (5-day window)
    df['Volatility_5d'] = df['Return'].rolling(5).std()
    
    df = df.dropna()
    return df

# Create enhanced features
enhanced_data = create_enhanced_features(stock_data)
print(f"Enhanced feature data shape: {enhanced_data.shape}")

# Prepare enhanced feature set
enhanced_feature_columns = ['Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5', 'Volatility_5d']
X_enhanced = enhanced_data[enhanced_feature_columns]
y_enhanced = enhanced_data['Return']

# Split data chronologically
split_idx = int(0.8 * len(enhanced_data))
X_train_enh = X_enhanced.iloc[:split_idx]
X_test_enh = X_enhanced.iloc[split_idx:]
y_train_enh = y_enhanced.iloc[:split_idx]
y_test_enh = y_enhanced.iloc[split_idx:]

# Train enhanced model
enhanced_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

enhanced_pipeline.fit(X_train_enh, y_train_enh)
y_test_pred_enh = enhanced_pipeline.predict(X_test_enh)

# Evaluate enhanced model
enhanced_metrics = evaluate_model(y_test_enh, y_test_pred_enh, "Enhanced (with Volatility)")

# Compare original vs enhanced
print(f"\n{'='*60}")
print("ORIGINAL vs ENHANCED (WITH VOLATILITY) COMPARISON")
print(f"{'='*60}")
print(f"{'Metric':<15} {'Original':<15} {'Enhanced':<15} {'Improvement':<15}")
print(f"{'-'*60}")

# Note: Need to align test sets for fair comparison
min_len = min(len(y_test), len(y_test_enh))
original_r2 = metrics.r2_score(y_test[-min_len:], y_test_pred[-min_len:])
enhanced_r2 = enhanced_metrics['R2']

print(f"R²{'':<13} {original_r2:<15.6f} {enhanced_r2:<15.6f} {enhanced_r2-original_r2:<15.6f}")

# Analyze volatility feature importance
enhanced_coef = enhanced_pipeline.named_steps['regressor'].coef_
volatility_coef = enhanced_coef[-1]  # Last coefficient is volatility

print(f"\nVolatility Feature Analysis:")
print(f"- Volatility coefficient: {volatility_coef:.6f}")
print(f"- Volatility importance rank: {np.argsort(np.abs(enhanced_coef))[::-1].tolist().index(5) + 1} out of {len(enhanced_coef)}")

# Visualize volatility over time
plt.figure(figsize=(12, 4))
plt.plot(enhanced_data.index[-100:], enhanced_data['Volatility_5d'][-100:], label='5-day Rolling Volatility')
plt.title('5-Day Rolling Volatility (Last 100 days)')
plt.xlabel('Date')
plt.ylabel('Volatility')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Exercise 3 Solution: Train/Test Split Analysis

In [None]:
# INSTRUCTOR ONLY - Exercise 3 Solution: Train/Test Split Analysis

def train_with_split(X, y, train_ratio=0.8):
    """Train model with custom split ratio."""
    split_idx = int(train_ratio * len(X))
    
    # Chronological split
    X_train_split = X.iloc[:split_idx]
    X_test_split = X.iloc[split_idx:]
    y_train_split = y.iloc[:split_idx]
    y_test_split = y.iloc[split_idx:]
    
    # Train model
    model = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])
    
    model.fit(X_train_split, y_train_split)
    y_pred_split = model.predict(X_test_split)
    
    # Calculate metrics
    mse = metrics.mean_squared_error(y_test_split, y_pred_split)
    r2 = metrics.r2_score(y_test_split, y_pred_split)
    
    return {
        'train_size': len(X_train_split),
        'test_size': len(X_test_split),
        'mse': mse,
        'r2': r2,
        'train_ratio': train_ratio
    }

# Test different splits
split_ratios = [0.7, 0.8, 0.9]
results = {}

print("Training models with different train/test splits...")
for ratio in split_ratios:
    results[ratio] = train_with_split(X, y, ratio)
    print(f"Split {ratio:.0%}: Train={results[ratio]['train_size']}, Test={results[ratio]['test_size']}")

# Create comparison table
print(f"\n{'='*70}")
print("TRAIN/TEST SPLIT COMPARISON")
print(f"{'='*70}")
print(f"{'Split Ratio':<12} {'Train Size':<12} {'Test Size':<12} {'MSE':<12} {'R²':<12}")
print(f"{'-'*70}")

for ratio in split_ratios:
    r = results[ratio]
    print(f"{ratio:.0%}{'/'}{1-ratio:.0%:<8} {r['train_size']:<12} {r['test_size']:<12} {r['mse']:<12.6f} {r['r2']:<12.6f}")

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# MSE comparison
ax1.bar([f"{r:.0%}" for r in split_ratios], [results[r]['mse'] for r in split_ratios], color='red', alpha=0.7)
ax1.set_title('MSE by Train/Test Split Ratio')
ax1.set_xlabel('Train Ratio')
ax1.set_ylabel('MSE')
ax1.grid(True, alpha=0.3)

# R² comparison
ax2.bar([f"{r:.0%}" for r in split_ratios], [results[r]['r2'] for r in split_ratios], color='blue', alpha=0.7)
ax2.set_title('R² by Train/Test Split Ratio')
ax2.set_xlabel('Train Ratio')
ax2.set_ylabel('R²')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Analysis and insights
print(f"\nAnalysis of Split Ratio Effects:")
print(f"1. More training data (90% split) generally leads to {'better' if results[0.9]['r2'] > results[0.7]['r2'] else 'similar'} performance")
print(f"2. Smaller test sets (10%) may give {'less reliable' if results[0.9]['test_size'] < 50 else 'adequate'} performance estimates")
print(f"3. The 80/20 split provides a good balance between training data and test set reliability")

best_ratio = max(split_ratios, key=lambda x: results[x]['r2'])
print(f"4. Best performing split: {best_ratio:.0%} train (R² = {results[best_ratio]['r2']:.6f})")

print(f"\nWhy split ratio matters in time series:")
print(f"- More recent data in test set may have different patterns than older training data")
print(f"- Market conditions change over time (concept drift)")  
print(f"- Smaller test sets reduce statistical significance of performance metrics")
print(f"- Too much training data may include outdated patterns")