---
title: "Week 01 — Introduction to Financial Modelling & ML Basics"
week: 1
author: "Praveen Kumar"
date: 2025-10-07
duration: "2-3 hours"
prerequisites: ["Basic Python", "High school algebra"]
tags: ["intro","linear-regression","financial-modeling"]
version: v1.0
---

# Week 01 — Introduction to Financial Modelling & ML Basics

## Student Notebook: Linear Regression for Stock Return Prediction

This notebook introduces the fundamentals of financial modelling using machine learning. We'll implement a basic linear regression model to predict stock returns using historical lag features.

### Learning Objectives:
- Understand financial time series data structure
- Implement feature engineering with lag variables
- Train and evaluate a linear regression model
- Visualize model performance and interpret results

In [4]:
# Parameters
SEED = 42
SAMPLE_MODE = True  # Set to True for quick runs, False for full analysis
DATA_PATH = "data/synthetic/"
DATASET = "stock_prices.csv"

print(f"Configuration:")
print(f"SEED: {SEED}")
print(f"SAMPLE_MODE: {SAMPLE_MODE}")
print(f"DATA_PATH: {DATA_PATH}")
print(f"DATASET: {DATASET}")

Configuration:
SEED: 42
SAMPLE_MODE: True
DATA_PATH: data/synthetic/
DATASET: stock_prices.csv


In [None]:
# Install required packages (run only if needed)
import sys
import subprocess

try:
    import yfinance
    print("yfinance already installed")
except ImportError:
    print("Installing yfinance...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "yfinance", "-q"])
    print("yfinance installed successfully")

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import os
from datetime import datetime

# Machine Learning
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Financial data
try:
    import yfinance as yf
    YF_AVAILABLE = True
except ImportError:
    YF_AVAILABLE = False
    print("Warning: yfinance not available. Will use synthetic data only.")

# Set random seed for reproducibility
np.random.seed(SEED)
warnings.filterwarnings('ignore')

# Configure plotting
plt.rcParams['figure.figsize'] = (10, 6)

print("Setup Complete!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"YFinance available: {YF_AVAILABLE}")

Setup Complete!
NumPy version: 2.3.3
Pandas version: 2.3.3
YFinance available: True


## Section 1: Data Loading and Preprocessing

We'll load stock price data with a fallback strategy:
1. First, try to load from synthetic CSV file (for offline/quick runs)
2. If not available, download AAPL data using yfinance
3. Handle any data loading issues gracefully

In [3]:
# Data Loading with Fallback Logic
def load_stock_data():
    """Load stock data with fallback logic."""
    data = None
    
    # Strategy 1: Try to load synthetic data first
    synthetic_path = os.path.join(DATA_PATH, DATASET)
    if os.path.exists(synthetic_path):
        try:
            print(f"Loading synthetic data from {synthetic_path}")
            data = pd.read_csv(synthetic_path, index_col=0, parse_dates=True)
            print(f"Successfully loaded synthetic data: {data.shape}")
            return data, "synthetic"
        except Exception as e:
            print(f"Failed to load synthetic data: {e}")
    
    # Strategy 2: Download from yfinance (if available)
    if YF_AVAILABLE:
        try:
            print("Downloading AAPL data from Yahoo Finance...")
            # Use different date ranges based on SAMPLE_MODE
            if SAMPLE_MODE:
                # Small sample for quick runs
                data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
            else:
                # Larger dataset for full analysis
                data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
            
            print(f"Successfully downloaded data: {data.shape}")
            return data, "yfinance"
        except Exception as e:
            print(f"Failed to download data: {e}")
    
    # Fallback: Create synthetic data
    print("Creating synthetic stock data...")
    dates = pd.date_range(start='2023-01-01', end='2024-01-01', freq='D')
    np.random.seed(SEED)
    
    # Generate synthetic stock price with random walk
    returns = np.random.normal(0.001, 0.02, len(dates))  # Mean return 0.1%, volatility 2%
    prices = 100 * np.exp(np.cumsum(returns))
    
    data = pd.DataFrame({
        'Open': prices * np.random.uniform(0.98, 1.02, len(dates)),
        'High': prices * np.random.uniform(1.00, 1.05, len(dates)),
        'Low': prices * np.random.uniform(0.95, 1.00, len(dates)),
        'Close': prices,
        'Volume': np.random.randint(1000000, 10000000, len(dates)),
        'Adj Close': prices
    }, index=dates)
    
    print(f"Created synthetic data: {data.shape}")
    return data, "synthetic_generated"

# Load the data
stock_data, data_source = load_stock_data()
print(f"\nData source: {data_source}")
print(f"Date range: {stock_data.index.min()} to {stock_data.index.max()}")
print(f"Available columns: {list(stock_data.columns)}")
stock_data.head()

Downloading AAPL data from Yahoo Finance...
Successfully downloaded data: (250, 5)

Data source: yfinance
Date range: 2023-01-03 00:00:00 to 2023-12-29 00:00:00
Available columns: [('Close', 'AAPL'), ('High', 'AAPL'), ('Low', 'AAPL'), ('Open', 'AAPL'), ('Volume', 'AAPL')]


Price,Close,High,Low,Open,Volume
Ticker,AAPL,AAPL,AAPL,AAPL,AAPL
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2023-01-03,123.330658,129.079575,122.443173,128.468202,112117500
2023-01-04,124.602707,126.870724,123.340509,125.125335,89113600
2023-01-05,123.281342,125.993097,123.024963,125.361998,80962700
2023-01-06,127.817383,128.478063,123.153167,124.257594,87754700
2023-01-09,128.339981,131.554653,128.083602,128.655538,70790800


## Section 2: Feature Engineering for Financial Time Series

Now we'll create features for our machine learning model:
1. Calculate daily returns
2. Create lag features (previous 5 days' returns)
3. Handle missing values

In [None]:
# Feature Engineering
def create_features(data, price_column='Adj Close'):
    """Create features for ML model."""
    df = data.copy()
    
    # Calculate daily returns
    df['Return'] = df[price_column].pct_change()
    
    # Create lag features (previous 5 days' returns)
    for i in range(1, 6):
        df[f'Lag_{i}'] = df['Return'].shift(i)
    
    # Drop missing values
    df = df.dropna()
    
    return df

# Create features
feature_data = create_features(stock_data)

print(f"Feature engineering complete!")
print(f"Original data shape: {stock_data.shape}")
print(f"Feature data shape: {feature_data.shape}")
print(f"\nFeature columns: {[col for col in feature_data.columns if 'Lag_' in col]}")

# Display sample of engineered features
feature_data[['Return', 'Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5']].head(10)

## Section 3: Model Preparation and Training

We'll prepare our data for machine learning:
1. Split features (X) and target (y)
2. Perform chronological train/test split (80/20)
3. Train a Linear Regression model

In [None]:
# Prepare features and target
feature_columns = ['Lag_1', 'Lag_2', 'Lag_3', 'Lag_4', 'Lag_5']
X = feature_data[feature_columns]
y = feature_data['Return']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Chronological split (important for time series!)
split_idx = int(0.8 * len(feature_data))
split_date = feature_data.index[split_idx]

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print(f"\nTrain set: {X_train.shape[0]} samples (up to {split_date.date()})")
print(f"Test set: {X_test.shape[0]} samples (from {split_date.date()})")
print(f"Train period: {X_train.index.min().date()} to {X_train.index.max().date()}")
print(f"Test period: {X_test.index.min().date()} to {X_test.index.max().date()}")

In [None]:
# Train Linear Regression Model
# Using a pipeline with standardization (good practice for linear models)
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Train the model
print("Training Linear Regression model...")
model_pipeline.fit(X_train, y_train)

# Make predictions
y_train_pred = model_pipeline.predict(X_train)
y_test_pred = model_pipeline.predict(X_test)

print("Model training complete!")
print(f"Model coefficients shape: {model_pipeline.named_steps['regressor'].coef_.shape}")

# Display model coefficients
coefficients = model_pipeline.named_steps['regressor'].coef_
intercept = model_pipeline.named_steps['regressor'].intercept_

print(f"\nModel Equation:")
print(f"Return(t) = {intercept:.6f}", end="")
for i, coef in enumerate(coefficients):
    sign = "+" if coef >= 0 else ""
    print(f" {sign}{coef:.6f}*Lag_{i+1}", end="")
print("\n")

# Create coefficient DataFrame for better visualization
coef_df = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': coefficients
})
print("Feature Coefficients:")
print(coef_df)

## Section 4: Model Evaluation

Let's evaluate our model's performance using standard regression metrics.

In [None]:
# Calculate evaluation metrics
def evaluate_model(y_true, y_pred, dataset_name=""):
    """Calculate and display evaluation metrics."""
    mse = metrics.mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = metrics.r2_score(y_true, y_pred)
    mae = metrics.mean_absolute_error(y_true, y_pred)
    
    print(f"\n{dataset_name} Set Performance:")
    print(f"Mean Squared Error (MSE): {mse:.8f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.6f}")
    print(f"R-squared (R²): {r2:.6f}")
    print(f"Mean Absolute Error (MAE): {mae:.6f}")
    
    return {'MSE': mse, 'RMSE': rmse, 'R2': r2, 'MAE': mae}

# Evaluate on both sets
train_metrics = evaluate_model(y_train, y_train_pred, "Training")
test_metrics = evaluate_model(y_test, y_test_pred, "Test")

# Summary comparison
print(f"\n{'='*50}")
print("SUMMARY COMPARISON:")
print(f"{'='*50}")
print(f"{'Metric':<15} {'Train':<12} {'Test':<12} {'Difference':<12}")
print(f"{'-'*50}")
for metric in ['MSE', 'R2']:
    train_val = train_metrics[metric]
    test_val = test_metrics[metric]
    diff = test_val - train_val
    print(f"{metric:<15} {train_val:<12.6f} {test_val:<12.6f} {diff:<12.6f}")

# Interpretation
print(f"\nInterpretation:")
if test_metrics['R2'] > 0:
    print(f"✓ Model explains {test_metrics['R2']*100:.2f}% of variance in test returns")
else:
    print(f"✗ Model performs worse than simply predicting the mean return")

if abs(test_metrics['R2'] - train_metrics['R2']) < 0.1:
    print(f"✓ Model shows good generalization (low overfitting)")
else:
    print(f"⚠ Model may be overfitting (large train-test performance gap)")

## Section 5: Visualization and Results Analysis

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Actual vs Predicted Returns (Test Set)
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6, color='blue')
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Returns')
axes[0, 0].set_ylabel('Predicted Returns')
axes[0, 0].set_title(f'Actual vs Predicted Returns (Test Set)\nR² = {test_metrics["R2"]:.4f}')
axes[0, 0].grid(True, alpha=0.3)

# 2. Residual Plot
residuals = y_test - y_test_pred
axes[0, 1].scatter(y_test_pred, residuals, alpha=0.6, color='green')
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Returns')
axes[0, 1].set_ylabel('Residuals (Actual - Predicted)')
axes[0, 1].set_title('Residual Plot')
axes[0, 1].grid(True, alpha=0.3)

# 3. Model Coefficients
coef_names = [f'Lag_{i+1}' for i in range(len(coefficients))]
colors = ['red' if c < 0 else 'blue' for c in coefficients]
bars = axes[1, 0].bar(coef_names, coefficients, color=colors, alpha=0.7)
axes[1, 0].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[1, 0].set_xlabel('Lag Features')
axes[1, 0].set_ylabel('Coefficient Value')
axes[1, 0].set_title('Model Coefficients by Lag Period')
axes[1, 0].grid(True, alpha=0.3)

# Add value labels on bars
for bar, coef in zip(bars, coefficients):
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height,
                   f'{coef:.4f}', ha='center', va='bottom' if height > 0 else 'top')

# 4. Time Series of Actual vs Predicted (last 50 points for clarity)
n_display = min(50, len(y_test))
test_dates = y_test.index[-n_display:]
axes[1, 1].plot(test_dates, y_test[-n_display:], 'b-', label='Actual', linewidth=2)
axes[1, 1].plot(test_dates, y_test_pred[-n_display:], 'r--', label='Predicted', linewidth=2)
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Return')
axes[1, 1].set_title(f'Recent Predictions vs Actual (Last {n_display} days)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Additional insights
print("Model Insights:")
print(f"1. Most influential lag: Lag_{np.argmax(np.abs(coefficients)) + 1} (coefficient: {coefficients[np.argmax(np.abs(coefficients))]:.6f})")
print(f"2. Model captures {'momentum' if coefficients[0] > 0 else 'mean reversion'} in daily returns")
print(f"3. Average absolute coefficient: {np.mean(np.abs(coefficients)):.6f}")
print(f"4. Residual standard deviation: {np.std(residuals):.6f}")

In [None]:
# Save predictions to CSV for submission
predictions_df = pd.DataFrame({
    'Date': y_test.index,
    'Actual_Return': y_test.values,
    'Predicted_Return': y_test_pred,
    'Residual': y_test.values - y_test_pred
})

# Create output directory if it doesn't exist
output_dir = '/kaggle/working' if '/kaggle' in os.getcwd() else 'output'
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, 'predictions_week01.csv')
predictions_df.to_csv(output_path, index=False)

print(f"Predictions saved to: {output_path}")
print(f"File contains {len(predictions_df)} predictions")
print("\nFirst few predictions:")
print(predictions_df.head())

## Section 6: Student Exercises

Complete the following exercises to deepen your understanding of financial ML concepts.

### Exercise 1: Implement Ridge Regression

Ridge regression adds L2 regularization to prevent overfitting. Compare its performance with our baseline Linear Regression model.

**Hint:** Use `sklearn.linear_model.Ridge` instead of `LinearRegression`.

In [None]:
# TODO: Exercise 1 - Implement Ridge Regression
# Your task:
# 1. Create a Ridge regression model using sklearn.linear_model.Ridge
# 2. Train it on the same training data (X_train, y_train)
# 3. Make predictions on the test set
# 4. Calculate MSE and R² for comparison
# 5. Compare coefficients with the Linear Regression model

# Ridge regression pipeline
ridge_pipeline = None  # TODO: Create pipeline with StandardScaler and Ridge

# Train and evaluate
# TODO: Fit the model, make predictions, and calculate metrics

# Compare results
# TODO: Print comparison of Linear vs Ridge performance

print("Exercise 1: Complete the Ridge regression implementation above!")

### Exercise 2: Add Rolling Volatility Feature

Financial volatility is an important feature. Create a 5-day rolling volatility and assess whether it improves model performance.

**Hint:** Use `df['Return'].rolling(5).std()` to calculate rolling volatility.

In [None]:
# TODO: Exercise 2 - Add Rolling Volatility Feature
# Your task:
# 1. Calculate 5-day rolling volatility: volatility = returns.rolling(5).std()
# 2. Add this as a new feature to your existing lag features
# 3. Re-train the Linear Regression model with lag features + volatility
# 4. Compare performance with the original model (lags only)
# 5. Visualize the volatility feature over time

# Create enhanced feature set
def create_enhanced_features(data, price_column='Adj Close'):
    """Create features including volatility."""
    df = data.copy()
    df['Return'] = df[price_column].pct_change()
    
    # Original lag features
    for i in range(1, 6):
        df[f'Lag_{i}'] = df['Return'].shift(i)
    
    # TODO: Add rolling volatility feature
    # df['Volatility_5d'] = ???
    
    df = df.dropna()
    return df

# TODO: Create enhanced features, train model, and compare results

print("Exercise 2: Complete the rolling volatility implementation above!")

### Exercise 3: Different Train/Test Split Analysis

Time series models can be sensitive to the train/test split. Experiment with different split ratios and observe the impact.

**Hint:** Try 70/30 and 90/10 splits and compare results.

In [None]:
# TODO: Exercise 3 - Different Train/Test Split Analysis
# Your task:
# 1. Test different train/test split ratios: 70/30, 90/10
# 2. Train the same Linear Regression model for each split
# 3. Compare the R² and MSE across different splits
# 4. Discuss why the split ratio might affect performance in time series

def train_with_split(X, y, train_ratio=0.8):
    """Train model with custom split ratio."""
    split_idx = int(train_ratio * len(X))
    # TODO: Implement custom split training and evaluation
    pass

# Test different splits
split_ratios = [0.7, 0.8, 0.9]
results = {}

# TODO: Loop through split ratios and collect results
# for ratio in split_ratios:
#     results[ratio] = train_with_split(X, y, ratio)

# TODO: Create comparison table/visualization

print("Exercise 3: Complete the train/test split analysis above!")

## Summary

Congratulations! You've completed Week 1 of the Financial Modelling using Machine Learning Bootcamp. 

### What you learned:
- Financial time series data structure and characteristics
- Feature engineering with lag variables for stock returns
- Linear regression implementation and evaluation
- Model interpretation through coefficients and visualizations
- Common evaluation metrics (MSE, R²) for regression problems

### Key takeaways:
1. **Time series requires chronological splits** - Never shuffle time series data randomly
2. **Feature engineering is crucial** - Raw prices are less useful than returns and lags
3. **Regularization helps** - Ridge regression can improve generalization
4. **Financial data is noisy** - Perfect predictions are unrealistic; focus on consistent patterns

### Next steps:
- Complete the exercises above
- Try the weekly assignment
- Explore different stocks and time periods
- Consider other features like technical indicators

Great work! See you in Week 2 where we'll explore more sophisticated ML algorithms.