# Module 4: Machine Learning for Finance - Introduction

This notebook provides an introduction to machine learning applications in finance, focusing on regression models for financial prediction.

## 1. Introduction to Machine Learning in Finance

Machine learning is increasingly used in finance for tasks such as:
- **Algorithmic trading:** Predicting price movements and generating trading signals.
- **Risk management:** Assessing credit risk, market risk, and operational risk.
- **Fraud detection:** Identifying fraudulent transactions.
- **Portfolio optimization:** Constructing optimal portfolios based on risk and return objectives.

## 2. Regression Models for Financial Prediction

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Generate some sample financial data
np.random.seed(42)
n_samples = 100
X = np.random.rand(n_samples, 1) * 10
y = 2 * X.squeeze() + 5 + np.random.randn(n_samples) * 2

# Create a DataFrame
data = pd.DataFrame({'Feature': X.squeeze(), 'Target': y})

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[['Feature']], data['Target'], test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Plot the results
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.title('Linear Regression for Financial Prediction')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.show()

## 3. Feature Engineering

**Feature engineering** is the process of creating new features from existing data to improve model performance. In finance, common features include:
- **Moving averages:** To smooth out price data and identify trends.
- **Momentum:** To measure the rate of change in price.
- **Volatility:** To measure the magnitude of price fluctuations.

In [None]:
# Example of creating a moving average feature
data['Moving_Average_10'] = data['Target'].rolling(window=10).mean()
print(data.head(15))

## üìù Guided Exercises with Auto-Validation

Practice ML fundamentals for finance!

### Exercise 1: Train-Test Split for Time Series (Beginner)

Properly split time series data for financial modeling.

In [None]:
# Exercise 1: Time Series Split
import numpy as np
import pandas as pd

# Given: 1000 days of stock data
n_samples = 1000
dates = pd.date_range('2020-01-01', periods=n_samples, freq='D')
data = np.random.randn(n_samples)  # Simulated returns

# TODO: Split data 80% train, 20% test
# CRITICAL: For time series, must split chronologically (no random shuffle!)
train_size = None  # Number of samples in training set
test_size = None   # Number of samples in test set

# TODO: Get the split index
split_index = None

# TODO: Create train and test sets
train_data = None  # First 80% chronologically
test_data = None   # Last 20% chronologically

# TODO: What's the last date in training set?
last_train_date = None

# TODO: What's the first date in test set?
first_test_date = None

# ============= AUTO-VALIDATION (DO NOT MODIFY) =============
assert train_size is not None, "‚ùå Calculate training size!"
assert test_size is not None, "‚ùå Calculate test size!"
assert split_index is not None, "‚ùå Find split index!"
assert train_data is not None, "‚ùå Create train set!"
assert test_data is not None, "‚ùå Create test set!"
assert last_train_date is not None, "‚ùå Get last training date!"
assert first_test_date is not None, "‚ùå Get first test date!"
assert train_size == 800, f"‚ùå Train size should be 800, got {train_size}"
assert test_size == 200, f"‚ùå Test size should be 200, got {test_size}"
assert split_index == 800, f"‚ùå Split at index 800"
assert len(train_data) == 800, f"‚ùå Train data length incorrect"
assert len(test_data) == 200, f"‚ùå Test data length incorrect"
assert last_train_date < first_test_date, "‚ùå Train must come before test!"
assert last_train_date == dates[799], "‚ùå Last train date incorrect"
assert first_test_date == dates[800], "‚ùå First test date incorrect"
print("‚úÖ Exercise 1 Complete!")
print(f"   Train size: {train_size} samples")
print(f"   Test size: {test_size} samples")
print(f"   Last train date: {last_train_date.date()}")
print(f"   First test date: {first_test_date.date()}")
print(f"   Interpretation: Time series MUST be split chronologically to avoid look-ahead bias!")
# =========================================================