# Stock Market Data Collection, Modeling, and Analysis (CRISP-DM)

This notebook is designed as an end-to-end, **educational** example for students to practice data science with real-world financial data. It follows the **CRISP-DM framework**:

1. **Business Understanding**
2. **Data Understanding**
3. **Data Preparation**
4. **Modeling**
5. **Evaluation**
6. **Deployment**

> **Disclaimer:** This notebook is for learning purposes only and **not** financial advice.

## 1. Business Understanding

**Goal:** Build a repeatable workflow to collect stock data, compute trading-relevant statistics (returns, volatility, moving averages, RSI, MACD), and train a simple predictive model for **next-day returns**.

**Key questions:**
- How can we collect open, high, low, close, volume, and engineered features?
- Which statistics are commonly used to inform trading decisions?
- How can we evaluate model performance for time-series data?

## 2. Data Understanding

We will use **Yahoo Finance** data via the `yfinance` package (no API key required).

> **Note:** In production, you might use paid APIs (Polygon, Alpha Vantage, IEX Cloud) for higher reliability and limits. Those typically require API keys.

In [None]:
# If needed in your environment, uncomment the line below to install yfinance.
# !pip -q install yfinance

# Import core libraries for data collection and analysis.
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt

# Configure plotting for readability in notebooks.
plt.style.use("seaborn-v0_8")

In [None]:
# Define the stock tickers and date range for analysis.
# You can edit this list to explore other stocks or ETFs.
tickers = ["AAPL", "MSFT", "AMZN", "GOOGL", "TSLA"]
start_date = "2018-01-01"
end_date = "2024-01-01"

# Download daily OHLCV (Open, High, Low, Close, Volume) data for all tickers.
# The result is a multi-index column DataFrame.
raw_data = yf.download(tickers, start=start_date, end=end_date, group_by="ticker", auto_adjust=False)

# Display the first few rows to understand the structure.
raw_data.head()

In [None]:
# Let's inspect the available fields for one ticker (AAPL).
# This helps students understand how the data is organized.
raw_data["AAPL"].head()

## 3. Data Preparation

We will reshape the data, calculate **daily returns**, **moving averages**, **rolling volatility**, **RSI**, and **MACD**, which are common technical indicators.

In [None]:
# Extract the 'Close' prices into a clean DataFrame with tickers as columns.
close_prices = pd.DataFrame({ticker: raw_data[ticker]["Close"] for ticker in tickers})

# Calculate daily returns (percentage change) for each ticker.
returns = close_prices.pct_change().dropna()

# Preview the processed datasets.
close_prices.head(), returns.head()

In [None]:
# Define helper functions for RSI and MACD to keep our code clean and reusable.
def compute_rsi(series, window=14):
    """Compute the Relative Strength Index (RSI) for a price series."""
    # Calculate price changes
    delta = series.diff()
    # Separate positive and negative gains
    gain = delta.where(delta > 0, 0.0)
    loss = -delta.where(delta < 0, 0.0)
    # Compute average gains and losses
    avg_gain = gain.rolling(window=window).mean()
    avg_loss = loss.rolling(window=window).mean()
    # Prevent division by zero
    rs = avg_gain / avg_loss.replace(0, np.nan)
    # Convert to RSI
    rsi = 100 - (100 / (1 + rs))
    return rsi


def compute_macd(series, fast=12, slow=26, signal=9):
    """Compute MACD line and signal line for a price series."""
    # Exponential moving averages for fast and slow windows
    ema_fast = series.ewm(span=fast, adjust=False).mean()
    ema_slow = series.ewm(span=slow, adjust=False).mean()
    # MACD line is the difference
    macd_line = ema_fast - ema_slow
    # Signal line is the EMA of the MACD line
    signal_line = macd_line.ewm(span=signal, adjust=False).mean()
    return macd_line, signal_line

In [None]:
# Build a feature dataset for a single ticker to simplify modeling.
# You can loop over tickers later as a student exercise.

ticker = "AAPL"
prices = raw_data[ticker]["Close"].copy()

# Create a DataFrame for features and target.
data = pd.DataFrame({"close": prices})

# Add daily returns.
data["return"] = data["close"].pct_change()

# Add simple moving averages (SMA) for short and long windows.
data["sma_20"] = data["close"].rolling(window=20).mean()
data["sma_50"] = data["close"].rolling(window=50).mean()

# Add rolling volatility (standard deviation of returns).
data["volatility_20"] = data["return"].rolling(window=20).std()

# Add RSI and MACD indicators.
data["rsi_14"] = compute_rsi(data["close"], window=14)
macd_line, signal_line = compute_macd(data["close"], fast=12, slow=26, signal=9)
data["macd"] = macd_line
data["macd_signal"] = signal_line

# Define the prediction target: next-day return.
data["target_return"] = data["return"].shift(-1)

# Remove rows with missing values caused by rolling windows.
data = data.dropna()

# Inspect feature set.
data.head()

## 4. Modeling

We will train a simple model to predict **next-day return** using the engineered features.
This is **not** a production trading system—just a learning example.

In [None]:
# Import machine learning tools from scikit-learn.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Select features and target for modeling.
feature_cols = ["return", "sma_20", "sma_50", "volatility_20", "rsi_14", "macd", "macd_signal"]
X = data[feature_cols]
y = data["target_return"]

# Use a time-based split (no shuffling) for realistic evaluation.
# Here, we take the first 80% as training and the last 20% as testing.
split_index = int(len(data) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

# Train a simple linear regression model.
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set.
y_pred = model.predict(X_test)

## 5. Evaluation

We evaluate model performance using **MAE**, **RMSE**, and **R²**. For time-series, you should also visually inspect predictions and residuals.

In [None]:
# Compute evaluation metrics.
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Display metrics in a readable format.
print(f"MAE:  {mae:.6f}")
print(f"RMSE: {rmse:.6f}")
print(f"R²:   {r2:.6f}")

In [None]:
# Plot actual vs. predicted next-day returns.
plt.figure(figsize=(12, 4))
plt.plot(y_test.index, y_test, label="Actual", alpha=0.7)
plt.plot(y_test.index, y_pred, label="Predicted", alpha=0.7)
plt.title(f"{ticker} Next-Day Return: Actual vs Predicted")
plt.xlabel("Date")
plt.ylabel("Return")
plt.legend()
plt.show()

## 6. Deployment (Educational Example)

In a real project, you might save your trained model and create a pipeline to refresh data daily.
Here, we show how to serialize the model using `joblib`.

In [None]:
# If needed, install joblib. (Often comes with scikit-learn.)
# !pip -q install joblib

import joblib

# Save the trained model to disk.
model_path = "linear_return_model.joblib"
joblib.dump(model, model_path)

# Load the model back (to verify saving worked).
loaded_model = joblib.load(model_path)

# Quick sanity check: predictions from the loaded model should match.
loaded_pred = loaded_model.predict(X_test)
np.allclose(y_pred, loaded_pred)

## Extension Ideas (Student Exercises)

1. **Multi-ticker modeling:** Loop through `tickers` and compare metrics.
2. **Alternative models:** Try RandomForestRegressor, GradientBoostingRegressor, or XGBoost.
3. **Feature engineering:** Add volume-based indicators (OBV, VWAP).
4. **Backtesting:** Create a simple strategy based on model signals and evaluate returns.
5. **API exploration:** Replace `yfinance` with another API using authentication keys.

---

### Quick Discussion Questions

- Why might next-day return prediction be difficult?
- How do moving averages help smooth noisy financial data?
- What risks arise from training on historical prices?