# Predicting Volatility from Text Data: A Time Series Approach

This notebook demonstrates a time series cross-validation approach to predict future stock volatility based on text data (e.g., news headlines, articles). It utilizes TF-IDF for text feature extraction and XGBoost for the regression task. The methodology employs an expanding window approach, where the model is trained on an increasing historical dataset and evaluated on the subsequent year's data.

## 1. Setup and Library Imports

First, we import all the necessary Python libraries for data manipulation (`pandas`, `numpy`), text processing (`TfidfVectorizer` from `sklearn.feature_extraction.text`), machine learning model (`XGBRegressor` from `xgboost`), and evaluation metrics (`mean_squared_error` from `sklearn.metrics`).

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor

## 2. Data Loading and Initial Preparation

In this section, we load the dataset, perform initial text preprocessing by converting the 'cleaned_text' column to lowercase, and extract the 'year' from the 'Date' column to facilitate time-based data splitting for the cross-validation. We then display the head of the processed DataFrame.

In [2]:
# Load the dataset from the specified CSV file.
# The 'index_col=0' argument tells pandas to use the first column as the DataFrame index.
df_data = pd.read_csv('./Data/cleaned_dataset_2124.csv', index_col=0)

# Convert the 'cleaned_text' column to lowercase.
# This step standardizes the text, ensuring that words like "Volatility" and "volatility" are treated identically
# during feature extraction, which helps improve model consistency.
df_data['cleaned_text'] = df_data['cleaned_text'].str.lower()

# Extract the year from the 'Date' column and create a new 'year' column.
# The 'Date' column is first converted to datetime objects, and then the .dt.year accessor is used.
# This 'year' column is essential for implementing the time series cross-validation strategy.
df_data['year'] = pd.to_datetime(df_data['Date']).dt.year

# Display the first 5 rows of the modified DataFrame.
# This helps in quickly inspecting the data structure, confirming the lowercase conversion,
# and verifying the newly added 'year' column.
print(df_data.head(5))# Load the dataset from the specified CSV file.

        Symbol        Date    vol_3d    vol_7d   vol_15d   vol_30d   vol_60d  \
CIK                                                                            
796343    ADBE  2021-01-15  0.254187  0.163928  0.267558  0.266536  0.295177   
920760     LEN  2021-01-22  0.250670  0.222891  0.317954  0.450116  0.490832   
1067701    URI  2021-01-27  0.897888  0.592564  0.453141  0.458637  0.402286   
87347      SLB  2021-01-27  0.448659  0.459556  0.355023  0.415898  0.410442   
1326801   META  2021-01-28  0.314609  0.227651  0.184727  0.307393  0.295126   

          vol_90d                                       cleaned_text  year  
CIK                                                                         
796343   0.273107  as previously discussed, our actual results co...  2021  
920760   0.448351  the following are what we believe to be the pr...  2021  
1067701  0.363955  our business, results of operations and financ...  2021  
87347    0.424772  the following discussion of risk fa

## 3. Data Description

This step provides a statistical summary of the numerical columns in the DataFrame. This helps in understanding the distribution, central tendency, and spread of the volatility target variables (`vol_3d` to `vol_90d`) column.

In [3]:
df_data.describe()

Unnamed: 0,vol_3d,vol_7d,vol_15d,vol_30d,vol_60d,vol_90d,year
count,1729.0,1729.0,1729.0,1729.0,1729.0,1729.0,1729.0
mean,0.180231,0.261484,0.285047,0.28221,0.297074,0.29234,2022.529786
std,0.214962,0.177482,0.15832,0.134701,0.130181,0.121974,1.122352
min,8e-06,0.023844,0.034914,0.076416,0.084912,0.086059,2021.0
25%,0.059327,0.155061,0.185311,0.196017,0.211379,0.209553,2022.0
50%,0.127947,0.225854,0.252115,0.251763,0.267466,0.264587,2023.0
75%,0.236228,0.312807,0.340543,0.333483,0.349459,0.339083,2024.0
max,4.974497,3.743436,3.007637,2.186403,1.682482,1.379846,2024.0


## 4. Time Series Cross-Validation and Model Evaluation

* **Outer Loop:** Iterates through test years (2022, 2023, 2024). For each iteration, the training data includes all years *before* the current test year, and the testing data is the current year. This simulates how a model would be deployed and retrained over time.
* **Inner Loop:** For each test year, the model's performance is evaluated for different prediction horizons (3, 7, 15, 30, 60, 90 days), represented by the `vol_Xd` columns.

For each combination of test year and prediction horizon:
1.  **Data Splitting:** Data is split into training and testing sets based on the year.
2.  **TF-IDF Vectorization:** Text data is converted into numerical features using `TfidfVectorizer`. The vectorizer is `fit_transform` on the training data and `transform` on the test data to prevent data leakage.
3.  **XGBoost Model Training:** An `XGBRegressor` model is trained on the vectorized training data and the corresponding volatility target.
4.  **Prediction and Evaluation:** Predictions are made on the test set, and the Mean Squared Error (MSE) is calculated and printed. All MSE results are collected in a list for final display.

In [4]:
# Initialize an empty list to store evaluation metrics for each test year.
# Each element will be a dictionary containing the year and a list of MSE scores
# for different prediction horizons within that year.
evaluation_year = []

# Outer loop: Iterate through the test years from 2022 to 2024 (exclusive of 2025).
# This loop simulates an expanding window approach for time series cross-validation.
for i in range(2022, 2025):
    # Print the current training and testing year range for clarity.
    print(f"Training on data from 2021 to {i-1} and testing on {i}")

    # Split the dataset into training and testing sets based on the current year 'i'.
    # df_data_train: Contains all rows where the 'year' is strictly less than 'i'.
    df_data_train = df_data[df_data['year'] < i]
    # df_data_test: Contains all rows where the 'year' is exactly 'i'.
    df_data_test = df_data[df_data['year'] == i]

    # Print the number of samples in the training and testing sets.
    # This helps verify the correctness of the data split for each fold.
    print(f"Number of training samples: {df_data_train.shape[0]}")
    print(f"Number of testing samples: {df_data_test.shape[0]}")

    # Initialize a list to store Mean Squared Error (MSE) scores for different prediction horizons
    # within the current test year.
    evaluation_ndays_mse = []

    # Inner loop: Iterate through different prediction horizons (in days).
    # These correspond to the 'vol_Xd' target columns in the dataset.
    for n_days in [3, 7, 15, 30, 60, 90]:
        # Print the current prediction horizon being processed.
        print(f"Training model for {n_days} days ahead prediction")

        # Dynamically construct the target column name based on the current 'n_days'.
        target_col = f'vol_{n_days}d'

        # Define the feature (X) and target (y) for the training set.
        X_train = df_data_train['cleaned_text']
        y_train = df_data_train[target_col]

        # Define the feature (X) and target (y) for the testing set.
        X_test = df_data_test['cleaned_text']
        y_test = df_data_test[target_col]

        # Initialize TfidfVectorizer.
        # max_features=5000: Limits the vocabulary size to the top 5000 most frequent terms,
        # which helps manage dimensionality and reduce noise.
        # ngram_range=(1, 2): Considers both unigrams (single words) and bigrams (two-word phrases)
        # as features, capturing more context from the text.
        vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

        # Fit the vectorizer on the training text data and transform it into TF-IDF features.
        # This step learns the vocabulary and IDF weights from the training data.
        X_train_tfidf = vectorizer.fit_transform(X_train)
        # Transform the testing text data using the *same* fitted vectorizer.
        # It's crucial to only 'transform' the test data to prevent data leakage from the test set
        # into the feature engineering process.
        X_test_tfidf = vectorizer.transform(X_test)

        # Initialize and train the XGBoost Regressor model.
        # Hyperparameters are set as specified in the original code:
        # n_estimators=1000: Number of boosting rounds (i.e., number of trees to build).
        # learning_rate=0.01: Step size shrinkage used in updates to prevent overfitting.
        # max_depth=6: Maximum depth of a tree.
        # min_child_weight=1: Minimum sum of instance weight (hessian) needed in a child.
        # subsample=0.8: Fraction of samples used for fitting the trees.
        # colsample_bytree=0.8: Fraction of features (columns) used when constructing each tree.
        # gamma=0: Minimum loss reduction required to make a further partition on a leaf node.
        # reg_alpha=0.1 (L1 regularization): Regularization term on weights.
        # reg_lambda=1 (L2 regularization): Regularization term on weights.
        xgb_model = XGBRegressor(n_estimators=1000,
                                 learning_rate=0.01,
                                 max_depth=6,
                                 min_child_weight=1,
                                 subsample=0.8,
                                 colsample_bytree=0.8,
                                 gamma=0,
                                 reg_alpha=0.1,
                                 reg_lambda=1
                                )
        # Train the XGBoost model using the TF-IDF features and the corresponding volatility targets.
        xgb_model.fit(X_train_tfidf, y_train)

        # Make predictions on the TF-IDF vectorized test data.
        y_pred = xgb_model.predict(X_test_tfidf)

        # Calculate Mean Squared Error (MSE) between the actual test volatility (y_test)
        # and the predicted volatility (y_pred).
        # MSE is a common metric for regression tasks, measuring the average squared difference.
        mse = mean_squared_error(y_test, y_pred)

        # Print the calculated MSE for the current prediction horizon.
        print(f'Mean Squared Error: {mse}')
        print('---------------') # Separator for readability in output

        # Append the calculated MSE score to the list for the current year.
        evaluation_ndays_mse.append(mse)

    # After iterating through all prediction horizons for the current test year,
    # append the collected MSE scores for that year to the main 'evaluation_year' list.
    evaluation_year.append({
        'year': i,
        'mse': evaluation_ndays_mse
    })

# Print the final aggregated evaluation results.
# This list contains dictionaries, each summarizing the MSE scores for all prediction horizons
# across each of the test years.
print(evaluation_year)

Training on data from 2021 to 2021 and testing on 2022
Number of training samples: 419
Number of testing samples: 429
Training model for 3 days ahead prediction
Mean Squared Error: 0.10674581435768475
---------------
Training model for 7 days ahead prediction
Mean Squared Error: 0.05721633585441949
---------------
Training model for 15 days ahead prediction
Mean Squared Error: 0.038864257855644804
---------------
Training model for 30 days ahead prediction
Mean Squared Error: 0.02109916411113449
---------------
Training model for 60 days ahead prediction
Mean Squared Error: 0.021704831856504167
---------------
Training model for 90 days ahead prediction
Mean Squared Error: 0.02285179703077452
---------------
Training on data from 2021 to 2022 and testing on 2023
Number of training samples: 848
Number of testing samples: 427
Training model for 3 days ahead prediction
Mean Squared Error: 0.06578927614347502
---------------
Training model for 7 days ahead prediction
Mean Squared Error: 0.