In [None]:
import numpy as np
from tqdm.notebook import tqdm
import random
from collections import Counter

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading Pre-Computed Embeddings

We start by loading the pre-computed ESG news embeddings from a pickled file.
These embeddings represent daily aggregated ESG news per company, date, and category, and will serve as input sequences for the LSTM model.

In [None]:
path ="/content/drive/MyDrive/MIAX/TRABAJO FINAL DE MASTER DEFINITIVO/EMBEDDINGS.pkl"
df = pd.read_pickle(path)

In [None]:
df

Unnamed: 0,ticker,date,category,event_text,embedding,embedding_reduced,cluster,embedding_combined
0,AAPL,2020-03-26,Governance,The coronavirus crisis won’t force IBM or thes...,"[-0.061546337, -0.05315133, 0.031761326, -0.03...","[0.13161144, 0.21722357, -0.20482038, 0.217550...",0,"[0.1316114366054535, 0.21722356975078583, -0.2..."
1,AAPL,2020-03-26,Social,Apple Inc. (NASDAQ: AAPL) CEO Tim Cook announc...,"[0.00027809513, -0.029144509, 0.117705405, -0....","[-0.1196747, 0.28669232, -0.041156188, 0.13470...",0,"[-0.1196746975183487, 0.2866923213005066, -0.0..."
2,AAPL,2020-03-27,Social,Apple says the app and website are meant to gi...,"[-0.015929632, 0.0119838, 0.025352007, -0.0224...","[-0.34469396, 0.13473184, -0.027780786, 0.2523...",0,"[-0.34469395875930786, 0.13473184406757355, -0..."
3,AAPL,2020-03-28,Governance,"As the U.S. economy starts to shut down, cash ...","[0.032775655, -0.025250023, -0.005527121, 0.02...","[0.108597055, -0.118303485, -0.056267083, -0.0...",0,"[0.1085970550775528, -0.11830348521471024, -0...."
4,AAPL,2020-03-30,Governance,"Apple Inc's (NASDAQ: AAPL) largest supplier, F...","[0.021084372, -0.04204493, -0.027231807, -0.00...","[0.084087715, 0.17990188, -0.20706366, -0.0217...",0,"[0.08408771455287933, 0.1799018830060959, -0.2..."
...,...,...,...,...,...,...,...,...
69650,XOM,2025-03-19,Governance,ExxonMobil filed a protest notice on Wednesday...,"[0.027937062, 0.05524979, 0.053883664, 0.00581...","[0.12804541, -0.25893778, 0.12964192, -0.17031...",1,"[0.12804540991783142, -0.2589377760887146, 0.1..."
69651,XOM,2025-03-20,Environmental,XOM questions Colonial Pipeline's plan to cut ...,"[-0.018101593, 0.07403076, 0.039944347, -0.041...","[0.25796288, -0.23406321, 0.16715088, -0.04122...",1,"[0.2579628825187683, -0.23406320810317993, 0.1..."
69652,XOM,2025-03-21,Environmental,New Wave Offshore Energy is expected to provid...,"[-0.014534758, -0.0074794795, 0.0008310969, -0...","[0.12106457, -0.14770606, 0.13917562, 0.138137...",1,"[0.12106457352638245, -0.1477060616016388, 0.1..."
69653,XOM,2025-03-22,Environmental,Exxon Mobil Corporation stock has remained res...,"[-0.01221892, -0.055477124, 0.0681576, 0.06422...","[0.42223483, 0.006553337, -0.10097705, -0.0310...",1,"[0.42223483324050903, 0.006553336977958679, -0..."


# Preparing Embedding Data

We sort the dataset by `ticker`, `category`, and `date` to ensure chronological order, which is crucial for sequential modeling.
Additionally, we convert the reduced embeddings into NumPy arrays for efficient manipulation.

In [None]:
df = df.sort_values(by=["ticker", "category", "date"]).reset_index(drop=True)
df["embedding_combined"] = df["embedding_combined"].apply(lambda x: np.array(x))

# Rolling Window Generation with Padding

In this section, we define a function to generate fixed-size rolling windows of ESG news embeddings, which will serve as sequential inputs for the LSTM model.

Each window represents a 90-day period of news embeddings for a specific company (`ticker`) and ESG category (`category`). These sequences are aligned in time and ordered by date, with the most recent day representing the **target date** for which we want to predict the quarterly ESG score.

## Why use rolling windows?

LSTM networks are designed to model temporal dependencies in sequential data. To train them effectively, we need sequences of **fixed length** with **chronologically ordered inputs**. Rolling windows allow us to:
- Simulate the sequential nature of ESG-related news flows.
- Use multiple overlapping sequences to maximize training data.
- Learn from both short-term fluctuations and long-term accumulation of ESG-relevant signals.

This structure enables the model to infer whether the pattern of ESG news in a given 90-day window is predictive of changes in quarterly ESG scores.


## Why apply padding?

Due to missing or sparse data (e.g., companies with few news articles in a given time), some 90-day windows might not be fully populated. Instead of discarding them, we allow **zero-vector padding** at the beginning of the sequence.

This has multiple benefits:
- Increases the number of usable training samples.
- Maintains consistent input shape for the LSTM (essential for batch training).
- Ensures fair representation of companies with irregular data flow.

However, **excessive padding** can dilute meaningful signal. To address this, we define a **maximum padding fraction** (e.g., 20%) and discard windows that don’t meet the minimum threshold of real days (72 out of 90).

---

## Core logic of the function

1. Group the dataset by `ticker` and `category`.
2. Sort each group by `date` to ensure chronological consistency.
3. Slide a window of 90 days across the timeline:
   - If the group has less than `min_required_days`, skip it.
   - For each valid window:
     - Extract the 90-day slice of reduced embeddings.
     - Apply **zero-padding** at the beginning if the window is shorter than 90 days.
     - Store the resulting sequence in `X`.
     - Append metadata (ticker, category, target date) to track the context.

We also prepare placeholders for the target ESG score (`y`), which will be matched later in a separate step using official data.

---

## Function Output

The function returns:
- `X`: a list of NumPy arrays of shape (90, D), where D is the embedding size (e.g., 100).
- `y`: a list of targets (currently set to `None`, to be filled in the next notebook).
- `tickers`: list of company tickers per window.
- `categories`: ESG category (E, S, or G) per window.
- `target_dates`: the date corresponding to the last day in each window (target alignment point).

---

## Design Justifications (for evaluation or academic defense)

- **Fixed window size (90 days)**: Based on quarterly ESG updates; consistent with SEC/official scoring frequency.
- **Rolling windows**: Increase sample size while capturing sequential evolution.
- **Zero-padding**: Balanced strategy to include partial data without harming model input consistency.
- **Minimum threshold**: Ensures signal quality by enforcing a minimum number of real data points.
- **No target assignment yet**: This is done in a later step to ensure clean separation between input generation and target matching logic.

This approach allows flexibility, reproducibility, and robustness, and is directly inspired by time series modeling practices in NLP and financial deep learning literature (e.g., Bai et al., 2018; Zhang et al., 2021).

In [None]:
def generate_windows_with_padding(df, window_size=90, max_padding_fraction=0.3):
    X, y, tickers, categories, target_dates = [], [], [], [], []

    min_required_days = int(window_size * (1 - max_padding_fraction))

    grouped = df.groupby(["ticker", "category"])

    for (ticker, category), group in tqdm(grouped, desc="Generando ventanas con padding"):
        group = group.sort_values("date").reset_index(drop=True)

        if len(group) < min_required_days:
            continue

        for i in range(min_required_days, len(group) + 1):
            window = group.iloc[max(0, i - window_size):i]

            embeddings = window["embedding_combined"].values

            embeddings = np.stack(embeddings)

            if embeddings.shape[0] < window_size:
                missing = window_size - embeddings.shape[0]
                padding = np.zeros((missing, embeddings.shape[1]))
                embeddings = np.vstack([padding, embeddings])

            X.append(embeddings)

            y.append(None)
            tickers.append(ticker)
            categories.append(category)
            target_dates.append(window.iloc[-1]["date"])

    return np.array(X), y, tickers, categories, target_dates

In [None]:
X, y, tickers, categories, target_dates = generate_windows_with_padding(df, window_size=90, max_padding_fraction=0.2)

Generando ventanas con padding:   0%|          | 0/90 [00:00<?, ?it/s]

In [None]:
print(f"Total ventanas generadas: {len(X)}")
print("Forma de una ventana (X[0]):", X[0].shape)

Total ventanas generadas: 63401
Forma de una ventana (X[0]): (90, 101)


In [None]:
num_padded_windows = 0

for window in X:
    first_rows = window[:20]
    if np.any(np.all(first_rows == 0, axis=1)):
        num_padded_windows += 1

print(num_padded_windows)

1548


In [None]:
i = random.randint(0, len(X) - 1)

print("Ticker:", tickers[i])
print("Category:", categories[i])
print("Target date:", target_dates[i])
print("Primera fila del embedding de la ventana:", X[i][0][:10])

Ticker: TSLA
Category: Social
Target date: 2023-07-20
Primera fila del embedding de la ventana: [-0.03310936 -0.01596753  0.08599661 -0.02037817  0.0988157  -0.08739669
 -0.19336331 -0.03057285 -0.00702647 -0.21213117]


In [None]:
print(Counter(categories))

Counter({'Social': 22866, 'Governance': 22125, 'Environmental': 18410})


In [None]:
import pickle

output_path = "/content/drive/MyDrive/MIAX/TRABAJO FINAL DE MASTER DEFINITIVO/WINDOWS.pkl"

with open(output_path, "wb") as f:
    pickle.dump({
        "X": X,
        "y": y,
        "tickers": tickers,
        "categories": categories,
        "target_dates": target_dates
    }, f)