# Improving embeddings with Signal Processing

The canonical way of getting embeddings that encode more text than can fit in the embeddings models' context length is chunking the text, embedding each separately, then averaging in the time domain (see Cookbook notebook [here](https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb)). However, better performance can be achieved if you transpose the embeddings to the frequency domain, use a signal processing technique like a lowpass filter to remove noise, transpose back to the time domain, and *then* average the embeddings in the time domain.

This notebook demonstrates how to do this on an example document classification task.

For more information, [read the paper here](https://jagilley.github.io/fft-embed.html).

Utilities for implementing this method more simply can be found in [this GitHub repo](https://github.com/jagilley/fft-embeddings).

In [None]:
from nltk.corpus import reuters
import openai
from tqdm import tqdm
import numpy as np
from openai.embeddings_utils import get_embeddings
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pickle
import librosa
import nltk

# download nltk's reuters corpus
nltk.download('reuters')

trade_docs = reuters.fileids(categories='trade')
crude_docs = reuters.fileids(categories='crude')

all_docs = [reuters.raw(doc_id) for doc_id in trade_docs + crude_docs]
all_labels = ['trade' for _ in trade_docs] + ['crude' for _ in crude_docs]

# shuffle docs and labels together
np.random.seed(42)
combined = list(zip(all_docs, all_labels))
np.random.shuffle(combined)
all_docs, all_labels = zip(*combined)

Traditional text classification: embed the entire document at once, then train an `MLPClassifier` to classify them

In [None]:
print('Getting embeddings...')
EMBEDDINGS_ENGINE = "text-embedding-ada-002"
all_embeddings = get_embeddings(all_docs, engine=EMBEDDINGS_ENGINE)

# train/test split
X_train, X_test, y_train, y_test = train_test_split(all_embeddings, all_labels, test_size=0.2, random_state=42)

# train classifier
clf = MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=1000, alpha=1e-4,
                    solver='sgd', verbose=10, tol=1e-4, random_state=1,
                    learning_rate_init=.1)
clf.fit(X_train, y_train)

# predict on test set
y_pred = clf.predict(X_test)

# evaluate
print(accuracy_score(y_test, y_pred))

New method: split the texts using a sliding window function and embed them separately

In [None]:
# fft classification with sliding windows

def split_text(text, segment_length=40, overlap_percent=0.5):
    # text: a string containing the corpus of text
    # segment_length: an integer indicating the number of words in each segment
    # overlap_percent: a float between 0 and 1 indicating the percentage of overlap between segments
    # returns: a list of strings containing the overlapping segments

    # check if the parameters are valid
    if not isinstance(text, str):
        raise TypeError("text must be a string")
    if not isinstance(segment_length, int) or segment_length <= 0:
        raise ValueError("segment_length must be a positive integer")
    if not isinstance(overlap_percent, float) or overlap_percent < 0 or overlap_percent > 1:
        raise ValueError("overlap_percent must be a float between 0 and 1")

    # initialize an empty list to store the segments
    segments = []

    # split the text into words by whitespace
    words = text.split()

    # calculate the number of words to skip for each segment
    skip = int(segment_length * (1 - overlap_percent))

    # loop through the words with a sliding window
    for i in range(0, len(words), skip):
        # get the current segment by slicing the words
        segment = " ".join(words[i:i+segment_length])
        # append the segment to the list
        segments.append(segment)

    return segments

all_docs_paras = [split_text(doc, segment_length=40) for doc in all_docs]

# remove any empty paragraphs
all_docs_paras = [[para for para in paras if para] for paras in all_docs_paras]
# remove any '' paragraphs
all_docs_paras = [[para for para in paras if para != ''] for paras in all_docs_paras]

# get embeddings for each paragraph
print('Getting embeddings...')
EMBEDDINGS_ENGINE = "text-embedding-ada-002"
all_embeddings_paras = [get_embeddings(paras, engine=EMBEDDINGS_ENGINE) for paras in tqdm(all_docs_paras)]

Apply the Fast Fourier Transform (FFT) to each paragraph to get the frequency domain representation of the text. Apply a simple lowpass filter, then transform back into the time domain with the ISTFT. Then collapse the sequence of embeddings to a single embedding by averaging in the time domain.

In [None]:
# convert to numpy arrays
all_embeddings_paras = [np.array(doc) for doc in all_embeddings_paras]

# get FFTs
def get_fft(embedding):
    return librosa.stft(embedding, n_fft=32, win_length=4)

# lowpass filter
def lowpass_filter(fft, cutoff=0.5):
    """
    Lowpass filter for FFTs
    """
    fft = fft.copy()
    fft[:, int(cutoff*fft.shape[1]):] = 0
    return fft

# convert back to embeddings
def fft_to_embedding(fft):
    return librosa.istft(fft, win_length=4)

apply_lowpass = True

# get FFTs
print('Applying FFTs...')
all_embeddings_paras_fft = [get_fft(embedding) for embedding in tqdm(all_embeddings_paras)]

if apply_lowpass:
    # lowpass filter
    print('Lowpass filtering...')
    all_embeddings_paras_fft = [lowpass_filter(fft) for fft in tqdm(all_embeddings_paras_fft)]

# convert back to embeddings
print('Converting back to embeddings with ISTFT...')
all_embeddings_paras_lowpass = [fft_to_embedding(fft) for fft in tqdm(all_embeddings_paras_fft)]

if not apply_lowpass:
    # assert that the embeddings are the same if lowpass filtering is not applied
    assert np.allclose(all_embeddings_paras_lowpass[0], all_embeddings_paras[0])

# average embeddings
train_embeddings_lowpass_avg = [np.mean(embeddings, axis=0) for embeddings in all_embeddings_paras_lowpass]

Train a second `MLPClassifier` to classify the original documents using the new embeddings

In [None]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(train_embeddings_lowpass_avg, all_labels, test_size=0.2, random_state=42)

# train classifier
clf2 = MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=1000, alpha=1e-4,
                    solver='sgd', verbose=10, tol=1e-4, random_state=1,
                    learning_rate_init=.1)
clf2.fit(X_train, y_train)

# predict on test set
y_pred = clf2.predict(X_test)

# evaluate
print(accuracy_score(y_test, y_pred))

# results

- get embeddings for whole text: 97.1% accuracy
- sliding window without lowpass filter: 96% accuracy
- sliding window with lowpass filter @ 0.5: 97.6% accuracy