# Demo: sequences <-> embeddings
The following notebook provides a demonstration how the ts2e library can be used to convert time series into embeddings. To that end, we use a dataset with the history of daily prices of Amazon stock (AMZN). All the column descriptions are provided. Currency is USD.

In [None]:
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

In [None]:
import csv
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine
import seaborn as sns
import networkx as nx
import pandas as pd
from timeseries.strategies import TimeseriesToGraphStrategy, TimeseriesEdgeVisibilityConstraintsNatural, EdgeWeightingStrategyNull
from timeseries.vectors import TimeSeriesEmbedding
from sklearn.preprocessing import MinMaxScaler
from core import model 

## Loading data

We first load the dataset :)

In [None]:
amazon_data = pd.read_csv(os.path.join(os.getcwd(), "amazon", "AMZN.csv"))

To properly use the ‘Date’ column, we convert it to datetime format and ensure the dates are recognized as date objects. We then set the ‘Date’ column as the index of the DataFrame. This creates a time-series structure, facilitating analysis and visualization over time.

In [None]:
amazon_data["Date"] = pd.to_datetime(amazon_data["Date"])
amazon_data.set_index("Date", inplace=True)

# How does the time series look like?

In [None]:
def plot_timeseries(sequence, title, x_legend, y_legend, color):
    plt.figure(figsize=(10, 6))
    plt.plot(sequence, linestyle='-', color=color)
    
    plt.title(title)
    plt.xlabel(x_legend)
    plt.ylabel(y_legend)
    plt.grid(True)
    plt.show()

In [None]:
def plot_timeseries_sequence(df_column, title, x_legend, y_legend, color='black'):
    sequence = model.Timeseries(model.TimeseriesArrayStream(df_column)).to_sequence()
    plot_timeseries(sequence, title, x_legend, y_legend, color)

In [None]:
def sequence_to_graph(column, color):
    strategy = TimeseriesToGraphStrategy(
        visibility_constraints=[TimeseriesEdgeVisibilityConstraintsNatural()],
        graph_type="undirected",
        edge_weighting_strategy=EdgeWeightingStrategyNull(),
    )

    g = strategy.to_graph(model.TimeseriesArrayStream(column))
    pos=nx.spring_layout(g.graph, seed=1)
    nx.draw(g.graph, pos, node_size=40, node_color=color)

In [None]:
plot_timeseries_sequence(amazon_data["Close"], "Original Sequence", "Year", "Value")

Given the great length of the timeseries, let's focus on some sub-segments, so that we can better appreciate its behavior.

In [None]:
segment_1 = amazon_data[60:260]
segment_2 = amazon_data[960:1160]
segment_3 = amazon_data[3120:3320]
segment_4 = amazon_data[4320:4520]
segment_5 = amazon_data[5640:5840]
segment_6 = amazon_data[6000:6200]

How do the plots and networks (graphs) for these segments look like?

In [None]:
plot_timeseries_sequence(segment_1["Close"], "Example 1: Segment 1 for Amazon data", "Year", "Value", 'gray')
sequence_to_graph(segment_1["Close"], 'gray')

In [None]:
plot_timeseries_sequence(segment_2["Close"], "Example 2: Segment 2 from Amazon data", "Year", "Value", 'green')
sequence_to_graph(segment_2["Close"], 'green')

In [None]:
plot_timeseries_sequence(segment_3["Close"], "Example 3: Segment 3 from Amazon data", "Year", "Value", 'blue')
sequence_to_graph(segment_3["Close"], 'blue')

In [None]:
plot_timeseries_sequence(segment_4["Close"], "Example 4: Segment 4 from Amazon data", "Year", "Value", 'red')
sequence_to_graph(segment_4["Close"], 'red')

In [None]:
plot_timeseries_sequence(segment_5["Close"], "Example 5: Segment 5 from Amazon data", "Year", "Value", 'orange')
sequence_to_graph(segment_5["Close"], 'orange')

In [None]:
plot_timeseries_sequence(segment_6["Close"], "Example 6: Segment 6 from Amazon data", "Year", "Value", 'yellow')
sequence_to_graph(segment_6["Close"], 'yellow')

Let's turn the timeseries into vectors!

In [None]:
def normalize_data(dataset,column):
    data = dataset[column].values
    scaler = MinMaxScaler()
    return scaler.fit_transform(data.reshape(-1, 1)).flatten()    

This function normalizes a specific column of a dataset using the Min-Max scaling technique. Normalization is a common preprocessing step in machine learning, ensuring that all features have the same scale. It helps algorithms converge faster and perform better.

In [None]:
def create_and_train_ts_embedding (data, window_size=100, epochs=20):
    ts_embedding = TimeSeriesEmbedding(data, window_size)
    print(ts_embedding)
    print(ts_embedding.data.size)
    ts_embedding.train_lstm(epochs)
    print(ts_embedding)
    return ts_embedding

Here, a method called TimeSeriesEmbedding is employed to generate embeddings for time series data. Embeddings serve as compact representations of data in a lower-dimensional space, effectively capturing significant patterns and relationships. The function utilizes a window-based strategy to derive embeddings from the time series data. Furthermore, it includes the training of a Long Short-Term Memory (LSTM) neural network model. This model is trained to discern intricate embeddings directly from the data, enhancing comprehension of temporal dynamics and facilitating subsequent analyses.

In [None]:
def print_ts_embeddings_info(ts_embedding):
    embeddings = ts_embedding.get_embeddings()
    print("Shape of embeddings:", embeddings.shape)
    print("Sample embeddings:\n", embeddings[:5])
    return embeddings

This function retrieves and presents information regarding the embeddings produced by the TimeSeriesEmbedding model. It offers insights into the shape of the embeddings and presents a glimpse of the embeddings themselves, facilitating further analysis and interpretation.

In [None]:
def calculate_embeddings_similarity(ts_embedding, similarity_threshold=0.9):
    embeddings = ts_embedding.get_embeddings()
    num_embeddings = embeddings.shape[0]
    similarity_matrix = np.zeros((num_embeddings, num_embeddings))
    similar_pairs = []
    
    for i in range(num_embeddings):
        for j in range(i + 1, num_embeddings):  # Avoid duplicate calculations
            cosine_sim = 1 - cosine(embeddings[i], embeddings[j])
            similarity_matrix[i, j] = cosine_sim
            similarity_matrix[j, i] = cosine_sim 
            
            # closer to 1, more similar
            if cosine_sim > similarity_threshold:
                similar_pairs.append((i, j, cosine_sim))
    
    print("Pairwise cosine similarity:\n", similarity_matrix)
    print("Similar pairs (above threshold):\n", similar_pairs)
    return similarity_matrix, similar_pairs

With this function we calculate the cosine similarity between pairs of embeddings from a provided tensor and identifies those pairs that exceed a specified similarity threshold. It retrieves the embeddings, initializes a similarity matrix and a list for similar pairs, and then computes the pairwise cosine similarities, storing the results in the matrix while recording pairs with similarities above the threshold. The similarity matrix and the list of similar pairs, including their similarity scores, are printed and returned. This process helps to identify and quantify the similarity between different embeddings efficiently.

In [None]:
def sort_similarity_matrix(similarity_matrix):
    num_embeddings = similarity_matrix.shape[0]
    
    # Sort similarity matrix by cosine distance increasing
    sorted_indices = np.argsort(-similarity_matrix, axis=None)
    sorted_similarity_matrix = similarity_matrix.flatten()[sorted_indices].reshape(similarity_matrix.shape)
    
    # Collect sorted pairs
    sorted_pairs = []
    for index in sorted_indices:
        i, j = divmod(index, num_embeddings)
        if i < j: 
            sorted_pairs.append((i, j, similarity_matrix[i, j]))

    print("Pairwise cosine similarity (sorted):\n", sorted_similarity_matrix)
    print("Sorted pairs:\n", sorted_pairs)
    return sorted_similarity_matrix, sorted_pairs

We aim to sort the similarity matrix by decreasing cosine distance to identify the most similar pairs. This function gathers sorted pairs of indices, focusing solely on the upper triangular portion to eliminate duplicate pairs.

In [None]:
def plot_similarity_heatmap(similarity_matrix):
    plt.figure(figsize=(10, 8))
    sns.heatmap(similarity_matrix, cmap='gray_r')
    plt.title("Heatmap of Pairwise Cosine Similarity")
    plt.xlabel("Embedding Index")
    plt.ylabel("Embedding Index")
    plt.show()

The provided similarity matrix is plotted using a color map ('gray') to represent the cosine similarity. The resulting heatmap provides a visual representation of similarity relationships between embeddings, with warmer colors indicating higher similarity and cooler colors indicating lower similarity. This visualization aids in understanding the clustering and relationships within the embeddings, facilitating further analysis and interpretation.

#### Example 1: Segment 1 from Amazon data

In [None]:
data_1 = normalize_data(segment_1,'Close')
ts_embedding_1 = create_and_train_ts_embedding (data_1)

In [None]:
embeddings_1 = print_ts_embeddings_info (ts_embedding_1)

In [None]:
similarity_matrix_1, similar_pairs_1 = calculate_embeddings_similarity(ts_embedding_1)
plot_similarity_heatmap(similarity_matrix_1)

In [None]:
sorted_matrix_1, sorted_pairs_1 = sort_similarity_matrix(similarity_matrix_1)

<a id = 'example_1_heatmap'></a>

#### Example 2: Segment 2 from Amazon data

In [None]:
data_2 = normalize_data(segment_2,'Close')
ts_embedding_2 = create_and_train_ts_embedding (data_2)

In [None]:
embeddings_2 = print_ts_embeddings_info (ts_embedding_2)

In [None]:
similarity_matrix_2, similar_pairs_2 = calculate_embeddings_similarity(ts_embedding_2)
plot_similarity_heatmap(similarity_matrix_2)

In [None]:
sorted_matrix_2, sorted_pairs_2 = sort_similarity_matrix(similarity_matrix_2)

#### Example 3: Segment 3 from Amazon data

In [None]:
data_3 = normalize_data(segment_3,'Close')
ts_embedding_3 = create_and_train_ts_embedding (data_3)

In [None]:
embeddings_3 = print_ts_embeddings_info (ts_embedding_3)

In [None]:
similarity_matrix_3, similar_pairs_3 = calculate_embeddings_similarity(ts_embedding_3)
plot_similarity_heatmap(similarity_matrix_3)

In [None]:
sorted_matrix_3, sorted_pairs_3 = sort_similarity_matrix(similarity_matrix_3)

#### Example 4: Segment 4 from Amazon data

In [None]:
data_4 = normalize_data(segment_4,'Close')
ts_embedding_4 = create_and_train_ts_embedding (data_4)

In [None]:
embeddings_4 = print_ts_embeddings_info (ts_embedding_4)

In [None]:
similarity_matrix_4, similar_pairs_4 = calculate_embeddings_similarity(ts_embedding_4)
plot_similarity_heatmap(similarity_matrix_4)

In [None]:
sorted_matrix_4, sorted_pairs_4 = sort_similarity_matrix(similarity_matrix_4)

#### Example 5: Segment 5 from Amazon data

In [None]:
data_5 = normalize_data(segment_5,'Close')
ts_embedding_5 = create_and_train_ts_embedding (data_5)

In [None]:
embeddings_5 = print_ts_embeddings_info (ts_embedding_5)

In [None]:
similarity_matrix_5, similar_pairs_5 = calculate_embeddings_similarity(ts_embedding_5)
plot_similarity_heatmap(similarity_matrix_5)

In [None]:
sorted_matrix_5, sorted_pairs_5 = sort_similarity_matrix(similarity_matrix_5)

#### Example 6: Segment 6 from Amazon data

In [None]:
data_6 = normalize_data(segment_6,'Close')
ts_embedding_6 = create_and_train_ts_embedding (data_6)

In [None]:
embeddings_6 = print_ts_embeddings_info (ts_embedding_6)
print(embeddings_6.size)

In [None]:
similarity_matrix_6, similar_pairs_6 = calculate_embeddings_similarity(ts_embedding_6)
plot_similarity_heatmap(similarity_matrix_6)

In [None]:
sorted_matrix_6, sorted_pairs_6 = sort_similarity_matrix(similarity_matrix_6)

Based on above presented heatmaps we can conclude that all of them have a white diagonal that represents zeros for the same valued pairs. Out of the six examples [Example 1: Segment 1 from Amazon data](#example-1-segment-1-from-amazon-data) and [Example 3: Segment 3 from Amazon data](#example-3-segment-3-from-amazon-data) are the most similar by their embeddings. If we compare their time series graphs we can also see that the graphs are quite similar, both of them contain increasing values. 