<a class="anchor" id="0"></a>
# Cryptocurrencies with Market Cap +$1B : EDA and Clustering

## Acknowledgements

* dataset [Forecasting Top Cryptocurrencies](https://www.kaggle.com/datasets/vbmokin/forecasting-top-cryptocurrencies)
* notebook [Time Series Clustering [Store Sales]](https://www.kaggle.com/code/raskoshik/time-series-clustering-store-sales)
* notebook [Introduction to Time Series Clustering](https://www.kaggle.com/code/izzettunc/introduction-to-time-series-clustering/notebook)

## Intro

### Let's cluster and study the patterns of cryptocurrencies in 2021 with a capitalization of more than $1 billion now.

### Let's take notebook [Time Series Clustering [Store Sales]](https://www.kaggle.com/code/raskoshik/time-series-clustering-store-sales) as a basis and adapt it for our task

## Content 
- <a href='#1'>1. Data Description</a>
- <a href='#2'>2. Dealing with Missing Data</a>
- <a href='#3'>3. Time Series Feature Extraction</a>
- <a href='#4'>4. Clustering Methods</a>
    - <a href='#4.1'>4.1 Time Series Smoothing</a>
    - <a href='#4.2'>4.2 Time Series Scaling</a>
    - <a href='#4.3'>4.3 Time Series K-Means</a>
    - <a href='#4.4'>4.4 Downsizing Feature Space</a>
        - <a href='#4.4.1'>4.4.1 t-SNE</a>
        - <a href='#4.4.2'>4.4.2 MultiDimensional Scaling (MDS)</a>
    - <a href='#4.5'>4.5 Hierarchical Agglomerative Clustering (HAC)</a>
    - <a href='#4.6'>4.6 Time Series KMeans Results</a>
- <a href='#5'>5. Cluster Series Extraction</a>
    - <a href='#5.1'>5.1 Cluster Series DBA</a>
- <a href='#6'>6. Time Series Embeddings</a>
- <a href='#7'>7. References</a>

In [None]:
# Some libraries installation
! git clone https://github.com/tejaslodaya/timeseries-clustering-vae.git
! pip install tslearn
! pip uninstall scikit-learn --yes 
! pip install scikit-learn==0.24.1

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os

from tslearn.clustering import TimeSeriesKMeans
from tslearn.barycenters import dtw_barycenter_averaging

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE, MDS
from sklearn.cluster import AgglomerativeClustering

from scipy.cluster.hierarchy import dendrogram
from tqdm.autonotebook import tqdm

warnings.filterwarnings("ignore")
sns.set_style("darkgrid")

SEED=42

In [None]:
import datetime
import requests
import pandas_datareader as web
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Set time interval of data for given cryptocurrency : 2021 - the last full year
date_start = datetime.datetime(2021, 1, 1)
# date_end = datetime.datetime.now()
date_end = datetime.datetime(2021, 12, 31)
print(f"Time interval: from {date_start} to {date_end}")

### <a id='1'>1. Data Description</a>

So in total, we have 83 cryptocurrencies with +$1B Market Cap - see dataset [Forecasting Top Cryptocurrencies](https://www.kaggle.com/datasets/vbmokin/forecasting-top-cryptocurrencies)

We cluster them. That can help to:
- Find similar patterns in/between cryptocurrencies
- Reduce the number of models to be trained (cluster models)
- ...

**We are going to find similar cryptocurrencies**

In [None]:
# Data Reading 
df_about = pd.read_csv('../input/forecasting-top-cryptocurrencies/about_top_cryptocurrencies_1B_information.csv', sep=";")
df_about.head()

In [None]:
# Get list of the code of all cryptocurrencies in this dataset
crypto_codes_list = df_about['code'].tolist()
np.array(crypto_codes_list)

In [None]:
# Data download via API
def get_data_codes(cryptocurrencies_list, col, date_start, date_end=None):
    # Get feature col for given cryptocurrency in USD from Yahoo.finance and https://coinmarketcap.com/
    # col is the 'High', 'Low', 'Open', 'Close' or 'Volume' only!
    # date_end = None means that the date_end is the current day
    
    # Check for col
    if col not in ['High', 'Low', 'Open', 'Close', 'Volume']:
        print(f"Feature {col} is absent")
        return None
    
    # Check for date of the end
    if date_end is None:
        date_end = dt.datetime.now()
    
    # Generate the DataFrame with the list of dates
    data = pd.DataFrame()
    dates_list = []
    for i in range((date_end - date_start).days + 1):
        #dates_list.append((date_start + datetime.timedelta(i)).strftime("%Y-%m-%d"))
        dates_list.append(date_start + datetime.timedelta(i))
    data['date'] = dates_list
    
    # Get data
    for item in cryptocurrencies_list:
        
        # Download data
        try:
            df = web.DataReader(f'{item}-USD', 'yahoo', date_start, date_end)
            df = df[[col]].reset_index(drop=False)
            df.columns = ['date', item]

            # Merging data
            data = data.merge(df, on='date', how='left')
            #print(item)            
        
        except:
            print(f'Cryptocurrency "{item}" has problem downloading from Yahoo')
        
    return data

In [None]:
%%time
#df = get_data_codes(['BTC', 'ICP', 'GALA'], 'Close', date_start, date_end)
df = get_data_codes(crypto_codes_list, 'Close', date_start, date_end)
df

In [None]:
print('Number of cryptocurrencies (without "ICP"): ', df.shape[1]-1)

Let's find cryptocurrencies. First, we have to make sure that the data is correct and we have no missing values

### <a id='2'>2. Dealing wtih Missing Data</a>

In [None]:
# Count missing data
df_missing = df.isna().sum().sort_values(ascending=False)
df_missing[df_missing > 0]

Only 5 cryptocurrencies have missing data. Let's remove them.

In [None]:
df_missing_list = df_missing[df_missing > 0].index.tolist()
df_missing_list

In [None]:
df = df.drop(columns = df_missing_list)
crypto_codes_list = df.columns.tolist()
crypto_codes_list.remove('date')

**TASK :** Try not to delete cryptocurrencies with missing data, but to impute or interpolate them to neighboring values

In [None]:
# MinMaxScaler
def df_minmax_scaler(df):
    # Data Scalling
    scaler = MinMaxScaler().fit(df)
    df = pd.DataFrame(scaler.transform(df), columns = df.columns)
    return df

In [None]:
#df2 = df[['date', 'BTC', 'ETH', 'USDT', 'BNB']].copy()
df2 = df.copy()
df2.index = df['date']
df2 = df2.drop(columns=['date'])
df2 = df_minmax_scaler(df2)

In [None]:
# The 5 cryptocurrencies with the biggest market cap
crypto_codes_list_biggest = crypto_codes_list[:5]
df2[crypto_codes_list_biggest].plot(figsize=(16,12))

In [None]:
# The 5 cryptocurrencies with the biggest market cap
#axs = df2.plot.area(figsize=(12, len(crypto_codes_list)), subplots=True)
axs = df2[crypto_codes_list_biggest].plot.area(figsize=(12, 5), subplots=True)

In [None]:
df3 = df.melt(id_vars = ['date'])
df3.columns = ['date', 'currency', 'value']
df3

### <a id='3'>3. Time Series Feature Extraction</a>

In general, time series clustering can be divided into 2 types:
- **Feature-Based approach**: we try to extract everything possible from the signal/time series (feature extraction)
- **Raw data-Based approach**: directly applied to time series vectors without any spatial transformations

In this notebook, we are going to use **Raw-data Based approach**. It means that we will have a matrix of features where:
- Rows: Different Time Series
- Features: Time Observations

In this case, we will be clustering in a very high dimensional space and will most likely run into a problem known as the **Curse of Dimensionality**. As a result, obtained clusters may have sparse shapes, overlap with other clusters and so on.

To prevent this, we will need to use **dimensionality reduction methods** (t-SNE, PCA, MDS...)

### <a id='4'>4. Clustering Methods</a>

We will focus on the following clustering methods:
- `K-Means/TimeSeriesKMeans: (ts_learn library)`
- `Hierarchical Agglomerative Clustering`

But any known clustering algorithm can be applied

**If the time series is a signal** (data from various devices), then the best way to extract features would be methods from the `signal processing` area

For example, Fourier transformation for finding different frequencies, spectrograms and wavelet transformations

**If the series is noisy, then it would be nice to smooth it first** (various smoothing methods) so as not to find false patterns

### <a id='4.1'>4.1 Time Series Smoothing</a>
Nice, we don't have missing values **but the series is still looking noisy**. Let's apply moving average (window size = 7: weekly trend )

In [None]:
# Time Series Smoothing 
res_df = pd.DataFrame()
for item in df3['currency'].unique():
    current_cur = df3.query(f'currency == "{item}"')
    current_cur['smoth_7'] = current_cur['value'].rolling(7, center=True).mean()
    res_df = res_df.append(current_cur[['date', 'currency', 'smoth_7']])
    
df4 = res_df.dropna()
df4

In [None]:
# Let's have a look to Bitcoin
selected_cur3 = df3[df3['currency']=='BTC']
selected_cur4 = df4[df4['currency']=='BTC']
selected_cur3['smooth'] = selected_cur4['smoth_7']
selected_cur3[['value', 'smooth']].plot(figsize=(12, 6))

After smoothing we can get more insights about the series as well as define similarities between them

Initial preprocessing has been done and we can create the main feature matrix  

In [None]:
# Feature matrix with shape (n_series x time_observations)
series_df = df4.pivot(index='currency', columns='date', values='smoth_7')
series_df = series_df.dropna(axis='columns')
series_df.head()

### <a id='4.2'>4.2 Time Series Scaling</a>
Scaling must be applied to each series independently

In [None]:
# Scaling
scaler = StandardScaler()

# First transposition - to have series in columns (allows scaling each series independently)
# Second Transposition - come back to initial feature matrix shape (n_series x time_observations)
scaler = StandardScaler()
scaled_ts = scaler.fit_transform(series_df.T).T 

### <a id='4.3'>4.3 Time Series K-Means</a>

When using `K-Means` clustering, it is better to use the **Feature-Based Approach**. We extract a bunch of features from the series and hope that they will describe the time series well then perform clustering. I'd like to demonstrate **out of the box solution** (Raw-Data Approach). For Feature-Based Approach, you have to get features for each series and group them using any clustering algorithm. These libraries can help: 
- <a href='https://github.com/fraunhoferportugal/tsfel'>ts_fel</a>
- <a href='https://github.com/blue-yonder/tsfresh'>ts_fresh</a>

It is important how we define the similarity between observations in a feature space. When using KMeans we can use:
- `Euclidean distance` 
- `Dynamic Time Warping Matching (DTW)`


When using <a href='https://tslearn.readthedocs.io/en/stable/user_guide/dtw.html'> Dynamic Time Warping Matching </a> the **Feature-Based approach is not suitable**, since we are trying to determine a measure of the similarity of the series (how they overlap, peaks size/similarity/location...)

For `DTW` better downsample the series using `resampling` (i.e. change the frequency of the series). For example, instead of daily observations/ticks, take 5/10/15 days ones but we have to keep in mind that the main patterns (peaks, fluctuations) fall into this interval. It allows keeping the series structure, making it shorter and therefore much faster to identify similar series with `DTW`


First, apply KMeans algorithm from <a href='https://tslearn.readthedocs.io/en/stable/index.html'>ts_learn library</a>

In [None]:
# Run KMeans and plot the results 
def get_kmeans_results(data, max_clusters=10, metric='euclidean', seed=23):
    """
    Runs KMeans n times (according to max_cluster range)

    data: pd.DataFrame or np.array
        Time Series Data
    max_clusters: int
        Number of different clusters for KMeans algorithm
    metric: str
        Distance metric between the observations
    seed: int
        random seed
    Returns: 
    -------
    None      
    """
    # Main metrics
    distortions = []
    silhouette = []
    clusters_range = range(1, max_clusters+1)
    
    for K in tqdm(clusters_range):
        kmeans_model = TimeSeriesKMeans(n_clusters=K, metric=metric, n_jobs=-1, max_iter=10, random_state=seed)
        kmeans_model.fit(data)
        distortions.append(kmeans_model.inertia_)
        if K > 1:
            silhouette.append(silhouette_score(data, kmeans_model.labels_))
        
    # Visualization
    plt.figure(figsize=(10,4))
    plt.plot(clusters_range, distortions, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Distortion')
    plt.title('Elbow Method')
    
    plt.figure(figsize=(10,4))
    plt.plot(clusters_range[1:], silhouette, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Silhouette score')
    plt.title('Silhouette');

Let's try finding similar series using DTW metric

In [None]:
%%time

# Run the algorithm using DTW algorithm 
get_kmeans_results(data=scaled_ts, max_clusters=5, metric='dtw', seed=SEED)

Well, we can hardly say anything according to Silhouette (4?) but Elbow Method says that 2 clusters are good

Let's have a look at obtained clusters 

In [None]:
# Visualization for obtained clusters   
def plot_clusters(data, cluster_model, dim_red_algo, xsize=16, ysize=10, title=""):
    """
    Plots clusters obtained by clustering model 

    data: pd.DataFrame or np.array
        Time Series Data
    cluster_model: Class
        Clustering algorithm 
    dim_red_algo: Class
        Dimensionality reduction algorithm (e.g. TSNE/PCA/MDS...) 
    Returns:
    -------
    None
    """
    cluster_labels = cluster_model.fit_predict(data)
    centroids = cluster_model.cluster_centers_
    u_labels = np.unique(cluster_labels)
    
    # Centroids Visualization
    plt.figure(figsize=(xsize, ysize))
    plt.scatter(centroids[:, 0] , centroids[:, 1] , s=150, color='r', marker="x")
    
    # Downsize the data into 2D
    if data.shape[1] > 2:
        data_2d = dim_red_algo.fit_transform(data)
        for u_label in u_labels:
            cluster_points = data[(cluster_labels == u_label)]
            plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=u_label)
    else:
        for u_label in u_labels:
            cluster_points = data[(cluster_labels == u_label)]
            plt.scatter(cluster_points[:, 0], cluster_points[:, 1], label=u_label)

    plt.title('Clustered Data'+title)
    plt.xlabel("Feature space for the 1st feature")
    plt.ylabel("Feature space for the 2nd feature")
    plt.grid(True)
    plt.legend(title='Cluster Labels');

In [None]:
%%time

# let's look at the cluster shape with n_clusters=2 (Elbow Method)
model = TimeSeriesKMeans(n_clusters=2, metric='dtw', n_jobs=-1, max_iter=10, random_state=SEED)

plot_clusters(data=scaled_ts,
              cluster_model=model,
              dim_red_algo=TSNE(n_components=2, init='pca', random_state=SEED))

In [None]:
%%time

# let's look at the cluster shape with n_clusters=4 (Silhouette Method)
model = TimeSeriesKMeans(n_clusters=4, metric='dtw', n_jobs=-1, max_iter=10, random_state=SEED)

plot_clusters(data=scaled_ts,
              cluster_model=model,
              dim_red_algo=TSNE(n_components=2, init='pca', random_state=SEED))

Clusters overlap and cluster number 2 or 4 looks like a noise

In [None]:
# let's compare with the euclidean metric
get_kmeans_results(data=scaled_ts, max_clusters=5, metric='euclidean', seed=SEED)

The results are much worse in comparison with `DTW` algorithm. Let's try downsizing the features

### <a id='4.4'>4.4 Downsizing Feature Space</a> 

Let's apply dimensionality reduction methods (t-SNE, MDS, VRAE...)

### <a id='4.4.1'>4.4.1 t-SNE</a> 

In [None]:
# Downsize the features into 2D
tsne = TSNE(n_components=2, init='pca', random_state=SEED)
data_tsne = tsne.fit_transform(scaled_ts)

get_kmeans_results(data=data_tsne, max_clusters=10, metric='euclidean', seed=SEED)

In [None]:
# let's look at the cluster shape
model = TimeSeriesKMeans(n_clusters=2, metric='euclidean', n_jobs=-1, max_iter=10, random_state=SEED)

plot_clusters(data=data_tsne,
              cluster_model=model,
              dim_red_algo=TSNE(n_components=2, init='pca', random_state=SEED))

Cluster shape is relatively good, observations don't overlap but are a bit sparse

### <a id='4.4.2'>4.4.2 MultiDimensional Scaling (MDS)</a> 

In [None]:
mds = MDS(n_components=2, n_init=3, max_iter=100, random_state=SEED)
data_mds = mds.fit_transform(scaled_ts) 

get_kmeans_results(data=data_mds, max_clusters=10, metric='euclidean', seed=SEED)

In [None]:
# let's look at the cluster shape
model = TimeSeriesKMeans(n_clusters=2, metric='euclidean', n_jobs=-1, max_iter=10, random_state=SEED)

plot_clusters(data=data_mds,
              cluster_model=model,
              dim_red_algo=TSNE(n_components=2, init='pca', random_state=SEED))

We can choose between 2 and 5 clusters

In [None]:
# let's look at the cluster shape
for i in range(4):
    model = TimeSeriesKMeans(n_clusters=i+2, metric='euclidean', n_jobs=-1, max_iter=10, random_state=SEED)

    plot_clusters(data=data_mds,
                  cluster_model=model,
                  dim_red_algo=TSNE(n_components=2, init='pca', random_state=SEED), xsize=12, ysize=4, 
                  title=f'for {i+2} clusters')

### <a id='4.5'>4.5 Hierarchical Agglomerative Clustering (HAC)</a> 

In [None]:
# HAC clustering (similar to get_kmeans_results function)
def get_hac_results(data, max_clusters=10, linkage='euclidean', seed=23):
    silhouette = []
    clusters_range = range(2, max_clusters+1)
    for K in tqdm(clusters_range):
        model = AgglomerativeClustering(n_clusters=K, linkage=linkage)
        model.fit(data)
        silhouette.append(silhouette_score(data, model.labels_))
        
    # Plot
    plt.figure(figsize=(10,4))
    plt.plot(clusters_range, silhouette, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Silhouette score')
    plt.title('Silhouette')
    plt.grid(True);

In [None]:
# Look at all results at a time 
features_df = [scaled_ts, data_tsne, data_mds]
for df in features_df:
    get_hac_results(data=df, max_clusters=10, linkage='ward', seed=SEED)

Let's choose 5 clusters with MDS features

In [None]:
def plot_dendrogram(data, model, figsize=(16,10), **kwargs):
    """
    Plots a dendogram using HAC 

    data: pd.DataFrame or np.array
        Time Series Data
    model: Class
        Clustering Model 
    figsize: tuple
        Figure size
    Returns:
    -------
    None 
    """
    model.fit(data)
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)
    
    plt.figure(figsize=figsize, dpi=200)
    dendrogram(linkage_matrix, **kwargs)
    plt.title('Dendogram')
    plt.xlabel('Objects')
    plt.ylabel('Distance')
    plt.grid(False)
    plt.tight_layout();

In [None]:
# Dendrogram
model = AgglomerativeClustering(n_clusters=5, linkage='ward', affinity='euclidean', compute_distances=True)

plot_dendrogram(data=features_df[-1],
                model=model,
                color_threshold=60)

###  <a id='4.6'>4.6 Time Series KMeans Results</a> 
Finally, we will choose TimeSeriesKMeans with downsized features using MDS and 5 clusters. It's likely that the data is various and with 5 clusters we will get clusters with similar series.

In [None]:
# Train TimeSeriesKMeans with MDS
kmeans_model = TimeSeriesKMeans(n_clusters=5, metric='euclidean', n_jobs=-1, max_iter=10, random_state=SEED)
cluster_labels = kmeans_model.fit_predict(data_mds)

ts_clustered = [ scaled_ts[(cluster_labels == lable), :] for lable in np.unique(cluster_labels)]

In [None]:
# Objects distribution in the obtained clusters 
labels = [f'Cluster_{i}' for i in range(len(ts_clustered))]
samples_in_cluster = [val.shape[0] for val in ts_clustered]

plt.figure(figsize=(16,5))
plt.bar(labels, samples_in_cluster);

In [None]:
 def plot_cluster_ts(current_cluster):
    """
    Plots time series in a cluster 

    current_cluster: np.array
        Cluster with time series 
    Returns:
    -------
    None 
    """
    fig, ax = plt.subplots(
        int(np.ceil(current_cluster.shape[0]/4)),4,
        figsize=(45, 3*int(np.ceil(current_cluster.shape[0]/4)))
    )
    fig.autofmt_xdate(rotation=45)
    ax = ax.reshape(-1)
    for indx, series in enumerate(current_cluster):
        ax[indx].plot(series)
        plt.xticks(rotation=45)

    plt.tight_layout()
    plt.show();

Let's have a look at the obtained clusters

In [None]:
for cluster in range(len(ts_clustered)):
    print(f"==========Cluster number: {cluster}==========")
    plot_cluster_ts(ts_clustered[cluster])

Most of the series within its cluster are alike and it is cool. We have found out that all the cryptocurrencies can be clustered into 5 different groups. There are cryptocurrencies that have the same patterns 

### <a id='5'>5. Cluster Series Extraction</a>
Alright, we clustered the series data, what's next? Well, it depends on the task you are dealing with. Probably, after clustering the series you will want to get a cluster series (a series that describes all the series in the cluster)

There are several options:
- Use cluster centroid 
- Take the mean of all the series in a cluster
- Takes a series that has a minimum distance to the cluster centroid 
- <a href='https://tslearn.readthedocs.io/en/stable/variablelength.html#barycenter-computation'>DBA method</a>

We will cover:
- DBA
- Cluster Mean
- Closest Series to Cluster Centroid 

In [None]:
# Closest Series to Cluster Centroid
closest_clusters_indxs = [np.argmin([np.linalg.norm(cluster_center - point, ord=2) for point in data_mds]) \
                                                                        for cluster_center in kmeans_model.cluster_centers_]

closest_ts = scaled_ts[closest_clusters_indxs, :]

In [None]:
# DBA
dba_ts = [dtw_barycenter_averaging(cluster_serieses, max_iter=10, verbose=True) for cluster_serieses in ts_clustered]

Let's compare how a certain method affects a final cluster shape

Choose a cluster with a few series. This will help to see the differences between the algorithms!

In [None]:
CLUSTER_N = 2

plt.figure(figsize=(25, 5))
plt.plot(ts_clustered[CLUSTER_N].T,  alpha = 0.4) # all series in the cluster_1
plt.plot(closest_ts[CLUSTER_N], c = 'r', label='Cluster Time Series')
plt.title('Cluster Series - Closest to Cluster Centroid. Cluster 1')
plt.legend();

plt.figure(figsize=(25, 5))
plt.plot(ts_clustered[CLUSTER_N].T,  alpha = 0.4) 
plt.plot(np.mean(ts_clustered[CLUSTER_N], axis=0), c = 'r', label='Cluster Time Series')
plt.title('Cluster Series - Cluster Mean. Cluster 1')
plt.legend();

plt.figure(figsize=(25, 5))
plt.plot(ts_clustered[CLUSTER_N].T,  alpha = 0.4) 
plt.plot(dba_ts[CLUSTER_N], c = 'r', label='Cluster Time Series')
plt.title('Cluster Series - DBA. Cluster 1')
plt.legend();

Why not choose the first option? Well, it has a big spike and doesn't describe all series data. As a solution, smoothing can be applied (I think it's always a good idea to apply smoothing in this case because noisy series might be chosen)

DBA or Mean method look good. Both can be chosen!

### <a id='5.1'>5.1 Cluster Series DBA</a>
All clusters series extracted by DBA

In [None]:
for indx, series in enumerate(dba_ts):
    plt.figure(figsize=(25, 5))
    plt.plot(ts_clustered[indx].T,  alpha = 0.15)
    plt.plot(series, c = 'r', label='Cluster Time Series')
    plt.title(f'Scaled values. Cluster {indx}')
    plt.legend();

### <a id='6'>6. Time Series Embeddings</a>

In this approach, we will train NN (Recurrent Auto-encoders with LSTM / GRU blocks) and get compressed vector representations of series (embeddings)

Trying to train the encoder and decoder in such a way that in all the variety of data at the input they would receive series close to each other, and those that differ were separated, according to the distance that we choose.

The algorithm is trained in unsupervised mode. Obtained embeddings will be clustered in the end 

In [None]:
os.chdir('./timeseries-clustering-vae')

from vrae.vrae import VRAE
from vrae.utils import *

import torch
import plotly
from torch.utils.data import DataLoader, TensorDataset
plotly.offline.init_notebook_mode()

In [None]:
vrae_df = scaled_ts.copy()
dload = '/content/timeseries_clustering_vae/' 

In [None]:
# Model Params
hidden_size = 50
hidden_layer_depth = 1
latent_length = 20
batch_size = 5
learning_rate = 0.005
n_epochs = 40
dropout_rate = 0.1
optimizer = 'Adam' # Adam/SGD
cuda = True # Train on GPU
print_every=30
clip = True 
max_grad_norm=5
loss = 'MSELoss' # SmoothL1Loss/MSELoss
block = 'LSTM' # LSTM/GRU

In [None]:
# We don't use test_df, create train_df using all the data we have
X_train = np.expand_dims(scaled_ts, -1)
train_dataset = TensorDataset(torch.from_numpy(X_train))

sequence_length = X_train.shape[1] 
number_of_features = X_train.shape[2] 

In [None]:
# Model Creation 
vrae = VRAE(sequence_length=sequence_length,
            number_of_features = number_of_features,
            hidden_size = hidden_size, 
            hidden_layer_depth = hidden_layer_depth,
            latent_length = latent_length,
            batch_size = batch_size,
            learning_rate = learning_rate,
            n_epochs = n_epochs,
            dropout_rate = dropout_rate,
            optimizer = optimizer, 
            cuda = cuda,
            print_every=print_every, 
            clip=clip, 
            max_grad_norm=max_grad_norm,
            loss = loss,
            block = block,
            dload = dload)

In [None]:
%%time 

vrae.fit(train_dataset)

In [None]:
# Get embeddings
embeddings = vrae.transform(train_dataset)

# Cluster the embeddings
get_kmeans_results(data=embeddings, max_clusters=10, metric='euclidean', seed=SEED)

In [None]:
model = TimeSeriesKMeans(n_clusters=6, metric='euclidean', n_jobs=-1, max_iter=10, random_state=SEED)
 
plot_clusters(data=embeddings,
              cluster_model=model,
              dim_red_algo=TSNE(n_components=2, init='pca', random_state=SEED))

I hope you find this notebook useful and enjoyable.

Your comments and feedback are most welcome.

[Go to Top](#0)