Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shapelet Transform #135

Closed
adityabhandwalkar opened this issue Sep 28, 2022 · 18 comments
Closed

Shapelet Transform #135

adityabhandwalkar opened this issue Sep 28, 2022 · 18 comments

Comments

@adityabhandwalkar
Copy link

Hello everyone,

I have a dataset , like this where Q0 is the feature value and TS is the time stamp , and I would like to apply shapelet transform on this csv file. and I have written code for this, but it is throwing an error saying

ValueError: could not convert string to float: '2018-03-02 00:58:19.202450'
Q0 TS
0.012364804744720459, 2018-03-02 00:44:51.303082
0.012344598770141602, 2018-03-02 00:44:51.375207
0.012604951858520508, 2018-03-02 00:44:51.475198
0.012307226657867432, 2018-03-02 00:44:51.575189
0.012397348880767822, 2018-03-02 00:44:51.675180
0.013141036033630371, 2018-03-02 00:44:51.775171
0.012811839580535889, 2018-03-02 00:44:51.875162
0.012950420379638672, 2018-03-02 00:44:51.975153
0.013257980346679688, 2018-03-02 00:44:52.075144
########################################
Code:

from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from pyts.datasets import load_gunpoint
from pyts.transformation import ShapeletTransform
from datetime import time

Toy dataset

data=pd.read_csv('dataset11.csv')
pf=data.head(10)

y=data[['Q0']]
X=data[['TS']]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)

print(X_train)

as columns.

dataframe = pd.DataFrame(
pf,columns=['TS', 'Q0'])

Changing the datatype of Date, from

Object to datetime64

#dataframe["Sample2"] = Sample2.time.strptime("%T")

Setting the Date as index

dataframe = dataframe.set_index("TS")
dataframe

setting figure size to 12, 10

plt.figure(figsize=(12, 6))

Labelling the axes and setting

a title

plt.xlabel("Time")
plt.ylabel("Values")
plt.title("Vibration")

plotting the "A" column alone

plt.plot(dataframe["Q0"])
plt.legend(loc='best', fontsize=8)
plt.show()

st = ShapeletTransform(window_sizes='auto', sort=True)
X_new = st.fit_transform(X_train, y_train)

print(X_new)

Visualize the four most discriminative shapelets

plt.figure(figsize=(6, 4))
for i, index in enumerate(st.indices_[:4]):
idx, start, end = index
plt.plot(X_train[idx], color='C{}'.format(i),
label='Sample {}'.format(idx))
plt.plot(np.arange(start, end), X_train[idx, start:end],
lw=5, color='C{}'.format(i))

plt.xlabel('Time', fontsize=12)
plt.title('The four most discriminative shapelets', fontsize=14)
plt.legend(loc='best', fontsize=8)
plt.show()

######################################

Can anyone help me with this to run this code and visualize the shapelet transform
shapelet.txt

@johannfaouzi
Copy link
Owner

Hi,

I don't think that the shapelet transform algorithm is suited for your dataset. From what I understood, you have a single time series (with the values corresponding to variable Q0). To use the shapelet transform algorithm, you would need:

  • a dataset of time series. For a given time series and any shapelet extracted from this time series, the distance between this shapelet and this time series is equal to 0 by definition. Thus, if you have a single time series, the distance between this time series and any shapelet extracted from this time series is equal to 0, which is not useful at all.
  • (optionally) class labels. The shapelet transform algorithm usually performs feature selection in order to keep the most relevant shapelets only. If your dataset is large (many time series and/or many time points), you will end up with many shapelets if you don't perform feature selection.

I would need more information about your data, because, from what I understood, the shapelet transform algorithm is not adapted to the data snippet that you gave.

@adityabhandwalkar
Copy link
Author

Thanks for your brief answer, So I have this huge one time series which I have sliced down to the number of parts(data snippets) , And every snippet is similar to this(above one) , now what I would like to do is to shapelet discovery first and then shapelet transform in order to detect anomalies in the time series data. That is why I was using the pyts library. as you said earlier Should I use every time snippet in the repository?

thanks

@johannfaouzi
Copy link
Owner

In this case, what you could do:

  1. Create the matrix X with shape (n_snippets, n_timestamps) such as each row corresponds to a data snippet.
  2. Compute the shapelet transform using pyts.transformation.ShapeletTransform. You will get a matrix with shape (n_snippets, n_shapelets) corresponding to the distance between each snippet and each shapelet.
  3. Identify anomalies by computing some statistics (e.g., the maximum mean distance).

Here is a minimal working example:

import numpy as np
from pyts.datasets import load_gunpoint
from pyts.transformation import ShapeletTransform

# Load a dataset
X, _, _, _ = load_gunpoint(return_X_y=True)

# Compute fake labels.
# This trick is needed because the current implementation requires labels
# to select the most discriminative shapelets, but it's irrelevant in your use case.
n_snippets = X.shape[0]
y = np.r_[np.zeros(n_snippets // 2), np.ones(n_snippets - n_snippets // 2)]

# Compute all the shapelets of length 9.
# Here we set 'n_shapelets' to a very high integer so that no selection is performed.
# This number must be higher than X.shape[0] * (X.shape[1] - window_size + 1) to avoid selection. 
clf = ShapeletTransform(n_shapelets=int(1e9), window_sizes=[9])
X_new = clf.fit_transform(X, y)

# Compute statistics to identify anomalies.
# Here we find the shapelet with the highest mean distance to all the snippets.
idx = X_new.mean(axis=1).argmax()
print(idx)

@adityabhandwalkar
Copy link
Author

Hi ,
Thanks for your help, Could you please tell me how can I visualize those shapelets from the snippets.

@johannfaouzi
Copy link
Owner

The shapelets are saved in the shapelets_ attribute of the fitted instance of ShapeletTransform, which means that you can access them with clf.shapelets_ in the example. You can then plot them with matplotlib, but you would need to retrieve the date from the TS columns if you also want to know when this shapelet occurred.

Here is a very minimal working example (following the previous one):

import matplotlib.pyplot as plt

plt.plot(clf.shapelets_[idx], 'o-')
plt.show()

@adityabhandwalkar
Copy link
Author

Hi ,Thanks for your quick reply here is the glimpse of my code for the shapelets, I used pretty much your implementation where I created the matrix by breaking down by time snippet into the parts which the create the matrix with the split function and goes further for y variable , I have seen one example of shapelet transform and its plot with matplotlib which quite interesting.
https://pyts.readthedocs.io/en/stable/auto_examples/transformation/plot_shapelet_transform.html#sphx-glr-auto-examples-transformation-plot-shapelet-transform-py
could you please take a look at my code and tell me what correction should I make in order to make it better with visualization part as in the link as well as from the code view.

Thanks


from turtle import color
import stumpy
import numpy as np

import matplotlib.pyplot as plt
import pandas as pd
import stumpy
import numpy as np

import matplotlib.pyplot as plt
import datetime as dt
from nptdms import TdmsFile
from scipy.fftpack import fft
from matplotlib.patches import Rectangle
from pyts.transformation import ShapeletTransform

class matrixprof():
    def shapelet(self,path):
        file_read = TdmsFile.read(path)
        #data = file_read.groups()[0].channels()[0].data
        df = file_read.as_dataframe()
        df= df.head(11600000)

          def split(seq, num):
             avg = len(seq) / float(num)
             out = []
             last = 0

            while last < len(seq):
               out.append(seq[int(last):int(last + avg)])
               last += avg

            return out
        split_size = input('Enter the split of the data size for shapelets?\n')
        v=split(df.iloc[:,0],int(split_size))
        dataset= np.vstack((v))
        X = dataset
        print(X)
        
        n_snippets = X.shape[0]
        y = np.r_[np.zeros(n_snippets // 2), np.ones(n_snippets - n_snippets // 2)]

# Compute all the shapelets of length 9.
# Here we set 'n_shapelets' to a very high integer so that no selection is performed.
# This number must be higher than X.shape[0] * (X.shape[1] - window_size + 1) to avoid selection. 
        clf = ShapeletTransform(n_shapelets=int(1e9), window_sizes=[6])
        X_new = clf.fit_transform(X, y)

# Compute statistics to identify anomalies.
# Here we find the shapelet with the highest mean distance to all the snippets.
        idx = X_new.mean(axis=1).argmax()
        plt.plot(clf.shapelets_[idx], 'o-')
        plt.show()

"""         for i, index in enumerate(clf.indices_[:2]):
                idx, start, end = index
                plt.plot(X[idx], color='C{}'.format(i),
                label='Sample {}'.format(idx))
                plt.plot(np.arange(start, end), X[idx, start:end],
                lw=5, color='C{}'.format(i))

        plt.xlabel('Time', fontsize=12)
        plt.title(' shapelets', fontsize=14)
        plt.show()  """
        

@johannfaouzi
Copy link
Owner

Your code looks fine to me.

@adityabhandwalkar
Copy link
Author

adityabhandwalkar commented Oct 8, 2022

I mean if you could help me with the visualization part of the shapelets as in the example of pyts as shown in the above link that would be great.

@johannfaouzi
Copy link
Owner

You should not use the first n shapelets (clf.indices_[:2]) because there are not sorted in your case. You need to define your criterium to rank the shapelets, then sort them. In my example, I used the mean distance (X_new.mean(axis=1)), so you would need to sort this array to rank them (np.argsort(X_new.mean(axis=1))). Then you could use the first n indices to plot the n first shapelets.

@adityabhandwalkar
Copy link
Author

Thanks for your comment
Do you think using the below snippet of the code might be relevant for visualizing the shapelets

for i, index in enumerate(clf.indices_[:2]):
                idx, start, end = index
                plt.plot(X[idx], color='C{}'.format(i),
                label='Sample {}'.format(idx))
                plt.plot(np.arange(start, end), X[idx, start:end],
                lw=5, color='C{}'.format(i))

        plt.xlabel('Time', fontsize=12)
        plt.title(' shapelets', fontsize=14)
        plt.show()

@johannfaouzi
Copy link
Owner

I think that it is relevant, but as I said, there is a big issue with this code. Here, you pick the first 2 shapelets (clf.indices_[:2]), but the order of the shapelets is not meaningful. This is why I proposed another criterium for which the ranks are meaningful (np.argsort(X_new.mean(axis=1)[::-1]) sorts the indices of the shapelets based on the mean distance to the snippets in descending order).

@adityabhandwalkar
Copy link
Author

ok ,
So I already did it with your previous comment suggestion, and used idx = np.argsort(X_new.mean(axis=1)).argmax() for that so you are saying that idx = (np.argsort(X_new.mean(axis=1)[::-1]) might be more relevant as in pyts.transformation.ShapeletTransform(n_shapelets='auto', criterion='mutual_info', window_sizes='auto', window_steps=None, remove_similar=True, sort=False, verbose=0, random_state=None, n_jobs=None) we already have the option of sort=True so can't we use it in the first place, And one more question how can we decide window size can't we take it as auto because you took it as 9 in your case why ?

clf = ShapeletTransform(n_shapelets=int(1e9), window_sizes=[9])
X_new = clf.fit_transform(X, y)
       idx = np.argsort(X_new.mean(axis=1)).argmax()
       for i, index in enumerate(clf.indices_[:4]):
               idx, start, end = index
               plt.plot(X[idx], color='C{}'.format(i),
               label='Sample {}'.format(idx))
               plt.plot(np.arange(start, end), X[idx, start:end],
               lw=5, color='C{}'.format(i))

       plt.xlabel('Time', fontsize=12)
       plt.title(' shapelets', fontsize=14)
       plt.show() 

@johannfaouzi
Copy link
Owner

You don't want to use argmax if the values are already argsorted. And you want to uses these indices (and not the first n ones like you are doing with [:4]).

clf = ShapeletTransform(n_shapelets=int(1e9), window_sizes=[9])
X_new = clf.fit_transform(X, y)
indices = np.argsort(X_new.mean(axis=1))[::-1]
for i, index in enumerate(indices[:4]):
     idx, start, end = clf.indices_[index]
     plt.plot(X[idx], color='C{}'.format(i), label='Sample {}'.format(idx))
     plt.plot(np.arange(start, end), X[idx, start:end], lw=5, color='C{}'.format(i))

plt.xlabel('Time', fontsize=12)
plt.title(' shapelets', fontsize=14)
plt.show() 

@adityabhandwalkar
Copy link
Author

Ok!, Can't we take the window size other than 9 ? Could you please tell me how can we decide that.

@johannfaouzi
Copy link
Owner

Sorry for not answering this point! You can use any window size (as long as it is between 1 and the length of the snippet). You can try out several values, I would guess that the window size should depend on your use case (in terms of actual time, how many seconds / minutes / hours / days / weeks / months / years).

@adityabhandwalkar
Copy link
Author

Ok understood,
I was also looking through the the learning shapelet approach where in example 2 shapelet are selected like this

        shapelets = np.asarray([clf.shapelets_[0, 9], clf.shapelets_[0, 12]])
            # Derive the distances between the time series and the shapelets
        print(shapelets)
        shapelet_size = shapelets.shape[1]
        X_window = windowed_view(X, window_size=shapelet_size, window_step=1)
        X_dist = np.mean(
        (X_window[:, :, None] - shapelets[None, :]) ** 2, axis=3).min(axis=1)

so is it the only way to get shapelets in the learning, what if I want shapelet with the maximum and the minimun mean distance from the clf

@johannfaouzi
Copy link
Owner

To learn shapelets, you need labels (i.e., each data snippet is labeled). This is supervised learning. From what I understood, you don't have labels (because you only have one time series), and you want to do anomaly detection with unsupervised learning. So the learning shapelet approach is not relevant in your case (unless you have labels for your data snippets).

@adityabhandwalkar
Copy link
Author

Resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants