<a href="https://colab.research.google.com/github/ptah0414/stock-coin-price-prediction/blob/main/06_06_bitcoin_anomaly_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Timeseries anomaly detection using an Autoencoder



## Introduction

This script demonstrates how you can use a reconstruction convolutional
autoencoder model to detect anomalies in timeseries data.

## Setup

In [None]:
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from matplotlib import pyplot as plt

## Load the data


In [None]:
# !pip install yfinance
# !pip install pytrends
# !pip install pyupbit
# !pip install schedule
# !pip install pymysql

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import yfinance as yf
import pyupbit

In [None]:
# day/minute1/minute3/minute5/minute10/minute15/minute30/minute60/minute240/week/month
interval_upbit = "minute60"

# valid intervals: 1m,2m,5m,15m,30m,60m,90m,1h,1d,5d,1wk,1mo,3mo
interval_yf = "1h"
unit = 24 # 하루에 몇 개? 
train_period = 365*5
test_period = 28

In [None]:
ticker = "KRW-BTC"
df = pyupbit.get_ohlcv(ticker, interval=interval_upbit, count=unit*train_period)
df

Unnamed: 0,open,high,low,close,volume,value
2017-09-25 12:00:00+00:00,4201000.0,4244000.0,4191000.0,4202000.0,98.210406,4.142651e+08
2017-09-25 13:00:00+00:00,4222000.0,4235000.0,4208000.0,4235000.0,7.656490,3.230358e+07
2017-09-25 17:00:00+00:00,4235000.0,4235000.0,4235000.0,4235000.0,0.000030,1.270500e+02
2017-09-25 19:00:00+00:00,4213000.0,4213000.0,4187000.0,4204000.0,0.914286,3.832582e+06
2017-09-25 20:00:00+00:00,4204000.0,4204000.0,4175000.0,4191000.0,2.197385,9.211909e+06
...,...,...,...,...,...,...
2022-06-08 11:00:00+00:00,39357000.0,39479000.0,39209000.0,39216000.0,122.341135,4.812968e+09
2022-06-08 12:00:00+00:00,39216000.0,39277000.0,38128000.0,38235000.0,1601.081353,6.166533e+10
2022-06-08 13:00:00+00:00,38235000.0,38658000.0,38130000.0,38433000.0,360.640393,1.385043e+10
2022-06-08 14:00:00+00:00,38433000.0,38850000.0,38406000.0,38705000.0,205.860040,7.955241e+09


## Quick look at the data

## Visualize the data
### Timeseries data without anomalies

We will use the following data for training.

In [None]:
# # 정상 데이터(비트코인)
# fig, ax = plt.subplots()
# df_small_noise.plot(legend=False, ax=ax)
# plt.show()

### Timeseries data with anomalies

We will use the following data for testing and see if the sudden jump up in the
data is detected as an anomaly.

In [None]:
# # 비정상 데이터(루나코인)
# fig, ax = plt.subplots()
# df_daily_jumpsup.plot(legend=False, ax=ax)
# plt.show()

##비트코인(주황) & 루나코인(파랑)

In [None]:
# fig, ax = plt.subplots()
# df_daily_jumpsup.plot(legend=False, ax=ax)
# df_small_noise.plot(legend=False, ax=ax)
# plt.ylim([-50, 50])
# plt.show()

## Prepare training data

Get data values from the training timeseries data file and normalize the
`value` data. We have a `value` for every 5 mins for 14 days.

-   24 * 60 / 5 = **288 timesteps per day**
-   288 * 14 = **4032 data points** in total

In [None]:
# Normalize and save the mean and std we get,
# for normalizing test data.
training_mean = df_small_noise.mean()
training_std = df_small_noise.std()
df_training_value = (df_small_noise - training_mean) / training_std
print("Number of training samples:", len(df_training_value))

NameError: ignored

### Create sequences
Create sequences combining `TIME_STEPS` contiguous data values from the
training data.

In [None]:
TIME_STEPS = unit

# Generated training sequences for use in the model.
def create_sequences(values, time_steps=TIME_STEPS):
    output = []
    for i in range(len(values) - time_steps + 1):
        output.append(values[i : (i + time_steps)])
    return np.stack(output)


x_train = create_sequences(df_training_value.values)
print("Training input shape: ", x_train.shape)

## Build a model

We will build a convolutional reconstruction autoencoder model. The model will
take input of shape `(batch_size, sequence_length, num_features)` and return
output of the same shape. In this case, `sequence_length` is 288 and
`num_features` is 1.

In [None]:
model = keras.Sequential(
    [
        layers.Input(shape=(x_train.shape[1], x_train.shape[2])),
        layers.Conv1D(
            filters=32, kernel_size=7, padding="same", strides=2, activation="relu"
        ),
        layers.Dropout(rate=0.2),
        layers.Conv1D(
            filters=16, kernel_size=7, padding="same", strides=2, activation="relu"
        ),
        layers.Conv1DTranspose(
            filters=16, kernel_size=7, padding="same", strides=2, activation="relu"
        ),
        layers.Dropout(rate=0.2),
        layers.Conv1DTranspose(
            filters=32, kernel_size=7, padding="same", strides=2, activation="relu"
        ),
        layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),
    ]
)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")
model.summary()

## Train the model

Please note that we are using `x_train` as both the input and the target
since this is a reconstruction model.

In [None]:
history = model.fit(
    x_train,
    x_train,
    epochs=5,
    batch_size=128,
    validation_split=0.1,
    callbacks=[
        keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")
    ],
)

Let's plot training and validation loss to see how the training went.

In [None]:
# plt.plot(history.history["loss"], label="Training Loss")
# plt.plot(history.history["val_loss"], label="Validation Loss")
# plt.legend()
# plt.show()

## Detecting anomalies

We will detect anomalies by determining how well our model can reconstruct
the input data.


1.   Find MAE loss on training samples.
2.   Find max MAE loss value. This is the worst our model has performed trying
to reconstruct a sample. We will make this the `threshold` for anomaly
detection.
3.   If the reconstruction loss for a sample is greater than this `threshold`
value then we can infer that the model is seeing a pattern that it isn't
familiar with. We will label this sample as an `anomaly`.


#Threshold 설정

In [None]:
# Get train MAE loss.
x_train_pred = model.predict(x_train)
train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)

plt.hist(train_mae_loss, bins=50)
plt.xlabel("Train MAE loss")
plt.ylabel("No of samples")
plt.show()

# Get reconstruction loss threshold.
threshold = np.max(train_mae_loss)
print("Reconstruction error threshold: ", threshold)

### Compare recontruction

Just for fun, let's see how our model has recontructed the first sample.
This is the 288 timesteps from day 1 of our training dataset.

In [None]:
# Checking how the first sequence is learnt
plt.plot(x_train[0])
plt.plot(x_train_pred[0])
plt.show()

### Prepare test data

##luna from yfinance

In [None]:
# interval = "5m"
# # valid intervals: 1m,2m,5m,15m,30m,60m,90m,1h,1d,5d,1wk,1mo,3mo

In [None]:
# 9일부터 12일 폭락함
start = "2022-04-29"
end = "2022-05-13"

In [None]:
df = yf.download("LUNA1-USD", start=start, end=end, interval=interval_yf)
df["close_chg"] = (df["Close"] - df["Close"].shift(1)) / df["Close"].shift(1) * 100
luna = df[["close_chg"]]
luna

In [None]:
luna = luna.fillna(method = "bfill")
luna.head()

In [None]:
# # df = yf.download("BTC-USD", start=start, end=end, interval=interval)
# df["close_chg"] = (df["close"] - df["close"].shift(1)) / df["close"].shift(1) * 100
# btc = df[["close_chg"]]
# btc

In [None]:
# btc = btc.fillna(method = "bfill")
# btc.head()

In [None]:
# print(df_small_noise.head())

# print(df_daily_jumpsup.head())

In [None]:
# print(luna.head())
# print(btc.head())

In [None]:
# df_small_noise = luna # 정상
# df_daily_jumpsup = luna # 비정상

In [None]:
# ticker = "KRW-BTC"
# df = pyupbit.get_ohlcv(ticker, interval=interval, count=unit*test_period)
# df

In [None]:
# # df = yf.download("BTC-USD", start=start, end=end, interval=interval)
# df["close_chg"] = (df["close"] - df["close"].shift(1)) / df["close"].shift(1) * 100
# btc_curr = df[["close_chg"]]
# btc_curr = btc_curr.fillna(method = "bfill")

In [None]:
df_daily_jumpsup = luna
df_daily_jumpsup

In [None]:
# ticker = "BTC-USD"
# stock_data = yf.Ticker(ticker)

# hist="1mo"
# hist_data = stock_data.history(hist, interval="5m", auto_adjust=True)

# df = hist_data[["Close"]]
# df = df[-8064:]

# df["close_chg"] = (df["Close"] - df["Close"].shift(1)) / df["Close"].shift(1) * 100
# btc_curr = df[["close_chg"]]
# btc_curr = btc_curr.fillna(method = "bfill")

In [None]:
# df_daily_jumpsup = btc_curr
# df_daily_jumpsup

In [None]:
df_test_value = (df_daily_jumpsup - training_mean) / training_std
fig, ax = plt.subplots()
df_test_value.plot(legend=False, ax=ax)
plt.ylim([-50, 50])
plt.show()

# Create sequences from test values.
x_test = create_sequences(df_test_value.values)
print("Test input shape: ", x_test.shape)

# Get test MAE loss.
x_test_pred = model.predict(x_test)
test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)
test_mae_loss = test_mae_loss.reshape((-1))

plt.hist(test_mae_loss, bins=50)
plt.xlabel("test MAE loss")
plt.ylabel("No of samples")
plt.show()

# Detect all the samples which are anomalies.
anomalies = test_mae_loss > threshold
print("Number of anomaly samples: ", np.sum(anomalies))
print("Indices of anomaly samples: ", np.where(anomalies))

## Plot anomalies

We now know the samples of the data which are anomalies. With this, we will
find the corresponding `timestamps` from the original test data. We will be
using the following method to do that:

Let's say time_steps = 3 and we have 10 training values. Our `x_train` will
look like this:

- 0, 1, 2
- 1, 2, 3
- 2, 3, 4
- 3, 4, 5
- 4, 5, 6
- 5, 6, 7
- 6, 7, 8
- 7, 8, 9

All except the initial and the final time_steps-1 data values, will appear in
`time_steps` number of samples. So, if we know that the samples
[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
5 is an anomaly.

In [None]:
# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies
anomalous_data_indices = []
for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):
    if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):
        anomalous_data_indices.append(data_idx)

Let's overlay the anomalies on the original test data plot.

In [None]:
df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]
fig, ax = plt.subplots()
df_daily_jumpsup.plot(legend=False, ax=ax)
if anomalous_data_indices:
  df_subset.plot(legend=False, ax=ax, color="r")
  print("anormaly detected!")
plt.ylim([-50, 50])
plt.show()