# **Smarter anomaly detection** - Data preparation step
*Part 1 - Data preparation*

## Initialization
---
This repository is structured as follow:

```sh
. smarter-anomaly-detection
|
├── data/
|   ├── interim                          # Temporary intermediate data are stored here
|   ├── processed                        # Finalized datasets ready to be moved to Amazon S3
|   └── raw                              # Immutable original data are stored here
|
└── notebooks/
    ├── 1_data_preparation.ipynb         <<< THIS NOTEBOOK <<<
    ├── 2_model_training.ipynb
    └── 3_model_evaluation.ipynb
```

### Notebook configuration update

In [None]:
!pip install --quiet --upgrade pip
!pip install --quiet --upgrade tqdm tsia

### Imports

In [None]:
import synthetic_config as config
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import tsia
import zipfile

from matplotlib import gridspec
from tqdm import tqdm

### Parameters
Let's first check if the bucket name is defined, if it exists and if we have access to it from this notebook. If this notebook does not have access to the S3 bucket, you will have to update its permission:

In [None]:
RAW_DATA = os.path.join('..', 'data', 'raw')
TMP_DATA = os.path.join('..', 'data', 'interim')
PROCESSED_DATA = os.path.join('..', 'data', 'processed')
os.makedirs(RAW_DATA, exist_ok=True)
os.makedirs(TMP_DATA, exist_ok=True)
os.makedirs(PROCESSED_DATA, exist_ok=True)

%matplotlib inline
plt.style.use('fivethirtyeight')
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
plt.rcParams['lines.linewidth'] = 1.0

## Loading data
---

In [None]:
synth_fname = os.path.join(TMP_DATA, 'synthetic', 'sensors.csv')
synth_df = pd.read_csv(synth_fname)
synth_df['timestamp'] = pd.to_datetime(synth_df['timestamp'])
synth_df = synth_df.set_index('timestamp')
synth_df.index.name = 'Timestamp'
synth_df

## Dataset visualization
---

This dataset contains some labels with failure and healing periods:

In [None]:
broken_df = synth_df[synth_df['machine_status'] == 'BROKEN'].copy()

recovering_df = pd.DataFrame(index=synth_df.index, columns=['value'])
recovering_df['value'] = 0.0
recovering_index = synth_df[synth_df['machine_status'] == 'RECOVERING'].index
recovering_df.loc[recovering_index, 'value'] = 1500.0

In [None]:
fig = plt.figure(figsize=(24,6))
plt.plot(synth_df['signal_05'], label='Signal 05')
plt.plot(synth_df['signal_04'], label='Signal 04')
plt.plot(synth_df['signal_19'], label='Signal 19')
plt.plot(synth_df['signal_07'], label='Signal 07')
plt.plot(synth_df['signal_00'], label='Signal 00')
plt.scatter(broken_df.index, broken_df['signal_03'], marker='o', color=colors[1], s=100, edgecolor='#000000', alpha=0.8, zorder=3, label='Failure time')
plt.fill_between(x=recovering_df.index, y1=recovering_df['value'], color=colors[2], alpha=0.4, label='Recovering period')

plt.legend(loc='lower center', fontsize=10, ncol=7, bbox_to_anchor=(0.5, -0.15))
plt.title('Synthetic sensor data')

plt.show()

In [None]:
tags_list = list(synth_df.columns)
num_cols = 2
num_rows = len(tags_list) // num_cols + 1
fig = plt.figure(figsize=(24, 5 * num_rows))

for index, f in enumerate(tags_list):
    ax = fig.add_subplot(num_rows, num_cols, index+1)
    ax.plot(synth_df[f], color=colors[index % len(colors)])
    ax.set_title(f)
    
plt.show()

In [None]:
features = [f for f in tags_list if f not in ['machine_status']]

# Build a list of dataframes, one per feature:
df_list = []
for sensor in features:
    tag_df = synth_df[[sensor]]
    tag_df = tag_df.replace(np.nan, 0.0)
    df_list.append(tag_df)

# Discretize each signal in 3 bins:
array = tsia.markov.discretize_multivariate(df_list)

# Plot the strip chart:
tsia.plot.plot_timeseries_strip_chart(
    array, 
    signal_list=features,
    fig_width=24,
    signal_height=0.2,
    dates=df_list[0].index.to_pydatetime(),
    day_interval=10
)

## Preparing the dataset for ingestion
---
Let's now prepare the data for ingestion into the Amazon Lookout for Equipment service.

We need two datasets, the **time series data** and some **label** data: although Lookout for Equipment only uses unsupervised approaches, these label data are used to rank the models trained in the background and select the best one.

### Time series data

In [None]:
timeseries_df = synth_df[features]
timeseries_df.head()

In [None]:
TRAIN_DATA = os.path.join(PROCESSED_DATA, 'train-data', 'synthetic')
os.makedirs(TRAIN_DATA, exist_ok=True)
timeseries_fname = os.path.join(TRAIN_DATA, 'sensors.csv')
timeseries_df.to_csv(timeseries_fname)

### Label data
We need to transform the label time series into a sequence of time ranges with start time and end time:

In [None]:
label_index = synth_df[
    (synth_df['machine_status'] == 'RECOVERING') | 
    (synth_df['machine_status'] == 'BROKEN')
].index

label_df = pd.DataFrame(index=synth_df.index, columns=['value'])
label_df['value'] = 0.0
label_df.loc[label_index, 'value'] = 1.0

label_df['previous'] = label_df.shift(1, fill_value=0.0)
label_df['start']    = (label_df['value'] == 1.0) & (label_df['previous'] == 0.0)
label_df['end']      = (label_df['value'] == 0.0) & (label_df['previous'] == 1.0)
label_df             = label_df[(label_df['start'] == True) | (label_df['end'] == True)]

anomaly_ranges = pd.DataFrame(columns=['start', 'end'])
for index, row in label_df.iterrows():
    if row['start'] == True:
        start = row.name
        
    if row['end'] == True:
        end = row.name
        anomaly_ranges = anomaly_ranges.append({'start': start, 'end': end}, ignore_index=True)
        
anomaly_ranges['start'] = anomaly_ranges['start'].dt.strftime('%Y-%m-%d %H:%M:%S')
anomaly_ranges['end'] = anomaly_ranges['end'].dt.strftime('%Y-%m-%d %H:%M:%S')
anomaly_ranges

In [None]:
LABEL_DATA = os.path.join(PROCESSED_DATA, 'label-data')
os.makedirs(LABEL_DATA, exist_ok=True)
labels_fname = os.path.join(LABEL_DATA, 'synthetic-labels.csv')
anomaly_ranges.to_csv(labels_fname, index=None, header=None)

### Uploading data to Amazon S3

In [None]:
BUCKET       = config.BUCKET
TRAIN_PREFIX = config.PREFIX_TRAINING
TRAIN_LABEL  = config.PREFIX_LABEL

s3_train_prefix = f's3://{BUCKET}/{TRAIN_PREFIX}synthetic/sensors.csv'
s3_label_prefix = f's3://{BUCKET}/{TRAIN_LABEL}labels.csv'

!aws s3 cp $timeseries_fname $s3_train_prefix
!aws s3 cp $labels_fname $s3_label_prefix

## Conclusion
---
In this notebook, you prepared a synthetic dataset for ingestion in Amazon Lookout for Equipment.

You also had a quick overview of the dataset with basic timeseries visualization.

You uploaded the training time series data and the anomaly labels to Amazon S3: in the next notebook of this getting started, you will be acquainted with the Amazon Lookout for Equipment API to create your first dataset and train a model