# MANUela Anomaly ML Model

Goal: Build a machine lerning model that detects anomalies in sensor vibration data

![anomalies](https://raw.githubusercontent.com/sa-mw-dach/manuela/master/docs/images/manuela-anomalies.png)

Steps:
- Wrangling sensor data 
- Save the training data
- Prepare the data for modeling, training and testing
- Train and validate models
- Select and save the best model
- Prototype class for Seldon model serving

*Note: There are many ways to address the problem. ARIMA. baslining or forecasting with an LSTM neural network would be interesting. In this notebook we picked a rather simple approach, because the focus is on real-time alerts.*



## Wrangling sensor data 

In [None]:
import pandas as pd
import numpy as np


df = pd.read_csv('raw-data.csv')
df['time'] = pd.to_datetime(df['ts'],unit='ms')
df.set_index('time', inplace=True)
df.drop(columns=['ts'], inplace=True)

### Preview the raw data


In [None]:
df.head(20)

### Raw data over time
Vibration pump 1: Data shows a few anomalies

In [None]:
df1 = df.loc[df['id'] == 'pump-1']
df1 = df1.drop(columns=['id', 'label'])

Vibration pump 2: Data shows a few anomalies

In [None]:
df1 = df.loc[df['id'] == 'pump-2']
df1 = df1.drop(columns=['id', 'label'])

In [None]:
df1 = df.loc[df['id'] == 'pump-1']
df1 = df1.drop(columns=['id'])

In [None]:
df1.head(10)

### Labeled data over time
- Vibration pump 1. 
- Label = 1 -> Anomanly
- The (manually) labeled data makes few more anomalies visibile.

In [None]:
df1 = df.loc[df['id'] == 'pump-1']
df1 = df1.drop(columns=['id'])

Vibration pump 2: 
- Label = 1 -> Anomanly
- The (manually) labeled data makes few more anomalies visibile.

In [None]:
df2 = df.loc[df['id'] == 'pump-2']
df2 = df2.drop(columns=['id'])

## Data Wrangling
Goal: Convert time series data into small episodes that can be uses for supervised learning.


In [None]:
#
# Few helper functions
#

# Get list with column names: F1, F2, Fn, L
def get_columns(n):
    f = []
    for x in range(1, n+1):
        f.append("F"+str(x))
    f.append("L")
    return f


# Create empty data frame
def create_empty_df(n):
    d = ([0.]*n)
    d.append(0)
    dfx = pd.DataFrame([d], columns=get_columns(n))
    dfx.drop(dfx.index[0], inplace=True)
    return dfx


# Create data frame with one row
def create_df(vals: list, label: int = 0):
    if not isinstance(vals, list):
        raise TypeError
    dfx = pd.DataFrame([vals+[label]], columns=get_columns(len(vals)))
    return dfx

Create a new dataframe: Rows represent the last x (length) value and the label.

```
--+-----+-----
tz value label
--+-----+-----
..  ...    0
04  6.2    0
05  7.2    0
06  3.1    0
07 12.4    1
..  ...
--+-----+-----
```

Convert to episodes with lenght = 3

```
---+----+----+---
F1   F2   F3   L
---+----+----+---
..
6.2  7.2  3.1  0
7.2  3.1 12.4  1
..
---+----+----+---
```


In [None]:
length = 5  # Episode length

df_epis = create_empty_df(length)

for id in df.id.unique():
    print("Convert data for: ", id)

    df2 = df.loc[df['id'] == id]

    epi = []
    for index, row in df2.iterrows():
        # print('%6.2f, %d' % (row['value'], row['label']))
        epi.append(row['value'])
        if len(epi) == length:
            df_row = create_df(epi, row['label'])
            df_epis = df_epis.append(df_row, ignore_index=True)
            del(epi[0])

### Explore the new data 

In [None]:
df_epis.head(20)

In [None]:
df_epis.describe()

In [None]:
# Calculate number of episodes
n_episodes = df_epis.shape[0]

# Calculate number of features
n_features = df_epis.shape[1] - 1

# Calculate passing students
n_anomaly = df_epis[df_epis['L'] == 1].shape[0]

# TODO: Calculate failing students
n_normal = df_epis[df_epis['L'] == 0].shape[0]

# TODO: Calculate graduation rate
anomaly_rate = n_anomaly / float(n_episodes) * 100

# Print the results
print("Total number of episodes: {}".format(n_episodes))
print("Number of features: {}".format(n_features))
print("Number of episodes with anomaly: {}".format(n_anomaly))
print("Number of episodes witManipulatehout anomaly: {}".format(n_normal))
print("Anomaly rate in dataset: {:.2f}%".format(anomaly_rate))

Let's vary the anomalies to make the model more robust

In [None]:
factor = 5  # Number of copies
dfr = df_epis.copy()
for i in range(1, factor):

    f = 0.5 + ((i - 1) * 0.5 / (factor-1))  # vary the anomaly by a factor

    dfi = df_epis.copy()
    dfi['F5'] = np.where(dfi['L'] == 1, dfi['F5']*f, dfi['F5'])
    dfr = dfr.append(dfi)

df_epis = dfr.copy()

In [None]:
# Calculate number of episodes
n_episodes = df_epis.shape[0]

# Calculate number of features
n_features = df_epis.shape[1] - 1

# Calculate passing students
n_anomaly = df_epis[df_epis['L'] == 1].shape[0]

# TODO: Calculate failing students
n_normal = df_epis[df_epis['L'] == 0].shape[0]

# TODO: Calculate graduation rate
anomaly_rate = n_anomaly / float(n_episodes) *100

# Print the results
print("Total number of episodes: {}".format(n_episodes))
print("Number of features: {}".format(n_features))
print("Number of episodes with anomaly: {}".format(n_anomaly))
print("Number of episodes without anomaly: {}".format(n_normal))
print("Anomaly rate in dataset: {:.2f}%".format(anomaly_rate))

### Save Training data to CSV

In [None]:
df_epis.to_csv(
    'sensor-training-data.csv', index=False, header=True, float_format='%.2f'
)