# MANUela Anomaly ML Model

Goal: Build a machine lerning model that detects anomalies in sensor vibration data

![anomalies](https://raw.githubusercontent.com/sa-mw-dach/manuela/master/docs/images/manuela-anomalies.png)

Steps:
- Wrangling sensor data 
- Save the training data
- Prepare the data for modeling, training and testing
- Train and validate models
- Select and save the best model
- Prototype class for Seldon model serving

*Note: There are many ways to address the problem. ARIMA. baslining or forecasting with an LSTM neural network would be interesting. In this notebook we picked a rather simple approach, because the focus is on real-time alerts.*



## Wrangling sensor data 

In [1]:
import pandas as pd
import numpy as np


df = pd.read_csv('raw-data.csv')
df['time'] = pd.to_datetime(df['ts'],unit='ms')
df.set_index('time', inplace=True)
df.drop(columns=['ts'], inplace=True)

### Preview the raw data


In [2]:
df.head(20)

Unnamed: 0_level_0,id,value,label
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-04-23 11:13:54.617,pump-1,18.340181,0
2020-04-23 11:13:57.999,pump-2,12.703972,0
2020-04-23 11:13:59.618,pump-1,17.647661,0
2020-04-23 11:14:02.999,pump-2,13.805114,0
2020-04-23 11:14:04.617,pump-1,16.874933,0
2020-04-23 11:14:07.999,pump-2,15.415206,0
2020-04-23 11:14:09.617,pump-1,16.180807,0
2020-04-23 11:14:12.999,pump-2,15.922729,0
2020-04-23 11:14:14.618,pump-1,15.407113,0
2020-04-23 11:14:17.999,pump-2,17.05156,0


### Raw data over time
Vibration pump 1: Data shows a few anomalies

In [3]:
df1 = df.loc[df['id'] == 'pump-1']
df1 = df1.drop(columns=['id', 'label'])

Vibration pump 2: Data shows a few anomalies

In [4]:
df1 = df.loc[df['id'] == 'pump-2']
df1 = df1.drop(columns=['id', 'label'])

In [5]:
df1 = df.loc[df['id'] == 'pump-1']
df1 = df1.drop(columns=['id'])

In [6]:
df1.head(10)

Unnamed: 0_level_0,value,label
time,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-04-23 11:13:54.617,18.340181,0
2020-04-23 11:13:59.618,17.647661,0
2020-04-23 11:14:04.617,16.874933,0
2020-04-23 11:14:09.617,16.180807,0
2020-04-23 11:14:14.618,15.407113,0
2020-04-23 11:14:19.617,15.324012,0
2020-04-23 11:14:24.617,13.470387,0
2020-04-23 11:14:29.617,11.702384,0
2020-04-23 11:14:34.617,11.176102,0
2020-04-23 11:14:39.617,10.678349,0


### Labeled data over time
- Vibration pump 1. 
- Label = 1 -> Anomanly
- The (manually) labeled data makes few more anomalies visibile.

In [7]:
df1 = df.loc[df['id'] == 'pump-1']
df1 = df1.drop(columns=['id'])

Vibration pump 2: 
- Label = 1 -> Anomanly
- The (manually) labeled data makes few more anomalies visibile.

In [8]:
df2 = df.loc[df['id'] == 'pump-2']
df2 = df2.drop(columns=['id'])

## Data Wrangling
Goal: Convert time series data into small episodes that can be uses for supervised learning.


In [9]:
#
# Few helper functions
#

# Get list with column names: F1, F2, Fn, L
def get_columns(n):
    f = []
    for x in range(1, n+1):
        f.append("F"+str(x))
    f.append("L")
    return f


# Create empty data frame
def create_empty_df(n):
    d = ([0.]*n)
    d.append(0)
    dfx = pd.DataFrame([d], columns=get_columns(n))
    dfx.drop(dfx.index[0], inplace=True)
    return dfx


# Create data frame with one row
def create_df(vals: list, label: int = 0):
    if not isinstance(vals, list):
        raise TypeError
    dfx = pd.DataFrame([vals+[label]], columns=get_columns(len(vals)))
    return dfx

Create a new dataframe: Rows represent the last x (length) value and the label.

```
--+-----+-----
tz value label
--+-----+-----
..  ...    0
04  6.2    0
05  7.2    0
06  3.1    0
07 12.4    1
..  ...
--+-----+-----
```

Convert to episodes with lenght = 3

```
---+----+----+---
F1   F2   F3   L
---+----+----+---
..
6.2  7.2  3.1  0
7.2  3.1 12.4  1
..
---+----+----+---
```


In [10]:
length = 5  # Episode length

df_epis = create_empty_df(length)

for id in df.id.unique():
    print("Convert data for: ", id)

    df2 = df.loc[df['id'] == id]

    epi = []
    for index, row in df2.iterrows():
        # print('%6.2f, %d' % (row['value'], row['label']))
        epi.append(row['value'])
        if len(epi) == length:
            df_row = create_df(epi, row['label'])
            df_epis = df_epis.append(df_row, ignore_index=True)
            del(epi[0])

Convert data for:  pump-1


  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(d

Convert data for:  pump-2


  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(df_row, ignore_index=True)
  df_epis = df_epis.append(d

### Explore the new data 

In [11]:
df_epis.head(20)

Unnamed: 0,F1,F2,F3,F4,F5,L
0,18.340181,17.647661,16.874933,16.180807,15.407113,0
1,17.647661,16.874933,16.180807,15.407113,15.324012,0
2,16.874933,16.180807,15.407113,15.324012,13.470387,0
3,16.180807,15.407113,15.324012,13.470387,11.702384,0
4,15.407113,15.324012,13.470387,11.702384,11.176102,0
5,15.324012,13.470387,11.702384,11.176102,10.678349,0
6,13.470387,11.702384,11.176102,10.678349,9.831242,0
7,11.702384,11.176102,10.678349,9.831242,11.555063,0
8,11.176102,10.678349,9.831242,11.555063,13.197594,0
9,10.678349,9.831242,11.555063,13.197594,14.337077,0


In [12]:
df_epis.describe()

Unnamed: 0,F1,F2,F3,F4,F5,L
count,3014.0,3014.0,3014.0,3014.0,3014.0,3014.0
mean,14.419974,14.421438,14.430688,14.430566,14.429951,0.02787
std,4.514179,4.51439,4.544588,4.54451,4.544357,0.164627
min,8.089854,8.089854,8.089854,8.089854,8.089854,0.0
25%,11.724137,11.724137,11.724137,11.724137,11.724137,0.0
50%,13.96132,13.966907,13.971367,13.971367,13.971367,0.0
75%,16.179648,16.180626,16.180626,16.180626,16.179648,0.0
max,48.423213,48.423213,48.423213,48.423213,48.423213,1.0


In [13]:
# Calculate number of episodes
n_episodes = df_epis.shape[0]

# Calculate number of features
n_features = df_epis.shape[1] - 1

# Calculate passing students
n_anomaly = df_epis[df_epis['L'] == 1].shape[0]

# TODO: Calculate failing students
n_normal = df_epis[df_epis['L'] == 0].shape[0]

# TODO: Calculate graduation rate
anomaly_rate = n_anomaly / float(n_episodes) * 100

# Print the results
print("Total number of episodes: {}".format(n_episodes))
print("Number of features: {}".format(n_features))
print("Number of episodes with anomaly: {}".format(n_anomaly))
print("Number of episodes witManipulatehout anomaly: {}".format(n_normal))
print("Anomaly rate in dataset: {:.2f}%".format(anomaly_rate))

Total number of episodes: 3014
Number of features: 5
Number of episodes with anomaly: 84
Number of episodes witManipulatehout anomaly: 2930
Anomaly rate in dataset: 2.79%


Let's vary the anomalies to make the model more robust

In [14]:
factor = 5  # Number of copies
dfr = df_epis.copy()
for i in range(1, factor):

    f = 0.5 + ((i - 1) * 0.5 / (factor-1))  # vary the anomaly by a factor

    dfi = df_epis.copy()
    dfi['F5'] = np.where(dfi['L'] == 1, dfi['F5']*f, dfi['F5'])
    dfr = dfr.append(dfi)

df_epis = dfr.copy()

  dfr = dfr.append(dfi)
  dfr = dfr.append(dfi)
  dfr = dfr.append(dfi)


In [15]:
# Calculate number of episodes
n_episodes = df_epis.shape[0]

# Calculate number of features
n_features = df_epis.shape[1] - 1

# Calculate passing students
n_anomaly = df_epis[df_epis['L'] == 1].shape[0]

# TODO: Calculate failing students
n_normal = df_epis[df_epis['L'] == 0].shape[0]

# TODO: Calculate graduation rate
anomaly_rate = n_anomaly / float(n_episodes) *100

# Print the results
print("Total number of episodes: {}".format(n_episodes))
print("Number of features: {}".format(n_features))
print("Number of episodes with anomaly: {}".format(n_anomaly))
print("Number of episodes without anomaly: {}".format(n_normal))
print("Anomaly rate in dataset: {:.2f}%".format(anomaly_rate))

Total number of episodes: 15070
Number of features: 5
Number of episodes with anomaly: 420
Number of episodes without anomaly: 14650
Anomaly rate in dataset: 2.79%


### Save Training data to CSV

In [16]:
df_epis.to_csv(
    'sensor-training-data.csv', index=False, header=True, float_format='%.2f'
)