# Isolation Forest

In this Jupyter notebook, you'll use the data you collected with your IoT Freezer Monitor to train a machine learning model to detect anomalies in the operation of your freezer. The goal of using a anomaly detection is to send you a warning before freezer has an obvious malfunction.

## What is an Isolation Forest

Isolation Forest is a machine learning algorithem that is used to identify outliers in unlabeled data. Unlabeled data mean that you don't know whether your training data set contains anomalies. The data you're using is most likely unlabeled since hopefully your freezer didn't break while you recording the temperature. 

## How does this work?

The isolation forest looks for outliers by randomly spliting the data recursivly until a point is isolated from all other point, the fewer steps it needs to do this the more likely the point is an outlier.

## Prepare your environment

The sections below prepares the Python environment with all the libraries you need to train you anomaly detection model.

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
from os import listdir
from os.path import join
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

## Process your raw data

The following sections will load your data set, trim any unwanted data out, and split the data into individual timed increments.

In [None]:
# Helper function
def hours_to_seconds(hours):
    return hours * 3600

In [None]:
# Caluclate how many samples to include per file
# The sample rate should match the code on your temperature monitor
sample_rate = 0.2   # Hz
sample_time = .25     # Hours
samples_per_file = sample_rate * hours_to_seconds(sample_time)

In [None]:
# Path to the dataset from Adafruit.io
# Change <DATASET-NAME> to the file name of the Adafruit.io data you downloaded.
dataset_path = './dataset/raw/<DATASET-NAME>'

# Load the dataset using pandas
df = pd.read_csv(dataset_path, usecols=[1])

df

In [None]:
# Plot the data
plt.plot(df)
plt.show

In [None]:
# You may want to trim start and end of your data. Espcially if the data at the start of the set is from outside the freezer or while the thermocouple was cooling down.

start_time_trim = 1 # Hours
end_time_trim = 1   # Hours
start_time = hours_to_seconds(start_time_trim)
end_time = len(df.index) - hours_to_seconds(end_time_trim)

df = df.truncate(before=start_time, after=end_time)

# Create one file for each group of samples
arr = []
for i, temp in df.iterrows():
    # starting_idx = i
    arr.append(temp)
    if i % samples_per_file == 0 and i != start_time:
        sample = pd.DataFrame(data=arr)
        sample.to_csv('./dataset/training/output_'+str(i), index=False, header=False)
        arr = []
        

## Prepare you data for training

In [None]:
# Create an array of file names in directory
samples_in_dir = listdir('./dataset/training')
# Join the path and file name for all files in dicrectory
samples_in_dir = [join('./dataset/training', sample) for sample in samples_in_dir]

In [None]:
# Set the size of the validation set to 20%
val_size = int(.2 * len(samples_in_dir))

# Randomize the samples
np.random.shuffle(samples_in_dir)

# Split data into training samples and validation samples
val_samples = samples_in_dir[:val_size]
train_samples = samples_in_dir[val_size:]

# Check that the data split correctly
assert(len(val_samples) + len(train_samples) == len(samples_in_dir))

In [None]:
# Test that the data set loaded correctly
np.loadtxt(samples_in_dir[0])

In [None]:
def extract_features(sample):
    features = []

    # Median absolute deviation (MAD)
    mad = sp.stats.median_absolute_deviation(sample)

    features.append(mad)
    return np.array(features).flatten()

In [None]:
# Load 1 sample to test feature extraction
sample = np.loadtxt(samples_in_dir[0])
mean = np.mean(sample)
features = extract_features(sample)
print(sample.shape)
print(features.shape)
print(features)

In [None]:
# Function: loop through filenames, creating feature sets
def create_feature_set(filenames):
    x_out = []
    for file in filenames:
        sample = np.loadtxt(file)
        features = extract_features(sample)
        x_out.append(features)
    return np.array(x_out)

In [None]:
# Extract Features for the training and validation sets
training = create_feature_set(train_samples)
val = create_feature_set(val_samples)

In [None]:
# iforest start

clf = IsolationForest(max_samples=100, n_estimators=100, contamination=0.03)
clf.fit(training)
y_pred_train = clf.predict(training)
y_pred_val = clf.predict(val)

print("Training Accuracy:", list(y_pred_train).count(1)/y_pred_train.shape[0])
print("Validation Accuracy:", list(y_pred_val).count(1)/y_pred_val.shape[0])