# Unsupervised learning with a Clustering Model

This notebook shows the necessary steps you need to take in order to train and save a model if you already know the most appropriate parameters.
At the end of the notebook, you will have a trained model, which is used in the subsequent notebooks to create an edge package that can be deployed on AI Inference Server.

### Imports  

In [None]:
import pandas
import joblib
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

import seaborn as sns
from matplotlib import pyplot

import tsfresh.feature_extraction.feature_calculators as fc

import sys
from pathlib import Path
sys.path.insert(0, str(Path('../src').resolve()))
from si.pipeline import WindowTransformer, FeatureTransformer, FillMissingValues
from si.preprocessing import positive_sum_of_changes, negative_sum_of_changes
from si.pipeline import back_propagate_labels

%matplotlib inline

### Load data

Once the list is shown, choose the appropriate one with its index. In our case, we'll choose the very first element in the list with index 0.

In [None]:
df = pandas.read_csv("../data/raw/si-sample.csv")
df

The data set contains three strongly correlated signals and we therefore want to use the sum of their values.  
In order to calculate the sum, we will introduce a Transformer class. We will use it both to preprocess data for visualizing the input, and also as a preprocessing step in our ML pipeline.  
The class is defined in python file ([preprocessing.py](../src/si/preprocessing.py)), so we need to import it, and define its input columns.

In [None]:
from si.preprocessing import SumColumnsTransformer

input_columns = ["ph1","ph2","ph3"]

df["ph_sum"] = SumColumnsTransformer().transform(df[input_columns].values).flatten()

### Define the features
The features specified here will be extracted window by window. You can differentiate between the importance of various features by specifying different weights as integers.  
Features specified with a weight greater than 1 will be fed to the subsequent parts of the ML pipeline with the corresponding multiplicity.

In [None]:
weighted_feature_list = [
    (2, [ fc.maximum, fc.minimum, fc.mean ]),
    (1, [ fc.variance, fc.standard_deviation ]),
    (1, [ fc.sum_values ]),
    (1, [ fc.absolute_sum_of_changes ]),
    (1, [ positive_sum_of_changes, negative_sum_of_changes ]),
    (1, [ fc.count_above_mean, fc.longest_strike_above_mean,  fc.longest_strike_below_mean ])
]

### Define AI/ML pipeline
Define your AI/ML pipeline as a sequence of preprocessing, feature extraction and clustering steps.

In [None]:
pipe = Pipeline([
        ('preprocessing', Pipeline([
            ('fillmissing', FillMissingValues('ffill')),
            ('summarization', SumColumnsTransformer()), # summarizes the variables into one variable
            ('windowing', WindowTransformer(window_size=300, window_step=300)),
            ('featurization', FeatureTransformer(function_list=weighted_feature_list)),
            ('scaling', MinMaxScaler(feature_range=(0, 1))),
        ])),
        ('clustering', KMeans(n_clusters=3, random_state=0)),
    ])

### Train the model

In [None]:
X = df[input_columns].values # transforming training data

pipe.fit(X)

### Predict with the model
Once the model has been trained, we can predict the class of any windowed data, or we can display the training data in full with color codes.

In [None]:
x_classes = pipe.predict(X)
df = back_propagate_labels(df, pipe['preprocessing'], x_classes)
df

In [None]:
colormap = {-1: 'white', 0: 'red', 1: 'green', 2: 'blue', 3: 'orange', 4: 'purple', 5: 'yellow'}
fig, ax = pyplot.subplots(figsize=(24, 12))
sns.scatterplot(x=df.index, y='ph_sum', data=df, hue='class', palette=colormap, ax=ax)

#### Save model as joblib file

If you are satisfied with the result, you can save your model into a joblib file. 

In [None]:
model_path = Path("../models/clustering-model.joblib")
with open(model_path, 'wb') as fh:
    joblib.dump(pipe, model_path, compress=9)


#### Subsequent notebooks

With the saved model you can create a Pipeline Component which is the basic block of a Pipeline.

Notebook [20-CreateInferenceWrapper](20-CreateInferenceWrapper.ipynb) shows how to create a Python wrapper around the model.  
Notebook [30-CreatePipelinePackage](30-CreatePipelinePackage.ipynb) demonstrates the steps how to create the edge configuration package. 