# Clustering times series data with SQL

**The data**

The data in the example cases have the `time` column and one or more columns representing `engine heat` in celcius degrees. 

An increase of 1 in time equals 1 hour. Each case 3 days of data.

This notebook is loosely inspired by an actual business need, but the data and examples are generalization of the problem.

**Clustering**

The value in the `egine heat` column increases by time when the engine is running. Once the temperature raises to a certain level, the system automatically switces to secondary engine. 

For the reporting purposes this data is needed:
* Identify the cluster number for each observation
* Identify number engine switches (clusters) in the data

For one reason or another the engine temperature sensor is the only available information. Because of the system limitations SQL is the only available analytics tool.




## Initialize

In [142]:
#Import libraries
import pandas as pd
import numpy as np

In [191]:
#Constants
days_n = 3
observations_n = days_n * 24
clusters_n = 7
start_heat_min = 18
start_heat_max = 22
col_time = "time"
col_cluster_rank = "cluster_rank"
col_cluster_id = "cluster_id"
col_start_heat = 'start_heat'

In [192]:
#Generate time series from 0 ... observations_n
def generate_initial_data():

    #Generate the time column
    time_series = np.arange(observations_n)

    #Create cluster breakpoints
    np.random.seed(10)
    cluster_breakpoints = np.sort(np.random.randint(observations_n, size=clusters_n-1))
    cluster_breakpoints = np.insert(cluster_breakpoints, 0, 0)
    
    #Create cluster partitions
    cluster_id = np.repeat(0, observations_n)
    cluster_id[cluster_breakpoints] = 1
    cluster_id = np.cumsum(cluster_id)
    
    #Create starting heat
    cluster_start_heat = np.repeat(np.nan, observations_n)
    cluster_start_heat[cluster_breakpoints] = np.random.randint(low=start_heat_min, high=start_heat_max, size=clusters_n)
    
    #Create the data frame
    df = pd.DataFrame({col_time: time_series, col_cluster_id: cluster_id, col_start_heat: cluster_start_heat})
    
    #Rank inside the cluster
    df[col_cluster_rank] = df.groupby(col_cluster_id)[col_cluster_id].rank().astype(int)
    
    #Initial temperature
    cluster_firsts = np.where(df[col_cluster_rank]==1)
    df[[col_start_heat]] = df[[col_start_heat]].fillna(method='ffill')
    
    return df

def generate_engine_heat(df_arg, col_engine_heat='engine_heat'):
    
    df = df_arg.copy()
    
    col_heat_delta = 'heat_delta_temp'
    col_heat_cum = 'heat_cum_temp'
    
    df[col_heat_delta] = np.random.random(size=df.shape[0]) * df[col_cluster_rank]**0.5
    
    df[col_heat_cum] = df.groupby(col_cluster_id)[col_heat_delta].cumsum()
    
    df[col_engine_heat] = df[col_start_heat] + df[col_heat_cum]
    
    df.drop([col_heat_delta, col_heat_cum], inplace=True, axis=1)
    
    return df

def drop_columns(df_arg):
    
    df = df_arg.copy().drop([col_cluster_id, col_cluster_rank], axis=1)
    
    return df

## Generate initial data

In [193]:
#Display the data frame

df_init = generate_initial_data()
display(df_init.head(12))

Unnamed: 0,time,cluster_id,start_heat,cluster_rank
0,0,1,19.0,4
1,1,1,19.0,4
2,2,1,19.0,4
3,3,1,19.0,4
4,4,1,19.0,4
5,5,1,19.0,4
6,6,1,19.0,4
7,7,1,19.0,4
8,8,2,18.0,1
9,9,3,20.0,3


## Case 1: Clustering a single variable
In this case we make an expectation that engine heat always increases.

In [195]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

df_case1 = df_init.copy()

df_case1 = generate_engine_heat(df_case1)

fig = go.Figure()
fig.add_scatter(
    x=df_case1['time'],
    y=df_case1['engine_heat'],
    mode='markers',
    marker={
        'size': 5,
        'color': 'red',
        'opacity': 0.6,
        'colorscale': 'Viridis'
    }
)
iplot(fig)

## Case 2: Clustering a single variable with more variation

In [None]:
pass

## Case 3: Clustering combination of variables

In [None]:
pass