<a href="https://colab.research.google.com/github/paullo0106/prophet_anomaly_detection/blob/master/prophet_anomaly_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistical analysis of Caru's sensor data:

*Tried predicting forecast with Prophet. Try here a statistical approach*

### Goal of the script
**Project Goals :** Find patterns in the sensor signals that correlate with a person's activity. 
Examples are the times someone goes to bed or wakes up in the morning, or nightly bathroom breaks. Patterns may also include unknown activities that nevertheless occur regularly across nights and persons. Create an activity report after each night.

**Milestones 1 :** Normalize the data, identify patterns, detect certain activities.

**Milestones 2 :** Real-time activity reporting every 15 - 30 minutes. Be able to detect an activity (and maybe the type of activity) and send a notification.

### Structure of the script

### Run the script

### Further thoughts and improvements:

> Questions:
> Contact Guillaume Azarias at guillaume.azarias@hotmail.com

### Import the relevant library

In [32]:
import pandas as pd
import numpy as np
import time
import re

import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.dates import DateFormatter

# Note that the interactive plot may not work in Jupyter lab, but only in Jupyter Notebook (conflict of javascripts)
%matplotlib widget 
# %matplotlib inline

from datetime import datetime, timedelta
from pytz import timezone

In [33]:
from sklearn.model_selection import ParameterGrid
from sklearn.mixture import GaussianMixture

In [34]:
import fbprophet
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation, performance_metrics
from fbprophet.plot import plot_cross_validation_metric

In [35]:
# Import the functions from the helper.py
from helper import df_dev_formater, find_index, df_generator, prophet_fit, prophet_plot, get_outliers, execute_cross_validation_and_performance_loop

### Load data from the Amazon S3 bucket:

[Link to the caru bucket on Amazon](https://s3.console.aws.amazon.com/s3/buckets/carudata/?region=eu-north-1&tab=overview) *(credentials required)*

### Load local file

In [4]:
device_nb = '13' # 2-digit number !
device, df_dev = df_dev_formater(device_nb)

assert device.shape[0]==1, 'No, or several devices in the df'

# Check report:
print('Check report:\n##############################################')
print('Device contained in the dataset: ' + device)
print('Tenant using the device: ' + df_dev['tenant'].unique())
print('\nThere are ' + str(df_dev.shape[0]) + ' lines.')
last = df_dev.shape[0] - 1
print('Full dataset: {:%Y-%m-%d} to the {:%Y-%m-%d}.'
          .format(df_dev['ds'][0], df_dev['ds'][last]))
print('\nData types:')
print(df_dev.dtypes)

Check report:
##############################################
['Device contained in the dataset: device13']
['Tenant using the device: tenant01']

There are 967297 lines.
Full dataset: 2019-07-18 to the 2020-03-26.

Data types:
device                                object
tenant                                object
ds             datetime64[ns, Europe/Zurich]
light                                float64
temperature                          float64
humidity                             float64
co2                                  float64
dtype: object


## Data visualisation

In [41]:
df, fig, predict_n, today_index, lookback_n = df_generator(df_dev, device, 'co2', '2019-07-18', '2020-03-26',  '1T', 0.08)
# plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Full dataset: 2019-07-18 to the 2020-03-26. Analysed data the 2019-07-18 to the 2020-03-26.


In [8]:
find_index(df, '2019-09-08', '20:30')

1439


In [37]:
plt.close()
df, fig, predict_n, today_index, lookback_n = df_generator(df_dev, device, 'co2', '2019-12-01', '2019-12-10',  '5T', 0.08)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Full dataset: 2019-07-18 to the 2020-03-26. Analysed data the 2019-12-01 to the 2019-12-10.


In [40]:
# config the model
model = Prophet(interval_width=0.6, # anomaly threshold,
                yearly_seasonality=False, weekly_seasonality=False, daily_seasonality=False,
                changepoint_prior_scale=0.01) # Adjusting trend flexibility. should be <0.1 low --> toward overfit
model.add_seasonality(name='daily', period=1, fourier_order=12) # prior scale
# model.add_seasonality(name='half_day', period=0.5, fourier_order=10)

# Fit the model, flag outliers, and visualize
assert today_index>lookback_n, 'Not enough data for prediction (lookback_n<today_index)'
fig, forecast, model = prophet_fit(df, model, today_index, '5T', 0.08, lookback_days=lookback_n, predict_days=predict_n)   
outliers, df_pred = get_outliers(df, forecast, today_index, predict_days=predict_n)
prophet_plot(df, fig, today_index, predict_days=predict_n, outliers=outliers)
plt.show()
param_grid = {'model' : [model],
              'initial' : ['3 days'], # If not provided, 3 * horizon is used. Same units as horizon
              'period'  : ['0.5 days'], # Integer amount of time between cutoff dates. If not provided, 0.5 * horizon is used.
              'horizon' : ['1 days']} # A forecast is made for every observed point between cutoff and cutoff + horizon}
execute_cross_validation_and_performance_loop(list(ParameterGrid(param_grid)), metric = 'mape')

o Trained on the data from the 2019-12-01 to the 2019-12-10 (41 days).
o Predict from the 2019-12-10 to the 2019-12-10 (1 days).


Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

INFO:fbprophet:Making 10 forecasts with cutoffs between 2019-12-05 01:15:42.030000 and 2019-12-09 13:15:42.030000


Unnamed: 0,initial,horizon,period,mse,rmse,mae,mape,coverage
0,3 days,1 days,0.5 days,13216.762866,114.964181,87.34375,0.129939,0.427401


In [None]:
plt.close()
df, fig, predict_n, today_index, lookback_n = df_generator(df_dev, device, 'light', '2019-11-28', '2019-12-08',  '5T', 0.08)

In [None]:
plt.close()
df, fig, predict_n, today_index, lookback_n = df_generator(df_dev, device, 'co2', '2019-11-07', '2019-11-08',  '5T', 0.08)


In [None]:
plt.close()
df, fig, predict_n, today_index, lookback_n = df_generator(df_dev, device, 'co2', '2019-11-08', '2019-11-09',  '5T', 0.08)


In [7]:
plt.close()
y = df.iloc[:,1]
sns.distplot(y, bins=50, kde=False, rug=True)
plt.show()

# Separate day and night
y_np = y.to_numpy().reshape(-1, 1)
mixture = GaussianMixture(n_components=2).fit(y_np)
means_hat = mixture.means_.flatten()
weights_hat = mixture.weights_.flatten()
sds_hat = np.sqrt(mixture.covariances_).flatten()

print(mixture.converged_)
print(means_hat)
print(sds_hat)
print(weights_hat)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

True
[532.47990335 792.64881147]
[48.99686696 14.65875853]
[0.59764097 0.40235903]


In [74]:
df.shape

(1439, 2)

In [10]:
df_index = df.reset_index(drop=True)
df_index.head(2)

Unnamed: 0,ds,y
0,2019-11-08 21:00:53.857000+01:00,789.379944
1,2019-11-08 21:01:53.957000+01:00,787.441162


### Moving average
https://medium.com/schkn/why-use-k-means-for-time-series-data-part-one-a8f19964f538

Taking advantage of the GaussianMixture result:
![First quartile](https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fwww.mathematicsdictionary.com%2Fenglish%2Fvmd%2Fimages%2Fi%2Finterquartilerange.gif&f=1&nofb=1)

GaussianMixture does not determine the median. I used the *mean - 2(sd)* instead of *median - 2(sd)*.

In [14]:
mean_of_night = means_hat[0]
sd_of_day = sds_hat[0]

threshold_night = mean_of_night - sd_of_day
threshold_night

483.4830363932725

### t-test based detection of day/night change

*If a is a group of point on 6 hours is statistically different from a group of point on 6 hours
and the first group of point on 6 hours is not statistically different from night
and the second group of point on 6 hours is not statistically different from day
then it is the time when the person woke up.*

**Does not work.**

In [131]:
df, fig, predict_n, today_index, lookback_n = df_generator(df_dev, device, 'co2', '2019-11-08', '2019-11-09',  '15T', 0.08)
df_index = df.reset_index(drop=True)
df_index.shape[0]

from scipy import stats
period = 12
for i in range(0, df_index.shape[0], period):
    if i+2*period<df_index.shape[0]-1:
        print(df_index.iloc[i, 0])
        print(df_index.iloc[i+period, 0])
        a = df_index.iloc[i:i+period-1, 1]
        b = df_index.iloc[i+period:i+2*period-1, 1]
        val, t = stats.ttest_rel(a,b)
        formatted_t = "{:.2f}".format(t)
        print(formatted_t)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Full dataset: 2019-07-18 to the 2020-03-26. Analysed data the 2019-11-08 to the 2019-11-09.
2019-11-08 21:14:55.283000+01:00
2019-11-09 00:14:53.671000+01:00
0.52
2019-11-09 00:14:53.671000+01:00
2019-11-09 03:14:52.013000+01:00
0.00
2019-11-09 03:14:52.013000+01:00
2019-11-09 06:14:50.234000+01:00
0.00
2019-11-09 06:14:50.234000+01:00
2019-11-09 09:14:48.475000+01:00
0.03
2019-11-09 09:14:48.475000+01:00
2019-11-09 12:14:46.845000+01:00
0.83
2019-11-09 12:14:46.845000+01:00
2019-11-09 15:14:45.228000+01:00
0.09


In [123]:
df_index.shape[0]

from scipy import stats
period = 50
for i in range(0, df_index.shape[0], period):
    if i+2*period<df_index.shape[0]-1:
        print(df_index.iloc[i, 0])
        print(df_index.iloc[i+period, 0])
        a = df_index.iloc[i:i+period-1, 1]
        b = df_index.iloc[i+period:i+2*period-1, 1]
        val, t = stats.ttest_rel(a,b)
        formatted_t = "{:.2E}".format(t)
        print(formatted_t)

In [124]:
df_index.shape[0]

47