# Timeseries Insights API - Anomaly Detection

This notebook goes through both the preparation of the data, and then uses that data to test the Timeseries Insights API. In this notebook we will use the API to create a dataset, list all the datasets, fetch the time series of one of the dataset we created, query that dataset for an anomaly, and finally demonstrate how we would go about appending events in a streaming fashion with appendEvent

## Preparation and visualization of data

### Installing libraries and setting variables

In [None]:
%load_ext autoreload
%autoreload 2
!pip3 install --upgrade oauth2client 
!pip3 install pandas_bokeh

In [None]:
from oauth2client.client import GoogleCredentials
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
import json
import requests
import matplotlib.pyplot as plt
import pandas_bokeh
import datetime
import time

In [None]:
client = bigquery.Client()
PROJECT_ID="<project-id>"

#Key file for API access and TSI API endopoint
key_file = '/home/jupyter/TSI-API/keyfile/key.json' # JSON file has key of service account for Vertex AI
ts_endpoint =  f'https://timeseriesinsights.googleapis.com/v1/projects/{PROJECT_ID}/datasets'


### Extract full data from BigQuery and visualize with Pandas for reference when calling API

In [None]:
# telling bokeh library where to place output of visualization

pd.set_option('display.max_colwidth', None)
pandas_bokeh.output_file("chart.html")
pandas_bokeh.output_notebook()

In [None]:
#Extract full dataset from BigQuery (200,000+ rows)

sql = """
  select FARM_FINGERPRINT(CONCAT(time, temp, Humidity, Light, h2_raw)) groupId, 
            FORMAT_TIMESTAMP("%Y-%m-%dT%X%Ez", time, "UTC") eventTime, 
            temp, 
            Humidity, 
            Light, 
            h2_raw,
            'LTHH' as measure
        from `<project-id>.<dataset-id>.full_ts_data`
"""
df = client.query(sql).to_dataframe()
# df = df.melt(id_vars=['groupId','eventTime'])
df = df.sort_values("eventTime" , ascending = True)

In [None]:
# data frame with normalized values for temp and light

normFrame = df.copy(deep = True)
cols_to_norm = ['temp','Light', 'Humidity', 'h2_raw']
normFrame[cols_to_norm] = normFrame[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

In [None]:
#Display and save chart

startTime = pd.to_datetime("06/14/21", infer_datetime_format=True) 
endTime =  pd.to_datetime("08/09/21", infer_datetime_format=True) 
size = (1400,500)
normFrame.plot_bokeh(x = 'eventTime' , 
               xlabel = 'time',
               y = [ 'temp', "h2_raw", "Humidity", "Light"], 
               kind = 'line', 
               figsize = size,
               xlim = [startTime, endTime],
               title="both (normalized)"
              )

pandas_bokeh.output_file("chart.html")
df.plot_bokeh(x = 'eventTime' , 
               xlabel = 'time',
               y = ["Light"], 
               kind = 'line', 
               figsize = size,
               xlim = [startTime, endTime],
               title="light"
              )

### Read data from BQ and visualize head to show the data converted to JSON for the API
The data was converted into this format so that it could easily be made into a JSON file for dataset upload. This just displays how the data looks after some transformation not done in this notebook, with event timestamp (every 23 seconds), a hashed grouId (as required by API), and then the dimension data. 

In [None]:
# Read melted data from BQ table and prepare json for time series indight API dataset

sql_out = """
with data as
    (
        select groupId, eventTime, STRUCT(variable as name, value as doubleVal) as dimensions 
        from (
                select * from `<project-id>.<dataset-id>.anomaly_data`
                order by eventTime, variable
             )
    )

    SELECT eventTime, groupId, ARRAY_AGG(dimensions) AS dimensions FROM data GROUP BY eventTime, groupId 
"""
df_out = client.query(sql_out).to_dataframe()
df_out.head()

### Information on the JSON file used to upload to API
In the examples below, we will be using the Cloud Stroage bucket item to upload our data to the sample dataset. This file contains a subset of our data, specifically the data between 06/14 to 06/17. The data was converted from BQ to this consolidated JSON format outside of this notebook.

### A note on our slicing
Due to the nature of the API, the dimension that is "sliced" on must be a string. Since our data is theoretically from one "sensor", we don't have multiple data points we want to slice on. As a result, in our JSON data, we have added a 5th dimension called "measure" that contains the same string for all values, simply as a filler so that we could query the time series API


## Testing the API
Below we will establish some helper functions and an authorization token. We will then test creating a dataset, listing existing datasets, fetching a time series, querying a time series for anomalies, and then prep a request for streaming data.

### Establish helper functions and extract authentication from Service Account key file

In [None]:
# reads json file and returns request body

def read_json_file(path):
    with open(path) as json_file:
        query = json.load(json_file)
        
    return query

In [None]:
# Function to interact with time series API

def query_ts(method, endpoint, data, auth_token):
    data = str(data)
    headers = {'Content-type': 'application/json', "Authorization": f"Bearer {auth_token}"}
    
    if method == "GET":
        resp = requests.get(endpoint, headers=headers)
    if method == "POST":
        resp = requests.post(endpoint, data=data, headers=headers)
    if method == "DELETE":
        resp = requests.delete(endpoint, headers=headers)
    #print(resp.content)
    return(resp.json())

In [None]:
!gcloud auth activate-service-account --key-file {key_file}
!gcloud auth print-access-token
token_array = !gcloud auth print-access-token 
token = token_array[0]
token

### 1. Create dataset

In [None]:
# Create dataset using API

file_data = {
    "name": "data_test",
    "dataNames": [
        "measure",
        "Humidity",
        "Light",
        "h2_raw",
        "temp",
    ],
    "dataSources": [
        {"uri": "gs://demo-ts-data/jbl-short-ts2.json"} #sample of data in Cloud Storage JSON file
    ]
} 
res = query_ts(method="POST", endpoint=ts_endpoint, data=file_data, auth_token=token)
res

### 2. List datasets
There are four other datasets we had created when testing, but here we should see the "yaxin_test" dataset

In [None]:
listdata = query_ts(method="GET", endpoint=ts_endpoint, data="", auth_token=token)
listdata

### 3. Fetch Timeseries

In [None]:
file_data = {
    "pinnedDimensions": [
        {
            "name":"measure",
            "stringVal":"LTTH"
        }
        ],
      "timeInterval": {
        "startTime": "2021-06-14T00:00:00Z",
        "length": "864000s"
      },
       "granularity": "3600s",
       "metric": "temp"
}

# Fetch timeseries for inspection
dataset_name = "data_test"
fetch_ds_endpt = f'https://timeseriesinsights.googleapis.com/v1/projects/{PROJECT_ID}/datasets/{dataset_name}:fetchTimeseries'
res = query_ts(method="POST", endpoint=fetch_ds_endpt, data=file_data, auth_token=token)

### 4. Querying the Timeseries

In [None]:
request_body = {
    "detectionTime": "2021-06-17T16:40:00Z",
    "slicingParams": {
        "dimensionNames": ["measure"]
        },
    "timeseriesParams": {
        "forecastHistory": "43200s",
        "granularity": "450s",
        "metric": "temp"
        },
    "forecastParams": {
        "sensitivity": 0.9,
        "noiseThreshold": 12.0,
        "seasonalityHint": "DAILY"
        },
   
    "returnNonAnomalies": "true",
    "returnTimeseries": "true"
}


# get forecast
dataset_name = "data_test"
query_ds_endpt = f'https://timeseriesinsights.googleapis.com/v1/projects/{PROJECT_ID}/datasets/{dataset_name}:query'
res = query_ts(method="POST", endpoint=query_ds_endpt, data=request_body, auth_token=token)
res

### 5. Testing the appendEvents functionality

#### Preparing the data
Because the streaming data appended has to be recent, we needed to perform some operations on our data in order to make the timestamps more recent. To do this we will take the data from a JSON file with dates beyond our created dataset which contains 06/14 - 06/17. We will then take the data immediately following it (data from 06/18 and 06/19) and convert that subset of data to the two days preceeeding the current date. This way we can then loop through and attempt to add this data to the dataset.

In [None]:
# getting all data from json, in order to itterate over it
events = []
count = 0
data = 1

#reading events from json file, and appending to evetns List
with open('data/jbl-full-ts2.json', 'r') as myfile:
    try:
        while data:
            data = myfile.readline()
            json_load = json.loads(data)
            if json_load['eventTime'] > "2021-06-18T00:00:04+00:00" and json_load['eventTime'] < "2021-06-20T00:00:04+00:00":
                events.append(json_load)
                count += 1
    # TODO: error at end of json file, seems to be formatting
    except:
        print("met error (TODO). Current Count: " + str(count))  # count for reference later

# replace dates extracted to more current dates based on today's date

for event in events:
    event['eventTime'] = event['eventTime'].replace('2021-06-18', '2022-02-21')
    event['eventTime'] = event['eventTime'].replace('2021-06-19', '2022-02-22')
    event['groupId'] = event['groupId'][:-1]
events[0]['eventTime'] # Check to make sure dates have been updated in event array

#### Send requests

In [None]:
url_endpoint = f'https://timeseriesinsights.googleapis.com/v1/projects/{PROJECT_ID}/datasets/{dataset_name}:appendEvents'

#body containing streamed events
request_body = {
   "events":[]
}

#choose number of appends - any number greator than 1 appends all events:
appends = 1 # use this variable to toggle between 0 and >0 for uploading one or several events

#TODO sleep 3 sec
if appends > 0:
    # iterating though 100 events and appending to data set
    for event in events[0:100]: 
        request_body['events'] = [event]
        res = query_ts('POST', url_endpoint, request_body, token)
        time.sleep(1) # sleep to make sure not too many request are sent to API at once
else:
    #testing one append
    request_body['events'] = events[0]
    res = query_ts('POST', url_endpoint, request_body, token)
        
# printing result of last api call       
print(json.dumps(res, indent=4))
print(json.dumps(request_body))

#### Comment on append response
As seen above in the output, the response of a "successful" event appending is currently just an empty response. We think it would be helpful for deubugging purposes **if the response would return some sort of "success" message**.

#### Check if event(s) were appended
This can be done by listing the datasets and checking if the num_items_examined value on the "yaxin_test" dataset is greater than 6968, which is what it was when we initally created it

In [None]:
listdata = query_ts(method="GET", endpoint=ts_endpoint, data="", auth_token=token)
listdata

#### Querying range of appended events
As another test of appendEvents, we are going to run a time series query on the range of data we attempted to append. (The most recently ran test appended data that is pseudo from 02/20 and 02/21). If we are able run a succesful query on that timeframe, then we know the data may have been appended, even tho it isn't showing up when the dataset is listed.

In [None]:
request_body = {
    "detectionTime": "2022-02-20T16:40:00Z",
    "slicingParams": {
        "dimensionNames": ["measure"]
        },
    "timeseriesParams": {
        "forecastHistory": "100000s",
        "granularity": "450s",
        "metric": "temp",
        },
    "forecastParams": {
        "sensitivity": 0.9,
        "noiseThreshold": 12.0,
        "seasonalityHint": "DAILY"
        },
   
    "returnNonAnomalies": "true",
    "returnTimeseries": "true"
}


# get forecast
dataset_name = "data_test"
query_ds_endpt = f'https://timeseriesinsights.googleapis.com/v1/projects/{PROJECT_ID}/datasets/{dataset_name}:query'
res = query_ts(method="POST", endpoint=query_ds_endpt, data=request_body, auth_token=token)
res

In [101]:
eval_request_body = {
    "detectionTime": "2021-06-18T00:59:43Z",
    "pinnedDimensions": [{"name": "measure",  "stringVal": "LTTH"}],
    "timeseriesParams": {
        "forecastHistory": "315400s",
        "granularity": "3601s",
        "metric": "temp",
        "metricAggregationMethod": "AVERAGE",
        },
    "forecastParams": {
        "sensitivity": 0.9,
        "noiseThreshold": 12.0,
        "seasonalityHint": "DAILY"
        }
}


# get forecast
dataset_name = "data_test"
query_ds_endpt = f'https://timeseriesinsights.googleapis.com/v1/projects/{PROJECT_ID}/datasets/{dataset_name}:evaluateSlice'
res = query_ts(method="POST", endpoint=query_ds_endpt, data=eval_request_body, auth_token=token)
res

{'error': {'code': 400,
  'message': 'Request contains an invalid argument.',
  'status': 'INVALID_ARGUMENT'}}