# Pre/Post-process data in production environment

In model developing process, `TSDataset` is used to preprocess(including feature engineering, data sampling, scaling, ...) the raw data the postprocess the predicted result(majorly unscaling). This post provides a way by which users could replay the preprocessing and postprocessing in production environment(e.g. model serving).

In this guide, we will
1. Train a TCNForecaster with nyc_taxi datset and export the model in onnx type and scaler.
2. Show users how to replay the preprocessing and postprocessing in production environment.
3. Evaluate the performance of preprocessing and postprocessing
4. More tips about this topic.

## Forecaster developing

First let's prepare the data. We will manually download the data to show the details.

In [30]:
# run following
!wget https://raw.githubusercontent.com/numenta/NAB/v1.0/data/realKnownCause/nyc_taxi.csv

--2022-10-15 17:16:51--  https://raw.githubusercontent.com/numenta/NAB/v1.0/data/realKnownCause/nyc_taxi.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 265771 (260K) [text/plain]
Saving to: ‘nyc_taxi.csv’


2022-10-15 17:16:53 (577 KB/s) - ‘nyc_taxi.csv’ saved [265771/265771]



Then we may load the data to pandas dataframe and carry out preprocessing through `TSDataset`.

In [36]:
from sklearn.preprocessing import StandardScaler
from bigdl.chronos.data import TSDataset

# load the data to pandas dataframe
df = pd.read_csv("nyc_taxi.csv", parse_dates=["timestamp"])

# use nyc_taxi public dataset
train_data, _, test_data = TSDataset.from_pandas(df,
                                                 dt_col="timestamp",
                                                 target_col="value",
                                                 repair=False,
                                                 with_split=True,
                                                 test_ratio=0.1)

# create a scaler for data scaling
scaler = StandardScaler()

# preprocess(generate datetime feature, scale and roll samping)
for data in [train_data, test_data]:
    data.gen_dt_feature(features=["WEEKDAY", "HOUR", "MINUTES"])\
        .scale(scaler, fit=(data is train_data))\
        .roll(lookback=48, horizon=24)

  if missing_value / rows > threshold:


In [37]:
from bigdl.chronos.forecaster import TCNForecaster  # TCN is algorithm name

# create a forecaster
forecaster = TCNForecaster.from_tsdataset(train_data)

# train the forecaster
forecaster.fit(train_data)

Global seed set to 3551947761
Global seed set to 3551947761
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name  | Type            | Params
------------------------------------------
0 | model | TemporalConvNet | 5.6 K 
1 | loss  | MSELoss         | 0     
------------------------------------------
5.6 K     Trainable params
0         Non-trainable params
5.6 K     Total params
0.022     Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

In [2]:
# save the forecaster in onnx type
forecaster.export_onnx_file(dirname="nyc_tax_onnx_model", quantized_dirname=None)



In [3]:
import pickle

# save the scaler
# There are many ways, we use pickle here
with open('scaler.pkl','wb') as f:
    pickle.dump(scaler, f)

## In production environment

In [8]:
# generate data to predict in a local csv file
_, _, test_data = get_public_dataset("nyc_taxi")
test_data.df[-48:].to_csv("inference_data.csv")

In [9]:
import pandas as pd

with open('scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)
df = pd.read_csv("inference_data.csv", parse_dates=["timestamp"])

In [13]:
def preprocess_during_deployment(df, scaler):
    tsdata = TSDataset.from_pandas(df,
                                   dt_col="timestamp",
                                   target_col="value",
                                   repair=False)
    tsdata.gen_dt_feature(features=["WEEKDAY", "HOUR", "MINUTES"])\
          .scale(scaler, fit=False)\
          .roll(lookback=48, horizon=24, is_predict=True)
    data = tsdata.to_numpy()
    return tsdata, data

In [11]:
def postprocess_during_deployment(data, tsdata):
    return tsdata.unscale_numpy(data)

In [22]:
import onnxruntime
session = onnxruntime.InferenceSession("nyc_tax_onnx_model/onnx_saved_model.onnx")

In [28]:
tsdata, data = preprocess_during_deployment(df, scaler)
data = session.run(None, {'x': data})[0]
processed_data = postprocess_during_deployment(data, tsdata)

In [18]:
from bigdl.chronos.metric.forecast_metrics import Evaluator
print(Evaluator.get_latency(preprocess_during_deployment, df, scaler))

{'p50': 3.75, 'p90': 4.097, 'p95': 4.228, 'p99': 5.582}
