### Time Series GAN - Data Synthesis

**Author:** Noah Perry

**Overview:** This notebook contains code to generate a synthetic version of the Tetouan City power consumption dataset.

**Data description:** The Tetouan City power consumption dataset contains weather and power consumption data recorded every 10 minutes from January 1, 2017 to December 30, 2017. There are 52,416 observations and 9 variables.

I obtain this dataset through the UCI Machine Learning Repository.

Link to data source: https://archive.ics.uci.edu/dataset/849/power+consumption+of+tetouan+city

For this analysis, I aggregated the data by day as follows:

| Variable | Aggregation Function |
| :-: | :-: |
| Date | N/A |
| Temperature | Mean  |
| Humidity | Mean |
| Wind Speed | Mean  |
| General Diffuse Flows | Sum |
| Diffuse Flows | Sum |
| Zone 1 Power Consumption | Sum  |
| Zone 2 Power Consumption | Sum  |
| Zone 3 Power Consumption | Sum |

The aggregated dataset contains 364 observations and 9 variables.

**Method:** A Time Series Generative Adversarial Network is an neural network architecture designed for generating synthetic time-series data. I use YData's implementation `TimeGAN` in their `ydata-synthetic` Python package.

Link to paper on time series GANs: https://papers.nips.cc/paper/2019/file/c9efe5f26cd17ba6216bbe2a7d26d490-Paper.pdf

Link to `ydata-synthetic` GitHub repo: https://github.com/ydataai/ydata-synthetic

### Setup

In [1]:
#pip install ydata-synthetic

In [2]:
# Packages
import os
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from ydata_synthetic.synthesizers import ModelParameters
from ydata_synthetic.synthesizers.timeseries.timegan.model import TimeGAN

### Import Raw Data

In [3]:
# Read in data
tc_pwr_10m = pd.read_csv("Tetuan City power consumption.csv", 
                     header = 0, 
                     names = ["datetime", "temp", "humidity", "wind_speed", "gen_diff_flows", "diff_flows", "z1_pwr", "z2_pwr", "z3_pwr"], 
                     parse_dates = [0])

### Data Exploration

In [4]:
tc_pwr_10m.head()

Unnamed: 0,datetime,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
0,2017-01-01 00:00:00,6.559,73.8,0.083,0.051,0.119,34055.6962,16128.87538,20240.96386
1,2017-01-01 00:10:00,6.414,74.5,0.083,0.07,0.085,29814.68354,19375.07599,20131.08434
2,2017-01-01 00:20:00,6.313,74.5,0.08,0.062,0.1,29128.10127,19006.68693,19668.43373
3,2017-01-01 00:30:00,6.121,75.0,0.083,0.091,0.096,28228.86076,18361.09422,18899.27711
4,2017-01-01 00:40:00,5.921,75.7,0.081,0.048,0.085,27335.6962,17872.34043,18442.40964


In [5]:
tc_pwr_10m.tail()

Unnamed: 0,datetime,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
52411,2017-12-30 23:10:00,7.01,72.4,0.08,0.04,0.096,31160.45627,26857.3182,14780.31212
52412,2017-12-30 23:20:00,6.947,72.6,0.082,0.051,0.093,30430.41825,26124.57809,14428.81152
52413,2017-12-30 23:30:00,6.9,72.8,0.086,0.084,0.074,29590.87452,25277.69254,13806.48259
52414,2017-12-30 23:40:00,6.758,73.0,0.08,0.066,0.089,28958.1749,24692.23688,13512.60504
52415,2017-12-30 23:50:00,6.58,74.1,0.081,0.062,0.111,28349.80989,24055.23167,13345.4982


In [6]:
tc_pwr_10m.dtypes

datetime          datetime64[ns]
temp                     float64
humidity                 float64
wind_speed               float64
gen_diff_flows           float64
diff_flows               float64
z1_pwr                   float64
z2_pwr                   float64
z3_pwr                   float64
dtype: object

In [7]:
tc_pwr_10m.shape

(52416, 9)

### Data Quality Assessment

In [8]:
# Checking there are no gaps in time series
tc_pwr_10m["datetime"].diff().value_counts()

datetime
0 days 00:10:00    52415
Name: count, dtype: int64

In [9]:
# Checking for missing values
tc_pwr_10m.isna().sum()

datetime          0
temp              0
humidity          0
wind_speed        0
gen_diff_flows    0
diff_flows        0
z1_pwr            0
z2_pwr            0
z3_pwr            0
dtype: int64

In [10]:
# Checking for extreme values
tc_pwr_10m.describe()

Unnamed: 0,datetime,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
count,52416,52416.0,52416.0,52416.0,52416.0,52416.0,52416.0,52416.0,52416.0
mean,2017-07-01 23:55:00,18.810024,68.259518,1.959489,182.696614,75.028022,32344.970564,21042.509082,17835.406218
min,2017-01-01 00:00:00,3.247,11.34,0.05,0.004,0.011,13895.6962,8560.081466,5935.17407
25%,2017-04-01 23:57:30,14.41,58.31,0.078,0.062,0.122,26310.668692,16980.766032,13129.32663
50%,2017-07-01 23:55:00,18.78,69.86,0.086,5.0355,4.456,32265.92034,20823.168405,16415.11747
75%,2017-09-30 23:52:30,22.89,81.4,4.915,319.6,101.0,37309.018185,24713.71752,21624.10042
max,2017-12-30 23:50:00,40.01,94.8,6.483,1163.0,936.0,52204.39512,37408.86076,47598.32636
std,,5.815476,15.551177,2.348862,264.40096,124.210949,7130.562564,5201.465892,6622.165099


### Aggregate Data

In [11]:
# Aggregate to daily
tc_pwr_10m["date"] = pd.to_datetime(tc_pwr_10m["datetime"]).dt.date

tc_pwr_day = tc_pwr_10m.groupby(['date']).agg(
    temp = ('temp', 'mean'),
    humidity = ('humidity', 'mean'),
    wind_speed = ('wind_speed', 'mean'),
    gen_diff_flows = ('gen_diff_flows', 'sum'),
    diff_flows = ('diff_flows', 'sum'),
    z1_pwr = ('z1_pwr', 'sum'),
    z2_pwr = ('z2_pwr', 'sum'),
    z3_pwr = ('z3_pwr', 'sum')
)

tc_pwr_day = tc_pwr_day.reset_index(drop = False)

In [12]:
tc_pwr_day.head()

Unnamed: 0,date,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
0,2017-01-01,9.675299,68.519306,0.315146,17480.271,3743.125,4098993.0,2554242.0,2573107.0
1,2017-01-02,12.476875,71.456319,0.076563,17338.246,3920.747,4157207.0,2816312.0,2566190.0
2,2017-01-03,12.1,74.981667,0.076715,17378.786,4114.751,4400992.0,2888247.0,2537396.0
3,2017-01-04,10.509479,75.459792,0.082417,17706.142,4151.12,4419336.0,2894699.0,2545012.0
4,2017-01-05,10.866444,71.040486,0.083896,17099.98,4282.767,4435619.0,2884888.0,2543641.0


In [13]:
tc_pwr_day.tail()

Unnamed: 0,date,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
359,2017-12-26,11.62184,69.070903,0.083062,15384.483,7309.496,4321941.0,3565009.0,1640978.0
360,2017-12-27,15.232917,59.445903,0.082028,13808.257,6005.529,4315243.0,3608277.0,1655752.0
361,2017-12-28,13.662361,62.839375,0.081354,16217.303,4350.148,4358449.0,3540276.0,1608052.0
362,2017-12-29,12.990486,49.07875,0.078181,17599.683,3451.827,4206187.0,3543958.0,1608663.0
363,2017-12-30,11.688993,51.361667,0.078174,17829.234,3461.752,4052976.0,3486425.0,1664759.0


In [14]:
tc_pwr_day.dtypes

date               object
temp              float64
humidity          float64
wind_speed        float64
gen_diff_flows    float64
diff_flows        float64
z1_pwr            float64
z2_pwr            float64
z3_pwr            float64
dtype: object

In [15]:
tc_pwr_day.shape

(364, 9)

### Prepare Aggregated Data for Synthesizer

In [16]:
# Getting rid of date and making arrays for TimeGAN function
tc_pwr_day_nodate = np.array(tc_pwr_day.iloc[:,1:9])

# Scale data to [0,1] interval
scaler = MinMaxScaler(feature_range = (0,1))
scaled_day = scaler.fit_transform(tc_pwr_day_nodate)

scaled_day2 = np.resize(scaled_day, (1, 364, 8))

In [18]:
# GAN parameters
batch_size = 1
learning_rate = 5e-4
noise_dim = 32
dim = 24

gan_args = ModelParameters(batch_size=batch_size,
                           lr=learning_rate,
                           noise_dim=noise_dim,
                           layers_dim=dim)


# TimeGAN specific parameters
seq_len = scaled_day.shape[0] # obs in data
n_seq = scaled_day.shape[1]   # variables in data

# Troubleshooting code
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
tf.keras.backend.set_image_data_format("channels_last")

In [19]:
if os.path.exists('synthesizer_tc_pwr_day.pkl'):
    synth = TimeGAN.load('synthesizer_tc_pwr_day.pkl')
else:
    synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=seq_len, n_seq=n_seq, gamma=1)
    synth.train(scaled_day2, train_steps=500)
    synth.save('synthesizer_tc_pwr_day.pkl')

2023-03-21 22:01:47.816722: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-03-21 22:01:47.816808: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-172-16-81-27.us-east-2.compute.internal): /proc/driver/nvidia/version does not exist
2023-03-21 22:01:47.834833: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Emddeding network training: 100%|██████████| 500/500 [03:12<00:00,  2.60it/s]
Supervised network training: 100%|██████████| 500/500 [01:24<00:00,  5.92it/s]
Joint networks training: 100%|██████████| 500/500 [6:35:03<00:00, 47.41s

In [33]:
# Generate synthetic data
scaled_day2_synth = synth.sample(0)

Synthetic data generation: 100%|██████████| 1/1 [00:03<00:00,  3.52s/it]


In [34]:
scaled_day2_synth.shape

(1, 364, 8)

In [40]:
scaled_day2_synth_0 = scaled_day2_synth[0]
scaled_day2_synth_0.shape

(364, 8)

In [46]:
# make synthesized data in original scale
unscaled_day2_synth = scaler.inverse_transform(scaled_day2_synth_0)
unscaled_day2_synth.shape

(364, 8)

In [47]:
tc_pwr_day_synth = pd.DataFrame(unscaled_day2_synth,
                               columns = ['temp','humidity','wind_speed','gen_diff_flows',
                                          'diff_flows','z1_pwr','z2_pwr','z3_pwr'])
tc_pwr_day_synth.head()

Unnamed: 0,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
0,12.130336,74.320763,0.327842,16130.302734,4384.32666,4254893.5,2771892.5,2689854.5
1,12.377903,73.317581,0.120847,14517.18457,4053.237061,4287524.0,2799297.5,2475369.25
2,12.806835,74.183884,0.108594,12998.375977,4282.602539,4347913.0,2807689.25,2435136.25
3,13.238839,75.719093,0.107075,11782.285156,4617.543457,4389643.5,2821528.75,2440698.5
4,13.681409,76.648277,0.104652,11139.827148,4989.103516,4406766.5,2831252.5,2462341.5


In [48]:
tc_pwr_day_synth.describe()
    # similar to tc_pwr_day_desc
    # for some variables, extreme values are more extreme in the original data
    # for others, the extreme values are more extreme in synthetic data

Unnamed: 0,temp,humidity,wind_speed,gen_diff_flows,diff_flows,z1_pwr,z2_pwr,z3_pwr
count,364.0,364.0,364.0,364.0,364.0,364.0,364.0,364.0
mean,18.565718,66.542961,2.078659,28293.939453,10551.811523,4620509.0,3046825.0,2538442.25
std,5.362943,5.271105,2.191741,14503.333984,4799.035156,377405.375,282110.5,556258.75
min,11.649935,57.566711,0.071221,7241.521973,4053.237061,4096796.75,2408406.0,1771869.625
25%,13.738938,62.428345,0.093693,10775.555176,6329.303223,4348889.125,2798741.0,2164600.25
50%,17.875117,65.847202,0.133754,33008.826172,9668.206055,4527158.0,3004557.0,2523455.375
75%,24.316883,70.282919,4.880797,42574.833984,14027.094482,4947001.125,3287254.0,2712859.625
max,26.889494,87.680855,4.919249,47033.1875,23992.001953,5367952.0,3557332.0,3747234.75


In [53]:
tc_pwr_day_synth.to_csv('TC Day Synth.csv', index = False)