In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("gen_time_data.csv")
print(df.shape)

(32256, 8)


This notebook uses a manually generated dataset to simulate seta of observations for individuala with some time dependency. Each user has 12 observations associated with it (e.g. months). The goal here is to demonstrate how to get data like this into the right format and set up a RNN to try and learn from it.

The sequence that results in a positive y value is X5 is 1 in month 11, X3 is less than 0.6 in month 12, and the sum of X5 over months 6 to 12 is greater than 3. X1 and X2 are just noise. 

y is positive about 11% of the time for a user ID.

In [3]:
df.head(24)

Unnamed: 0,id,time,x1,x2,x3,x4,x5,y
0,1,1,0.34522,0.82725,0.737033,1,1,1
1,1,2,0.802148,0.698895,0.528889,1,0,1
2,1,3,0.170504,0.085219,0.404662,1,0,1
3,1,4,0.427657,0.157935,0.49576,0,0,1
4,1,5,0.745276,0.231196,0.577447,1,1,1
5,1,6,0.175969,0.910015,0.444284,0,1,1
6,1,7,0.114116,0.05324,0.711894,0,0,1
7,1,8,0.999931,0.034291,0.83037,1,1,1
8,1,9,0.237785,0.816266,0.841942,1,1,1
9,1,10,0.171571,0.059315,0.829446,0,1,1


In [4]:
y = df[['id','y']].groupby('id').max()
df.pop('y')
print(round(y.describe().y[1],3), "rate of positive Y")

0.11 rate of positive Y


In [5]:
# create an empty 3D tensor of the size we need
n_samples = df.id.nunique()
n_timesteps = df.time.nunique()
n_features = df.shape[1] - 2

X = np.zeros((n_samples, n_timesteps, n_features)) 
print(X.shape)

(2688, 12, 5)


In [6]:
ids = df.id.unique()

for i,value in enumerate(ids):
    X[i] = df.loc[df['id'] == value].drop(['id','time'], axis=1).values

In [7]:
from keras.models import Sequential
from keras.layers import Dense, GRU, Dropout

Using TensorFlow backend.


In [8]:
model = Sequential()                                       
model.add(GRU(128, return_sequences=True, input_shape=(n_timesteps, n_features)))
model.add(GRU(128, input_shape=(n_timesteps, n_features)))
model.add(Dense(1, activation='sigmoid'))

In [9]:
model.compile(optimizer='adam',                             
              loss='binary_crossentropy', 
              metrics=['accuracy'])

In [10]:
model.fit(X,                                          
          y, 
          shuffle=True,
          epochs=20, 
          batch_size=32,
          validation_split=0.2)

Train on 2150 samples, validate on 538 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1206d65f8>

Now experimenting with an incomplete dataset. 
Here I'm simulating half of the dataset's IDs only containing inputs for the most recent 3 months. These sequences are padding so the missing values are filled with zeros.

In [11]:
df_ = pd.read_csv("gen_time_data.csv")

In [12]:
y_ = df_[['id','y']].groupby('id').max()
df_.pop('y')
1

1

In [13]:
df_1 = df_.iloc[:int((n_samples/2)), :]
df_2 = df_.iloc[int((n_samples/2)):, :]

df_1 = df_1[(df_1['time'] == 10) | (df_1['time'] == 11) | (df_1['time'] == 12)]
df_ = pd.concat([df_1, df_2])

In [14]:
df_.shape

(31248, 7)

In [15]:
X_ = np.zeros((n_samples, n_timesteps, n_features)) 
print(X_.shape)

(2688, 12, 5)


In [16]:
from keras.preprocessing.sequence import pad_sequences

ids = df_.id.unique()

for i,value in enumerate(ids):
    seq = df_.loc[df_['id'] == value].drop(['id','time'], axis=1).values.T
    padded_seq = pad_sequences(seq, 12, dtype='float32').T
    X_[i] = padded_seq

In [17]:
model = Sequential()                                       
model.add(GRU(128, return_sequences=True, input_shape=(n_timesteps, n_features)))
model.add(GRU(128, input_shape=(n_timesteps, n_features)))
model.add(Dense(1, activation='sigmoid'))

In [18]:
model.compile(optimizer='adam',                             
              loss='binary_crossentropy', 
              metrics=['accuracy'])

In [19]:
model.fit(X_,                                          
          y_, 
          shuffle=True,
          epochs=20, 
          batch_size=32,
          validation_split=0.2)

Train on 2150 samples, validate on 538 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1232fb6a0>