<a href="https://colab.research.google.com/github/manjitullal/foursquare/blob/master/FourSquare_Temporal_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Temporal and Spatial analysis of events data using LSTM**



**Dataset:** foursquare

**Aim:** to predict future location and time of user given the historical sequences of location and time .

An analogy for the aim is , predicting the next word in a sentence. 


**Contents:**
***
1. Data pre processing
2. Encoding
3. Modeling
4. Training

# **1. Data pre processing**


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf


from google.colab import drive
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils import data
from torch.nn.utils.rnn import pad_sequence
from sklearn.preprocessing import OneHotEncoder

torch.manual_seed(1)

<torch._C.Generator at 0x7f18dddf7270>

1. **Check for GPU**
---
This code presently does not use GPU. This will later be modified to use one.


In [2]:
#check the devices available

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 12330238408744325041
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 896786060585843051
physical_device_desc: "device: XLA_CPU device"
]


In [3]:
#check if the gpu is available
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  0


In [4]:
#the GPU may not be available at the moment

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    print('GPU device not found')


GPU device not found


In [5]:
#dataset is in the google drive

drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [6]:
#my google drive path

!ls "/content/drive/My Drive/dataset/foursquare"
path = "/content/drive/My Drive/dataset/foursquare"

Checkin.txt	 UserFriends.txt    VenueRating.txt
Description.txt  VenueCategory.txt  Venue.txt


2. **Load dataset.** 
---
Dataset, consists of 5 tables. `Checkin` table has the event details

In [7]:
%%time
Checkin_columns = ['UserID','VenueID','Year','Month','Date','Hour']
Checkin = pd.read_csv(path+'/Checkin.txt', sep=',', skiprows=1, names=Checkin_columns)

CPU times: user 675 ms, sys: 153 ms, total: 828 ms
Wall time: 1.29 s


In [8]:
%%time

Venue_columns = ['VenueID','VenueName','Latitude','Longitude','CategoryID']
Venue = pd.read_csv(path+'/Venue.txt', sep=',', error_bad_lines=False,skiprows=1,names=Venue_columns)

VenueCategory_columns = ['CategoryID','CategoryName','ParentCategoryID']
VenueCategory = pd.read_csv(path+'/VenueCategory.txt', sep=',',error_bad_lines=False,skiprows=1,names=VenueCategory_columns)

VenueRating_columns = ['VenueID','Rating']
VenueRating = pd.read_csv(path+'/VenueRating.txt', sep=',',error_bad_lines=False,skiprows=1,names=VenueRating_columns)

UserFriends = pd.read_csv(path+'/UserFriends.txt', sep=',') 

all_tables = [Checkin,Venue,VenueCategory,VenueRating,UserFriends]
all_tables_string = ['Checkin','Venue','VenueCategory','VenueRating','UserFriends']

CPU times: user 923 ms, sys: 75.8 ms, total: 999 ms
Wall time: 1.86 s


In [9]:
Checkin.head()

Unnamed: 0,UserID,VenueID,Year,Month,Date,Hour
0,u1302,v47,2012,2,24,11
1,u45,v132,2012,2,24,11
2,u24844,v86,2012,2,24,11
3,u896,v248,2012,2,24,11
4,u5020,v29,2012,2,24,11


In [10]:
#stats of the data 

def _describe(data):
    print(f" Number of rows: {data.shape[0]}")
    print(f" Number of columns: {data.shape[1]}")
    print(f" Number of null values: {np.sum(data.isnull().sum())}")
    print("The columns that have null values")
    print(pd.DataFrame(data.isnull().sum()).T)
    
for index,table in enumerate(all_tables):
    print(f"Details of table {all_tables_string[index]}")
    print("")
    _describe(table)
    print("")

Details of table Checkin

 Number of rows: 1276988
 Number of columns: 6
 Number of null values: 0
The columns that have null values
   UserID  VenueID  Year  Month  Date  Hour
0       0        0     0      0     0     0

Details of table Venue

 Number of rows: 85928
 Number of columns: 5
 Number of null values: 14
The columns that have null values
   VenueID  VenueName  Latitude  Longitude  CategoryID
0        0         12         2          0           0

Details of table VenueCategory

 Number of rows: 394
 Number of columns: 3
 Number of null values: 0
The columns that have null values
   CategoryID  CategoryName  ParentCategoryID
0           0             0                 0

Details of table VenueRating

 Number of rows: 68178
 Number of columns: 2
 Number of null values: 96
The columns that have null values
   VenueID  Rating
0       96       0

Details of table UserFriends

 Number of rows: 1366388
 Number of columns: 2
 Number of null values: 0
The columns that have null valu

3. **Model for one user** (for testing), hence we will filter the data for one user, eventually this will be extended for all users

In [11]:
#filter data for one user 

Checkin_u1205 = Checkin[Checkin.UserID == 'u1205']
Checkin_u1205.head()

Unnamed: 0,UserID,VenueID,Year,Month,Date,Hour
2723,u1205,v73805,2012,2,25,9
3817,u1205,v9884,2012,2,25,11
4739,u1205,v3906,2012,2,25,13
5904,u1205,v10373,2012,2,25,15
6840,u1205,v9884,2012,2,25,17


In [12]:
# drop userid as that is not useful now, since there is only one user 

Checkin_u1205.drop(['UserID'], axis=1, inplace=True)

#renaming column Date to Day
Checkin_u1205.rename(columns={"Date":"Day"}, inplace=True)
Checkin_u1205.head()

Unnamed: 0,VenueID,Year,Month,Day,Hour
2723,v73805,2012,2,25,9
3817,v9884,2012,2,25,11
4739,v3906,2012,2,25,13
5904,v10373,2012,2,25,15
6840,v9884,2012,2,25,17


In [13]:
# create a new column, datetime to sort the events 

%%time
Checkin_u1205['Datetime'] = pd.to_datetime(Checkin_u1205[['Year', 'Month', 'Day', 'Hour']])
Checkin_u1205.head()

CPU times: user 14.4 ms, sys: 2.06 ms, total: 16.5 ms
Wall time: 32.3 ms


In [14]:
# sort based on datetime
Checkin_u1205.sort_values(by='Datetime',inplace=True)

Checkin_u1205.head()

Unnamed: 0,VenueID,Year,Month,Day,Hour,Datetime
2723,v73805,2012,2,25,9,2012-02-25 09:00:00
9154,v40561,2012,2,25,9,2012-02-25 09:00:00
3817,v9884,2012,2,25,11,2012-02-25 11:00:00
10664,v1743,2012,2,25,11,2012-02-25 11:00:00
4739,v3906,2012,2,25,13,2012-02-25 13:00:00


In [15]:
# from the above, we can see that for some reason there are 2 duplicate timestamps with different venues
# it is not possible for a person to be at different location at the same time, so removing the rows with duplicate time stamps
# the category of the venues is hierarchical, however for there appears no link between the venues

Checkin_u1205_nodup = Checkin_u1205.drop_duplicates('Datetime')

In [16]:
print("Rows in Checkin_u1205: ", Checkin_u1205.shape[0])
print("Rows in Checkin_u1205_nodup: ", Checkin_u1205_nodup.shape[0])

Rows in Checkin_u1205:  1303
Rows in Checkin_u1205_nodup:  1227


In [18]:
Checkin_u1205_nodup.iloc[:20]

Unnamed: 0,VenueID,Year,Month,Day,Hour,Datetime
2723,v73805,2012,2,25,9,2012-02-25 09:00:00
3817,v9884,2012,2,25,11,2012-02-25 11:00:00
4739,v3906,2012,2,25,13,2012-02-25 13:00:00
5904,v10373,2012,2,25,15,2012-02-25 15:00:00
6840,v9884,2012,2,25,17,2012-02-25 17:00:00
18507,v9885,2012,2,26,9,2012-02-26 09:00:00
12801,v10373,2012,2,26,11,2012-02-26 11:00:00
11674,v9885,2012,2,26,13,2012-02-26 13:00:00
15263,v2927,2012,2,26,15,2012-02-26 15:00:00
20745,v6013,2012,2,27,9,2012-02-27 09:00:00


For time being we are not using the heirarchical informations about the venue, to keep the baseline model simple.

Now, we need to create a time-series of events. 

`Example:`

User goes to gym, grocery and home in that order or shopping, movies, restaurant and home in that order. 

Here, we need to create the longest time-series.


4. **Create longest sequences.** 
---
`Idea`: gather longest time-series by viewing the events of the user, events less than the duration of 8hrs between them will be added to the same series. A gap of 8 hrs or more indicates the day has ended for the user, hence no more travel.


In [19]:

%%time

import datetime

def _generate_events(data):
  previous_time = datetime.datetime(2020, 12, 31)
  all_events = []
  current_events = []
  for index, row in data.iterrows():
    current_time = row['Datetime']  
    current_hour = row['Hour']
    venue = row['VenueID']
    if( (current_time - previous_time).total_seconds()/60/60 < 8):
      current_events.append([venue, current_hour])
      previous_time = current_time
    else:
      all_events.append(current_events)
      current_events = []
      current_events.append([venue, current_hour])
      previous_time = current_time
  if len(current_events)>0:
    all_events.append(current_events)
  return all_events



CPU times: user 10 µs, sys: 0 ns, total: 10 µs
Wall time: 14.8 µs


5. **Subsequences:** A Sequence cannot be directly fed into a model, hence one possibility is to break into sequence of two.

`note: this part is not in use`

---

Example:

for a sequence (v1,t1) (v2,t2) (v3,t3) (v4,t4)

(v1,t1) (v2,t2)

(v2,t2) (v3,t3)

(v3,t3) (v4,t4)


In [20]:
# create all possible subsequences from the above sequence (maintaining the order)
# we need a length of atleast 2 for feature and the label, ignore all sequence of events less than length 2


# LSTM will remember that v2 comes after v1 and that v3 comes after v2 and if it sees v3 then next would be v4

%%time

def _generate_subsequence(data):
  all_sequences = []
  for sequence in data:  
    if len(sequence) >= 2:
      sequences = []
      for i in range(0,len(sequence)-1):
          sequences.append(sequence[i:i+2])          
      all_sequences.append(sequences)
  return all_sequences


#old way of creating sub sequences 
# (v1,t1) (v2,t2)
# (v1,t1) (v2,t2) (v3,t3)
# (v1,t1) (v2,t2) (v3,t3) (v4,t4)

'''
def _generate_subsequence(data):
  all_sequences = []
  for sequence in data:  
    if len(sequence) > 2:
      for i in range(0,1):
        sequences = []
        for j in range(i+2,len(sequence)+1):      
          sequences.append(sequence[i:j])          
        all_sequences.append(sequences)
  return all_sequences
'''


CPU times: user 10 µs, sys: 0 ns, total: 10 µs
Wall time: 14.8 µs


# **2. Encoding**


In [21]:
%time

events = _generate_events(Checkin_u1205_nodup[:30])
print("Events")
print(events)

#sequences = _generate_subsequence(events)
#print("Subsequences")
#print(sequences)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.63 µs
Events
[[['v73805', 9], ['v9884', 11], ['v3906', 13], ['v10373', 15], ['v9884', 17]], [['v9885', 9], ['v10373', 11], ['v9885', 13], ['v2927', 15]], [['v6013', 9], ['v67648', 11], ['v37724', 13], ['v13235', 14], ['v85343', 15], ['v9884', 17], ['v55736', 19], ['v18477', 20], ['v9885', 21]], [['v6013', 8], ['v342', 10], ['v69205', 11], ['v128', 12], ['v52425', 13], ['v17926', 14], ['v67615', 15], ['v19062', 16], ['v9884', 18], ['v18477', 20], ['v9885', 21]], [['v6013', 8]]]


In [22]:
%time

def _venue_only(data):
  all_sequences = []
  for sequence in data:
    events = []
    for event in sequence:      
      events.append(event[0])
    all_sequences.append(events)
  return all_sequences

v_events = _venue_only(events)
print(v_events)


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
[['v73805', 'v9884', 'v3906', 'v10373', 'v9884'], ['v9885', 'v10373', 'v9885', 'v2927'], ['v6013', 'v67648', 'v37724', 'v13235', 'v85343', 'v9884', 'v55736', 'v18477', 'v9885'], ['v6013', 'v342', 'v69205', 'v128', 'v52425', 'v17926', 'v67615', 'v19062', 'v9884', 'v18477', 'v9885'], ['v6013']]


1. **Split the data:** 

---
Before, encoding, split the data to train, validation and test set and then perform the encoding and padding. We do this to maintain integrity and structure of the data.

Since we have inputs of the different length, we will have to pad each input with zeros , so that all inputs are of the same length in a mini-batch

60, 20, 20 split (train, validation and test)

In [23]:
n = len(v_events)  

n_test = int( n * .2 ) 
n_val = int( n * .2 ) 
n_train = n - (n_test+n_val)

print(f"Dataset set terms: {n}")
print(f"Train set terms: {n_train}") 
print(f"Test set terms: {n_test}")
print( f"Validation set terms: {n_val}")

train_set, val_set, test_set = data.random_split(v_events, (n_train, n_val, n_test))

#the train, val and the test indices has to be whole numbers adding upto to the n

Dataset set terms: 5
Train set terms: 3
Test set terms: 1
Validation set terms: 1


2. **Get Max length of each set.**

---
we need this info for padding.

In [24]:
#find max length of sequence for each set

def _get_maxlength_seq(data):
  t = []
  for i in data:
    t.append([i,len(i)])
  temp_df = pd.DataFrame(t)
  temp_df.sort_values(by=1, ascending=False,inplace=True)
  return np.max(temp_df[1])

max_length_train= _get_maxlength_seq(train_set)
max_length_val= _get_maxlength_seq(val_set)
max_length_test= _get_maxlength_seq(test_set)

print(f"max_length_train: {max_length_train}")
print(f"max_length_val: {max_length_val}")
print(f"max_length_test: {max_length_test}")

max_length_train: 9
max_length_val: 11
max_length_test: 4


3. **Padding all the inputs to have same size**
---
pad empty string for each sequence

In [25]:
def _padding(data, maxlength):
  changed_data = []
  for sequence in data:
    c_sequence = sequence.copy()
    if (len(c_sequence) <= maxlength and len(c_sequence) >= 2):
      zero_length = len(c_sequence[0])
      for i in range(maxlength-len(sequence)):
        c_sequence.insert(0,[0*i for i in range(zero_length)])

      changed_data.append(c_sequence)
    else:
      pass
    
  return np.array(changed_data)

'''
p_train_set = _padding(train_set, max_length_train)
p_val_set = _padding(val_set, max_length_val)
p_test_set = _padding(test_set, max_length_test)
'''

'\np_train_set = _padding(train_set, max_length_train)\np_val_set = _padding(val_set, max_length_val)\np_test_set = _padding(test_set, max_length_test)\n'

In [26]:
for i in train_set:
  print(i)

['v73805', 'v9884', 'v3906', 'v10373', 'v9884']
['v6013']
['v6013', 'v67648', 'v37724', 'v13235', 'v85343', 'v9884', 'v55736', 'v18477', 'v9885']


4. **One hot ecoding** of the the venues then do padding.

----

Pytorch does not have one hot encoding package hence, need to create our own code.

In [27]:

class one_hot_char_coding(object):
  def __init__(self):
    self.word2idx = {}
    self.idx2word = []
    self.length = 0

  def add_word(self, word):
    if word not in self.idx2word:
      self.idx2word.append(word)
      self.word2idx[word] = self.length + 1
      self.length += 1
    return self.word2idx[word]

  def __len__(self):
    return len(self.idx2word)

  def onehot_encoded(self, word):
    vec = [i*0 for i in range(self.length)]
    vec[self.word2idx[word]-1] = 1
    return vec


In [28]:
#%time

#get codes for all the words 

coded_obj = one_hot_char_coding()

for word in Checkin_u1205_nodup.VenueID.values:
    coded_obj.add_word(word)

#print(coded_obj.word2idx)
#'v73805': 1, 'v9884': 2, 'v3906': 3, 'v10373
#print(coded_obj.onehot_encoded("v3906"))


#here we have 139 unique values, so the one hot encoded vector is 139 length long

def _one_hot_code(data):
  all_seq = []
  for sequence in data:    
    eve = []
    if(len(sequence)>1):
      for index, event in enumerate(sequence):
        eve.append(coded_obj.onehot_encoded(event))
      all_seq.append(eve)
  return all_seq

'''
e_train_set = _one_hot_code(train_set)
e_val_set = _one_hot_code(val_set)
e_test_set = _one_hot_code(test_set)
'''

#test3 = _one_hot_code(train_set)
#test3 = _padding(test3,9)
#test3
#temp2 = torch.from_numpy(test3)


'\ne_train_set = _one_hot_code(train_set)\ne_val_set = _one_hot_code(val_set)\ne_test_set = _one_hot_code(test_set)\n'

In [29]:
def _one_hot_code_target(data):
  all_seq = []
  for event in data:    
    all_seq.append(coded_obj.onehot_encoded(event))
  return np.array(all_seq)

5. **Encoding time:**
---
Time is cyclical feature so it will be encoded as Time Column = Sin(Time) + Cosine(Time). This will maintain the cyclic nature of the data.

sequence with less than 2 will be descarded as they cannot be used to predict

`note: this part is not in use`

Including time with location for modeling will be included later

In [30]:
%time

def _assign_code(data):
  temp_row=[]
  for row in data:    
    temp_event = []
    for event in row:   
      venue_id = coded_obj.onehot_encoded(event[0])
      time_sine = np.sin(2 * np.pi * event[1]/23.0)
      time_cos = np.cos(2 * np.pi * event[1]/23.0)
      temp_event.append([venue_id, time_sine, time_cos])
    if(len(temp_event) >= 2):
      temp_row.append(temp_event)
  return temp_row
 
#train_set_e = _assign_code(train_set)
#print(f"Number of inputs less than 2 is: {len(train_set) - len(train_set_e)}") 


CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs


6. **Features and Labels**

---
Last event will be the target, so split the data 

In [31]:
def _separate_x_y(data):
  x = data[:-1]
  y = data[-1]
  return [x,y]


In [32]:
def _feature_label(data):
  X_data, Y_data = [], []
  for row in data:
    if(len(row) >= 2):
      temp = _separate_x_y(row)
      x,y = temp[0], temp[1]
      X_data.append(x)
      Y_data.append(y)
  return [X_data, Y_data]


In [33]:
temp = _feature_label(train_set)
X_train, Y_train = temp[0], temp[1]

X_train_m = _one_hot_code(X_train)
X_train_m = _padding(X_train_m, max_length_train-1) #minus one due to label
X_train_m = torch.from_numpy(X_train_m)

Y_train_m = _one_hot_code_target(Y_train) #no padding required for target
Y_train_m = torch.from_numpy(Y_train_m)


# 3. **Modeling**

In [34]:
print(X_train_m.shape)
print(Y_train_m.shape)

torch.Size([2, 8, 139])
torch.Size([2, 139])


1. **First Model**
---
Vanilla RNN

In [39]:
class Model(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    def forward(self, x):
        
        batch_size = x.size(0)

        # Initializing hidden state
        hidden = self.init_hidden(batch_size)

        # Model
        out, hidden = self.rnn(x, hidden)
        
        # Reshaping the outputs such that it can be fit into the fully connected layer
        out = out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(out)
        
        return out, hidden
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidden

In [40]:
# Instantiate the model
# input_size, is the number of unique ids, 139 in our case

model = Model(input_size=139, output_size=1, hidden_dim=12, n_layers=1)
# set the model to the device (default is CPU)
#model.to(device)

# hyperparameters
n_epochs = 100
lr=0.01

# Optimizer
criterion = nn.CrossEntropyLoss() #since present model is about classification Cross entropy loss is used, for prediction this has to be MSE
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [41]:
# Training Run
for epoch in range(1, n_epochs + 1):
    optimizer.zero_grad() 
    #input_seq.to(device)
    output, hidden = model(X_train_m)
    loss = criterion(output, Y_train_m.view(-1).long())
    loss.backward() # Does backpropagation and calculates gradients
    optimizer.step() # Updates the weights accordingly
    
    if epoch%10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

RuntimeError: ignored

Resources used
***
https://discuss.pytorch.org/t/mini-batch-training-for-inputs-of-variable-sizes/13662

https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed

https://stackoverflow.com/questions/53532352/how-do-i-split-the-training-dataset-into-training-validation-and-test-datasets

https://learning.oreilly.com/library/view/deep-learning-with/9781788624336/e397c9a2-28a2-41fd-bcc2-a6bb49adfc44.xhtml

https://www.kaggle.com/avanwyk/encoding-cyclical-features-for-deep-learning


https://jdhao.github.io/2017/11/15/pytorch-datatype-note/

https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

https://ai.stackexchange.com/questions/3156/how-to-select-number-of-hidden-layers-and-number-of-memory-cells-in-an-lstm

https://blog.floydhub.com/a-beginners-guide-on-recurrent-neural-networks-with-pytorch/
