# GP_Kernel_Experiments for SatMobFusion

Essentially, we want to see how well conventional machine learning (ML) techniques work for this kind of data (mixed discrete and continuous masses). Specifically, we establish baselines using linear regression, RIDGE and LASSO regressions, logistic regression, and bootstrap aggregation (random forest). Then, we implement gaussian processes (GP) and compare performances.

Our goal is to make predictions about the current cluster given certain information

## Data Loading & Pre-Processing

In [112]:
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split

from typing import Optional

import math
import numpy as np
import gpytorch
import torch
from torch import nn, Tensor
from torch.nn import functional as F

from gpytorch.constraints.constraints import Interval, Positive
from gpytorch.kernels.kernel import Kernel
from gpytorch.priors.prior import Prior

from gpytorch.models import ExactGP
from gpytorch.likelihoods import DirichletClassificationLikelihood
from gpytorch.means import ConstantMean
from gpytorch.kernels import ScaleKernel, RBFKernel

In [4]:
# Run this line every time in order to refresh the data
df = pd.read_csv('processed/user_0162bdca5925e11a37a48c507453734045b5d62cca0d6abc8300993dfbf8b69e.csv')

# Add additional corrections as needed
df['last_one'] = df['last_one'].fillna(-1)
df['last_two'] = df['last_two'].apply(lambda x: np.fromstring(x[1 : len(x) - 1], dtype = float, sep = ','))
df['last_three'] = df['last_three'].apply(lambda x: np.fromstring(x[1 : len(x) - 1], dtype = float, sep = ','))
df['last_three_activity_duration'] = df['last_three_activity_duration'].apply(lambda x: np.fromstring(x[1 : len(x) - 1], dtype = float, sep = ','))
df['last_start_time'] = pd.to_datetime(df['last_start_time'])
df['leaving_datetime'] = pd.to_datetime(df['leaving_datetime'])

# Add additional features
df['leaving_hour'] = df['leaving_datetime'].dt.hour.astype(int)
df['day_of_week'] = df['leaving_datetime'].dt.dayofweek # Monday = 0, Sunday = 6
df['is_weekday'] = df['day_of_week'].apply(lambda x: 1 if x < 5 else 0)

# Note that lists are non-hashable, so we need to convert them to tuples
df['last_three'] = df['last_three'].apply(lambda x: tuple(x))
df['last_two'] = df['last_two'].apply(lambda x: tuple(x))

df['third_last_one'] = df['last_three'].apply(lambda x: x[2]).fillna(-1)
df['second_last_one'] = df['last_two'].apply(lambda x: x[1]).fillna(-1)

# View the data
pd.options.display.max_columns = None
df.head()

Unnamed: 0,datetime,leaving_datetime,cluster,dist,last_one,last_two,last_three,activity_duration,last_three_activity_duration,last_three_start_time,last_three_end_time,last_three_lat,last_three_lng,last_three_dist,last_dist,second_last_dist,third_last_dist,last_start_time,second_last_start_time,third_last_start_time,last_end_time,second_last_end_time,third_last_end_time,last_activity_duration,second_last_activity_duration,third_last_activity_duration,leaving_hour,day_of_week,is_weekday,third_last_one,second_last_one
0,2019-12-31 14:49:22,2019-12-31 15:58:02,0,58813.0,-1.0,"(nan, nan)","(nan, nan, nan)",69,"[nan, nan, nan]","(NaT, NaT, NaT)","(NaT, NaT, NaT)","(nan, nan, nan)","(nan, nan, nan)","(nan, nan, nan)",,,,NaT,,,,,,,,,15,1,1,-1.0,-1.0
1,2019-12-31 17:01:04,2019-12-31 19:45:05,6,52032.0,0.0,"(0.0, nan)","(0.0, nan, nan)",164,"[69.0, nan, nan]","(Timestamp('2019-12-31 14:49:22'), NaT, NaT)","(Timestamp('2019-12-31 15:58:02'), NaT, NaT)","(47.22057343, nan, nan)","(-122.3467407, nan, nan)","(58813.0, nan, nan)",58813.0,,,2019-12-31 14:49:22,,,2019-12-31 15:58:02,,,69.0,,,19,1,1,-1.0,-1.0
2,2019-12-31 20:32:41,2020-01-01 01:35:33,3,38063.0,6.0,"(6.0, 0.0)","(6.0, 0.0, nan)",303,"[164.0, 69.0, nan]","(Timestamp('2019-12-31 17:01:04'), Timestamp('...","(Timestamp('2019-12-31 19:45:05'), Timestamp('...","(47.518939, 47.22057343, nan)","(-121.84213555, -122.3467407, nan)","(52032.0, 58813.0, nan)",52032.0,58813.0,,2019-12-31 17:01:04,2019-12-31 14:49:22,,2019-12-31 19:45:05,2019-12-31 15:58:02,,164.0,69.0,,1,2,1,-1.0,0.0
3,2020-01-01 02:41:51,2020-01-02 08:52:52,0,4705.0,3.0,"(3.0, 6.0)","(3.0, 6.0, 0.0)",1811,"[303.0, 164.0, 69.0]","(Timestamp('2019-12-31 20:32:41'), Timestamp('...","(Timestamp('2020-01-01 01:35:33'), Timestamp('...","(47.8473012, 47.518939, 47.22057343)","(-122.2763971, -121.84213555, -122.3467407)","(38063.0, 52032.0, 58813.0)",38063.0,52032.0,58813.0,2019-12-31 20:32:41,2019-12-31 17:01:04,2019-12-31 14:49:22,2020-01-01 01:35:33,2019-12-31 19:45:05,2019-12-31 15:58:02,303.0,164.0,69.0,8,3,1,0.0,6.0
4,2020-01-02 09:21:03,2020-01-02 14:45:51,1,4705.0,0.0,"(0.0, 3.0)","(0.0, 3.0, 6.0)",325,"[1811.0, 303.0, 164.0]","(Timestamp('2020-01-01 02:41:51'), Timestamp('...","(Timestamp('2020-01-02 08:52:52'), Timestamp('...","(47.22057343, 47.8473012, 47.518939)","(-122.3467407, -122.2763971, -121.84213555)","(4705.0, 38063.0, 52032.0)",4705.0,38063.0,52032.0,2020-01-01 02:41:51,2019-12-31 20:32:41,2019-12-31 17:01:04,2020-01-02 08:52:52,2020-01-01 01:35:33,2019-12-31 19:45:05,1811.0,303.0,164.0,14,3,1,6.0,3.0


## Encoding/Decoding

In [300]:
# Create a function that encodes the origin-destination combinations
def encode_combinations(row, col1, col2, combinations_dict):
    # Get the origin-destination combination
    combination = row[[col1, col2]].values.tolist()
    # Get the index of the combination
    index = list(combinations_dict.values()).index(combination)
    # Return the index
    return index

# Create a function that decodes the origin-destination combinations
def decode_combinations(row, combinations_dict):
    # Get the index
    index = row['combinations']
    # Get the origin-destination combination
    combination = combinations_dict[index]
    # Return the combination
    return combination

## Hamming Kernel Code

## Random Forest Baseline

In [336]:
import scipy.stats 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# The features we wish to learn from
features = ['last_one', 'second_last_one', 'third_last_one', 'leaving_hour', 'day_of_week']

class RandomForest: 
    """
    A majority vote random forest classifier
    """
    
    def __init__(self, num_trees, max_depth=None):
        """
        Constructs a RandomForest that uses the given number of trees, each with a 
        max depth of max_depth.
        """
        self._trees = [
            DecisionTreeClassifier(max_depth=max_depth, random_state=1) 
            for i in range(num_trees)
        ]
        
    def fit(self, X):
        """
        Takes an input dataset X and trains each tree on a random subset of the data.
        Sampling is done with replacement.
        """

        for mini_tree in self._trees:
            mini_data = X.iloc[np.random.randint(0, X.shape[0], X.shape[0])]
            mini_tree.fit(mini_data[features], mini_data['cluster'])
            
    def predict(self, X):
        """
        Takes an input dataset X and returns the predictions for each example in X.
        """
        # Builds up a 2d array with n rows and T columns
        # where n is the number of points to classify and T is the number of trees
        predictions = np.zeros((len(X), len(self._trees)))
        for i, tree in enumerate(self._trees):
            # Make predictions using the current tree
            preds = tree.predict(X)
            
            # Store those predictions in ith column of the 2d array
            predictions[:, i] = preds
            
        # For each row of predictions, find the most frequent label (axis=1 means across columns)
        return scipy.stats.mode(predictions, axis=1, keepdims=False)[0]

In [337]:
# Get training and testing data
train_set, test_set = train_test_split(df, test_size = 0.2, random_state=42)

# Get x data
x_train = train_set[features].values
x_test = test_set[features].values

# Convert cluster column to tensor
y_train = train_set['cluster'].values
y_actual = test_set['cluster'].values

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_actual:", y_actual.shape)

x_train: (369, 5)
x_test: (93, 5)
y_train: (369,)
y_actual: (93,)


In [349]:
# First calculate the accuracies for each depth
depths = list(range(1, 26, 2))
best_depth = 0
best_accuracy = 0

for i in depths:
    # Train and evaluate our RandomForest classifier with given max_depth 
    forest = RandomForest(15, max_depth=i)
    forest.fit(train_set)
    train_score = accuracy_score(forest.predict(train_set[features]), train_set['cluster'])
    observed_prediction = forest.predict(test_set[features])
    test_score = accuracy_score(observed_prediction, test_set['cluster'])
    
    if test_score > best_accuracy:
        best_accuracy = test_score
        best_depth = i

print("Best Test Accuracy: ", best_accuracy, "\nBest Test Depth: ", best_depth, sep="")
display(y_actual - observed_prediction)
print("Total Error: ", sum(abs(y_actual - observed_prediction)), sep="")

Best Test Accuracy: 0.9354838709677419
Best Test Depth: 13


array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,  -1.,   0.,   0.,   0.,
         0.,   0.,   0.,   2.,   0.,   0.,   8.,   0.,  -1.,   0.,   0.,
         0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,  -3.,   0.,   0.,   0.,   0.,   0.,   0.,  -1.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,  -9.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,
         0.,   0.,   0.,   0., -10.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.])

Total Error: 37.0


## GP Experiments

### Simple ExactGP + RBF 

In [119]:
# Use variables to specify the columns to encode
item1 = 'last_one'
item2 = 'leaving_hour'

# Encode the data such that each combination of origin and destination is a unique number
# Create a list of unique last_two combinations
unique_combinations = df[[item1, item2]].drop_duplicates().reset_index(drop=True)

# Create a dictionary of unique origin-destination combinations
unique_combinations_dict = dict(zip(range(0, len(unique_combinations)), unique_combinations.values.tolist()))

# Encode the origins and destinations
df['combinations'] = df.apply(lambda x: encode_combinations(x, item1, item2, unique_combinations_dict), axis=1)

# Get training and testing data
train_set, test_set = train_test_split(df, test_size = 0.2, random_state=42)

# Get x data
x_train = torch.tensor(train_set['combinations'].values, dtype=torch.float32)
x_test = torch.tensor(test_set['combinations'].values, dtype=torch.float32)

# Convert cluster column to tensor
y_train = torch.tensor(train_set['cluster'].values, dtype=torch.float32)
y_actual = torch.tensor(test_set['cluster'].values, dtype=torch.float32)

In [120]:
# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(x_train, y_train, likelihood)

In [121]:
# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 100

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    optimizer.zero_grad()
    output = model(x_train)
    loss = -mll(output, y_train)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()

Iter 1/100 - Loss: 4.317
Iter 2/100 - Loss: 4.082
Iter 3/100 - Loss: 3.876
Iter 4/100 - Loss: 3.693
Iter 5/100 - Loss: 3.531
Iter 6/100 - Loss: 3.388
Iter 7/100 - Loss: 3.262
Iter 8/100 - Loss: 3.152
Iter 9/100 - Loss: 3.054
Iter 10/100 - Loss: 2.968
Iter 11/100 - Loss: 2.893
Iter 12/100 - Loss: 2.826
Iter 13/100 - Loss: 2.767
Iter 14/100 - Loss: 2.715
Iter 15/100 - Loss: 2.669
Iter 16/100 - Loss: 2.628
Iter 17/100 - Loss: 2.592
Iter 18/100 - Loss: 2.560
Iter 19/100 - Loss: 2.532
Iter 20/100 - Loss: 2.507
Iter 21/100 - Loss: 2.484
Iter 22/100 - Loss: 2.464
Iter 23/100 - Loss: 2.446
Iter 24/100 - Loss: 2.430
Iter 25/100 - Loss: 2.415
Iter 26/100 - Loss: 2.402
Iter 27/100 - Loss: 2.390
Iter 28/100 - Loss: 2.379
Iter 29/100 - Loss: 2.370
Iter 30/100 - Loss: 2.361
Iter 31/100 - Loss: 2.353
Iter 32/100 - Loss: 2.346
Iter 33/100 - Loss: 2.339
Iter 34/100 - Loss: 2.333
Iter 35/100 - Loss: 2.328
Iter 36/100 - Loss: 2.323
Iter 37/100 - Loss: 2.318
Iter 38/100 - Loss: 2.314
Iter 39/100 - Loss: 2

In [122]:
# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# The gpytorch.settings.fast_pred_var flag activates LOVE (for fast variances)
# See https://arxiv.org/abs/1803.06058
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    # Make predictions
    observed_prediction = likelihood(model(x_test))

In [123]:
torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int)

  torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int)


tensor([ 0,  0,  0,  0, -2,  0,  0, -3,  0,  0,  0,  0,  0,  0,  1,  0,  0,  7,
         0, -3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        -2,  0,  0,  0,  0,  0,  0,  0,  0, -2,  0, -2,  0,  0,  0,  0, -1,  0,
         0,  0,  0,  0,  0, -2,  0, -2, -2,  0,  0, -2, -2,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0, -3,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0], dtype=torch.int32)

In [124]:
print("We predict the next location (current cluster) with an accuracy of: ", 
      '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")

We predict the next location (current cluster) with an accuracy of: 83.87%


  '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")


### Dirichlet Classification Experiments

#### Last Three Predict + Leaving Hour Predict Current Location

In [172]:
dummies1 = pd.get_dummies(df['third_last_one']).values
dummies2 = pd.get_dummies(df['second_last_one']).values
dummies3 = pd.get_dummies(df['last_one']).values
hour_of_day = 'leaving_hour'

# connect old data fram with new one horizontally
new_df = pd.concat([df, pd.DataFrame(dummies1), pd.DataFrame(dummies2), pd.DataFrame(dummies3), pd.DataFrame(df['leaving_hour'])], axis=1)
train_set, test_set = train_test_split(new_df, test_size = 0.2, random_state=42)

# The number of unique locations
locations = dummies1.shape[1] + dummies2.shape[1] + dummies3.shape[1]

# x_train is last columns of train_set
x_train = torch.tensor(train_set.iloc[:, -(locations + 1):].values, dtype=torch.int64)
print("training shape:", x_train.shape)

# x_test is last columns of test_set
x_test = torch.tensor(test_set.iloc[:, -(locations + 1):].values, dtype=torch.int64)
print("testing shape:", x_test.shape)

# Convert cluster column to tensor
y_train = torch.tensor(train_set['cluster'].values, dtype=torch.int64)
y_actual = torch.tensor(test_set['cluster'].values, dtype=torch.int64)

training shape: torch.Size([369, 67])
testing shape: torch.Size([93, 67])


In [173]:
class DirichletGPModel(ExactGP):
    def __init__(self, train_x, train_y, likelihood, num_classes):
        super(DirichletGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = ConstantMean(batch_shape=torch.Size((num_classes,)))
        self.covar_module = ScaleKernel(
            RBFKernel(batch_shape=torch.Size((num_classes,))),
            batch_shape=torch.Size((num_classes,))
        ) * ScaleKernel(
            RBFKernel(active_dims=[66])
            ) * ScaleKernel(
            RBFKernel(active_dims = np.arange(0, 66, 1), ard_num_dims=66 )
            )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
# we let the DirichletClassificationLikelihood compute the targets for us
likelihood = DirichletClassificationLikelihood(y_train, learn_additional_noise=True)
model = DirichletGPModel(x_train, likelihood.transformed_targets, likelihood, num_classes=likelihood.num_classes)

In [174]:
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(x_train)
    # Calc loss and backprop gradients
    loss = -mll(output, likelihood.transformed_targets).sum()
    loss.backward()
    if i % 5 == 0:
        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()

Iter 1/50 - Loss: 88.043
Iter 6/50 - Loss: 64.379
Iter 11/50 - Loss: 49.961
Iter 16/50 - Loss: 43.894
Iter 21/50 - Loss: 41.922
Iter 26/50 - Loss: 41.060
Iter 31/50 - Loss: 40.500
Iter 36/50 - Loss: 40.069
Iter 41/50 - Loss: 39.728
Iter 46/50 - Loss: 39.459


In [175]:
model.covar_module.kernels[2].base_kernel.lengthscale

tensor([[3.5971, 3.1682, 2.9587, 2.9064, 3.1710, 3.0427, 2.8683, 3.2768, 2.9048,
         2.7985, 0.6931, 3.0813, 0.6931, 3.3693, 0.6931, 3.0289, 3.2960, 2.8683,
         3.3558, 3.1177, 3.2728, 3.4598, 3.2809, 3.2546, 3.0828, 3.1107, 3.2962,
         3.2736, 3.1165, 3.0321, 3.0058, 3.2286, 3.0783, 3.3856, 3.3649, 3.0627,
         3.1487, 3.2960, 3.4162, 3.0783, 3.5164, 3.2001, 3.6720, 0.6931, 0.6931,
         2.9650, 3.1007, 2.9417, 3.1120, 3.2929, 3.1360, 4.0203, 3.2561, 3.3065,
         3.2728, 3.3295, 3.3693, 3.3944, 0.6931, 3.4162, 2.7649, 3.2561, 3.3255,
         3.0705, 0.6931, 3.1165]], grad_fn=<SoftplusBackward0>)

In [176]:
model.covar_module.kernels[0].outputscale

tensor([2.1308, 2.1806, 2.1186, 2.0058, 2.0546, 2.0484, 2.0416, 2.0266, 2.0595,
        2.0332, 2.0461, 2.0458, 2.0369, 2.0369, 2.0520, 2.0421, 2.0321, 2.0350,
        2.0411, 2.0426, 2.0443], grad_fn=<SoftplusBackward0>)

In [177]:
# evaluate our model
model.eval()
likelihood.eval()

with gpytorch.settings.fast_pred_var(), torch.no_grad():
    test_dist = model(x_test)
    pred_means = test_dist.loc

In [178]:
observed_prediction = pred_means.max(0)[1]
torch.tensor(y_actual - observed_prediction, dtype=torch.int)

  torch.tensor(y_actual - observed_prediction, dtype=torch.int)


tensor([ 0,  0,  0,  0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  2,  0,  0,  7,
         0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -6,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,
         0,  1,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0], dtype=torch.int32)

In [179]:
print("We predict the next location (current cluster) with an accuracy of: ", 
      '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")

We predict the next location (current cluster) with an accuracy of: 91.40%


  '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")


### Spectral Mixture Experiments

#### Predict Next Location Using Last Location and Leaving Hour Encoding

In [10]:
# Use variables to specify the columns to encode
item1 = 'last_one'
item2 = 'leaving_hour'

# Encode the data such that each combination of origin and destination is a unique number
# Create a list of unique last_two combinations
unique_combinations = df[[item1, item2]].drop_duplicates().reset_index(drop=True)

# Create a dictionary of unique origin-destination combinations
unique_combinations_dict = dict(zip(range(0, len(unique_combinations)), unique_combinations.values.tolist()))

# Encode the origins and destinations
df['combinations'] = df.apply(lambda x: encode_combinations(x, item1, item2, unique_combinations_dict), axis=1)

# Get training and testing data
train_set, test_set = train_test_split(df, test_size = 0.2, random_state=42)

# Get x data
x_train = torch.tensor(train_set['combinations'].values, dtype=torch.float32)
x_test = torch.tensor(test_set['combinations'].values, dtype=torch.float32)

# Convert cluster column to tensor
y_train = torch.tensor(train_set['cluster'].values, dtype=torch.float32)
y_actual = torch.tensor(test_set['cluster'].values, dtype=torch.float32)

NameError: name 'encode_combinations' is not defined

In [135]:
class SpectralMixtureGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(SpectralMixtureGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4)
        self.covar_module.initialize_from_data(train_x, train_y)

    def forward(self,x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = SpectralMixtureGPModel(x_train, y_train, likelihood)

In [136]:
# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 100

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    optimizer.zero_grad()
    output = model(x_train)
    loss = -mll(output, y_train)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()

Iter 1/100 - Loss: 3.673
Iter 2/100 - Loss: 3.413
Iter 3/100 - Loss: 3.249
Iter 4/100 - Loss: 3.099
Iter 5/100 - Loss: 2.980
Iter 6/100 - Loss: 2.880
Iter 7/100 - Loss: 2.790
Iter 8/100 - Loss: 2.705
Iter 9/100 - Loss: 2.629
Iter 10/100 - Loss: 2.568
Iter 11/100 - Loss: 2.522
Iter 12/100 - Loss: 2.483
Iter 13/100 - Loss: 2.450
Iter 14/100 - Loss: 2.419
Iter 15/100 - Loss: 2.393
Iter 16/100 - Loss: 2.369
Iter 17/100 - Loss: 2.349
Iter 18/100 - Loss: 2.331
Iter 19/100 - Loss: 2.315
Iter 20/100 - Loss: 2.301
Iter 21/100 - Loss: 2.289
Iter 22/100 - Loss: 2.277
Iter 23/100 - Loss: 2.267
Iter 24/100 - Loss: 2.258
Iter 25/100 - Loss: 2.250
Iter 26/100 - Loss: 2.242
Iter 27/100 - Loss: 2.235
Iter 28/100 - Loss: 2.229
Iter 29/100 - Loss: 2.224
Iter 30/100 - Loss: 2.219
Iter 31/100 - Loss: 2.214
Iter 32/100 - Loss: 2.210
Iter 33/100 - Loss: 2.207
Iter 34/100 - Loss: 2.204
Iter 35/100 - Loss: 2.201
Iter 36/100 - Loss: 2.198
Iter 37/100 - Loss: 2.196
Iter 38/100 - Loss: 2.194
Iter 39/100 - Loss: 2

In [137]:
# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# The gpytorch.settings.fast_pred_var flag activates LOVE (for fast variances)
# See https://arxiv.org/abs/1803.06058
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    # Make predictions
    observed_prediction = likelihood(model(x_test))

In [9]:
observed_prediction.mean

NameError: name 'observed_prediction' is not defined

In [138]:
torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int)

  torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int)


tensor([ 0,  0,  0,  0, -2,  0,  0, -2,  0,  0,  0,  0,  0,  0,  1,  0,  0,  7,
         0, -3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        -3,  0,  0,  0,  0,  0,  0,  0,  0, -2,  0, -2,  0,  0,  0,  0, -3,  0,
         0,  0,  0,  0,  0, -2,  0, -2, -1,  0,  0, -3, -1,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0, -3,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0], dtype=torch.int32)

In [139]:
print("We predict the next location (current cluster) with an accuracy of: ", 
      '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")

We predict the next location (current cluster) with an accuracy of: 83.87%


  '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction.mean, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")


### Combination Kernel Experiments

#### Dirichlet Product Kernel

In [351]:
dummies1 = pd.get_dummies(df['third_last_one']).values
dummies2 = pd.get_dummies(df['second_last_one']).values
dummies3 = pd.get_dummies(df['last_one']).values

# connect old data fram with new one horizontally
new_df = pd.concat([df, 
                    pd.DataFrame(dummies1), 
                    pd.DataFrame(dummies2), 
                    pd.DataFrame(dummies3), 
                    pd.DataFrame(df['leaving_hour']),
                    pd.DataFrame(df['day_of_week'])], axis=1)
train_set, test_set = train_test_split(new_df, test_size = 0.2, random_state=42)

# The number of unique locations
locations = dummies1.shape[1] + dummies2.shape[1] + dummies3.shape[1]

# x_train is last columns of train_set
x_train = torch.tensor(train_set.iloc[:, -(locations + 2):].values, dtype=torch.int64)
print("training shape:", x_train.shape)

# x_test is last columns of test_set
x_test = torch.tensor(test_set.iloc[:, -(locations + 2):].values, dtype=torch.int64)
print("testing shape:", x_test.shape)

# Convert cluster column to tensor
y_train = torch.tensor(train_set['cluster'].values, dtype=torch.int64)
y_actual = torch.tensor(test_set['cluster'].values, dtype=torch.int64)

training shape: torch.Size([369, 68])
testing shape: torch.Size([93, 68])


In [352]:
class DirichletGPModel(ExactGP):
    def __init__(self, train_x, train_y, likelihood, num_classes):
        super(DirichletGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = ConstantMean(batch_shape=torch.Size((num_classes,)))
        self.covar_module = ScaleKernel(
            RBFKernel(batch_shape=torch.Size((num_classes,))),
            batch_shape=torch.Size((num_classes,))
        ) * ScaleKernel(
                RBFKernel(active_dims=[66, 67])
            ) * ScaleKernel(
                    RBFKernel(active_dims = np.arange(0, 66, 1), ard_num_dims=66 )
                )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
# we let the DirichletClassificationLikelihood compute the targets for us
likelihood = DirichletClassificationLikelihood(y_train, learn_additional_noise=True)
model = DirichletGPModel(x_train, likelihood.transformed_targets, likelihood, num_classes=likelihood.num_classes)

In [353]:
# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    optimizer.zero_grad()
    output = model(x_train)
    loss = -mll(output, likelihood.transformed_targets).sum()
    loss.backward()
    if i % 10 == 0:
        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
    optimizer.step()

Iter 1/50 - Loss: 106.611
Iter 11/50 - Loss: 56.428
Iter 21/50 - Loss: 44.520
Iter 31/50 - Loss: 42.710
Iter 41/50 - Loss: 41.659


In [354]:
# evaluate our model
model.eval()
likelihood.eval()

with gpytorch.settings.fast_pred_var(), torch.no_grad():
    test_dist = model(x_test)
    pred_means = test_dist.loc

In [355]:
observed_prediction = pred_means.max(0)[1]
torch.tensor(y_actual - observed_prediction, dtype=torch.int)

  torch.tensor(y_actual - observed_prediction, dtype=torch.int)


tensor([ 0,  0,  0,  0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  2,  0,  0,  5,
         0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -6,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,
         0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0], dtype=torch.int32)

In [361]:
print("We predict the next location (current cluster) with an accuracy of: ", 
      '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")

print("Total Error: ", '%d'%((torch.sum(torch.abs(torch.tensor(y_actual - observed_prediction, dtype=torch.int))))), sep="")

We predict the next location (current cluster) with an accuracy of: 92.47%
Total Error: 18


  '%.2f'%((len(y_actual) - torch.count_nonzero(torch.tensor(y_actual - observed_prediction, dtype=torch.int))) / len(y_actual) * 100), "%", sep="")
  print("Total Error: ", '%d'%((torch.sum(torch.abs(torch.tensor(y_actual - observed_prediction, dtype=torch.int))))), sep="")


## Notes

- I would like to note that regarding conventional ML baselines for classification, random forest does very well with a prediction accuracy of 91.4%, and after my experiments thus far, I've been able to achieve 92.5% accuracy using a modified Dirichlet Kernel. A marginal increase but an increase nonetheless.

- The reason why I encode using just last_one and not last_two or last_three is because these tuples are not considered sequences, rather, are taken into account as a single unit. However, we have empirical evidence that there is better predicitive power when considering the similarity of trip locations as a sequence. For instance, finding similarity between [0, 1, 2] and [0, 1, 3] shows us that we have similarity 2 out of three as opposed to having 0 similarity in terms of same vs not-same thinking. Therefore, I use last_one as a baseline for encoding, but I want to find new methods that allow me to breakdown and compare these sequences (ex: Hamming Distance).

- Consider preidction fairness (create frequency table)