<h1>
<center>Graph Kernel SVM Regresion</center>
</h1>

<font size="3"> 
In this notebook, we train and evaluate a graph-based SVR model.
    
In more detail:   
- We use the PropagationAttr model for computing kernels with graph data. 
- PropagationAttr return adjacency matrixes of shape (num_of_grpahs,num_of_grpahs) for train and test.   
- We use these adjacency matrixes as input for an SVM MultiOutputRegressor.   
- The output for that model is an array with shape (num_of_grpahs,num_of_sectors).
- We use that info to train and test the model with different hyper-parameters of PropagationAttr
    
    Ps: We use ParkingViolation and Chickenpox datasets for our experiments   
</font>

## Generals

<font size="3"> 
Packages import and system configurations. 
</font>

In [None]:
import numpy as np
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
from grakel.kernels import PropagationAttr
import pandas as pd
import pickle
from datetime import datetime as dt
from sklearn import metrics
import matplotlib.pyplot as plt
import multiprocessing

cores = multiprocessing.cpu_count()-2
project_path = '/Users/nickkarras/PycharmProjects/Graph_Based_SVR'

<font size="3"> 
Datasets paths. 
</font>

In [None]:
#Parking Violation
G_train_path_park = project_path + '/Data/ParkingViolationPrediction/G_Train.pkl'
G_test_path_park = project_path + '/Data/ParkingViolationPrediction/G_Test.pkl'
test_targets_path_park = project_path + '/Data/ParkingViolationPrediction/Init/Test_Targets.csv'
train_targets_path_park = project_path + '/Data/ParkingViolationPrediction/Init/Train_Targets.csv'
test_mask_path_park = project_path + '/Data/ParkingViolationPrediction/Init/Test_Mask.csv'

#Chickenpox
G_train_path_chic = project_path + '/Data/Chickenpox/G2_Train.pkl'
G_test_path_chic = project_path + '/Data/Chickenpox/G2_Test.pkl'
test_targets_path_chic = project_path + '/Data/Chickenpox/Init/Chickenpox_Test_targets.csv'
train_targets_path_chic = project_path + '/Data/Chickenpox/Init/Chickenpox_Train_targets.csv'
test_mask_path_chic = None

## Data Preprocessing 

<font size="3"> 
A function that takes as input the data paths and return the data.
</font>

In [None]:
def data_load(G_train_path,G_test_path,test_targets_path,train_targets_path,use_test_mask,test_mask_path):
    with open(G_train_path, 'rb') as inp:
        G_train = pickle.load(inp)
    with open(G_test_path, 'rb') as inp:
        G_test = pickle.load(inp)
        
    y_train = pd.read_csv(train_targets_path,sep=',', index_col=0)
    y_test = pd.read_csv(test_targets_path,sep=',', index_col=0)
    if use_test_mask:
        test_mask = pd.read_csv(test_mask_path,index_col=0)
    else:
        test_mask = None
    return G_train,G_test,y_train,y_test,test_mask

<font size="3"> 
A function that takes the datasets and return a subset for each data accoriding the given data-sizes.
</font>

In [None]:
def get_subset(G_train,G_test,y_train,y_test,use_test_mask,test_mask,train_size,test_size):
    G_train = G_train[0:train_size]
    G_test = G_test[0:test_size]
    y_train = y_train.iloc[:,:train_size]
    y_test = y_test.iloc[:,:test_size]
    if use_test_mask:
        test_mask = test_mask.iloc[:,:test_size]
    else:
        test_mask = None
    return G_train,G_test,y_train,y_test,test_mask

<font size="3"> 
A function that get subset of the data if the given variable is True and reshape the targets to the necessary shape.
</font>

In [None]:
def data_preprocess(G_train,G_test,y_train,y_test,use_test_mask,test_mask,subset,train_size,test_size):
    if subset:
        G_train,G_test,y_train,y_test,test_mask = get_subset(G_train,G_test,y_train,y_test,use_test_mask,test_mask,train_size,test_size)

    y_train = np.array(y_train.T)
    y_test = np.array(y_test.T)
    if use_test_mask:
        test_mask = np.array(test_mask.T)
    else:
        test_mask = None
    return G_train,G_test,y_train,y_test,test_mask

## Model Evaluation

<font size="3"> 
A function that calculates the Mean Absolute Error (MAE) and Mean Squared Error (MSE) between predictions and actual targets for train and test sets. 

    
In case of Parking data, it uses a mask in order to calculate the errors only for the raw targets 
</font>

In [None]:
def calculate_metrics_on_actuals(eval_set,y_pred,y_test,use_test_mask,test_mask):
    
    if eval_set == 'test':
        pred = []
        actual = []
        if use_test_mask:
            for i in range(0,(len(y_pred))):
                for k in range (0,len(y_pred[0])):
                    if test_mask[i][k] == 1:
                        prd = y_pred[i][k]
                        pred.append(float(prd))
                        act = y_test[i][k]
                        actual.append(float(act))
        else:
            for i in range(0,(len(y_pred))):
                for k in range (0,len(y_pred[0])):
                    prd = y_pred[i][k]
                    pred.append(float(prd))
                    act = y_test[i][k]
                    actual.append(float(act))
    
    elif eval_set == 'train':
        pred = []
        actual = []
        for i in range(0,(len(y_pred))):
            for k in range (0,len(y_pred[0])):
                prd = y_pred[i][k]
                pred.append(float(prd))
                act = y_test[i][k]
                actual.append(float(act))    
        
        
    MAE = round(metrics.mean_absolute_error(actual, pred),5)
    print (f"The Mean Abslolute Error (MAE) that have been calculated for {eval_set} set is: {MAE}")
    MSE = round(metrics.mean_squared_error(actual, pred),5)
    print (f"The Mean Squared Error (MSE) that have been calculated for {eval_set} set is: {MSE}")
    return MAE,MSE,pred,actual

<font size="3">
The Default kernel computation function. (we just run some experiments with it)

In [None]:
def _dot(x, y):
    return sum(x[k]*y[k] for k in x)

<font size="3">
Metric function for kernel computation.  Where x,y is the vectors produced from LSH for each graph and the convolution product is only given for points where the signals overlap completely.

In [None]:
def _conv(x, y):
    a=list(x[k] for k in x)
    b=list(y[k] for k in x)
    c=np.convolve(a,b,'valid')
    return c[0]

<font size="3">
A core function that calculates the prediction errors by applying the following steps:
<ol>
<li>Take the necessary data and the given hyper-parameters.</li>
<li>Apply graph-based kernel calculations using PropagationAttr according to the given parameters.</li>
<li>Fit and transform train graphs.</li>
<li>Transform Test Graphs.</li>
<li>Create an adjacency matrix for test and train graphs.</li>
<li>Train a MultiOutputRegressor SVM Regressor with a precomputed kernel using adjacency matrixes.</li>
<li>Calculate metrics for train set and for raw test targets</li>

In [None]:
def evaluate_model(G_train,G_test,y_train,y_test,use_test_mask,test_mask,parameters,cores):
    start = dt.now()
    graph_kernels = PropagationAttr(metric=_conv,t_max=parameters['t_max'],w=parameters['w'],M=parameters['M'],normalize=True,n_jobs=cores)
    K_train = graph_kernels.fit_transform(G_train)
    K_test = graph_kernels.transform(G_test)
    G_train = []
    G_test = []
    running_secs = (dt.now() - start).seconds
    print (f"\nPropagationAttr training have finished succesfully in {(dt.now() - start).seconds} seconds")
    print (f"Parameters : t_max={parameters['t_max']}, w={parameters['w']}, M:{parameters['M']}")
    SVM_Mregressor = MultiOutputRegressor(SVR(kernel='precomputed'))
    SVM_Mregressor.fit(K_train, y_train)
    
    y_train_pred = SVM_Mregressor.predict(K_train)
    train_MAE,train_MSE,train_pred,train_actual = calculate_metrics_on_actuals('train',y_train_pred,y_train,use_test_mask,test_mask)
    
    y_pred = SVM_Mregressor.predict(K_test)
    MAE,MSE,pred,actual = calculate_metrics_on_actuals('test',y_pred,y_test,use_test_mask,test_mask)
    return MAE,MSE,pred,actual

<font size="3">
A function that use evaluate_model function in order to make experiments for all of the given hyper-parameters while saving the calculate metrics, predictions and targets.
</font>

In [None]:
def parameters_tuning(G_train,G_test,y_train,y_test,use_test_mask,test_mask,parameters,cores):
    results = []
    predictions = []
    true_values = []
    for i in range (0,len(parameters)):
        MAE,MSE,pred,actual = evaluate_model(G_train,G_test,y_train,y_test,use_test_mask,test_mask,parameters[i],cores)
        results.append({'Parameters':parameters[i],'MAE':MAE,'MSE':MSE})
        predictions.append(pred)
        true_values.append(actual)
    return results,predictions,true_values

## Plotting 

<font size="3">
A function tha takes the prediciton and the true values and create a plot with them.
</font>

In [None]:
def plot_actuals_predictions(predicted_value,true_value,data_name):
    plt.figure(figsize=(7,7))
    plt.scatter(true_value, predicted_value, c='crimson')
    p1 = max(max(predicted_value), max(true_value))
    p2 = min(min(predicted_value), min(true_value))
    plt.title('Actual vs Predicted Values')
    plt.plot([p1, p2], [p1, p2], 'b-')
    plt.xlabel('True Values', fontsize=12)
    plt.ylabel('Predictions', fontsize=12)
    plt.axis('equal')
    #plt.savefig('Exports/Graph_Based_SVR_Predictions_' + data_name + '.pdf')
    plt.show()

## Functionality Combinations for Parking Data

In [None]:
parameters = [{"t_max":5,'w':20,'M':'L1'},{"t_max":5,'w':25,'M':'L1'},{"t_max":5,'w':30,'M':'L1'},
          {"t_max":10,'w':20,'M':'L1'},{"t_max":10,'w':25,'M':'L1'},{"t_max":10,'w':30,'M':'L1'},
            {"t_max":5,'w':20,'M':'L2'},{"t_max":5,'w':25,'M':'L2'},{"t_max":5,'w':30,'M':'L2'},
          {"t_max":10,'w':20,'M':'L2'},{"t_max":10,'w':25,'M':'L2'},{"t_max":10,'w':30,'M':'L2'}]

#parameters = [{"t_max":10,'w':20,'M':'L2'}]

G_train,G_test,y_train,y_test,test_mask = data_load(G_train_path_park,G_test_path_park,test_targets_path_park,train_targets_path_park,True,test_mask_path_park)
G_train,G_test,y_train,y_test,test_mask = data_preprocess(G_train,G_test,y_train,y_test,True,test_mask,False,20,6)
results,pedictions,true_values = parameters_tuning(G_train,G_test,y_train,y_test,True,test_mask,parameters,cores)
plot_actuals_predictions(pedictions[11],true_values[11],'Parking')

## Functionality Combinations for ChickenPox Data

In [None]:
parameters = [{"t_max":15,'w':20,'M':'L1'},{"t_max":15,'w':25,'M':'L1'},{"t_max":15,'w':30,'M':'L1'},
          {"t_max":25,'w':20,'M':'L1'},{"t_max":25,'w':25,'M':'L1'},{"t_max":25,'w':30,'M':'L1'},
            {"t_max":15,'w':20,'M':'L2'},{"t_max":15,'w':25,'M':'L2'},{"t_max":15,'w':30,'M':'L2'},
          {"t_max":25,'w':20,'M':'L2'},{"t_max":25,'w':25,'M':'L2'},{"t_max":25,'w':30,'M':'L2'}]

#parameters = [{"t_max":5,'w':20,'M':'L1'}]

G_train,G_test,y_train,y_test,test_mask = data_load(G_train_path_chic,G_test_path_chic,test_targets_path_chic,train_targets_path_chic,False,test_mask_path_chic)
G_train,G_test,y_train,y_test,test_mask = data_preprocess(G_train,G_test,y_train,y_test,False,test_mask,False,200,60)
results,pedictions,true_values = parameters_tuning(G_train,G_test,y_train,y_test,False,test_mask,parameters,cores)
plot_actuals_predictions(pedictions[2],true_values[2],'ChickenPox') 