<h1>
<center> Experiments with PCA Kernels and LDA</center>
</h1>

<font size="3"> 
In notebook, we train and test various MultiOutput Classifiers (PCA kernels + LDA) from the sklearn library.

- All experiments were done in non-graph-based approaches where the information on edges and edge weights was not used.
    
- The data were converted to a common dataframe format and we use only the raw data for testing.

- Our datasets are regression problems, nevertheless, our target value range (0-1) helped us to solve the problem as a classification task. Where we rounded the target values and managed them as classes. The main reason that we use regression datasets is described in the report.
    
In summary, this notebook was created in order to compare the sklearn models with our graph kernel PCA technique.
    
     Ps: We use ParkingViolation and Chickenpox datasets for our experiments  
</font>

## Generals

<font size="3"> 
Packages import and system configurations. 
</font>

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import ParameterGrid
import pandas as pd
import pickle
from tqdm import tqdm
from datetime import datetime as dt
import tensorflow as tf
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import multiprocessing
from sklearn.pipeline import Pipeline
import os

cores = multiprocessing.cpu_count()-2
project_path = os.getcwd()

<font size="3"> 
Datasets paths. 
</font>

In [None]:
#ParkingViolation
train_set_path_park = project_path + '/Data/ParkingViolationPrediction_PCA_LDA/Init/Train_Dataset_Graph.pkl'
test_set_path_park = project_path + '/Data/ParkingViolationPrediction_PCA_LDA/Init/Test_Dataset_Graph.pkl'
train_targets_path_park = project_path + '/Data/ParkingViolationPrediction_PCA_LDA/Init/Train_Targets.csv'
test_targets_path_park = project_path + '/Data/ParkingViolationPrediction_PCA_LDA/Init/Test_Targets.csv'
train_mask_path_park = project_path + '/Data/ParkingViolationPrediction_PCA_LDA/Init/Train_Mask.csv'
test_mask_path_park = project_path + '/Data/ParkingViolationPrediction_PCA_LDA/Init/Test_Mask.csv'

#ChickePox
train_set_path_chic = project_path + '/Data/Chickenpox/Init/Chickenpox_Train_data.pkl'
test_set_path_chic = project_path + '/Data/Chickenpox/Init/Chickenpox_Test_data.pkl'
train_targets_path_chic = project_path + '/Data/Chickenpox/Init/Chickenpox_Train_targets.csv'
test_targets_path_chic = project_path + '/Data/Chickenpox/Init/Chickenpox_Test_targets.csv'
train_mask_path_chic = None
test_mask_path_chic = None

## Data Preprocessing 

<font size="3"> 
A function that takes as input the data paths and return the data or a subset of them.
</font>

In [None]:
def data_load(train_set_path,test_set_path,train_targets_path,test_targets_path,use_mask,train_mask_path,test_mask_path,subset,train_size,test_size):
    with open(train_set_path, 'rb') as inp:
        train_set = pickle.load(inp)
    with open(test_set_path, 'rb') as inp:
        test_set = pickle.load(inp)
      
    train_targets = pd.read_csv(train_targets_path,index_col=0)
    test_targets = pd.read_csv(test_targets_path,index_col=0)
    if use_mask:
        train_mask = pd.read_csv(train_mask_path,index_col=0)
        test_mask = pd.read_csv(test_mask_path,index_col=0)
    else:
        train_mask = None
        test_mask = None 
        
    if subset:
        train_set,test_set,train_targets,test_targets,train_mask,test_mask = get_subset(train_set,test_set,
                                        train_targets,test_targets,use_mask,train_mask,test_mask,train_size,test_size)
    
    return train_set,test_set,train_targets,test_targets,train_mask,test_mask  

<font size="3"> 
A function that takes the datasets and return a subset for each data accoriding the given data-sizes.
</font>

In [None]:
def get_subset(train_set,test_set,train_targets,test_targets,use_mask,train_mask,test_mask,train_size,test_size):
    train_set = train_set[0:train_size]
    test_set = test_set[0:test_size]
    train_targets = train_targets.iloc[:,:train_size]
    test_targets = test_targets.iloc[:,:test_size]
    if use_mask:
        train_mask = train_mask.iloc[:,:train_size]
        test_mask = test_mask.iloc[:,:test_size]
    else:
        train_mask = None
        test_mask = None
        
    return train_set,test_set,train_targets,test_targets,train_mask,test_mask

## Main functionality

<font size="3"> 
A function that converts our target values to classes.

In more detail, it scales target column from range (0-1) to (1-100)  
</font>

In [None]:
def values_to_classes(df):
    df = df.round(decimals=2)
    df = df * 100
    df = df.astype('int')
    return df

<font size="3"> 
A function takes a list of dataframes that descibes the features, a dataframe with the targets and a mask.

Return only the raw X and y on dataframe format.
</font>

In [None]:
def data_preprocess(data_set,y,mask):
    names = ['Date_Sin','Holidays','Capacity','temp','humidity','Week_Day_Sin','Month_Sin','Real_Time','Γενικό Νοσοκομείο Θεσσαλονίκης «Γ. Γεννηματάς»', 'Λιμάνι' ,'Δημαρχείο Θεσσαλονίκης','Λευκός Πύργος','Αγορά Καπάνι','Λαδάδικα','Πλατεία Άθωνος','Πλατεία Αριστοτέλους','Ροτόντα','Πλατεία Αγίας Σοφίας','Πλατεία Αντιγονιδών','Μουσείο Μακεδονικού Αγώνα','Πλατεία Ναυαρίνου','Πάρκο ΧΑΝΘ','Ιερός Ναός Αγίου Δημητρίου','ΔΕΘ','ΑΠΘ','Άγαλμα Ελευθερίου Βενιζέλου','Ρωμαϊκή Αγορά Θεσσαλονίκης','Predictions']
    for i in tqdm (range (0,len(data_set))):
        data_set[i] = data_set[i].sort_values("Slot_id")
        data_set[i] = data_set[i].set_index("Slot_id")
        data_set[i] = data_set[i][names]
        
        data_set[i] = data_set[i].join(y.iloc[:,i])
        data_set[i] = data_set[i].set_axis([*data_set[i].columns[:-1], 'Target'], axis=1, inplace=False)
        
        data_set[i] = data_set[i].join(mask.iloc[:,i])
        data_set[i] = data_set[i].set_axis([*data_set[i].columns[:-1], 'Mask'], axis=1, inplace=False)
        
        data_set[i] = data_set[i].loc[data_set[i]['Mask'] == 1]
        
    data_set = pd.concat(data_set).reset_index()
    data_set = data_set.drop(['Mask','Slot_id'], axis=1)
    
    X = data_set.drop(['Target'], axis=1)
    y = values_to_classes(data_set['Target'])
    return X,y

<font size="3"> 
A function that converts our target classes to values.

In more detail, it scales target column from range (0-100) to (0-1)  
</font>

In [None]:
def classes_to_values(y):
    y = [i / 100 for i in y]
    return y

<font size="3"> 
A function that calculates the Mean Absolute Error (MAE) and Mean Squared Error (MSE) between predictions and actual targets for train and test sets. 
    
Although we have transformed our problem as a classification task, we assign classes to it as integers and compute metrics that are applied to regression tasks.
</font>

In [None]:
def calculate_metrics(eval_set,y_pred,y_test):
    y_pred = classes_to_values(y_pred)
    y_test = classes_to_values(y_test)
    MAE = round(metrics.mean_absolute_error(y_test, y_pred),5)
    print (f"The Mean Abslolute Error (MAE) that have been calculated for {eval_set} set is: {MAE}")
    MSE = round(metrics.mean_squared_error(y_test, y_pred),5)
    print (f"The Mean Squared Error (MSE) that have been calculated for {eval_set} set is: {MSE}")
    return MAE,MSE

## PCA Evaluation

<font size="3"> 
A function that helps us to select the number of components for KPCA plotting explained variance.
</font>

In [None]:
def plot_components_explain_variance(X_train,data_name,cores):
    kpca = KernelPCA()
    kpca_transform = kpca.fit_transform(X_train)
    explained_variance = np.var(kpca_transform, axis=0)
    explained_variance_ratio = explained_variance / np.sum(explained_variance)
    plt.rcParams["figure.figsize"] = (12,6)
    fig, ax = plt.subplots()
    xi = np.arange(1, explained_variance_ratio.shape[0]+1, step=1)
    y = np.cumsum(explained_variance_ratio)
    plt.ylim(0.0,1.1)
    plt.plot(xi, y, marker='o', linestyle='--', color='b')
    plt.xlabel('Number of Components')
    plt.xticks(np.arange(0, 10, step=1)) #change from 0-based array index to 1-based human-readable label
    plt.ylabel('Cumulative variance (%)')
    plt.title('The number of components needed to explain variance')
    plt.axhline(y=0.95, color='r', linestyle='-')
    plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
    plt.savefig('Exports/n_componets_Kernel_PCA_' + data_name + '.pdf')
    ax.grid(axis='x')
    plt.show()

<font size="3"> 
A function that applies GridSearch of 4-fold cross-validation on pipeline (KPCA + LDA) in order to find the best kernels and hyper-parameters of KPCA (it use best 'n_components' found from the above plot).
</font>

In [None]:
def kpca_random_grid_search_cv(X_train,y_train,cores,param_grid,n_comp):
    clf = Pipeline([("kpca", KernelPCA(n_components=n_comp)),("lda", LinearDiscriminantAnalysis())])
    KPCA_grid_search = GridSearchCV(clf, param_grid, cv=3, n_jobs=cores, verbose=3,scoring='neg_mean_absolute_error')
    KPCA_grid_search.fit(X_train,y_train)
    print(f"\nBest KPCA model have been fitting with the following parameters: {KPCA_grid_search.best_params_}")
    return KPCA_grid_search

<font size="3">
A function that create a dataframe with cross validation results and save it to a csv file
</font>

In [None]:
def get_results_df(KPCA_grid_search,project_path,data_name):
    results = pd.DataFrame(KPCA_grid_search.cv_results_)
    results = results.sort_values(by=['rank_test_score'])
    results = results.reset_index()
    results = results.drop(['index'], axis=1)
    results.to_csv(project_path +'/Exports/CV_results_'+ data_name +'.csv')
    print('\nAll Cross Validation Results are described by the dataframe bellow:')
    return results

<font size="3"> 
A function that helps us to define the best hyper-parameters of LDA model 
    
In more detail:
- Fit a precomputed Kernel PCA model using the best hyper-parameters founded from GridSearch
- Use KPCA outputs in order to fit an MultiOutput LDA model with the given param_grid
- Calculate the necessary metrics in order to find the best hyper-parameters
</font>

In [None]:
def evaluate_lda(X_train,X_test,y_train,y_test,KPCA_grid_search,param_grid,cores):
    start = dt.now()
    transformer = KPCA_grid_search.best_estimator_[0]
    X_train_transformed = transformer.fit_transform(X_train)
    X_test_transformed = transformer.transform(X_test)
 
    LDA = LinearDiscriminantAnalysis(**param_grid)
    LDA.fit(X_train_transformed, y_train)
    running_secs = (dt.now() - start).seconds
    print (f"LDA + PCA model have fitted succesfully in {running_secs} seconds")
    
    y_train_pred = LDA.predict(X_train_transformed)
    train_MAE,train_MSE = calculate_metrics('train',y_train_pred,y_train)   
    
    y_pred = LDA.predict(X_test_transformed)
    MAE,MSE = calculate_metrics('test',y_pred,y_test)
    
    return MAE,MSE,running_secs

<font size="3"> 
A function that feed the above pipeline with different LDA hyper-parameters while saves the evaluation results.
</font>

In [None]:
def lda_evaluation_results(X_train,X_test,y_train,y_test,KPCA_grid_search,param_grid,cores): 
    lda_results = []
    for i in range(0,len(param_grid)):
        print (f"\nParameters: {list(param_grid)[i]}")
        MAE,MSE,running_secs = evaluate_lda(X_train,X_test,y_train,y_test,KPCA_grid_search,param_grid[i],cores)
        lda_results.append({'Parameters':param_grid[i],'MAE':MAE,'MSE':MSE,'Training Time':running_secs})
    lda_results = pd.DataFrame(lda_results)
    lda_results = lda_results.sort_values(by=['MAE'])
    lda_results = lda_results.reset_index()
    lda_results = lda_results.drop(['index'], axis=1)
    return lda_results

## 1. Functionality Combinations for Parking Data

### 1.1 DATA

In [None]:
train_set,test_set,train_targets,test_targets,train_mask,test_mask =  data_load(train_set_path_park,test_set_path_park,
                                train_targets_path_park,test_targets_path_park,True,train_mask_path_park,test_mask_path_park,True,250,60)

X_train,y_train = data_preprocess(train_set,train_targets,train_mask)
X_test,y_test = data_preprocess(test_set,test_targets,test_mask)
del(train_set,train_targets,train_mask,test_set,test_targets,test_mask)

plot_components_explain_variance(X_train,'Parking',cores)

### 1.2 KPCA Evaluation

In [None]:
kpca_param_grid = [{"kpca__kernel": ['poly'],"kpca__gamma": [0.3,0.4,0.5],"kpca__degree":[3,5]},
              {"kpca__kernel": ['rbf'],"kpca__gamma": [0.3,0.4,0.5,1]},
              {"kpca__kernel": ['sigmoid'],"kpca__gamma": [0.3,0.4,0.5],"kpca__coef0":[1,2]}]
n_comp_parking = 6
KPCA_grid_search = kpca_random_grid_search_cv(X_train,y_train,cores,kpca_param_grid,n_comp_parking)
results_kpca = get_results_df(KPCA_grid_search,project_path,'Parking')
results_kpca

### 1.3 LDA evalutation

In [None]:
lda_parameters = [{"solver":['svd'],'store_covariance':[False,True],'tol':[0.05,0.5]},
                {"solver":['lsqr','eigen'],'shrinkage':['auto',None,0.2,0.4]}]
LDA_param_grid = ParameterGrid(lda_parameters)
results_lda = lda_evaluation_results(X_train,X_test,y_train,y_test,KPCA_grid_search,LDA_param_grid,cores)

In [None]:
results_lda

## 2. Functionality Combinations for ChickenPox Data

In [None]:
def data_preprocess_chic(data_set,y):
    for i in tqdm (range (0,len(data_set))):       
        data_set[i] = data_set[i].join(y.iloc[:,i])
        data_set[i] = data_set[i].set_axis([*data_set[i].columns[:-1], 'Target'], axis=1, inplace=False)
        
    data_set = pd.concat(data_set).reset_index()
    
    y = values_to_classes(data_set['Target'])
    X = data_set.drop(['Target','index'], axis=1)
    data_set = []
    return X,y

### 2.1 DATA

In [None]:
train_set,test_set,train_targets,test_targets,train_mask,test_mask =  data_load(train_set_path_chic,test_set_path_chic,
                                train_targets_path_chic,test_targets_path_chic,False,train_mask_path_chic,test_mask_path_chic,False,200,50)

X_train,y_train = data_preprocess_chic(train_set,train_targets)
X_test,y_test = data_preprocess_chic(test_set,test_targets)
del(train_set,train_targets,train_mask,test_set,test_targets,test_mask)
plot_components_explain_variance(X_train,'Chickenpox',cores)

### 2.2 KPCA Evaluation

In [None]:
kpca_param_grid_chic = [{"kpca__kernel": ['poly'],"kpca__gamma": [0.3,0.4,0.5],"kpca__degree":[3,5]},
              {"kpca__kernel": ['rbf'],"kpca__gamma": [0.3,0.4,0.5,1]},
              {"kpca__kernel": ['sigmoid'],"kpca__gamma": [0.3,0.4,0.5],"kpca__coef0":[1,2]}]
n_comp_chick = 3
KPCA_grid_search_chic = kpca_random_grid_search_cv(X_train,y_train,cores,kpca_param_grid_chic,n_comp_chick)
results_kpca_chic = get_results_df(KPCA_grid_search_chic,project_path,'Chickenpox')

### 1.3 LDA evalutation

In [None]:
lda_parameters_chic = [{"solver":['svd'],'store_covariance':[False,True],'tol':[0.1,0.001]},
                {"solver":['lsqr','eigen'],'shrinkage':['auto',None,0.2,0.4]}]
LDA_param_grid_chic = ParameterGrid(lda_parameters_chic)
results_lda_chic = lda_evaluation_results(X_train,X_test,y_train,y_test,KPCA_grid_search_chic,LDA_param_grid_chic,cores)

In [None]:
results_lda_chic