# Intro

This notebook summarizes the experimental results obtained using Kedro pipeline and presents static figures as well as model scores.

It contains the following sections:
1. Data loading - loading the data from the Kedro catalog: model metrics from cross-validation runs.
2. Data inspection - visualizing the data using boxplots.
3. Statistical analysis - performing statistical analysis of the data using Kruskal-Wallis test and post-hoc pairwise tests.

# Lib imports

In [2]:
%load_ext kedro.ipython

In [37]:
import pandas as pd
import pingouin as pg
import plotly as py
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns

import IPython.display as dd

from deep_hybrid_recommender.pipelines.experiment.nodes import perform_statistical_comparison

In [8]:
%reload_kedro

# Load data

In [9]:
metrics_to_show = ['val_MAPE', 'val_MAE', 'val_MSE']

In [10]:
colab_filt_val_metrics = catalog.load("experiment.colab_filtering_crossval_val_metrics")
deep_colab_filt_val_metrics = catalog.load("experiment.deep_colab_filtering_crossval_val_metrics")
hybrid_rec_val_metrics = catalog.load("experiment.deep_hybrid_rec_crossval_val_metrics")
gnn_rec_val_metrics = catalog.load("experiment.gnn_rec_crossval_val_metrics")

all_metrics = pd.concat([colab_filt_val_metrics, deep_colab_filt_val_metrics, hybrid_rec_val_metrics, gnn_rec_val_metrics], axis=0, ignore_index=True)
all_metrics = pd.melt(all_metrics, id_vars='model_name', var_name='metric')

# Validation metrics inspection

In [17]:
all_metrics.pivot_table(index='model_name', columns='metric', aggfunc='mean')

Unnamed: 0_level_0,value,value,value,value,value,value
metric,train_MAE,train_MAPE,train_MSE,val_MAE,val_MAPE,val_MSE
model_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
colab filtering,3.320459,0.719094,14.875977,3.752275,0.813148,17.231289
deep colab filtering,0.437923,0.146391,0.442674,0.476319,0.15738,0.512165
gnn recommender,,,,0.136473,0.041248,0.127631
hybrid recommender,0.150604,0.051982,0.125893,0.16664,0.06205,0.206777


## Visualization of validation metrics

In [12]:
fig = px.box(
    all_metrics.loc[all_metrics.metric.isin(metrics_to_show)],
    color='model_name',
    y='value',
    facet_col='metric',
    title='Validation metrics comparison')
fig.write_html("val_metrics.html")
fig.show()

Visual inspection of the plots indicate three metric values are the smallest for the GNN recommender model. Of three approaches considered in the study, the classic collaborative filtering model scored the worst results every time.

## Statistical comparisons of metrics

Considering the fact that for repeated k-fold cross-validation samples are not independent (instead: selected on purpose to be in training or test subset without duplication) and measures are repeated k-times for multiple classifiers, Kruskal test was selected as a non-parametric rank-based alternative to classic ANOVA.

For each metric, the following two tests are performed:

1. The test for **overall** differences:
    1. **H0** - mean ranks of the groups are the same.
    2. **HA** - mean ranks of the groups are not the same
2. Post-hoc test for pairwise differences.

In [13]:
comparison_metrics = {metric: perform_statistical_comparison(all_metrics, metric) for metric in metrics_to_show}

In [14]:
for metric, (overall_result, pairwise_result) in comparison_metrics.items():
    dd.display(dd.Markdown(f"### {metric} analysis"))
    dd.display(dd.Markdown(f"#### Overall test"))
    dd.display(overall_result)
    dd.display(dd.Markdown(f"#### Pairwise tests"))
    display(pairwise_result)

### val_MAPE analysis

#### Overall test

Unnamed: 0,Source,ddof1,H,p-unc
Kruskal,model_name,3,34.243902,1.75967e-07


#### Pairwise tests

Unnamed: 0,Contrast,A,B,Paired,Parametric,U-val,alternative,p-unc,p-corr,p-adjust,hedges
0,model_name,colab filtering,deep colab filtering,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,25.091008
1,model_name,colab filtering,gnn recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,35.621909
2,model_name,colab filtering,hybrid recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,29.253058
3,model_name,deep colab filtering,gnn recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,5.660781
4,model_name,deep colab filtering,hybrid recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,3.858011
5,model_name,gnn recommender,hybrid recommender,False,False,20.0,two-sided,0.025748,0.154488,bonferroni,-1.043938


### val_MAE analysis

#### Overall test

Unnamed: 0,Source,ddof1,H,p-unc
Kruskal,model_name,3,34.074146,1.911051e-07


#### Pairwise tests

Unnamed: 0,Contrast,A,B,Paired,Parametric,U-val,alternative,p-unc,p-corr,p-adjust,hedges
0,model_name,colab filtering,deep colab filtering,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,103.837181
1,model_name,colab filtering,gnn recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,112.831482
2,model_name,colab filtering,hybrid recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,98.647342
3,model_name,deep colab filtering,gnn recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,13.675072
4,model_name,deep colab filtering,hybrid recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,10.255474
5,model_name,gnn recommender,hybrid recommender,False,False,22.0,two-sided,0.037635,0.225812,bonferroni,-0.982145


### val_MSE analysis

#### Overall test

Unnamed: 0,Source,ddof1,H,p-unc
Kruskal,model_name,3,34.069756,1.915134e-07


#### Pairwise tests

Unnamed: 0,Contrast,A,B,Paired,Parametric,U-val,alternative,p-unc,p-corr,p-adjust,hedges
0,model_name,colab filtering,deep colab filtering,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,51.812591
1,model_name,colab filtering,gnn recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,53.683473
2,model_name,colab filtering,hybrid recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,52.878386
3,model_name,deep colab filtering,gnn recommender,False,False,100.0,two-sided,0.000183,0.001096,bonferroni,4.832032
4,model_name,deep colab filtering,hybrid recommender,False,False,99.0,two-sided,0.000246,0.001477,bonferroni,3.316019
5,model_name,gnn recommender,hybrid recommender,False,False,20.0,two-sided,0.025748,0.154488,bonferroni,-1.033609


Analysis of the presented results indicates the following:
1. There is a significant difference in the performance of more advanced models (hybrid recommender and GNN recommender), compared to collaborative filtering and deep collaborative filtering.
2. There is no significant difference between the deep hybrid model and the GNN model.

# Test metrics analysis

In [18]:
test_met_dicts = []
for catalog_name in [met for met in catalog.list() if 'experiment.' in met and 'test' in met]:
    _, metric = catalog_name.split(".")
    model = metric.replace("_test_metrics", "")
    met_dict = catalog.load(catalog_name)
    met_dict['model'] = model
    
    test_met_dicts.append(met_dict)

test_mets_df = pd.DataFrame.from_records(test_met_dicts, index=list(range(len(test_met_dicts))))
test_mets_df

Unnamed: 0,test_MSE,test_MAPE,test_MAE,model
0,17.61796,0.814397,3.798708,collaborative_filtering
1,0.527138,0.15611,0.463374,deep_collaborative_filtering
2,0.192739,0.052388,0.146859,deep_hybrid_rec
3,0.181927,0.078926,0.264913,gnn_rec


# Conclusions

Analysis of all overall comparisons for all metrics allows to reject the null hypothesis. 
**This means that the mean ranks of the groups are not the same.**

**The post-hoc test results show that the Hybrid Recommender and GNN recommender model are significantly better than the other two models in all cases.**

Best test result were obtained by the Hybrid recommender model.