# Generation of results for the simulations of semi-automated title-abstract screening

This notebook contains the code for generation of the results. <br>

#### Part I: Set-up
- **1. Import the packages and functions**
- **2. Import the intervention and prognsos review datasets** <br>

#### Part II: Simulations with original review datasets 
- **3. Retrieve and merge the output from all simulations** 
<br>The simulations based on the imported datasets were previously run with a seperate code on a High Performance Computer (HPC). The simulations were run for variations of: the 6 prognosis and 6 intervention review datasets (= 12 datasets), 2 feature extraction models + 3 classification models ( = 5 model combinations), and 200 randomly sampled initial training data. This resulted in 12\*5*200 = 12000 simulations stored in the same number of pickle files (.p). Each of these files consists of a ranking from simulating the semi-automated screening with the respective dataset and modeling methods. All these files are loaded and the output stored for futher processing.
- **4. Compute performance metrics from the retrieved simulation output**
<br>The retrieved output is then used to calculate the performance metrics (recall, precision at 95% recall, WSS at 95% recall, and nWSS at 95% recall) for each of the simulations.
- **5. Create raw tables with all performance metrics seperately** 
- **6. Process raw tables into pooled tabels (for results)**
- **7. Create histograms for WSS and precision (for results)** 
- **8. Create boxplots/lineplots of increasing recall during screening (for results)** <br>

#### Part III: Simulations with adapted review datasets 
- **9. Variations in number of (relevant) records**
- **10. Retrieve and merge the output from all simulations**
The simulations using the manually adapted datasets were previously run with a seperate code on a High Performance Computer (HPC). The simulations were run for variations of: the 4 prognosis and 4 intervention datasets manually adapted to contain 2000, 1000, and 500 records of which 50 inclusions ( = 24 manually adapted datasets), 5 samples (seeds) for each manually adapted dataset, 1 feature extraction model + 1 classification model (default models), and 200 randomly sampled initial training data. This resulted in 24\*5*200 = 24000 simulations stored in the same number of pickle files (.p). Each of these files consists of a ranking from simulating the semi-automated screening with the respective dataset and sampling seed. All these files are loaded and the output stored for futher processing.
- **11. Compute performance metrics from the retrieved simulation output**
<br>The retrieved output is then used to calculate the performance metrics (recall, precision at 95% recall, WSS at 95% recall, and nWSS at 95% recall) for each of the simulations.
- **12. Create raw tables with all performance metrics seperately**
- **13. Process raw tables into pooled tables (for results)**
- **14. Create histograms for WSS and precision (for results)**
- **15. Create lineplots of increasing recall during screening (for results)**

## Part I: Set-up

### 1. Import the packages and functions

In [1]:
import sys
sys.path.append('../../')

import numpy as np
import pandas as pd
import pickle
import os
import shutil
import math
import seaborn as sns
import matplotlib.pyplot as plt
import argparse
import collections

In [2]:
from functions import compute_metrics, compute_nwss
from functions import generate_recall_table_prop, generate_recall_table_ss, max_recall_prop, max_recall_ss
from functions import generate_wss_table, generate_results_table

In [3]:
from asreview.models.classifiers import LogisticClassifier, LSTMBaseClassifier, LSTMPoolClassifier, NaiveBayesClassifier, NN2LayerClassifier, RandomForestClassifier, SVMClassifier
from asreview.models.query import ClusterQuery, MaxQuery, MaxRandomQuery, MaxUncertaintyQuery, RandomQuery, UncertaintyQuery
from asreview.models.balance import DoubleBalance, SimpleBalance, UndersampleBalance
from asreview.models.feature_extraction import Doc2Vec, EmbeddingIdf, EmbeddingLSTM, SBERT, Tfidf
from asreview import open_state

from pathlib import Path

In [4]:
import os
os.chdir("..")

In [5]:
# TODO put results in correct folders
# TODO change data files to .csv

path_data = 'data/' 
path_results_HPC = '/Users/ispiero2/Documents/Research/Results_HPC/'
path_results = 'results/' 

### 2. Import the intervention and prognosis review datasets

The intervention review datasets that were used for simulation are imported (numbering of the datasets is ordered by authors), and the prognosis review datasets that were used for simulation are imported (numbering of the datasets is ordered by author, so numbers do not correspond to numbers in data prep):

In [6]:
# Load all the review datasets from the dataset-containing folder into a dictionary
review_dic = {}

for file_name in os.listdir(path_data):
    if file_name.endswith('.xlsx'):
        file_path = os.path.join(path_data, file_name)
        df = pd.read_excel(file_path)
        key = os.path.splitext(file_name)[0].split("_")[0]
        review_dic[key] = df

review_dic = dict(sorted(review_dic.items()))

In [7]:
# Create separate dictionaries for intervention and prognosis review datasets
dfs_int = {key: value for key, value in review_dic.items() if key.startswith('Int')}
dfs_prog = {key: value for key, value in review_dic.items() if key.startswith('Prog')}

## Part II: Simulations with original review datasets

### 3. Retrieve and merge the output from all simulations

To assess the performance of the semi-automated screening tool, not only the reviews, but also the classification models, feature extraction models, and/or query models were varied in the simulations:

In [9]:
# Specify the the classification, feature extraction, and query model(s) that were tested
train_models = [LogisticClassifier(), NaiveBayesClassifier(), SVMClassifier()] 
feature_models = [Tfidf(), SBERT()] 
query_models = [MaxQuery()]

# Specify the number of simulations per review-model combination  
n_simulations = 200 

All the output from the simulations of these variations (conducted on the HPC) can then be retrieved and merged as follows:

In [10]:
# Create a list of the review-model combination names
sim_list_names = []
for review in review_dic:
    for train_model in train_models:
        for feature_model in feature_models:
            for query_model in query_models:
                review_id = str(review + "_" + train_model.name + "_" + feature_model.name + "_" + query_model.name)
                sim_list_names.append(review_id)
                
# Retrieve the output from the HPC generated pickle files with each having the rankings of a single simulation
multiple_sims = []
for i in range(0, len(sim_list_names)):
    raw_output = {}
    for j in range(1,n_simulations+1):
        if Path(path_results_HPC +'sim_{review_id}_{sim}.p'.format(review_id=sim_list_names[i], sim=j)).is_file():
            with open(path_results_HPC + 'sim_{review_id}_{sim}.p'.format(review_id=sim_list_names[i], sim=j),'rb') as f:
                raw_output.update(pickle.load(f))
    if len(raw_output) > 0:
        multiple_sims.append((sim_list_names[i], len(review_dic[sim_list_names[i].split('_')[0]]), n_simulations, raw_output))

In [11]:
# Save (back-up) the file with the simulation results 
with open(path_results + 'multiple_sims_saved_all_final_2okt.p','wb') as f:
    pickle.dump(multiple_sims, f)

OSError: [Errno 28] No space left on device

or the output can be directly opened from the already saved file:

In [None]:
# Open the file with the simulation results
with open(path_results + 'multiple_sims_saved_all_final_2okt.p','rb') as f:
    multiple_sims = pickle.load(f)

The output can be separated in dictionaries for the prognosis reviews and intervention reviews each:

In [None]:
#TODO remove this part?
# # Distinguish between the intervention and prognosis reviews by creating a separate dictionary for each
# multiple_sims_prog = multiple_sims[0:35]
# multiple_sims_int = multiple_sims[35:75]

### 4. Compute performance metrics from the retrieved simulation output

The proportions (i.e. proportion of records screened) and sample sizes (i.e. the number of records screened) of interest can be defined. These are then used for evaluation of the ranking of the records and to calculate the performance metrics at each of these proportions/sample sizes screened. These calculations may take a while to run, therefore they were previously saved and can also directly be opened.

In [None]:
proportions = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
sample_sizes = list(map(int,list(np.linspace(0, 99, 100,retstep = True)[0]))) + list(map(int,list(np.linspace(100, 12400, 124,retstep = True)[0])))

Using these proportions and sizes, the following function can be used to derive the performance metrics of the simulation(s):

In [None]:
# TODO: remove hashtags from compute_metrics function, remove nwss 2

# Use the compute_metrics function to compute the metrics from the retrieved simulation output
#raw_output = compute_metrics.compute_metrics(multiple_sims, proportions, sample_sizes)

In [None]:
# Save (back-up) a file with the computed output
# with open(path_results + 'sims_output_saved_all_final_2okt.p','wb') as f:
#     pickle.dump(raw_output, f)

and the normalized work-saved-over sampling metric: (TODO merge this with the compute_metrics function):

In [None]:
# TODO remove compute_nwss function if above works

# Use the compute_nwss function to derive the normalized WSS of the retrieved simulation output
#raw_output_nwss = compute_nwss.compute_nwss(multiple_sims, proportions, sample_sizes)

In [None]:
# Save (back-up) a file with the computed output
# with open(path_results + 'sims_output_saved_all_nwss_final.p','wb') as f:
#     pickle.dump(raw_output_nwss, f)

Or directly open the file containing the output (as these especially take a while to run):

In [None]:
# Open the file with the computed output
with open(path_results + 'sims_output_saved_all_final_2okt.p','rb') as f: #'sims_output_saved_all_final_extra.p','rb') as f:
    raw_output = pickle.load(f)

In [None]:
# Open the file with the computed output
# with open(path_results + 'sims_output_saved_all_nwss_final.p','rb') as f:
#     raw_output_nwss = pickle.load(f)

### 5. Create raw tables with all performance metrics seperately

Filter the (for now) relevant parts of the output for the results:

In [None]:
evaluation = {}
for i in range(0, len(raw_output)):
    evaluation[raw_output[i][0]] = []
    evaluation[raw_output[i][0]].append(raw_output[i][3:9])
    
import warnings
warnings.filterwarnings('ignore')

In [None]:
# evaluation_nwss = {}
# for i in range(0, len(raw_output_nwss)):
#     evaluation_nwss[raw_output_nwss[i][0]] = []
#     evaluation_nwss[raw_output_nwss[i][0]].append(raw_output_nwss[i][3:])
    
# import warnings
# warnings.filterwarnings('ignore')

Create a raw table with the performance metrics for proportions:

In [None]:
# Use the generate_recall_table_prop function to generate a table with all recall values for all proportions
df_prop = generate_recall_table_prop.generate_recall_table_prop(evaluation, proportions, n_simulations)
df_prop

Calculate the maximum recall values that could be obtained at each of the proportions screened:

In [None]:
# Sort the dictionary
review_dic_ord = collections.OrderedDict(sorted(review_dic.items()))
# Use the max_recall_prop function to calculate the maximum achievable recall for each review
df_max_recalls = max_recall_prop.max_recall_prop(review_dic_ord, proportions)
df_max_recalls

Create a raw table with the performance metrics for sample sizes:

In [None]:
# TODO remove this part:

# Use the generate_recall_table_ss function to generate a table with all recall values for all sample sizes
# df_ss = generate_recall_table_ss.generate_recall_table_ss(evaluation, sample_sizes, n_simulations)
# df_ss.head()

In [None]:
# Save (back-up) a file with the computed tables (these take almost a day to compute)
# # Computed table with sample size
# with open(path_results + 'sims_output_saved_all_ss_final.p','wb') as f:
#     pickle.dump(df_ss, f)

Or directly open the file with the sample sizes table:

In [None]:
# Open the computed table with sample size
# with open(path_results + 'sims_output_saved_all_ss_final.p','rb') as f:
#     df_ss = pickle.load(f)

In [None]:
# df_ss

Calculate the maximum recall values that could be obtained at each of the sample sizes screened

In [None]:
# # Define the sample sizes (number of records screened) at which to calculate the maximul achievable recalls
# sample_sizes_mr = list(map(int,list(np.linspace(0, 12400, 12400,retstep = True)[0])))
# # Use the max_recall_ss function to calculate the maximum achievable recall for each review
# df_max_recalls_ss = max_recall_ss.max_recall_ss(review_dic_ord, sample_sizes_mr)
# df_max_recalls_ss

Create a raw table with the work-saved-over sampling, normalized work-saved-over sampling, workload reduction in number of records, and workload reduction in hours:

In [None]:
# Use the generate_wss_table function to create a table with the workload metrics
df_wss = generate_wss_table.generate_wss_table(evaluation, n_simulations)
df_wss
#df_wss.to_excel('Table_wss.xlsx')

Create a raw table with the precision metric:

In [None]:
df_prec = pd.DataFrame()
length = n_simulations
for key, value in evaluation.items():
    names = key.split('_')
    review = [names[0]] * length
    train_model = [names[1]] * length
    feature_model = [names[2]] * length
    query_model = [names[3]] * length
    simulations = range(1, n_simulations+1)
    precision = value[0][4]['Precision'] ###
    df_sim = pd.DataFrame(list(zip(review,train_model,feature_model,query_model,simulations, precision)),
                           columns = ['Review', 'Train model', 'Feature model', 'Query model', 'Simulation', 'Precision@95%'])
    df_prec = pd.concat([df_prec, df_sim])

    df_prec = df_prec.reset_index(drop = True)

df_prec

### 6. Process raw tables into pooled tables (for results)

The generate_results_table function creates the table containing the WSS and precision values as presented in te results (df_wss_prec) and the table used for create figures (df_wss_prec_all_values):

In [None]:
df_wss_prec_all_values, df_wss_prec = generate_results_table.generate_results_table(df_wss, df_prec)
#df_wss_prec.to_excel('Table_WSS_precision_final_extra.xlsx')
df_wss_prec_all_values

In [None]:
df_wss_prec

### 7. Create histograms for WSS and precision (for results)

Remove the output of the reviews that are not included in the results of our study/replace the numbering:

In [None]:
# TODO do this beforehand and adapt in simulation HPC

df_wss_prec_all_values_ed = df_wss_prec_all_values.copy()
df_prop_ed = df_prop.copy()
df_max_recalls_ed = df_max_recalls.copy()

# Remove prognosis review 5 (wrong study design), and change numbering
df_wss_prec_all_values_ed = df_wss_prec_all_values_ed.drop(df_wss_prec_all_values_ed[df_wss_prec_all_values_ed['Review'] == 'Prog5'].index)
df_wss_prec_all_values_ed['Review'] = df_wss_prec_all_values_ed['Review'].replace({'Prog6': 'Prog5',
                                                                                   'Prog7': 'Prog6'})
df_prop_ed = df_prop_ed.drop(df_prop_ed[df_prop_ed['Review'] == 'Prog5'].index)
df_prop_ed['Review'] = df_prop_ed['Review'].replace({'Prog6': 'Prog5',
                                                     'Prog7': 'Prog6'})
df_prop_ed['Review_full'] = df_prop_ed['Review_full'].apply(lambda x: x.replace('Prognosis 6', 'Prognosis 5') if 'Prognosis 6' in x else x)
df_prop_ed['Review_full'] = df_prop_ed['Review_full'].apply(lambda x: x.replace('Prognosis 7', 'Prognosis 6') if 'Prognosis 7' in x else x)
df_prop_ed['Simulation'] = df_prop_ed['Simulation'].apply(lambda x: x.replace('Prognosis 6', 'Prognosis 5') if 'Prognosis 6' in x else x)
df_prop_ed['Simulation'] = df_prop_ed['Simulation'].apply(lambda x: x.replace('Prognosis 7', 'Prognosis 6') if 'Prognosis 7' in x else x)
df_max_recalls_ed = df_max_recalls_ed.drop(df_max_recalls_ed[df_max_recalls_ed['Review'] == 'Prog5'].index)
df_max_recalls_ed['Review'] = df_max_recalls_ed['Review'].replace({'Prog6': 'Prog5',
                                                                   'Prog7': 'Prog6'})

# Remove intervention review 7 and 8 (decided too few relevant records for proper simulation)
df_wss_prec_all_values_ed = df_wss_prec_all_values_ed.drop(df_wss_prec_all_values_ed[df_wss_prec_all_values_ed['Review'] == 'Int7'].index)
df_wss_prec_all_values_ed = df_wss_prec_all_values_ed.drop(df_wss_prec_all_values_ed[df_wss_prec_all_values_ed['Review'] == 'Int8'].index)
df_prop_ed = df_prop_ed.drop(df_prop_ed[df_prop_ed['Review'] == 'Int7'].index)
df_prop_ed = df_prop_ed.drop(df_prop_ed[df_prop_ed['Review'] == 'Int8'].index)
df_max_recalls_ed = df_max_recalls_ed.drop(df_max_recalls_ed[df_max_recalls_ed['Review'] == 'Int7'].index)
df_max_recalls_ed = df_max_recalls_ed.drop(df_max_recalls_ed[df_max_recalls_ed['Review'] == 'Int8'].index)

Create histograms for (n-)WSS and precision for intervention and prognosis reviews seperately:

In [None]:
#TODO create functions 

# Choose variables to plot
variables_to_plot = ['Mean_WSS95', 'Mean_nWSS95', 'Mean_prec95']
variable_names = ['n-WSS@95', 'WSS@95', 'Precision@95%']
y_lims = [1,1,0.5]
review_types = ['Int', 'Prog']

# def generate_histograms(df_wss_prec_all_values, variables_to_plot, variable_names, review_types, sort_by = 'Models'):
    
#     # Change dtype to numeric
#     for variable in variables_to_plot:
#         df_wss_prec_all_values[variable] = pd.to_numeric(df_wss_prec_all_values[variable])
        
#     # Create a histogram for each variable and review type
#     for i in range(0, len(variables_to_plot)):
#         for j in review_types:
            
#             sns.barplot(x='Review', y=variables_to_plot[i], hue=sort_by, 
#                         data=df_wss_prec_all_values[df_wss_prec_all_values['Review'].str.startswith(j)]).set(ylabel=variable_names[i],ylim=(0,y_lims[i]))
#             plt.legend(fontsize='10', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#             plt.show()

def generate_histograms(dataset, x, variables, y_labels, review_types, hue = 'Models'):
    
    # Change dtype to numeric
    for variable in variables:
        dataset[variable] = pd.to_numeric(dataset[variable])
        
    # Create a histogram for each variable and review type
    for i in range(0, len(variables)):
        for j in review_types:
            
            sns.barplot(x=x, y=variables[i], hue=hue, 
                        data=dataset[dataset['Review'].str.startswith(j)]).set(ylabel=y_labels[i],ylim=(0,y_lims[i]))
            plt.legend(fontsize='10', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
            plt.show()
            
generate_histograms(dataset=df_wss_prec_all_values_ed, 
                    x='Review',
                    variables=variables_to_plot, 
                    y_labels=variable_names, 
                    review_types=review_types)

### 8. Create boxplots/lineplots of increasing recall during screening (for results)

Create boxplots for the simulations of the default models (or any other subset of data/modeling methods):

In [None]:
# Optionally choose subset(s) to plot
subset_models = ['nb - tfidf']
subset_reviews = ['Prog', 'Int']

def generate_boxplots(df_prop_ed, subset_models = None, subset_reviews = None):
    
    # Select the subset from the dataframe containing all recall values for each proportion screened
    df_boxplots = df_prop_ed.copy()
    
    if subset_models != None:
        models = '|'.join(subset_models)
        df_boxplots = df_boxplots[df_boxplots['Simulation'].str.contains(models, regex=True)]
    if subset_reviews != None:
        reviews = '|'.join(subset_reviews)
        df_boxplots = df_boxplots[df_boxplots['Review'].str.contains(reviews, regex=True)]

    # Create a figure with boxplots for each review-model combination seperately
    sns.set(style = 'ticks', font_scale = 1.5)
    p1 = sns.catplot(data = df_boxplots, x = 'percentage of records screened', y = 'recall',
                     col = 'Simulation', kind = 'box', col_wrap = 2, color = 'white', aspect = 1.3)

    axes = p1.fig.axes
    x_axis = df_boxplots['percentage of records screened'][0:11]

    for i in range(0, len(df_boxplots['Simulation'].unique())):
        review = df_boxplots.loc[df_boxplots['Simulation'] == df_boxplots['Simulation'].unique()[i], 'Review'].values[0]
        max_recalls_per_prop = df_max_recalls_ed.loc[df_max_recalls_ed['Review'] == review]['Maximum recall']    
        axes[i].plot(x_axis, max_recalls_per_prop+0.005, 'k-', linewidth = 1, color = 'black', linestyle = '--', label = "maximal recall")   
        axes[i].legend(loc="lower right")
    p1.set_titles(col_template = "{col_name}", size = 16, weight = 'bold')

    plt.show()
    
generate_boxplots(df_prop_ed, subset_models, subset_reviews)

Create lineplots for all simulations (or any other subset of data/modeling methods):

In [None]:
# Optionally choose a subset to plot
subset_reviews = ['Prog']

def generate_lineplots(df_prop_ed, df_max_recalls_ed, subset_reviews = None):
    
    # Select the subset from the dataframe containing all recall values for each proportion screened
    df_lineplots = df_prop_ed.copy()
    
    df_lineplots['Models'] = df_lineplots['Models'].astype(pd.CategoricalDtype(categories=['tfidf - nb',
                                                                  'tfidf - svm',
                                                                  'tfidf - logistic',
                                                                  'sbert - svm',
                                                                  'sbert - logistic']))
    
    if subset_reviews != None:
        reviews = '|'.join(subset_reviews)
        df_lineplots = df_lineplots[df_lineplots['Review'].str.contains(reviews, regex=True)]

    # Create a figure with lineplots for each review-model combination seperately
    p2 = sns.catplot(data = df_lineplots, kind = 'point', 
                     x = 'percentage of records screened', y = 'recall', 
                     col = 'Review_full', 
                     hue = 'Models', 
                     errorbar = 'ci',
                     col_wrap = 2, aspect = 1.4, legend = False
                     )

    axes = p2.fig.axes

    for i in range(0, len(df_lineplots['Review'].unique())):
        max_recalls_per_prop = df_max_recalls_ed.loc[df_max_recalls_ed['Review'] == df_lineplots['Review'].unique()[i]]['Maximum recall']    
        manual_screening = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
        x_axis = df_lineplots['percentage of records screened'][0:len(max_recalls_per_prop)]
        axes[i].plot(x_axis, max_recalls_per_prop + 0.005, 'k-', linewidth = 1, color = 'grey', linestyle = '--', label = "maximum recall") 
        axes[i].plot(x_axis, manual_screening, 'k-', linewidth = 1, color = 'grey', linestyle = '-', label = "manual screening") 
        axes[i].legend(loc="lower right", fontsize = 12.5)
    
    if subset_reviews[0] == 'Int8':
        p2.set_titles('Intervention A1', size = 16, weight = 'bold')
    else:
        p2.set_titles(col_template = "{col_name}", size = 16, weight = 'bold')
    p2.set_xlabels('Percentage of records screened', size = 16)
    p2.set_ylabels('Recall', size = 16)
    p2.set(ylim = (0, 1.01))
    plt.show()

generate_lineplots(df_prop_ed, df_max_recalls_ed, subset_reviews)

In [None]:
subset_reviews = ['Int']
generate_lineplots(df_prop_ed, df_max_recalls_ed, subset_reviews)

In [None]:
subset_reviews = ['Int8']
generate_lineplots(df_prop, df_max_recalls, subset_reviews)

## Part III: Simulations with adapted review datasets

### 9. Variations in number of (relevant) records 

Create dataframes of manually adapted numbers of (relevant) records that were also used to simulate the semi-automated screening process on the High Performance Computer (HPC)

In [None]:
# Define a function to create a dictionary with subsets of a dataset consisting of varying numbers of records and inclusions:

def df_var_dict(review_name, df, sizes, incl_prop): 
    
    '''
    df (pandas.DataFrame):    a dataframe of a review with at least containing the columns 'abstract', 'title', and 'label_included'
    sizes (list):             a list of integers of dataframe sizes to vary  
    incl_prop (list):         a list of integers of inclusion proportions to vary 
    '''
    
    # First a list of possible size/inclusion combinations is created:
    unique_combinations = []
    for i in range(0, len(sizes)):
        unique_combinations.append([sizes[i],incl_prop[i]])

    # Then a dictionary is created with for each combination a sample of the dataset. If the sample cannot be retrieved,
    # i.e. the dataset size is too small or the inclusion proportion is not available, the respective item in the dictionary remains empty.
    df_dict = {}
    
    # For each combination of size/inclusion proportion:
    for i in range(len(unique_combinations)):
        # Check if the dataframe includes enough records to sample the respective size
        if len(df) >= unique_combinations[i][0]:
            
            
            # Check if the inclusion proportion is possible for the respective size
            if df.loc[df['label_included'] == 1].shape[0] >= int(unique_combinations[i][0] * unique_combinations[i][1]) and df.loc[df['label_included'] == 0].shape[0] >= int(unique_combinations[i][0] - (unique_combinations[i][0] * unique_combinations[i][1])):
                # If so:
                # Sample random inclusions
                incl = df.loc[df['label_included'] == 1].sample(n = int(unique_combinations[i][0] * unique_combinations[i][1]), replace = False, random_state = 1)
                # Sample random exclusions
                excl = df.loc[df['label_included'] == 0].sample(n = int(unique_combinations[i][0] - (unique_combinations[i][0] * unique_combinations[i][1])), replace = False, random_state = 1)
                # Create a dataframe of the inclusions and exclusions
                df_new = pd.concat([incl, excl]).sort_values('authors').reset_index()
                name = review_name + "_" + str(len(df_new)) + "_" + str(unique_combinations[i][1])
            # If the size/inclusion proportion is not possible, leave the dataframe empty
            else:
                name = review_name + "_" + str(len(df_new)) + "_" + str(unique_combinations[i][1])
                df_new = [] 
        else:
            df_new = []
        
        if len(df_new) > 0:
            if df_new['label_included'].sum() <= 10:
                df_new = []
        
        # Store the dataframe in the dictionary of all retrieved dataframes
        if len(df_new) > 0:
            df_dict[name] = df_new
  
    return(df_dict)

# Apply the function:
sizes = [500, 1000, 2000]
incl_prop = [0.1, 0.05, 0.025]
ss_dict = df_var_dict(review_name = 'Int1', df = dfs_int['Int1'], sizes = sizes, incl_prop = incl_prop)                 
ss_dict.update(df_var_dict(review_name = 'Int2', df = dfs_int['Int2'], sizes = sizes, incl_prop = incl_prop))
ss_dict.update(df_var_dict(review_name = 'Int4', df = dfs_int['Int4'], sizes = sizes, incl_prop = incl_prop))
ss_dict.update(df_var_dict(review_name = 'Int6', df = dfs_int['Int6'], sizes = sizes, incl_prop = incl_prop))
ss_dict.update(df_var_dict(review_name = 'Prog3', df = dfs_prog['Prog3'], sizes = sizes, incl_prop = incl_prop))
ss_dict.update(df_var_dict(review_name = 'Prog4', df = dfs_prog['Prog4'], sizes = sizes, incl_prop = incl_prop))
ss_dict.update(df_var_dict(review_name = 'Prog6', df = dfs_prog['Prog6'], sizes = sizes, incl_prop = incl_prop))
ss_dict.update(df_var_dict(review_name = 'Prog7', df = dfs_prog['Prog7'], sizes = sizes, incl_prop = incl_prop))

# Save the keys
ss_sims = list(ss_dict.keys())

### 10. Retrieve and merge the output from all simulations

To assess the performance of the semi-automated screening tool, the classification models, feature extraction models, and/or query models were kept constant while the reviews were manually adapted to consist of equal number of (relevant) records.

In [None]:
# TODO put results in correct folder
path_results_HPC = '/Users/ispiero2/Documents/Research/Results_HPC/new_seed/' #new_results/''

In [None]:
# Specify the the classification, feature extraction, and query model(s) that were tested
train_models = [NaiveBayesClassifier()]
feature_models = [Tfidf()] 
query_models = [MaxQuery()]

# Specify the number of simulations per review-model combination  
n_simulations = 200 

# Specify the number of sampling seeds of manually adapted datasets
seeds = 3

All the output from the simulations of these variations (conducted on the HPC) can then be retrieved and merged as follows:

In [None]:
# Create a list of the review-model combination names
sim_list_names_ss = []
for review in ss_sims:
    for train_model in train_models:
        for feature_model in feature_models:
            for query_model in query_models:
                review_id = str(review + "_" + train_model.name + "_" + feature_model.name + "_" + query_model.name)
                sim_list_names_ss.append(review_id)

# Derive the results from the HPC retrieved pickle files with each having the rankings of a single simulation
multiple_sims_sizes = []
n_simulations = 200 # number of simulations per review-model combination
for i in range(0, len(sim_list_names_ss)):
    for k in range(1, seeds+1):
        raw_output_ss = {}
        for j in range(1,n_simulations+1):
            if Path(path_results_HPC +'sim_{review_id}_{sim}_{seed}.p'.format(review_id=sim_list_names_ss[i], sim=j, seed = k)).is_file():
                with open(path_results_HPC + 'sim_{review_id}_{sim}_{seed}.p'.format(review_id=sim_list_names_ss[i], sim=j, seed = k),'rb') as f:
                    raw_output_ss.update(pickle.load(f))
        if len(raw_output_ss) > 0:
            review_id = str(sim_list_names_ss[i] + '_' + str(k))
            multiple_sims_sizes.append((review_id, len(ss_dict['_'.join(sim_list_names_ss[0].split('_')[0:3])]), n_simulations, raw_output_ss))

In [None]:
# # Create a list of the review-model combination names
# sim_list_names_ss = []
# for review in ss_sims:
#     for train_model in train_models:
#         for feature_model in feature_models:
#             for query_model in query_models:
#                 review_id = str(review + "_" + train_model.name + "_" + feature_model.name + "_" + query_model.name)
#                 sim_list_names_ss.append(review_id)

# # Derive the results from the HPC retrieved pickle files with each having the rankings of a single simulation
# multiple_sims_sizes = []
# n_simulations = 200 # number of simulations per review-model combination
# for i in range(0, len(sim_list_names_ss)):
#     raw_output_ss = {}
#     for j in range(1,n_simulations+1):
#         if Path(path_results_HPC +'sim_{review_id}_{sim}.p'.format(review_id=sim_list_names_ss[i], sim=j)).is_file():
#             with open(path_results_HPC + 'sim_{review_id}_{sim}.p'.format(review_id=sim_list_names_ss[i], sim=j),'rb') as f:
#                 raw_output_ss.update(pickle.load(f))
#     if len(raw_output_ss) > 0:
#         multiple_sims_sizes.append((sim_list_names_ss[i], len(ss_dict['_'.join(sim_list_names_ss[0].split('_')[0:3])]), n_simulations, raw_output_ss))

In [None]:
# Save (back-up) the file with the simulation results
# with open(path_results + 'multiple_sims_sizes_seeds.p','wb') as f:
#     pickle.dump(multiple_sims_sizes, f)

or the output can be directly opened from the already saved file:

In [None]:
# Open the file with the simulation results
with open(path_results + 'multiple_sims_sizes_seeds.p','rb') as f:
    multiple_sims_sizes = pickle.load(f)

### 11. Compute performance metrics from the retrieved simulation output

The proportions (i.e. proportion of records screened) and sample sizes (i.e. the number of records screened) of interest can be defined. These are then used for evaluation of the ranking of the records and to calculate the performance metrics at each of these proportions/sample sizes screened.

In [None]:
proportions = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
sample_sizes = list(map(int,list(np.linspace(0, 99, 100,retstep = True)[0]))) + list(map(int,list(np.linspace(100, 12400, 124,retstep = True)[0])))

Using these proportions and sizes, the following function can be used to derive the performance metrics of the simulation(s):

In [None]:
# Use the compute_metrics function to compute the metrics from the retrieved simulation output
#raw_output_sizes = compute_metrics.compute_metrics(multiple_sims_sizes, proportions, sample_sizes)

In [None]:
# Save (back-up) a file with the computed output
# with open(path_results + 'raw_output_sizes_seeds.p','wb') as f:
#     pickle.dump(raw_output_sizes, f)

and the normalized work-saved-over sampling metric: (TODO merge this with the compute_metrics function):

In [None]:
# Use the compute_nwss function to derive the normalized WSS of the retrieved simulation output
#raw_output_nwss_sizes = compute_nwss.compute_nwss(multiple_sims_sizes, proportions, sample_sizes)

In [None]:
# Save (back-up) a file with the computed output
# with open(path_results + 'raw_output_nwss_sizes_seeds.p','wb') as f:
#     pickle.dump(raw_output_nwss_sizes, f)

Or directly open the file containing the output (as these especially take a while to run):

In [None]:
# Open the files with the computed output
with open(path_results + 'raw_output_sizes_seeds.p','rb') as f:
    raw_output_sizes = pickle.load(f)

In [None]:
# Open the files with the computed output
with open(path_results + 'raw_output_nwss_sizes_seeds.p','rb') as f:
    raw_output_nwss_sizes = pickle.load(f)

### 12. Create raw tables with all performance metrics seperately

Filter the (for now) relevant parts of the output for the results:

In [None]:
evaluation_sizes = {}
for i in range(0, len(raw_output_sizes)):
    evaluation_sizes[raw_output_sizes[i][0]] = []
    evaluation_sizes[raw_output_sizes[i][0]].append(raw_output_sizes[i][3:8])

In [None]:
evaluation_sizes_nwss = {}
for i in range(0, len(raw_output_nwss_sizes)):
    evaluation_sizes_nwss[raw_output_nwss_sizes[i][0]] = []
    evaluation_sizes_nwss[raw_output_nwss_sizes[i][0]].append(raw_output_nwss_sizes[i][3:8])

Create a raw table with the performance metrics for proportions:

In [None]:
# Use the generate_recall_table_prop function to generate a table with all recall values for all proportions
df_prop_sizes = generate_recall_table_prop.generate_recall_table_prop(evaluation_sizes, proportions, n_simulations, data_type = 'adapted')

# Adapt the review names
df_prop_sizes['Review'] = df_prop_sizes.apply(lambda row: row['Review'] + '_' + str(row['Total records']) + '_' + str(row['Relevant records']), axis=1)

df_prop_sizes.head()

Calculate the maximum recall values that could be obtained at each of the proportions screened

In [None]:
# Sort the dictionary
ss_dict_ord = collections.OrderedDict(sorted(ss_dict.items()))
# Use the max_recall_prop function to calculate the maximum achievable recall for each review
df_max_recalls_sizes = max_recall_prop.max_recall_prop(ss_dict_ord, proportions)
df_max_recalls_sizes.head()

Create a raw table with the work-saved-over sampling, normalized work-saved-over sampling, workload reduction in number of records, and workload reduction in hours:

In [None]:
# Use the generate_wss_table function to create a table with the workload metrics
df_wss_sizes = generate_wss_table.generate_wss_table(evaluation_sizes, evaluation_sizes_nwss, n_simulations, data_type = 'adapted')

# Adapt the review names
df_wss_sizes['Review'] = df_wss_sizes.apply(lambda row: row['Review'] + '_' + str(row['Total records']) + '_' + str(row['Relevant records']), axis=1)

df_wss_sizes.head()
#df_wss_sizes.to_excel('Table_wss_sizes.xlsx')

Create a raw table with the precision metric:

In [None]:
# Output table for precision

df_prec_sizes = pd.DataFrame()
length = n_simulations
for key, value in evaluation_sizes.items():
    names = key.split('_')
    review = [names[0]] * length
    n_records = [names[1]] * length
    rel_records = [names[2]] * length
    train_model = [names[3]] * length
    feature_model = [names[4]] * length
    query_model = [names[5]] * length
    seed = [names[6]] * length
    simulations = range(1, n_simulations+1)
    precision = value[0][4]['Precision'] ###
    df_sim = pd.DataFrame(list(zip(review, seed, n_records, rel_records, train_model, feature_model, query_model, simulations, precision)),
                           columns = ['Review', 'Seed', 'Total records', 'Relevant records', 'Train model', 'Feature model', 'Query model', 'Simulation', 'Precision@95%'])
    df_prec_sizes = pd.concat([df_prec_sizes, df_sim])
    
    df_prec_sizes = df_prec_sizes.reset_index(drop = True)
    
df_prec_sizes['Review'] = df_prec_sizes.apply(lambda row: row['Review'] + '_' + str(row['Total records']) + '_' + str(row['Relevant records']), axis=1)

df_prec_sizes

### 13. Process raw tables into pooled tables (for results)

Create a table with the results of wss/workload reduction and precision combined:

In [None]:
# TODO reset old generate results table function / add seeds
df_wss_prec_all_values_sizes, df_wss_prec_sizes = generate_results_table.generate_results_table(df_wss_sizes, df_prec_sizes, data_type = 'adapted')
#df_wss_prec_all_values_sizes[['Review', 'Total records', 'Relevant records']] = df_wss_prec_all_values_sizes['Review'].str.split('_', expand=True)

#df_wss_prec_sizes.to_excel('Table_WSS_precision_sizes.xlsx')
df_wss_prec_all_values_sizes.head()

In [None]:
df_wss_prec_sizes[['Review', 'Total records', 'Relevant records']] = df_wss_prec_sizes['Review'].str.split('_', expand=True)
df_wss_prec_sizes

### 14. Create histograms for WSS and precision (for results)

Replace the numbering in the output of the reviews since Prognosis review 5 is not included in our study

In [None]:
#df_wss_prec_all_values_ed = df_wss_prec_all_values.copy()
df_prop_sizes_ed = df_prop_sizes.copy()
df_max_recalls_sizes_ed = df_max_recalls_sizes.copy()
df_wss_prec_all_values_sizes_ed = df_wss_prec_all_values_sizes.copy()

# Remove prognosis review 5 (wrong study design), and change numbering
# df_wss_prec_all_values_sizes_ed['Review'] = df_wss_prec_all_values_sizes_ed['Review'].replace({'Prog6': 'Prog5',
#                                                                                                'Prog7': 'Prog6'})

df_wss_prec_all_values_sizes_ed['Review'] = df_wss_prec_all_values_sizes_ed['Review'].apply(lambda x: x.replace('Prog6', 'Prog5') if 'Prog6' in x else x)
df_wss_prec_all_values_sizes_ed['Review'] = df_wss_prec_all_values_sizes_ed['Review'].apply(lambda x: x.replace('Prog7', 'Prog6') if 'Prog7' in x else x)

df_prop_sizes_ed['Review'] = df_prop_sizes_ed['Review'].apply(lambda x: x.replace('Prog6', 'Prog5') if 'Prog6' in x else x)
df_prop_sizes_ed['Review'] = df_prop_sizes_ed['Review'].apply(lambda x: x.replace('Prog7', 'Prog6') if 'Prog7' in x else x)

df_prop_sizes_ed['Review_full'] = df_prop_sizes_ed['Review_full'].apply(lambda x: x.replace('Prognosis 6', 'Prognosis 5') if 'Prognosis 6' in x else x)
df_prop_sizes_ed['Review_full'] = df_prop_sizes_ed['Review_full'].apply(lambda x: x.replace('Prognosis 7', 'Prognosis 6') if 'Prognosis 7' in x else x)
df_prop_sizes_ed['Simulation'] = df_prop_sizes_ed['Simulation'].apply(lambda x: x.replace('Prognosis 6', 'Prognosis 5') if 'Prognosis 6' in x else x)
df_prop_sizes_ed['Simulation'] = df_prop_sizes_ed['Simulation'].apply(lambda x: x.replace('Prognosis 7', 'Prognosis 6') if 'Prognosis 7' in x else x)
df_max_recalls_sizes_ed = df_max_recalls_sizes_ed.drop(df_max_recalls_sizes_ed[df_max_recalls_sizes_ed['Review'] == 'Prog5'].index)
df_max_recalls_sizes_ed['Review'] = df_max_recalls_sizes_ed['Review'].replace({'Prog6': 'Prog5',
                                                                               'Prog7': 'Prog6'})

Create histograms for (n-)WSS and precision for intervention and prognosis reviews seperately

In [None]:
df_wss_prec_all_values_sizes_ed = df_wss_prec_all_values_sizes_ed.apply(pd.to_numeric, errors='ignore')

# Choose variables to plot
variables_to_plot = ['Mean_WSS95', 'Mean_nWSS95', 'Mean_prec95']
variable_names = ['n-WSS@95', 'WSS@95', 'Precision@95%']
y_lims = [1,1,0.5]
review_types = ['Int', 'Prog']
            
generate_histograms(dataset=df_wss_prec_all_values_sizes_ed, 
                    x='Total records', 
                    variables=variables_to_plot, 
                    y_labels=variable_names, 
                    review_types=review_types, 
                    hue='Review')

### 15. Create lineplots of increasing recall during screening (for results)

Create lineplots for all simulations (or any other subset of data/modeling methods):

In [None]:
def generate_lineplots(df_prop_ed, df_max_recalls_ed, subset_reviews = None):
    
    # Select the subset from the dataframe containing all recall values for each proportion screened
    df_lineplots = df_prop_ed.copy()
    
    df_lineplots['Models'] = df_lineplots['Models'].astype(pd.CategoricalDtype(categories=['tfidf - nb',
                                                                  'tfidf - svm',
                                                                  'tfidf - logistic',
                                                                  'sbert - svm',
                                                                  'sbert - logistic']))
    
    if subset_reviews != None:
        reviews = '|'.join(subset_reviews)
        df_lineplots = df_lineplots[df_lineplots['Review'].str.contains(reviews, regex=True)]

    # Create a figure with lineplots for each review-model combination seperately
    p2 = sns.catplot(data = df_lineplots, kind = 'point', 
                     x = 'percentage of records screened', y = 'recall', 
                     col = 'Review_full', 
                     hue = 'Seed', 
                     errorbar = 'ci',
                     col_wrap = 2, aspect = 1.4, legend = False
                     )

    axes = p2.fig.axes

    for i in range(0, len(df_lineplots['Review'].unique())):
        max_recalls_per_prop = df_max_recalls_ed.loc[df_max_recalls_ed['Review'] == df_lineplots['Review'].unique()[i]]['Maximum recall']    
        manual_screening = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
        x_axis = df_lineplots['percentage of records screened'][0:len(max_recalls_per_prop)]
        axes[i].plot(x_axis, max_recalls_per_prop + 0.005, 'k-', linewidth = 1, color = 'grey', linestyle = '--', label = "maximum recall") 
        axes[i].plot(x_axis, manual_screening, 'k-', linewidth = 1, color = 'grey', linestyle = '-', label = "manual screening") 
        axes[i].legend(loc="lower right", fontsize = 12.5)
    
    if subset_reviews[0] == 'Int8':
        p2.set_titles('Intervention A1', size = 16, weight = 'bold')
    else:
        p2.set_titles(col_template = "{col_name}", size = 16, weight = 'bold')
    p2.set_xlabels('Percentage of records screened', size = 16)
    p2.set_ylabels('Recall', size = 16)
    p2.set(ylim = (0, 1.01))
    plt.show()
    
subset_reviews = ['2000']
generate_lineplots(df_prop_sizes, df_max_recalls_sizes, subset_reviews)

In [None]:
subset_reviews = ['2000']
generate_lineplots(df_prop_sizes, df_max_recalls_sizes, subset_reviews)

In [None]:
subset_reviews = ['1000']
generate_lineplots(df_prop_sizes, df_max_recalls_sizes, subset_reviews)

In [None]:
subset_reviews = ['2000']
generate_lineplots(df_prop_sizes, df_max_recalls_sizes, subset_reviews)

End