# Visualize Results: Downstream Performance - Multiclass Classification Corrupted Experiments -> Training and Test identically imputed

[Set Average Best Imputation Method Manually](#Set-Average-Best-Imputation-Method-Manually)

Notebook wurde angepasst -> für Tests nutzen!

This notebook should answer the questions: *Does imputation lead to better downstream performances?*

Data needs to be preprocessed with other notebook, her we only import two csv files with raw data regarding the results of the experiment and information about the used datasets!


In [115]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import seaborn as sns
from pandas.api.types import CategoricalDtype
from pathlib import Path

import plotly as py
import plotly.express as px
import plotly.graph_objects as go
import xarray as xr


%matplotlib inline

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Settings

In [116]:
sns.set(style="whitegrid")
sns.set_context('paper', font_scale=1.5)
mpl.rcParams['lines.linewidth'] = '2'

In [117]:
CLF_METRIC = "F1_macro"
REG_METRIC = "RMSE"

DOWNSTREAM_RESULT_TYPE = "downstream_performance_mean"
IMPUTE_RESULT_TYPE = "impute_performance_mean"


## Data Preparation

In [118]:
# import preprocessed data from experiments
results = pd.read_csv('../multiclass_classification_corrupted.csv')
results

Unnamed: 0,experiment,imputer,task,missing_type,missing_fraction,strategy,column,result_type,metric,train,test,baseline,corrupted,imputed
0,corrupted_multi_experiment,AutoKerasImputer,1459,MAR,0.01,single_single,V7,impute_performance_std,MAE,1.892547,0.876852,,,
1,corrupted_multi_experiment,AutoKerasImputer,1459,MAR,0.01,single_single,V7,impute_performance_std,MSE,45.273370,14.425227,,,
2,corrupted_multi_experiment,AutoKerasImputer,1459,MAR,0.01,single_single,V7,impute_performance_std,RMSE,2.402271,1.095047,,,
3,corrupted_multi_experiment,AutoKerasImputer,1459,MAR,0.10,single_single,V7,impute_performance_std,MAE,1.405203,0.063380,,,
4,corrupted_multi_experiment,AutoKerasImputer,1459,MAR,0.10,single_single,V7,impute_performance_std,MSE,33.147828,3.283238,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14551,corrupted_multi_experiment,VAEImputer,6,MNAR,0.30,single_single,x-box,downstream_performance_mean,F1_macro,,,0.727657,0.0,0.727422
14552,corrupted_multi_experiment,VAEImputer,6,MNAR,0.30,single_single,x-box,downstream_performance_mean,F1_weighted,,,0.729273,0.0,0.729067
14553,corrupted_multi_experiment,VAEImputer,6,MNAR,0.50,single_single,x-box,downstream_performance_mean,F1_micro,,,0.719000,0.0,0.718417
14554,corrupted_multi_experiment,VAEImputer,6,MNAR,0.50,single_single,x-box,downstream_performance_mean,F1_macro,,,0.718046,0.0,0.717661


In [119]:
# Filtering the relevant data for downstream analysis

na_impute_results = results[
    (results["result_type"] == IMPUTE_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]))
]
na_impute_results.drop(["baseline", "corrupted", "imputed"], axis=1, inplace=True)
na_impute_results = na_impute_results[na_impute_results.isna().any(axis=1)]
na_impute_results.shape



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



(3, 11)

In [120]:
# check if strategy type is correct!
STRATEGY_TYPE = "single_single"

downstream_results = results[
    (results["result_type"] == DOWNSTREAM_RESULT_TYPE) & 
    (results["metric"].isin(["F1_macro", "RMSE"]) &
    (results["strategy"] == STRATEGY_TYPE))
]

# remove experiments where imputation failed
downstream_results = downstream_results.merge(
    na_impute_results,
    how = "left",
    validate = "one_to_one",
    indicator = True,
    suffixes=("", "_imp"),
    on = ["experiment", "imputer", "task", "missing_type", "missing_fraction", "strategy", "column"]
)
downstream_results = downstream_results[downstream_results["_merge"]=="left_only"]

assert len(results["strategy"].unique()) == 1
downstream_results.drop(["experiment", "strategy", "result_type_imp", "metric_imp", "train", "test", "train_imp", "test_imp", "_merge"], axis=1, inplace=True)

downstream_results = downstream_results.rename(
    {
        "imputer": "Imputation_Method",
        "task": "Task",
        "missing_type": "Missing Type",
        "missing_fraction": "Missing Fraction",
        "column": "Column",
        "baseline": "Baseline",
        "imputed": "Imputed",
        "corrupted": "Corrupted"
    },
    axis = 1
)

In [121]:
rename_imputer_dict = {
    "ModeImputer": "Mean/Mode",
    "KNNImputer": "KNN",
    "ForestImputer": "Random Forest",
    "AutoKerasImputer": "Discriminative DL",
    "VAEImputer": "VAE",
    "GAINImputer": "GAIN"    
}

rename_metric_dict = {
    "F1_macro": CLF_METRIC,
    "RMSE": REG_METRIC
}

downstream_results = downstream_results.replace(rename_imputer_dict)
downstream_results = downstream_results.replace(rename_metric_dict)

downstream_results

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
0,Discriminative DL,1459,MAR,0.01,V7,downstream_performance_mean,F1_macro,0.289951,0.0,0.289993
1,Discriminative DL,1459,MAR,0.10,V7,downstream_performance_mean,F1_macro,0.296524,0.0,0.290257
2,Discriminative DL,1459,MAR,0.30,V7,downstream_performance_mean,F1_macro,0.282710,0.0,0.281470
3,Discriminative DL,1459,MAR,0.50,V7,downstream_performance_mean,F1_macro,0.285329,0.0,0.274868
4,Discriminative DL,1459,MCAR,0.01,V7,downstream_performance_mean,F1_macro,0.310493,0.0,0.311259
...,...,...,...,...,...,...,...,...,...,...
1208,VAE,6,MCAR,0.50,x-box,downstream_performance_mean,F1_macro,0.721662,0.0,0.722729
1209,VAE,6,MNAR,0.01,x-box,downstream_performance_mean,F1_macro,0.726015,0.0,0.725814
1210,VAE,6,MNAR,0.10,x-box,downstream_performance_mean,F1_macro,0.718090,0.0,0.719645
1211,VAE,6,MNAR,0.30,x-box,downstream_performance_mean,F1_macro,0.727657,0.0,0.727422


### Robustness: Check which Imputers Yielded `NaN`Values

In [122]:
for col in downstream_results.columns:
    na_sum = downstream_results[col].isna().sum()
    if na_sum > 0:
        print("-----" * 10)        
        print(col, na_sum)
        print("-----" * 10)        
        na_idx = downstream_results[col].isna()
        print(downstream_results.loc[na_idx, "Imputation Method"].value_counts(dropna=False))
        print("\n")

## Adding Dataset Info, Sorting and Ranking

In [123]:
#downstream_results.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1210 entries, 0 to 1212
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Imputation_Method  1210 non-null   object 
 1   Task               1210 non-null   int64  
 2   Missing Type       1210 non-null   object 
 3   Missing Fraction   1210 non-null   float64
 4   Column             1210 non-null   object 
 5   result_type        1210 non-null   object 
 6   metric             1210 non-null   object 
 7   Baseline           1210 non-null   float64
 8   Corrupted          1210 non-null   float64
 9   Imputed            1210 non-null   float64
dtypes: float64(4), int64(1), object(5)
memory usage: 104.0+ KB


In [124]:
# Sorting of data

#adjust order to fit the processing time -> fastest first
methods_order = CategoricalDtype(['Mean/Mode', 'KNN', 'Random Forest', 'VAE',  'GAIN', 'Discriminative DL'], ordered=True)
downstream_results_full_sort = downstream_results.copy()

downstream_results_full_sort['Imputation_Method'] = downstream_results_full_sort['Imputation_Method'].astype(methods_order)
downstream_results_full_sort = downstream_results_full_sort.sort_values(['Task', 'Missing Type',
                                                                         'Missing Fraction', 'Imputed','Imputation_Method'], ascending=[True, True, True, True, True])


downstream_results_full_sort


Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed
997,Mean/Mode,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725643,0.0,0.725766
1201,VAE,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725821,0.0,0.725778
391,Random Forest,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727330,0.0,0.727075
793,KNN,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727724,0.0,0.727724
188,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828
...,...,...,...,...,...,...,...,...,...,...
176,Discriminative DL,41671,MNAR,0.50,a9,downstream_performance_mean,F1_macro,0.240118,0.0,0.240732
984,Mean/Mode,41671,MNAR,0.50,a9,downstream_performance_mean,F1_macro,0.239275,0.0,0.240780
1188,VAE,41671,MNAR,0.50,a9,downstream_performance_mean,F1_macro,0.241995,0.0,0.242421
576,GAIN,41671,MNAR,0.50,a9,downstream_performance_mean,F1_macro,0.239877,0.0,0.243434


In [125]:
#downstream_results_full_sort.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1210 entries, 997 to 780
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Imputation_Method  1210 non-null   category
 1   Task               1210 non-null   int64   
 2   Missing Type       1210 non-null   object  
 3   Missing Fraction   1210 non-null   float64 
 4   Column             1210 non-null   object  
 5   result_type        1210 non-null   object  
 6   metric             1210 non-null   object  
 7   Baseline           1210 non-null   float64 
 8   Corrupted          1210 non-null   float64 
 9   Imputed            1210 non-null   float64 
dtypes: category(1), float64(4), int64(1), object(4)
memory usage: 95.9+ KB


In [126]:
# add dataset information from other csv file

dataset_info = pd.read_csv('../datasets_information_overview.csv')
dataset_info = dataset_info.rename(columns={"did": "Task"})

downstream_results_full_sort = pd.merge(downstream_results_full_sort, dataset_info, on='Task')
#downstream_results_full_sort.head()

Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses
0,Mean/Mode,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725643,0.0,0.725766,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0
1,VAE,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725821,0.0,0.725778,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0
2,Random Forest,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.72733,0.0,0.727075,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0
3,KNN,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727724,0.0,0.727724,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0
4,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0


In [127]:
# Ranking of downstream performance per data constellation for every imputation method

EXPERIMENTAL_CONDITIONS = ["Task", "Missing Type", "Missing Fraction", "Column", "result_type"]

downstream_results_rank = downstream_results_full_sort.copy()
downstream_results_rank["Downstream Performance Rank"] = downstream_results_rank.groupby(EXPERIMENTAL_CONDITIONS).rank(ascending=False, na_option="bottom", method="first")["Imputed"]


# create csv for detailled checks
downstream_results_rank.to_csv('downstream_results_multi_complete_overview.csv')
downstream_results_rank.head()


Unnamed: 0.1,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,Unnamed: 0,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank
0,Mean/Mode,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725643,0.0,0.725766,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,5.0
1,VAE,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725821,0.0,0.725778,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,4.0
2,Random Forest,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.72733,0.0,0.727075,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,3.0
3,KNN,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727724,0.0,0.727724,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,2.0
4,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828,59,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0


In [128]:
# Adjust column type for Imputation_Method
downstream_results_rank['Imputation_Method'] = downstream_results_rank['Imputation_Method'].astype('object')

#downstream_results_rank.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1210 entries, 0 to 1209
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Imputation_Method            1210 non-null   object 
 1   Task                         1210 non-null   int64  
 2   Missing Type                 1210 non-null   object 
 3   Missing Fraction             1210 non-null   float64
 4   Column                       1210 non-null   object 
 5   result_type                  1210 non-null   object 
 6   metric                       1210 non-null   object 
 7   Baseline                     1210 non-null   float64
 8   Corrupted                    1210 non-null   float64
 9   Imputed                      1210 non-null   float64
 10  Unnamed: 0                   1210 non-null   int64  
 11  name                         1210 non-null   object 
 12  MajorityClassSize            1210 non-null   float64
 13  MinorityClassSize 

In [129]:
# Merge the two columns "Missing Type" and "Missing Fraction"

downstream_results_rank['Missing Type'] = downstream_results_rank['Missing Type'].astype(str)
downstream_results_rank['Missing Fraction'] = downstream_results_rank['Missing Fraction'].astype(str)
#datatype_new = downstream_results_rank.dtypes

downstream_results_rank['Data_Constellation'] = downstream_results_rank['Missing Type'] + ' - ' + downstream_results_rank['Missing Fraction']
#downstream_results_rank.to_csv('downstream_results_rank_temp.csv')
downstream_results_rank_heatmap2 = downstream_results_rank.copy()
downstream_results_rank.head()


Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
0,Mean/Mode,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725643,0.0,0.725766,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,5.0,MAR - 0.01
1,VAE,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.725821,0.0,0.725778,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,4.0,MAR - 0.01
2,Random Forest,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.72733,0.0,0.727075,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.01
3,KNN,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727724,0.0,0.727724,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,2.0,MAR - 0.01
4,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.01


## Analyzing Performance Based on Rank per Data Constellation

In [130]:
data = downstream_results_rank.copy()

# Count amount of different Data constellations in column "Data_Constellation"
dc_unique = data.Data_Constellation.unique().size
print(dc_unique, "Data Constellations")
print("_____________________")
# Count amount of 1.0 Ranking result in column "Downstream Performance Rank" 
rank_count = data['Downstream Performance Rank'].value_counts()
print(rank_count)
print("_____________________")
# Filter for 1.0 Ranking -> Overview -> save as csv
rank_1 = data.loc[data['Downstream Performance Rank'] == 1.0]
rank_1.to_csv('rank_1.csv')

print("_____________________")
# Count how often each Imputation Method is present -> most "wins"
rank_wins = rank_1['Imputation_Method'].value_counts()
print(rank_wins)
print("_____________________")

# BE AWARE THAT THE AVERAGE RANK DOES NOT CONSIDER MISSING RESULTS, WHICH RESULT IN THE WORST RANK BY DEFAULT
# Take initial overview and filter for each imputation method and calculate average rank and average improvement
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
for i in methods:
    df_average_rank = data.loc[data['Imputation_Method'] == i]
    len_ar = len(df_average_rank)
    print(len_ar, "Amount of results available")
    rank_pos = df_average_rank['Downstream Performance Rank'].value_counts().sort_index(ascending=True)
    print(rank_pos)
    average_rank = df_average_rank["Downstream Performance Rank"].mean()
    print("Average Rank for", i, "is", average_rank)
    #average_improvement = df_average_rank["Improvement"].mean()
    #print("Average Improvement to baseline is", average_improvement)
    print("_____________________")



12 Data Constellations
_____________________
4.0    204
3.0    204
2.0    204
1.0    204
5.0    203
6.0    191
Name: Downstream Performance Rank, dtype: int64
_____________________
_____________________
VAE                  41
KNN                  36
Mean/Mode            35
GAIN                 35
Random Forest        34
Discriminative DL    23
Name: Imputation_Method, dtype: int64
_____________________
204 Amount of results available
1.0    34
2.0    36
3.0    46
4.0    41
5.0    22
6.0    25
Name: Downstream Performance Rank, dtype: int64
Average Rank for Random Forest is 3.2745098039215685
_____________________
204 Amount of results available
1.0    36
2.0    39
3.0    38
4.0    30
5.0    36
6.0    25
Name: Downstream Performance Rank, dtype: int64
Average Rank for KNN is 3.323529411764706
_____________________
204 Amount of results available
1.0    35
2.0    33
3.0    45
4.0    29
5.0    37
6.0    25
Name: Downstream Performance Rank, dtype: int64
Average Rank for Mean/Mode is 3.36

In [131]:
rank_1_backup = rank_1.copy()
rank_1

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
4,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.01
9,KNN,6,MAR,0.1,x-box,downstream_performance_mean,F1_macro,0.725914,0.0,0.726020,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.1
15,KNN,6,MAR,0.3,x-box,downstream_performance_mean,F1_macro,0.723328,0.0,0.724518,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.3
21,Random Forest,6,MAR,0.5,x-box,downstream_performance_mean,F1_macro,0.725631,0.0,0.726487,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.5
26,Mean/Mode,6,MCAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727026,0.0,0.726898,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MCAR - 0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,GAIN,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.265186,0.0,0.245338,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MCAR - 0.5
1189,KNN,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.01
1197,Discriminative DL,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240445,0.0,0.241994,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.1
1203,Mean/Mode,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.244691,0.0,0.244943,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.3


## Set Average Best Imputation Method Manually

In [132]:
# SET AVERAGE BEST IMPUTATION METHOD HERE, BASED ON THE PREVIOUS RESULTS
# Alternatively you can define a baseline method here, which will be used instead, depending on your analysis goals

AVERAGE_BEST_IMPUTATION_METHOD = "Random Forest"

## Differences in Performance Relative to Average Best Imputation Method

In [133]:
av_best = data.loc[data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
av_best['Task'] = av_best['Task'].astype(str)
av_best['Data_Constellation'] = av_best['Data_Constellation'] + ' - ' + av_best['Task']

av_best = av_best[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
av_best = av_best.rename(columns={'Imputation_Method':'Imputation_Method_average', 
                               'Imputed':'Imputed_average',
                                 'Downstream Performance Rank':'Downstream Performance Rank Average'})

rank_1['Task'] = rank_1['Task'].astype(str)
rank_1['Data_Constellation'] = rank_1['Data_Constellation'] + ' - ' + rank_1['Task']
rank_1 = rank_1[['Imputation_Method', 'Imputed', 'Data_Constellation', 'Downstream Performance Rank']]
rank_1 = rank_1.rename(columns={'Imputation_Method':'Imputation_Method_best', 
                               'Imputed':'Imputed_best',
                               'Downstream Performance Rank':'Downstream Performance Rank Best'})

performance_difference = pd.merge(av_best, rank_1, on='Data_Constellation')
#performance_difference.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [134]:
# Calculate the difference between the best imputation method for each data constellation to the average best imputation method in F1 score

performance_difference['Performance Difference Best to Average'] = performance_difference['Imputed_best'] - performance_difference['Imputed_average']
Average_Difference = performance_difference['Performance Difference Best to Average'].mean()
print("Average Difference in Improvement from best method to average best method for F1", Average_Difference)


Average Difference in Improvement from best method to average best method for F1 0.015333953622047472


In [135]:
# Improvement by Percentage

performance_difference['Performance Difference Best to Average in Percentage'] = ((performance_difference['Imputed_best'] - performance_difference['Imputed_average'])/performance_difference['Imputed_best'])*100
Average_Difference_per = performance_difference['Performance Difference Best to Average in Percentage'].mean()

print("Based on F1 Score the Average best method is worse than the best method by this percentage", Average_Difference_per)

Based on F1 Score the Average best method is worse than the best method by this percentage 4.190877890258599


In [136]:
performance_difference.to_csv('performance_difference.csv')
#performance_difference

## Analysis and Ranking based on F1 Score

In [137]:
# Relative Difference in Percent -> Best Method to Average Best Method

data = downstream_results_rank.copy()
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']


dc_unique = data.Data_Constellation_full.unique()

data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
average_best_complete = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    best_score_int = best_score.iloc[0]['Imputed']
    average_best_int = average_best.iloc[0]['Imputed']
    calc_result = ((best_score_int - average_best_int)/average_best_int)
    average_best['Performance Difference to Best to Average in Percent'] = calc_result
    average_best_complete = average_best_complete.append(average_best)

average_best_complete

    



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Best to Average in Percent
2,Random Forest,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727330,0.0,0.727075,...,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.01,MAR - 0.01 - 6,0.001035
7,Random Forest,6,MAR,0.1,x-box,downstream_performance_mean,F1_macro,0.724750,0.0,0.725023,...,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.1,MAR - 0.1 - 6,0.001374
13,Random Forest,6,MAR,0.3,x-box,downstream_performance_mean,F1_macro,0.722413,0.0,0.722508,...,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.3,MAR - 0.3 - 6,0.002782
21,Random Forest,6,MAR,0.5,x-box,downstream_performance_mean,F1_macro,0.725631,0.0,0.726487,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.5,MAR - 0.5 - 6,0.000000
25,Random Forest,6,MCAR,0.01,x-box,downstream_performance_mean,F1_macro,0.726720,0.0,0.726623,...,734.0,17.0,20000.0,16.0,1.0,26.0,2.0,MCAR - 0.01,MCAR - 0.01 - 6,0.000378
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1184,Random Forest,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240411,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MCAR - 0.5,MCAR - 0.5 - 41671,0.020493
1190,Random Forest,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.01,MNAR - 0.01 - 41671,0.000000
1196,Random Forest,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.1,MNAR - 0.1 - 41671,0.005711
1201,Random Forest,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.3,MNAR - 0.3 - 41671,0.017970


In [138]:
#Difference in Percentage
average_difference = average_best_complete['Performance Difference to Best to Average in Percent'].mean()
print(average_difference, "average difference in Percent")

0.0492004646260591 average difference in Percent


In [139]:
# Relative Difference in absolute values (F1 Score) -> Best Method to Average Best Method

data = downstream_results_rank.copy()
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
#print(data_constellations)
#print(type(methods))
average_best_total = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    best_score = data_constel.loc[data_constel['Downstream Performance Rank'] == 1.0]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    best_score_int = best_score.iloc[0]['Imputed']
    average_best_int = average_best.iloc[0]['Imputed']
    calc_result = (best_score_int - average_best_int)

    average_best['Performance Difference to Best to Average in absolute'] = calc_result
    average_best_total = average_best_total.append(average_best)
 
average_best_total




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Best to Average in absolute
2,Random Forest,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727330,0.0,0.727075,...,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.01,MAR - 0.01 - 6,0.000753
7,Random Forest,6,MAR,0.1,x-box,downstream_performance_mean,F1_macro,0.724750,0.0,0.725023,...,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.1,MAR - 0.1 - 6,0.000996
13,Random Forest,6,MAR,0.3,x-box,downstream_performance_mean,F1_macro,0.722413,0.0,0.722508,...,734.0,17.0,20000.0,16.0,1.0,26.0,3.0,MAR - 0.3,MAR - 0.3 - 6,0.002010
21,Random Forest,6,MAR,0.5,x-box,downstream_performance_mean,F1_macro,0.725631,0.0,0.726487,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.5,MAR - 0.5 - 6,0.000000
25,Random Forest,6,MCAR,0.01,x-box,downstream_performance_mean,F1_macro,0.726720,0.0,0.726623,...,734.0,17.0,20000.0,16.0,1.0,26.0,2.0,MCAR - 0.01,MCAR - 0.01 - 6,0.000274
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1184,Random Forest,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240411,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MCAR - 0.5,MCAR - 0.5 - 41671,0.004927
1190,Random Forest,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.01,MNAR - 0.01 - 41671,0.000000
1196,Random Forest,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,2.0,MNAR - 0.1,MNAR - 0.1 - 41671,0.001374
1201,Random Forest,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,3.0,MNAR - 0.3,MNAR - 0.3 - 41671,0.004324


In [140]:
average_difference = average_best_total['Performance Difference to Best to Average in absolute'].mean()
print(average_difference, "average difference in absolut")

0.015333953622047472 average difference in absolut


## Heatmap to Show Detailled Performance of Each Imputation Method for Each Data Constellation

In [141]:
df_heat = downstream_results_rank.copy()
df_heat.drop(["Missing Type", "Missing Fraction", "Column", "result_type", "metric", "Baseline", "Corrupted", "Unnamed: 0", "Unnamed: 0", "name", "NumberOfClasses", "MajorityClassSize", "MinorityClassSize"], axis=1, inplace=True)
df_heat

Unnamed: 0,Imputation_Method,Task,Imputed,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,Downstream Performance Rank,Data_Constellation
0,Mean/Mode,6,0.725766,17.0,20000.0,16.0,1.0,5.0,MAR - 0.01
1,VAE,6,0.725778,17.0,20000.0,16.0,1.0,4.0,MAR - 0.01
2,Random Forest,6,0.727075,17.0,20000.0,16.0,1.0,3.0,MAR - 0.01
3,KNN,6,0.727724,17.0,20000.0,16.0,1.0,2.0,MAR - 0.01
4,Discriminative DL,6,0.727828,17.0,20000.0,16.0,1.0,1.0,MAR - 0.01
...,...,...,...,...,...,...,...,...,...
1205,Discriminative DL,41671,0.240732,21.0,20000.0,20.0,1.0,5.0,MNAR - 0.5
1206,Mean/Mode,41671,0.240780,21.0,20000.0,20.0,1.0,4.0,MNAR - 0.5
1207,VAE,41671,0.242421,21.0,20000.0,20.0,1.0,3.0,MNAR - 0.5
1208,GAIN,41671,0.243434,21.0,20000.0,20.0,1.0,2.0,MNAR - 0.5


In [142]:
# Heatmap for total F1 score for each data constellation for each method

df_heat = df_heat.astype({"Task":"string"})

data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MCAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']


for i in data_constellations:
    data_constel = df_heat.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    F1_Score = data_constel["Imputed"]
    

    trace = go.Heatmap(
                   z=F1_Score,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'Reds',
                    zmin=0,
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()



In [143]:
df_heat_dif = downstream_results_rank_heatmap2.copy()


In [144]:
# Calculate Difference for every Imputation towards average best Imputation Method per Data Constellation
# Calculation for F1 Score Differences (not Percentage)

data = downstream_results_rank.copy()
data['Task'] = data['Task'].astype(str)
data['Data_Constellation_full'] = data['Data_Constellation'] + ' - ' + data['Task']

dc_unique = data.Data_Constellation_full.unique()
#print(dc_unique)

data_constellations = dc_unique.tolist()
methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
heatmap_data_difference = pd.DataFrame()


for i in data_constellations:
    data_constel = data.loc[data['Data_Constellation_full'] == i]
    average_best = data_constel.loc[data_constel['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD]
    dataset_number = best_score.iloc[0]['Task']
    for i in methods:
        if ((data_constel['Imputation_Method'] == i).any()):
            current_score_row = data_constel.loc[data['Imputation_Method'] == i]
            current_score_int = current_score_row.iloc[0]['Imputed']
            average_best_int = average_best.iloc[0]['Imputed']
            calc_result = (current_score_int - average_best_int)
            
            current_score_row['Performance Difference to Average Best'] = calc_result
            heatmap_data_difference = heatmap_data_difference.append(current_score_row)  
        else:
            print("Imputation Method not here ---------------------")

heatmap_data_difference

heatmap_data_difference['Missing Type'] = heatmap_data_difference['Missing Type'].astype(str)
heatmap_data_difference['Missing Fraction'] = heatmap_data_difference['Missing Fraction'].astype(str)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------
Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Imputation Method not here ---------------------




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

In [145]:
# Heatmap for F1 score differences for each data constellation for each method relative to average best imputation method

heatmap_data_difference = heatmap_data_difference.astype({"Task":"string"})
data_constellations = ['MAR - 0.01', 'MAR - 0.1', 'MAR - 0.3', 'MAR - 0.5', 'MCAR - 0.01', 'MCAR - 0.1', 'MCAR - 0.3', 'MCAR - 0.5', 'MNAR - 0.01', 'MNAR - 0.1', 'MNAR - 0.3', 'MNAR - 0.5']

for i in data_constellations:
    data_constel = heatmap_data_difference.loc[df_heat['Data_Constellation'] == i]

    ### uncomment whatever you want to investigate

    ## sort by amount datapoints (ascending)
    data_constel = data_constel.sort_values(by=['NumberOfInstances'])

    ## sort by amount of features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfFeatures'])

    ## sort by amount of datapoints and features (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfInstances', 'NumberOfFeatures'])

    ## sort by amount of categorical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfCategoricalFeatures', 'NumberOfInstances'])

    ## sort by amount of numerical features and datapoints (ascending)
    #data_constel = data_constel.sort_values(by=['NumberOfNumericFeatures', 'NumberOfInstances'])
    
    Dataset_number = data_constel["Task"]
    Imputation_Method = data_constel["Imputation_Method"]
    Improvement = data_constel["Performance Difference to Average Best"]
    

    trace = go.Heatmap(
                   z=Improvement,
                   x=Dataset_number,
                   y=Imputation_Method,
                   type = 'heatmap',
                    autocolorscale= False,
                    colorscale = 'RdBu_r',
                    zmid=0,
                    zmin=(-0.14),
                    zmax=0.14,
                    )
    data = [trace]
    fig = go.Figure(data=data)
    fig.update_layout(
        title=i,
        xaxis_nticks=36)
    fig.show()
    fig.write_image("multi_heatmap_f1_score_improvement_to_avbest%s.pdf" %i)

In [146]:
heatmap_data_difference.agg(['min', 'max'])

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Average Best
min,Discriminative DL,1459,MAR,0.01,A2,downstream_performance_mean,F1_macro,0.094034,0.0,0.094081,...,1.0,5.0,3200.0,0.0,1.0,3.0,1.0,MAR - 0.01,MAR - 0.01 - 1459,-0.158444
max,VAE,6,MNAR,0.5,x-box,downstream_performance_mean,F1_macro,0.933351,0.0,0.937845,...,4335.0,25.0,58000.0,21.0,25.0,102.0,6.0,MNAR - 0.5,MNAR - 0.5 - 6,0.114316


In [147]:
heatmap_data_difference
heatmap_data_difference.to_csv('multi_imputed_full_info.csv', index=False)

## Improvment Proportions for All Data Constellations and Methods Relative to Average Best Method

In [148]:
# data preprocessing here
df_quantiles = heatmap_data_difference.copy()
df_quantiles = df_quantiles.drop(df_quantiles[df_quantiles["Imputation_Method"] == AVERAGE_BEST_IMPUTATION_METHOD].index)

df_10 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] > (-0.09))].index)
df_09 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (-0.09)) | (df_quantiles["Performance Difference to Average Best"] > (-0.07))].index)
df_07 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (-0.07)) | (df_quantiles["Performance Difference to Average Best"] > (-0.05))].index)
df_05 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (-0.05)) | (df_quantiles["Performance Difference to Average Best"] > (-0.03))].index)
df_03 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (-0.03)) | (df_quantiles["Performance Difference to Average Best"] > (-0.01))].index)
df_01 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (-0.01)) | (df_quantiles["Performance Difference to Average Best"] > (0.01))].index)
df01 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (0.01)) | (df_quantiles["Performance Difference to Average Best"] > (0.03))].index)
df03 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (0.03)) | (df_quantiles["Performance Difference to Average Best"] > (0.05))].index)
df05 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (0.05)) | (df_quantiles["Performance Difference to Average Best"] > (0.07))].index)
df07 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (0.07)) | (df_quantiles["Performance Difference to Average Best"] > (0.09))].index)
df09 = df_quantiles.drop(df_quantiles[(df_quantiles["Performance Difference to Average Best"] <= (0.09))].index)

#df_quantiles
#df_quantiles.dtypes

In [149]:
len_df_10 = len(df_10.index)
len_df_09 = len(df_09.index)
len_df_07 = len(df_07.index)
len_df_05 = len(df_05.index)
len_df_03 = len(df_03.index)
len_df_01 = len(df_01.index)
len_df01 = len(df01.index)
len_df03 = len(df03.index)
len_df05 = len(df05.index)
len_df07 = len(df07.index)
len_df09 = len(df09.index)

quantile_freq = []

quantile_freq.extend((len_df_10, len_df_09, len_df_07, len_df_05, len_df_03, len_df_01, len_df01, len_df03, len_df05, len_df07, len_df09))
print(quantile_freq)


quantiles = []
quantiles.extend(['less than -0.09', '-0.09 to -0.07', '-0.07 to -0.05', '-0.05 to -0.03','-0.03 to -0.01', '-0.01 to 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09'])
print(quantiles)

improvement_quantiles = pd.DataFrame(
    {'Improvement to Average Best': quantiles,
     'Amount': quantile_freq,
    })


[9, 11, 16, 52, 126, 598, 137, 30, 18, 6, 3]
['less than -0.09', '-0.09 to -0.07', '-0.07 to -0.05', '-0.05 to -0.03', '-0.03 to -0.01', '-0.01 to 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


In [150]:
fig = px.bar(improvement_quantiles, x='Improvement to Average Best', y='Amount')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl.pdf")

In [151]:
# split barchart stacks into methods

quantile_datasets = [df_10, df_09, df_07, df_05, df_03, df_01, df01, df03, df05, df07, df09]

methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
methods.remove(AVERAGE_BEST_IMPUTATION_METHOD)
print(methods)

forest_freq = []
knn_freq = []
mode_freq = []
dl_freq = []
vae_freq = []
gain_freq = []
#print(quantile_datasets)

for i in methods:
    for j in quantile_datasets:

        df_temp = j.copy()
        df_temp = df_temp[df_temp['Imputation_Method'].str.contains(i)]

        df_temp_len = len(df_temp.index)
        if (i == 'Random Forest'):
            forest_freq.append(df_temp_len)
        elif (i == 'KNN'):
            knn_freq.append(df_temp_len)                                       
        elif (i == 'Mean/Mode'):
            mode_freq.append(df_temp_len)                                                 
        elif (i == 'Discriminative DL'):
            dl_freq.append(df_temp_len)                                       
        elif (i == 'VAE'):
            vae_freq.append(df_temp_len)                                         
        elif (i == 'GAIN'):
            gain_freq.append(df_temp_len)                                          
                                       
print(forest_freq)
print(knn_freq)
print(mode_freq)
print(dl_freq)
print(vae_freq)
print(gain_freq)

['KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
[]
[1, 2, 2, 10, 22, 134, 24, 5, 3, 1, 0]
[2, 3, 3, 8, 22, 113, 40, 7, 5, 1, 0]
[3, 1, 4, 11, 20, 137, 17, 4, 2, 0, 0]
[1, 4, 2, 11, 28, 111, 33, 5, 7, 1, 1]
[2, 1, 5, 12, 34, 103, 23, 9, 1, 3, 2]


In [152]:
quantiles = ['less than -0.09', '-0.09 to -0.07', '-0.07 to -0.05', '-0.05 to -0.03','-0.03 to -0.01', '-0.01 to 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']

fig = go.Figure(data=[
    go.Bar(name='Random Forest', x=quantiles, y=forest_freq),
    go.Bar(name='KNN', x=quantiles, y=knn_freq),
    go.Bar(name='Mean/Mode', x=quantiles, y=mode_freq),
    go.Bar(name='Discriminative DL', x=quantiles, y=dl_freq),
    go.Bar(name='VAE', x=quantiles, y=vae_freq),
    go.Bar(name='GAIN', x=quantiles, y=gain_freq)
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_per_method.pdf")

In [153]:
# split barchart stacks into methods

quantile_datasets = [df_10, df_09, df_07, df_05, df_03, df_01, df01, df03, df05, df07, df09]

fractions = ['0.01', '0.1', '0.3', '0.5']
#print(fractions)

freq_001 = []
freq_01 = []
freq_03 = []
freq_05 = []
#print(quantile_datasets)

for i in fractions:
    for j in quantile_datasets:
        df_temp = j.copy()
        df_temp = df_temp[df_temp['Missing Fraction'].str.contains(i)]
        df_temp_len = len(df_temp.index)
        if (i == '0.01'):
            freq_001.append(df_temp_len)
        elif (i == '0.1'):
            freq_01.append(df_temp_len)                                       
        elif (i == '0.3'):
            freq_03.append(df_temp_len)                                                 
        elif (i == '0.5'):
            freq_05.append(df_temp_len)                                       
                                        
                                       
print(freq_001)
print(freq_01)
print(freq_03)
print(freq_05)

[0, 0, 1, 6, 14, 197, 21, 2, 3, 1, 0]
[2, 4, 5, 16, 29, 144, 31, 14, 3, 3, 1]
[3, 5, 3, 13, 39, 137, 44, 7, 2, 1, 0]
[4, 2, 7, 17, 44, 120, 41, 7, 10, 1, 2]


In [154]:
quantiles = ['less than -0.09', '-0.09 to -0.07', '-0.07 to -0.05', '-0.05 to -0.03','-0.03 to -0.01', '-0.01 to 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


fig = go.Figure(data=[
    go.Bar(name='1% Missing Data', x=quantiles, y=freq_001, marker_color='#FD3216'),
    go.Bar(name='10% Missing Data', x=quantiles, y=freq_01, marker_color='#00FE35'),
    go.Bar(name='30% Missing Data', x=quantiles, y=freq_03, marker_color='#511CFB'),
    go.Bar(name='50% Missing Data', x=quantiles, y=freq_05, marker_color='#FF7F0E'),
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_per_frac.pdf")

In [155]:
# split barchart stacks into methods

quantile_datasets = [df_10, df_09, df_07, df_05, df_03, df_01, df01, df03, df05, df07, df09]

fractions = ['MCAR', 'MAR', 'MNAR']
print(fractions)
#print(df_10)

freq_001 = []
freq_01 = []
freq_03 = []
#print(quantile_datasets)

for i in fractions:
    for j in quantile_datasets:
        df_temp = j.copy()
        df_temp = df_temp[df_temp['Missing Type'].str.contains(i)]
        df_temp_len = len(df_temp.index)
        if (i == 'MCAR'):
            freq_001.append(df_temp_len)
        elif (i == 'MAR'):
            freq_01.append(df_temp_len)                                       
        elif (i == 'MNAR'):
            freq_03.append(df_temp_len)                                                                                     
                                        
                                       
print(freq_001)
print(freq_01)
print(freq_03)

['MCAR', 'MAR', 'MNAR']
[2, 2, 7, 19, 33, 201, 50, 10, 8, 3, 1]
[4, 3, 7, 15, 49, 196, 44, 7, 6, 2, 2]
[3, 6, 2, 18, 44, 201, 43, 13, 4, 1, 0]


In [156]:
quantiles = ['less than -0.09', '-0.09 to -0.07', '-0.07 to -0.05', '-0.05 to -0.03','-0.03 to -0.01', '-0.01 to 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


fig = go.Figure(data=[
    go.Bar(name='MCAR', x=quantiles, y=freq_001, marker_color='#222A2A'),
    go.Bar(name='MAR', x=quantiles, y=freq_01, marker_color='#B68100'),
    go.Bar(name='MNAR', x=quantiles, y=freq_03, marker_color='#750D86'),
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_per_patt.pdf")

## Improvment Proportions for the Best Imputation Method per Data Constellation Relative to Average Best Method

In [157]:
improv_to_av_bar = heatmap_data_difference.copy()

improv_to_av_bar = improv_to_av_bar.drop(improv_to_av_bar[improv_to_av_bar["Downstream Performance Rank"] != 1.0].index)

df_01 = improv_to_av_bar.drop(improv_to_av_bar[(improv_to_av_bar["Performance Difference to Average Best"] <= (-0.01)) | (improv_to_av_bar["Performance Difference to Average Best"] > (0.01))].index)
df01 = improv_to_av_bar.drop(improv_to_av_bar[(improv_to_av_bar["Performance Difference to Average Best"] <= (0.01)) | (improv_to_av_bar["Performance Difference to Average Best"] > (0.03))].index)
df03 = improv_to_av_bar.drop(improv_to_av_bar[(improv_to_av_bar["Performance Difference to Average Best"] <= (0.03)) | (improv_to_av_bar["Performance Difference to Average Best"] > (0.05))].index)
df05 = improv_to_av_bar.drop(improv_to_av_bar[(improv_to_av_bar["Performance Difference to Average Best"] <= (0.05)) | (improv_to_av_bar["Performance Difference to Average Best"] > (0.07))].index)
df07 = improv_to_av_bar.drop(improv_to_av_bar[(improv_to_av_bar["Performance Difference to Average Best"] <= (0.07)) | (improv_to_av_bar["Performance Difference to Average Best"] > (0.09))].index)
df09 = improv_to_av_bar.drop(improv_to_av_bar[(improv_to_av_bar["Performance Difference to Average Best"] <= (0.09))].index)

improv_to_av_bar

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation,Data_Constellation_full,Performance Difference to Average Best
4,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.01,MAR - 0.01 - 6,0.000753
9,KNN,6,MAR,0.1,x-box,downstream_performance_mean,F1_macro,0.725914,0.0,0.726020,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.1,MAR - 0.1 - 6,0.000996
15,KNN,6,MAR,0.3,x-box,downstream_performance_mean,F1_macro,0.723328,0.0,0.724518,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.3,MAR - 0.3 - 6,0.002010
21,Random Forest,6,MAR,0.5,x-box,downstream_performance_mean,F1_macro,0.725631,0.0,0.726487,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.5,MAR - 0.5 - 6,0.000000
26,Mean/Mode,6,MCAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727026,0.0,0.726898,...,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MCAR - 0.01,MCAR - 0.01 - 6,0.000274
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,GAIN,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.265186,0.0,0.245338,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MCAR - 0.5,MCAR - 0.5 - 41671,0.004927
1189,KNN,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.01,MNAR - 0.01 - 41671,0.000000
1197,Discriminative DL,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240445,0.0,0.241994,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.1,MNAR - 0.1 - 41671,0.001374
1203,Mean/Mode,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.244691,0.0,0.244943,...,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.3,MNAR - 0.3 - 41671,0.004324


In [158]:
len_df_01 = len(df_01.index)
len_df01 = len(df01.index)
len_df03 = len(df03.index)
len_df05 = len(df05.index)
len_df07 = len(df07.index)
len_df09 = len(df09.index)

quantile_freq = []
quantile_freq.extend((len_df_01, len_df01, len_df03, len_df05, len_df07, len_df09))
print(quantile_freq)


quantiles = []
quantiles.extend(['less than 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09'])
print(quantiles)

improvement_quantiles = pd.DataFrame(
    {'Improvement to Average Best': quantiles,
     'Amount': quantile_freq,
    })

fig = px.bar(improvement_quantiles, x='Improvement to Average Best', y='Amount')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_only_best.pdf")

[111, 63, 15, 7, 5, 3]
['less than 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


In [159]:
# split barchart stacks into methods

quantile_datasets = [df_01, df01, df03, df05, df07, df09]

methods = ['Random Forest', 'KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
methods.remove(AVERAGE_BEST_IMPUTATION_METHOD)
print(methods)

forest_freq = []
knn_freq = []
mode_freq = []
dl_freq = []
vae_freq = []
gain_freq = []
#print(quantile_datasets)

for i in methods:
    for j in quantile_datasets:
        df_temp = j.copy()
        df_temp = df_temp[df_temp['Imputation_Method'].str.contains(i)]
        df_temp_len = len(df_temp.index)
        if (i == 'Random Forest'):
            forest_freq.append(df_temp_len)
        elif (i == 'KNN'):
            knn_freq.append(df_temp_len)                                       
        elif (i == 'Mean/Mode'):
            mode_freq.append(df_temp_len)                                                 
        elif (i == 'Discriminative DL'):
            dl_freq.append(df_temp_len)                                       
        elif (i == 'VAE'):
            vae_freq.append(df_temp_len)                                         
        elif (i == 'GAIN'):
            gain_freq.append(df_temp_len)                                          
                                       
print(forest_freq)
print(knn_freq)
print(mode_freq)
print(dl_freq)
print(vae_freq)
print(gain_freq)

['KNN', 'Mean/Mode', 'VAE', 'GAIN', 'Discriminative DL']
[]
[23, 10, 3, 0, 0, 0]
[8, 23, 2, 1, 1, 0]
[17, 6, 0, 0, 0, 0]
[15, 16, 3, 5, 1, 1]
[14, 8, 7, 1, 3, 2]


In [160]:
quantiles = ['less than 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


fig = go.Figure(data=[
    go.Bar(name='Random Forest', x=quantiles, y=forest_freq),
    go.Bar(name='KNN', x=quantiles, y=knn_freq),
    go.Bar(name='Mean/Mode', x=quantiles, y=mode_freq),
    go.Bar(name='Discriminative DL', x=quantiles, y=dl_freq),
    go.Bar(name='VAE', x=quantiles, y=vae_freq),
    go.Bar(name='GAIN', x=quantiles, y=gain_freq)
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_only_best_per_method.pdf")

In [161]:
# split barchart stacks into missingness fractions

quantile_datasets = [df_01, df01, df03, df05, df07, df09]

fractions = ['0.01', '0.1', '0.3', '0.5']
print(fractions)


freq_001 = []
freq_01 = []
freq_03 = []
freq_05 = []
#print(quantile_datasets)

for i in fractions:
    for j in quantile_datasets:
        df_temp = j.copy()
        df_temp = df_temp[df_temp['Missing Fraction'].str.contains(i)]
        df_temp_len = len(df_temp.index)
        if (i == '0.01'):
            freq_001.append(df_temp_len)
        elif (i == '0.1'):
            freq_01.append(df_temp_len)                                       
        elif (i == '0.3'):
            freq_03.append(df_temp_len)                                                 
        elif (i == '0.5'):
            freq_05.append(df_temp_len)                                       
                                        
                                       
print(freq_001)
print(freq_01)
print(freq_03)
print(freq_05)

['0.01', '0.1', '0.3', '0.5']
[35, 11, 2, 2, 1, 0]
[30, 10, 7, 0, 3, 1]
[22, 23, 4, 1, 1, 0]
[24, 19, 2, 4, 0, 2]


In [162]:
quantiles = ['less than 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


fig = go.Figure(data=[
    go.Bar(name='1% Missing Data', x=quantiles, y=freq_001, marker_color='#FD3216'),
    go.Bar(name='10% Missing Data', x=quantiles, y=freq_01, marker_color='#00FE35'),
    go.Bar(name='30% Missing Data', x=quantiles, y=freq_03, marker_color='#511CFB'),
    go.Bar(name='50% Missing Data', x=quantiles, y=freq_05, marker_color='#FF7F0E'),
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_only_best_per_frac.pdf")

In [163]:
# split barchart stacks into missingness fractions

quantile_datasets = [df_01, df01, df03, df05, df07, df09]

fractions = ['MCAR', 'MAR', 'MNAR']
print(fractions)


freq_001 = []
freq_01 = []
freq_03 = []
#print(quantile_datasets)

for i in fractions:
    for j in quantile_datasets:
        df_temp = j.copy()
        df_temp = df_temp[df_temp['Missing Type'].str.contains(i)]
        df_temp_len = len(df_temp.index)
        if (i == 'MCAR'):
            freq_001.append(df_temp_len)
        elif (i == 'MAR'):
            freq_01.append(df_temp_len)                                       
        elif (i == 'MNAR'):
            freq_03.append(df_temp_len)                                                                                     
                                        
                                       
print(freq_001)
print(freq_01)
print(freq_03)

['MCAR', 'MAR', 'MNAR']
[37, 20, 5, 3, 2, 1]
[37, 22, 4, 1, 2, 2]
[37, 21, 6, 3, 1, 0]


In [164]:
quantiles = ['less than 0.01', '0.01 to 0.03', '0.03 to 0.05', '0.05 to 0.07', '0.07 to 0.09', 'more than 0.09']


fig = go.Figure(data=[
    go.Bar(name='MCAR', x=quantiles, y=freq_001, marker_color='#222A2A'),
    go.Bar(name='MAR', x=quantiles, y=freq_01, marker_color='#B68100'),
    go.Bar(name='MNAR', x=quantiles, y=freq_03, marker_color='#750D86'),
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()
fig.write_image("improv_rel_to_av_all_DC_no_av_incl_only_best_per_patt.pdf")

## Extract datasets for Automated Imputation Method Selection

To Do: Explore the possibility, that the average best method replaces the best method for a data constellation, if the improvement gain for the best method is below 1%

### Potential Features:
Missingess Pattern (Missing Type)  
Missing Fraction (Missing Fraction)  
Datapoints (NumberOfInstances)  
Features in total (NumberOfFeatures)  
Numeric Features (NumberOfNumericFeatures)  
Categorical Features (NumberOfCategoricalFeatures)  
Downstream Task Type -> Classification/Regression (metric)
  
    
      
Label: Best Imputation Method (Imputation_Method)

In [165]:
# Use dataset with only the best method for each data constellation
rank_1_backup.to_csv('rank_1_backup.csv')
rank_1_backup

Unnamed: 0,Imputation_Method,Task,Missing Type,Missing Fraction,Column,result_type,metric,Baseline,Corrupted,Imputed,...,name,MajorityClassSize,MinorityClassSize,NumberOfFeatures,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures,NumberOfClasses,Downstream Performance Rank,Data_Constellation
4,Discriminative DL,6,MAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727469,0.0,0.727828,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.01
9,KNN,6,MAR,0.1,x-box,downstream_performance_mean,F1_macro,0.725914,0.0,0.726020,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.1
15,KNN,6,MAR,0.3,x-box,downstream_performance_mean,F1_macro,0.723328,0.0,0.724518,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.3
21,Random Forest,6,MAR,0.5,x-box,downstream_performance_mean,F1_macro,0.725631,0.0,0.726487,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MAR - 0.5
26,Mean/Mode,6,MCAR,0.01,x-box,downstream_performance_mean,F1_macro,0.727026,0.0,0.726898,...,letter,813.0,734.0,17.0,20000.0,16.0,1.0,26.0,1.0,MCAR - 0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,GAIN,41671,MCAR,0.5,a9,downstream_performance_mean,F1_macro,0.265186,0.0,0.245338,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MCAR - 0.5
1189,KNN,41671,MNAR,0.01,a9,downstream_performance_mean,F1_macro,0.240619,0.0,0.240619,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.01
1197,Discriminative DL,41671,MNAR,0.1,a9,downstream_performance_mean,F1_macro,0.240445,0.0,0.241994,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.1
1203,Mean/Mode,41671,MNAR,0.3,a9,downstream_performance_mean,F1_macro,0.244691,0.0,0.244943,...,microaggregation2,11162.0,743.0,21.0,20000.0,20.0,1.0,5.0,1.0,MNAR - 0.3


In [166]:
# Dataset for Training 
properties_train_dataset_8 = rank_1_backup.copy()
properties_train_dataset_8 = properties_train_dataset_8[['Imputation_Method','Missing Type','Missing Fraction',
                                                         'NumberOfInstances','NumberOfFeatures','NumberOfNumericFeatures',
                                                         'NumberOfCategoricalFeatures','metric']]

properties_train_dataset_8


Unnamed: 0,Imputation_Method,Missing Type,Missing Fraction,NumberOfInstances,NumberOfFeatures,NumberOfNumericFeatures,NumberOfCategoricalFeatures,metric
4,Discriminative DL,MAR,0.01,20000.0,17.0,16.0,1.0,F1_macro
9,KNN,MAR,0.1,20000.0,17.0,16.0,1.0,F1_macro
15,KNN,MAR,0.3,20000.0,17.0,16.0,1.0,F1_macro
21,Random Forest,MAR,0.5,20000.0,17.0,16.0,1.0,F1_macro
26,Mean/Mode,MCAR,0.01,20000.0,17.0,16.0,1.0,F1_macro
...,...,...,...,...,...,...,...,...
1186,GAIN,MCAR,0.5,20000.0,21.0,20.0,1.0,F1_macro
1189,KNN,MNAR,0.01,20000.0,21.0,20.0,1.0,F1_macro
1197,Discriminative DL,MNAR,0.1,20000.0,21.0,20.0,1.0,F1_macro
1203,Mean/Mode,MNAR,0.3,20000.0,21.0,20.0,1.0,F1_macro


In [167]:
# Dataset for Training 
properties_train_dataset_original = rank_1_backup.copy()
properties_train_dataset_original = properties_train_dataset_original[['Imputation_Method','Missing Type','Missing Fraction',
                                                         'NumberOfInstances','NumberOfNumericFeatures',
                                                         'NumberOfCategoricalFeatures']]

properties_train_dataset_original
properties_train_dataset_original.to_csv('multi_properties_train_dataset_original.csv', index=False)

In [168]:
properties_train_dataset_original

Unnamed: 0,Imputation_Method,Missing Type,Missing Fraction,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
4,Discriminative DL,MAR,0.01,20000.0,16.0,1.0
9,KNN,MAR,0.1,20000.0,16.0,1.0
15,KNN,MAR,0.3,20000.0,16.0,1.0
21,Random Forest,MAR,0.5,20000.0,16.0,1.0
26,Mean/Mode,MCAR,0.01,20000.0,16.0,1.0
...,...,...,...,...,...,...
1186,GAIN,MCAR,0.5,20000.0,20.0,1.0
1189,KNN,MNAR,0.01,20000.0,20.0,1.0
1197,Discriminative DL,MNAR,0.1,20000.0,20.0,1.0
1203,Mean/Mode,MNAR,0.3,20000.0,20.0,1.0


In [169]:
# Dataset for Training -> replace best method with average best if imporvement is below 1%, 2% or 3%

alternate_data = heatmap_data_difference.copy()

df_temp = alternate_data.loc[(alternate_data['Downstream Performance Rank'] == 1.0) | (alternate_data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD)]


dc_unique = alternate_data.Data_Constellation_full.unique()
data_constellations = dc_unique.tolist()


for i in data_constellations:

    # define the threshold here!
    
    df_temp['Downstream Performance Rank'] = np.where((df_temp['Data_Constellation_full'] == i) & (df_temp['Performance Difference to Average Best'] <= 0.03) & (df_temp['Imputation_Method'] != AVERAGE_BEST_IMPUTATION_METHOD), 9.0, df_temp['Downstream Performance Rank'])
    df_temp['Downstream Performance Rank'] = np.where((df_temp['Data_Constellation_full'] == i) & (df_temp['Performance Difference to Average Best'] >= 0.03) & (df_temp['Imputation_Method'] != AVERAGE_BEST_IMPUTATION_METHOD), 11.0, df_temp['Downstream Performance Rank'])
   
df_temp = df_temp.drop(df_temp[df_temp['Downstream Performance Rank'] == 9.0].index) 

# Sorting of data

#adjust order to fit the processing time -> fastest first
methods_order = CategoricalDtype(['Random Forest', 'Mean/Mode', 'KNN', 'VAE', 'GAIN', 'Discriminative DL'], ordered=True)


df_temp['Imputation_Method'] = df_temp['Imputation_Method'].astype(methods_order)

df_temp = df_temp.sort_values(['Data_Constellation_full','Imputation_Method'], ascending=[True, True])
df_temp = df_temp.drop_duplicates(subset=["Data_Constellation_full"], keep='last')

df_temp = df_temp[['Imputation_Method','Missing Type','Missing Fraction',
                                                         'NumberOfInstances','NumberOfNumericFeatures',
                                                         'NumberOfCategoricalFeatures']]


df_temp.to_csv('multi_properties_train_dataset_3_percent.csv')
df_temp






A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Imputation_Method,Missing Type,Missing Fraction,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
504,Random Forest,MAR,0.01,10218.0,7.0,1.0
573,Random Forest,MAR,0.01,28056.0,3.0,4.0
645,Random Forest,MAR,0.01,5456.0,4.0,1.0
284,Random Forest,MAR,0.01,4177.0,7.0,2.0
360,Random Forest,MAR,0.01,28056.0,0.0,7.0
...,...,...,...,...,...,...
1067,Random Forest,MNAR,0.5,58000.0,9.0,1.0
1140,Random Forest,MNAR,0.5,44819.0,6.0,1.0
1204,Random Forest,MNAR,0.5,20000.0,20.0,1.0
784,Random Forest,MNAR,0.5,5665.0,2.0,15.0


In [170]:
alternate_data = heatmap_data_difference.copy()

df_temp = alternate_data.loc[(alternate_data['Downstream Performance Rank'] == 1.0) | (alternate_data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD)]


dc_unique = alternate_data.Data_Constellation_full.unique()
data_constellations = dc_unique.tolist()


for i in data_constellations:

    # define the threshold here!
    
    df_temp['Downstream Performance Rank'] = np.where((df_temp['Data_Constellation_full'] == i) & (df_temp['Performance Difference to Average Best'] <= 0.02) & (df_temp['Imputation_Method'] != AVERAGE_BEST_IMPUTATION_METHOD), 9.0, df_temp['Downstream Performance Rank'])
    df_temp['Downstream Performance Rank'] = np.where((df_temp['Data_Constellation_full'] == i) & (df_temp['Performance Difference to Average Best'] >= 0.02) & (df_temp['Imputation_Method'] != AVERAGE_BEST_IMPUTATION_METHOD), 11.0, df_temp['Downstream Performance Rank'])
   
df_temp = df_temp.drop(df_temp[df_temp['Downstream Performance Rank'] == 9.0].index) 

# Sorting of data

#adjust order to fit the processing time -> fastest first
methods_order = CategoricalDtype(['Random Forest', 'Mean/Mode', 'KNN', 'VAE', 'GAIN', 'Discriminative DL'], ordered=True)


df_temp['Imputation_Method'] = df_temp['Imputation_Method'].astype(methods_order)

df_temp = df_temp.sort_values(['Data_Constellation_full','Imputation_Method'], ascending=[True, True])
df_temp = df_temp.drop_duplicates(subset=["Data_Constellation_full"], keep='last')

df_temp = df_temp[['Imputation_Method','Missing Type','Missing Fraction',
                                                         'NumberOfInstances','NumberOfNumericFeatures',
                                                         'NumberOfCategoricalFeatures']]


df_temp.to_csv('multi_properties_train_dataset_2_percent.csv')
df_temp





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Imputation_Method,Missing Type,Missing Fraction,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
504,Random Forest,MAR,0.01,10218.0,7.0,1.0
573,Random Forest,MAR,0.01,28056.0,3.0,4.0
645,Random Forest,MAR,0.01,5456.0,4.0,1.0
284,Random Forest,MAR,0.01,4177.0,7.0,2.0
360,Random Forest,MAR,0.01,28056.0,0.0,7.0
...,...,...,...,...,...,...
1067,Random Forest,MNAR,0.5,58000.0,9.0,1.0
1140,Random Forest,MNAR,0.5,44819.0,6.0,1.0
1209,KNN,MNAR,0.5,20000.0,20.0,1.0
786,Mean/Mode,MNAR,0.5,5665.0,2.0,15.0


In [171]:
alternate_data = heatmap_data_difference.copy()

df_temp = alternate_data.loc[(alternate_data['Downstream Performance Rank'] == 1.0) | (alternate_data['Imputation_Method'] == AVERAGE_BEST_IMPUTATION_METHOD)]


dc_unique = alternate_data.Data_Constellation_full.unique()
data_constellations = dc_unique.tolist()


for i in data_constellations:

    # define the threshold here!
    
    df_temp['Downstream Performance Rank'] = np.where((df_temp['Data_Constellation_full'] == i) & (df_temp['Performance Difference to Average Best'] <= 0.01) & (df_temp['Imputation_Method'] != AVERAGE_BEST_IMPUTATION_METHOD), 9.0, df_temp['Downstream Performance Rank'])
    df_temp['Downstream Performance Rank'] = np.where((df_temp['Data_Constellation_full'] == i) & (df_temp['Performance Difference to Average Best'] >= 0.01) & (df_temp['Imputation_Method'] != AVERAGE_BEST_IMPUTATION_METHOD), 11.0, df_temp['Downstream Performance Rank'])
   
df_temp = df_temp.drop(df_temp[df_temp['Downstream Performance Rank'] == 9.0].index) 

# Sorting of data

#adjust order to fit the processing time -> fastest first
methods_order = CategoricalDtype(['Random Forest', 'Mean/Mode', 'KNN', 'VAE', 'GAIN', 'Discriminative DL'], ordered=True)


df_temp['Imputation_Method'] = df_temp['Imputation_Method'].astype(methods_order)

df_temp = df_temp.sort_values(['Data_Constellation_full','Imputation_Method'], ascending=[True, True])
df_temp = df_temp.drop_duplicates(subset=["Data_Constellation_full"], keep='last')

df_temp = df_temp[['Imputation_Method','Missing Type','Missing Fraction',
                                                         'NumberOfInstances','NumberOfNumericFeatures',
                                                         'NumberOfCategoricalFeatures']]


df_temp.to_csv('multi_properties_train_dataset_1_percent.csv')
df_temp





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Imputation_Method,Missing Type,Missing Fraction,NumberOfInstances,NumberOfNumericFeatures,NumberOfCategoricalFeatures
504,Random Forest,MAR,0.01,10218.0,7.0,1.0
573,Random Forest,MAR,0.01,28056.0,3.0,4.0
645,Random Forest,MAR,0.01,5456.0,4.0,1.0
284,Random Forest,MAR,0.01,4177.0,7.0,2.0
360,Random Forest,MAR,0.01,28056.0,0.0,7.0
...,...,...,...,...,...,...
1070,VAE,MNAR,0.5,58000.0,9.0,1.0
1140,Random Forest,MNAR,0.5,44819.0,6.0,1.0
1209,KNN,MNAR,0.5,20000.0,20.0,1.0
786,Mean/Mode,MNAR,0.5,5665.0,2.0,15.0
