# Analyse full screening answers

This notebook has the purpose to
- clean and prepare the data coming from the MS forms sheet
- visualize and aggregate the research questions
- further analyse the dataset

In [84]:
# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re
from scipy.stats import chi2_contingency

In [47]:
# for printing issues
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

In [48]:
# read in dataframe
df = pd.read_excel('..\data\Full Screening Questions (1-621)_reviewed.xlsx')

# strip strings in object columns
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

# check for multiple answers on the same paper
df = df.drop_duplicates(subset=['Number - Author Year'], keep='first')

print('df shape has ',df.shape)

df shape has  (610, 12)


## Clean and Prepare

In [49]:
# remove \n from strings
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)

## How many papers have to be excluded for which reason?

In [50]:
df[df['Concrete medical supervised Machine Learning usecase?']=='No -> Exclude from review and submit form'].shape[0]

87

In [51]:
df[df['Which XAI method is used?']=='None -> Exclude from review and submit form;'].shape[0]

64

In [52]:
df[df['Tabular or Image data as input?']=='Text-> Exclude from review and submit form'].shape[0]

13

In [53]:
df[df['Tabular or Image data as input?']=='Audio-> Exclude from review and submit form'].shape[0]

0

In [54]:
# drop papers that have been excluded due to at least one of the following reasons
# - No concrete supervised medical machine learning use case
df = df[df['Concrete medical supervised Machine Learning usecase?']!='No -> Exclude from review and submit form']
# - No XAI method provided
df = df[df['Which XAI method is used?']!='None -> Exclude from review and submit form;']
# - No image or tabular data as input
df = df[df['Tabular or Image data as input?']!='Text-> Exclude from review and submit form']
df = df[df['Tabular or Image data as input?']!='Audio-> Exclude from review and submit form']

In [55]:
# check for missing values
authors_list = []
for row in df.index:
    df_row = df.loc[row, :]
    
    if pd.isnull(df_row).any():
        authors_list.append(df.loc[row, 'Number - Author Year'])

authors_list = list(set(authors_list))
sorted(authors_list)

[]

In [56]:
print('After dropping excluded papers, df shape is ',df.shape)

After dropping excluded papers, df shape is  (450, 12)


No missing values left - continue with granularisation of columns - "unroll" columns.

In [57]:
def used_methods_value_count(df):
    """Performs a value count on which method is used? Excludes own developed methods"""
    
    # drop all rows with substring "Own method developed"
    df = df[~df['Which XAI method is used?'].str.contains('Own method developed')]
    
    # remove leading spaces
    df.loc[:,'Which XAI method is used?'] = df['Which XAI method is used?'].str.strip()
    
    # Perform value counts on remaining methods
    ser = df['Which XAI method is used?'].str.split(";", expand=True).stack(dropna=True)
    
    # regex search for saliency maps
    ser = ser.replace(r'(?i) ?Saliency Maps? ?', 'Saliency Map', regex=True)
    
    # regex search for attention mechanism
    ser = ser.replace(r'(?i)attention mechanism? ?|attention weights? ?', 'Attention Weight', regex=True)
    
    
    
    ser = ser.value_counts()
    
    # drop the first line as this is a value counts of None
    ser = ser.drop(labels=[''])
    
    # alter index for better readability
    ser.rename(index={'Model is intrinsic interpretable (i.e., decision tree or linear regression)':'Intrinsic interpretable'},
              inplace=True)
    
    return ser

ser = used_methods_value_count(df)

print('Top 10 most used XAI methods')
ser[:10]

Top 10 most used XAI methods


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


SHAP                                                     119
Intrinsic interpretable                                   94
Class Activation Mapping or related (i. e., Grad-CAM)     92
Random Forest Feature Importance                          46
LIME                                                      40
Partial Dependence Plots                                  14
Layer-Wise Relevance Propagation                           9
Attention Weight                                           8
Saliency Map                                               8
Permutation Importance                                     5
dtype: int64

In [58]:
# How many methods in total that are not self-developed?
ser.sum()

490

In [59]:
# How many self developed methods?
ser['Own method developed'] = df['Which XAI method is used?'].str.contains('Own method developed').sum()
ser = ser.sort_values(ascending=False)
ser['Own method developed']

45

In [60]:
# How many explanations methods per paper on average?
ser.sum()/df.shape[0]

1.1888888888888889

## Which XAI methods have been used the most?

In [61]:
# Explanation methods by data input
df_tab = df[df['Tabular or Image data as input?']=='Tabular (includes EEG, ECG, time-series data)']
df_img = df[df['Tabular or Image data as input?']=='Image data (includes video data)']

ser_tab = used_methods_value_count(df_tab).rename('tabular')
ser_img = used_methods_value_count(df_img).rename('image')

df_tab_img = pd.concat([ser_tab, ser_img], 
                       names=['tabular', 'image'], 
                       axis='columns')

# replace NaNs
df_tab_img=df_tab_img.fillna(0)

# drop methods that have been used less than 3 times
df_tab_img = df_tab_img[(df_tab_img.tabular>2) | (df_tab_img.image>2)]

df_tab_img

Unnamed: 0,tabular,image
SHAP,108.0,11.0
Intrinsic interpretable,90.0,4.0
Random Forest Feature Importance,41.0,5.0
LIME,27.0,13.0
Partial Dependence Plots,14.0,0.0
"Class Activation Mapping or related (i. e., Grad-CAM)",8.0,84.0
Attention Weight,7.0,1.0
Permutation Importance,5.0,0.0
Layer-Wise Relevance Propagation,3.0,6.0
Saliency Map,2.0,6.0


![image-2.png](attachment:image-2.png)

## Which data input type is more common?

In [62]:
df['Tabular or Image data as input?'].value_counts()

Tabular (includes EEG, ECG, time-series data)    307
Image data (includes video data)                 143
Name: Tabular or Image data as input?, dtype: int64

In [63]:
df['Tabular or Image data as input?'].value_counts(normalize=True)

Tabular (includes EEG, ECG, time-series data)    0.682222
Image data (includes video data)                 0.317778
Name: Tabular or Image data as input?, dtype: float64

Tabular data is still the most used format for ML applications. They're used twice as much as image applications.

In [64]:
# Prepare for big result_df
res_df=pd.DataFrame()

## What is the distribution of the publication years?

In [65]:
df['pub_year']=df['Number - Author Year'].str.split(' ', expand=True).loc[:, 3]
res = df['pub_year'].value_counts().sort_index()
res.loc['2008-2019']=res[(res.index >= '2008') & (res.index <= '2019')].sum()
res = res.drop(['2008', '2009', '2010', '2011', '2012', '2014', '2015', '2016', '2017',
       '2018', '2019'], axis='index')
res = res.reindex(['2008-2019', '2020', '2021', '2022'])
res_df['publications_count'] = res
res_df

Unnamed: 0,publications_count
2008-2019,79
2020,108
2021,200
2022,63


Exponential growth in publications!

## Do the ML pipeline descriptions improve over time?

In [66]:
res = df.groupby('pub_year').mean()['How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)']
res.loc['2008-2019']=res[(res.index >= '2008') & (res.index <= '2019')].mean()
res = res.drop(['2008', '2009', '2010', '2011', '2012', '2014', '2015', '2016', '2017',
       '2018', '2019'], axis='index')
res = res.reindex(['2008-2019', '2020', '2021', '2022'])
res_df['Mean of ML pipeline evaluation'] = res
res_df

Unnamed: 0,publications_count,Mean of ML pipeline evaluation
2008-2019,79,1.869501
2020,108,2.148148
2021,200,2.155
2022,63,2.301587


In [67]:
print('Mean ML pipeline ', 
      df['How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)'].std())
print('Std  ML pipeine ',
      df['How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)'].mean())

Mean ML pipeline  0.600156706418699
Std  ML pipeine  2.151111111111111


## How does the `code` sharing ratio change over time?

In [68]:
def grp_by_year_and_count(df, col):
    """
    Groups dataframe by publication year and counts instances of a column.
    Returns normalized and absolute values dataframe as df_abs, df_norm
    """
    

    res = df.groupby(['pub_year', col]).size().unstack(fill_value=0)
    res.loc['2008-2019', :]=res[(res.index >= '2008') & (res.index <= '2019')].sum()
    for year in ['2008', '2009', '2010', '2011', '2012', '2014', '2015', '2016', '2017',
           '2018', '2019']:
        try:
            res = res.drop([year], axis='index')
        except:
            print(f'{year} not in axis!')
            continue
    
    res = res.reindex(['2008-2019', '2020', '2021', '2022'])
    res_norm = res.div(res.sum(axis=1), axis=0)
    
    return res, res_norm

In [69]:
res, res_norm = grp_by_year_and_count(df, 'Source Code provided?')
res_df = pd.concat([res_df, 
                    res.add_prefix('code_abs_')], 
                   axis=1)
res_df = pd.concat([res_df, 
                    res_norm.add_prefix('code_norm_')], 
                   axis=1)

In [70]:
df['Source Code provided?'].value_counts(normalize=True)

No              0.755556
Yes             0.204444
Upon request    0.040000
Name: Source Code provided?, dtype: float64

## How does the `data` sharing ratio change over time?

In [71]:
res, res_norm = grp_by_year_and_count(df, 'Data available?')
res_df = pd.concat([res_df, 
                    res.add_prefix('data_abs_')], 
                   axis=1)
res_df = pd.concat([res_df, 
                    res_norm.add_prefix('data_norm_')], 
                   axis=1)

In [72]:
res_df.to_csv('..\\plots\\tables\\pub_count_ml_pipeline_code_data_sharing_eval.csv',
             index_label='pub_year')

In [77]:
res_df

Unnamed: 0,publications_count,Mean of ML pipeline evaluation,code_abs_No,code_abs_Upon request,code_abs_Yes,code_norm_No,code_norm_Upon request,code_norm_Yes,data_abs_No,data_abs_Upon request,data_abs_Yes,data_norm_No,data_norm_Upon request,data_norm_Yes
2008-2019,79,1.869501,57.0,4.0,18.0,0.721519,0.050633,0.227848,46.0,12.0,21.0,0.582278,0.151899,0.265823
2020,108,2.148148,74.0,4.0,30.0,0.685185,0.037037,0.277778,59.0,20.0,29.0,0.546296,0.185185,0.268519
2021,200,2.155,161.0,8.0,31.0,0.805,0.04,0.155,104.0,44.0,52.0,0.52,0.22,0.26
2022,63,2.301587,48.0,2.0,13.0,0.761905,0.031746,0.206349,36.0,8.0,19.0,0.571429,0.126984,0.301587


![image.png](attachment:image.png)

In [95]:
# TODO Report result of Chi² in the paper
mc = res_df[['code_abs_No',
       'code_abs_Upon request', 'code_abs_Yes']]
md = res_df[['data_norm_No',
       'data_abs_Upon request', 'data_abs_Yes']]

def check_chi2_contingency(df1, df2):
    """
    Compares distribution of two dataframes using the chi2_contingency table.
    
    """
    # flatten values
    arr1 = np.asarray(df1).ravel()
    arr2 = np.asarray(df1).ravel()
    
    obs = np.array([arr1,arr2])
    
    # test null hypothesis, see: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html
    chi2, p, dof, ex = chi2_contingency(obs, correction=False)
    
    return (p, dof)

p, dof = check_chi2_contingency(mc, md)

print(f'p\t{p}\ndof\t{dof}')

p	1.0
dof	11


P-value of 1.0 means that the two distributions (code sharing vs data sharing) do not differ significantly.
#TODO: Report this in the paper

### Added: Overall ml pipeline mean, code shared?, data shared?

In [29]:
#ml description mean+ code provided? + data shared? (overall)
df['How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)'].mean()

2.151111111111111

In [30]:
#percentage code provided overall
df['Source Code provided?'].value_counts(normalize=True)

No              0.755556
Yes             0.204444
Upon request    0.040000
Name: Source Code provided?, dtype: float64

In [31]:
#percentage data shared overall
df['Data available?'].value_counts(normalize=True)

No              0.544444
Yes             0.268889
Upon request    0.186667
Name: Data available?, dtype: float64

![2022-06-23%2011_16_13-pub_count_ml_pipeline_code_data_sharing_eval_1.pdf%20-%20Adobe%20Acrobat%20Pro%20DC.png](attachment:2022-06-23%2011_16_13-pub_count_ml_pipeline_code_data_sharing_eval_1.pdf%20-%20Adobe%20Acrobat%20Pro%20DC.png)

## Does the input type change over time?

In [32]:
res, res_norm = grp_by_year_and_count(df, 'Tabular or Image data as input?')
res_norm

Tabular or Image data as input?,Image data (includes video data),"Tabular (includes EEG, ECG, time-series data)"
pub_year,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-2019,0.202532,0.797468
2020,0.324074,0.675926
2021,0.315,0.685
2022,0.460317,0.539683


Clear trend towards image data due to increased computational power and availability of image data.

## What is the code and data sharing ratio for own developed methods?

In [33]:
filt = df['Which XAI method is used?'].str.contains('Own method developed')
res, res_norm = grp_by_year_and_count(df[filt], 'Source Code provided?')

2008 not in axis!
2010 not in axis!


In [34]:
res_norm

Source Code provided?,No,Upon request,Yes
pub_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2008-2019,0.933333,0.0,0.066667
2020,0.533333,0.133333,0.333333
2021,0.545455,0.090909,0.363636
2022,0.5,0.25,0.25


In [35]:
res, res_norm = grp_by_year_and_count(df[~filt], 'Source Code provided?')

In [36]:
res_norm

Source Code provided?,No,Upon request,Yes
pub_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2008-2019,0.671875,0.0625,0.265625
2020,0.709677,0.021505,0.268817
2021,0.820106,0.037037,0.142857
2022,0.779661,0.016949,0.20339


The willingness to share source code is 4.4 % points higher for own develop explanation methods.

## Who is able to understand the explanation method? To whom are the explanations of the AI system addressed?

In [37]:
ser = pd.Series(df['Who is potentially able to understand the XAI method?'].apply(lambda x: x.split(';')))
ser = pd.Series([x for item in ser for x in item])
ser[ser!=''].value_counts()/450

Developers (Data Scientists or ML Engineers)    1.000000
Domain experts                                  0.904444
Patiensts                                       0.164444
Other stakeholders not mentioned above          0.148889
dtype: float64

In [38]:
# TODO: Summarize the groups patients and other stakeholders and mention, 
# that a patient can potentially be a ML expert but we assume that he or she is not
# Domain Experts == Medical professional

# Great bias here for the understanding of physician
# The correct question would be: Would a physician be able to interpret the output of the XAI method correctly?
# TODO: Discuss the 90.4 % in the discussion (medical interpretation vs interpretation of the output of the explanation method)

### Which explanation methods are understood by patients?

In [39]:
# filter patients
filt = df['Who is potentially able to understand the XAI method?'].str.contains('Patiensts')
df_pat = df[filt]
df_npat = df[~filt]

ser_pat = used_methods_value_count(df_pat).rename('Understood by patients')
ser_npat = used_methods_value_count(df_npat).rename('Not understood by patients') 

df_pat_npat = pd.concat([ser_pat, ser_npat], 
                       names=['Understood by patients', 
                              'Not understood by patients'], 
                       axis='columns')

# replace NaNs
df_pat_npat=df_pat_npat.fillna(0)

# drop methods that have been used less than 3 times
filt2=(df_pat_npat['Understood by patients']>2) | (df_pat_npat['Not understood by patients']>2)
df_pat_npat = df_pat_npat[filt2]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [40]:
df_pat_npat

Unnamed: 0,Understood by patients,Not understood by patients
"Class Activation Mapping or related (i. e., Grad-CAM)",30.0,62.0
Intrinsic interpretable,21.0,73.0
SHAP,11.0,108.0
Random Forest Feature Importance,8.0,38.0
LIME,5.0,35.0
Saliency Map,4.0,4.0
Partial Dependence Plots,2.0,12.0
Attention Weight,2.0,6.0
Layer-Wise Relevance Propagation,1.0,8.0
Permutation Importance,1.0,4.0


![image-3.png](attachment:image-3.png)

## Differences regarding tabular and image data: 

### Do patients understand the XAI method?

In [41]:
# Understood by patients
res = df_pat.groupby(['Tabular or Image data as input?']).size().reset_index(name='Understood by patients')
res = res.set_index('Tabular or Image data as input?')

# Not understood by patients
res2 = df_npat.groupby(['Tabular or Image data as input?']).size().reset_index(name='Not understood by patients')
res2 = res2.set_index('Tabular or Image data as input?')

# Final results
res = pd.concat([res,res2], axis=1)
res

Unnamed: 0_level_0,Understood by patients,Not understood by patients
Tabular or Image data as input?,Unnamed: 1_level_1,Unnamed: 2_level_1
Image data (includes video data),36,107
"Tabular (includes EEG, ECG, time-series data)",38,269


![image.png](attachment:image.png)

Patients understand the XAI method more often within image data projects.

### Differences in ML description regarding tabular and image data?

In [42]:
df.groupby(['Tabular or Image data as input?',
           'How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)']).size().reset_index(name='counts')

Unnamed: 0,Tabular or Image data as input?,"How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)",counts
0,Image data (includes video data),1.0,7
1,Image data (includes video data),2.0,92
2,Image data (includes video data),3.0,44
3,"Tabular (includes EEG, ECG, time-series data)",1.0,45
4,"Tabular (includes EEG, ECG, time-series data)",2.0,186
5,"Tabular (includes EEG, ECG, time-series data)",3.0,76


![2022-06-23%2014_58_41-ml_description_by_data_type.xlsx%20-%20Excel.png](attachment:2022-06-23%2014_58_41-ml_description_by_data_type.xlsx%20-%20Excel.png)

The ML pipeline is better described in image data projects.

### Difference in code and data availability regarding tabular and image data?

In [43]:
df.groupby(['Tabular or Image data as input?', 'Source Code provided?']).size().reset_index(name='counts')

Unnamed: 0,Tabular or Image data as input?,Source Code provided?,counts
0,Image data (includes video data),No,106
1,Image data (includes video data),Upon request,9
2,Image data (includes video data),Yes,28
3,"Tabular (includes EEG, ECG, time-series data)",No,234
4,"Tabular (includes EEG, ECG, time-series data)",Upon request,9
5,"Tabular (includes EEG, ECG, time-series data)",Yes,64


In [44]:
df.groupby(['Tabular or Image data as input?', 'Data available?']).size().reset_index(name='counts')

Unnamed: 0,Tabular or Image data as input?,Data available?,counts
0,Image data (includes video data),No,69
1,Image data (includes video data),Upon request,23
2,Image data (includes video data),Yes,51
3,"Tabular (includes EEG, ECG, time-series data)",No,176
4,"Tabular (includes EEG, ECG, time-series data)",Upon request,61
5,"Tabular (includes EEG, ECG, time-series data)",Yes,70


![2022-06-27%2010_18_53-code_data_availability_by_data_type.xlsx%20-%20Excel.png](attachment:2022-06-27%2010_18_53-code_data_availability_by_data_type.xlsx%20-%20Excel.png)

no significant difference in the provision of the source code, but the data is more often available when image data was used (within image data projects) (13% higher)

In [45]:
df_tab['How well is the Machine Learning pipeline described? (1 = Not Described, 3 = Elaborately described)'].mean()

2.1009771986970684

## Difference in description of ml pipeline based on XAI method?

## Differences in code/data availablity based on used XAI method?

## Which XAI methods mostly used in specific years? Distribution of XAI methods over time