# DIRAC Analysis of Arivale Proteomics — Permutation Test for Chronological Age-stratified Group

***by Kengo Watanabe***  

The differential rank conservation (DIRAC; Eddy, J.A. et al. PLoS Comput. Biol. 2010) analysis was performed on the preprocessed Arivale proteomics dataset (analytes detected in 90% and more participants; random forest imputation for missingness; sample-based robust Z-score followed by analyte-based robust Z-score) using the retrieved a priori module set (Gene Ontology (Biological Process) derived by EMBL-EBI QuickGO API; ≥4 analytes and ≥30% coverage).  
–> In this notebook, to show the statistical significance of tightly regulated modules between groups, the DIRAC analysis is permutated while randomizing the sample–phenotype relationship.  

Input:  
* Preprocessed analyte data: 220530_Arivale-DIRAC-proteomics-ver5-3_Preprocessing_protDF-robustZscore.tsv  
* Module–analyte metadata: 220530_Arivale-DIRAC-proteomics-ver5-3_Preprocessing_module-metadata_QuickGO-GOBP-min-n4-cov30.tsv  
* Cleaned module metadata: 220531_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group_module-metadata.tsv  
* Sample–participant metadata: 220530_Arivale-DIRAC-proteomics-ver5-3_DataCleaning_participant-metadata.tsv  
* Intermediate DIRAC tables (network ranking dataframe): 220531_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group_NetworkRanking-[digit].tsv  
* Tables of DIRAC RCIs: 220531_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group_RankConservationIndex-[digit].tsv  

Output:  
* Supplementary Figure 4b  

Original notebook (memo for my future tracing):  
* wenceslaus:[JupyterLab HOME]/210216_Arivale-DIRAC/220616_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group-permutation.ipynb  

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

from itertools import combinations
import math
import random
from multiprocessing import Pool
from statsmodels.stats import multitest as multi
from decimal import Decimal, ROUND_HALF_UP

!conda list

## 0. DIRAC code

> The original code for DIRAC is written in MATLAB, then, it is re-written in Python 3 here.  
> <– I don't care about computational speed here; rather, the code adheres to the story in the original paper.  

In [None]:
def network_ranking(DF, networkS):
    # This function calculates the pairwise ordering of network genes (i.e., network ranking).
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a glance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    ## from itertools import combinations: confirmed with Python 3.7.6 and 3.9.6
    # Input:
    ## DF: pd.DataFrame containing expression values (X_gn) with gene (g; 1-G) indices and sample (n; 1-N) columns
    ## networkS: pd.Series containing genes (g; 1-G) with network (m; 1-M) indices (i.e., 1-on-1 long-format)
    # Output:
    ## pd.DataFrame containing binary values of the networking ranking comparison for each sample
    ## -> row: comparison_id (indicated by network (m) and ordering (g_i < g_j) columns)
    ## -> column: sample (n) (with exception of the NetworkID and Ordering columns)
    # Note:
    ## If items in network m and gene g contain ' : ' or ' < ', this code would produce error or unexpected output.
    ## If an item in sample n is 'NetworkID' or 'Ordering', this code would produce error or unexpected output.
    
    #Calculate binary values of the network ranking comparison for each sample
    rankDF = pd.DataFrame()
    sampleL = DF.columns.tolist()
    networkL = networkS.index.unique()
    for n in sampleL:
        rankDF_n = pd.DataFrame()
        for m in networkL:
            #Pairs of genes (g_i, g_j) in the network m
            networkS_m = networkS.loc[m]
            pairL_m = list(combinations(range(0, len(networkS_m)), 2))
            pairL_m_i, pairL_m_j = [[pair[x] for pair in pairL_m] for x in (0,1)]
            rankDF_m = pd.DataFrame({'g_i':networkS_m.iloc[pairL_m_i],
                                     'g_j':networkS_m.iloc[pairL_m_j]})#Hold network (m; 1-M) index
            #Compare the expression values (X_gn) between pairwise genes (g_i vs. g_j)
            tempL = []
            for pair_i in range(0, len(rankDF_m)):
                g_i = rankDF_m.iloc[pair_i, 0]
                g_j = rankDF_m.iloc[pair_i, 1]
                X_i = DF.loc[g_i, n]
                X_j = DF.loc[g_j, n]
                #If X_i < X_j is true, add 1; otherwise (X_i >= X_j), add 0
                if X_i < X_j:
                    tempL.append(1)
                else:
                    tempL.append(0)
            rankDF_m['X_i<X_j'] = tempL
            #Update the network ranking dataframe of sample n
            rankDF_m.index.set_names('NetworkID', inplace=True)#Set/reset index name
            rankDF_m = rankDF_m.reset_index()
            rankDF_n = pd.concat([rankDF_n, rankDF_m], axis=0)
        #Updata the network ranking dataframe of all samples
        rankDF_n['Sample'] = n
        rankDF = pd.concat([rankDF, rankDF_n], axis=0)
    ##Prepare dummy index and clean dataframe
    rankDF['ComparisonID'] = rankDF['NetworkID'] + ' : ' + rankDF['g_i'] + ' < ' + rankDF['g_j']
    rankDF = rankDF.pivot(index='ComparisonID', columns='Sample', values='X_i<X_j')#Sorted by index during this
    rankDF = rankDF.reset_index()#Index becomes row number here
    tempDF = rankDF['ComparisonID'].str.split(pat=' : ', expand=True)
    tempDF = tempDF.rename(columns={0:'NetworkID', 1:'Ordering'})
    rankDF = pd.concat([tempDF, rankDF], axis=1)#Dropping columns name 'Sample' during this
    rankDF = rankDF.drop(columns='ComparisonID')
    return rankDF

def rank_template(rankDF, phenotypeS):
    # This function generates the rank template (T) presenting the expected network ranking in a phenotype.
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a galance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    # Input:
    ## rankDF: pd.DataFrame obtained from the above network_ranking() function
    ## phenotypeS: pd.Series containing phenotypes (k; 1-K) with sample (n; 1-N) indices (i.e., 1-on-1 long-format)
    # Output:
    ## pd.DataFrame containing the expected binary values of network ranking comparison for each phenotype (T_mk)
    ## -> row: comparison_id (indicated by network (m) and ordering (g_i < g_j) columns)
    ## -> column: phenotype (k) (with exception of the NetworkID and Ordering columns)
    # Note:
    ## True rate = 0.5 is assigned to 0 in this code.
    ## If an item in phenotype k is 'NetworkID' or 'Ordering', this code would produce error or unexpected output.
    
    #Calculate the expected binary values of network ranking comparison for each phenotype (T_mk)
    templateDF = rankDF[['NetworkID', 'Ordering']]
    phenotypeL = phenotypeS.unique().tolist()
    for k in phenotypeL:
        sampleL_k = phenotypeS.loc[phenotypeS==k].index.tolist()
        tempDF = rankDF[sampleL_k]
        tempS = tempDF.mean(axis=1)#True (=1) rate
        tempS = (tempS>0.5).astype('int64')#If true rate > 0.5, 1; otherwise (<= 0.5), 0
        templateDF[k] = tempS
    return templateDF

def rank_matching_score(rankDF, templateDF):
    # This function calculates the rank matching score (R) of a sample.
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a galance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    # Input:
    ## rankDF: pd.DataFrame obtained from the above network_ranking() function
    ## templateDF: pd.DataFrame obtained from the above rank_template() function
    # Output:
    ## pd.DataFrame containing the rates of gene pairs matching to a rank template in a sample (R_mkn)
    ## -> row: rank template (T_mk) (indicated by network (m) and template phenotype (k) columns)
    ## -> column: sample (n) (with exception of the NetworkID and Template columns)
    # Note:
    ## True rate = 0.5 was assigned to 0 in the above rank_template() function.
    ## -> 'Match (1)' and 'Mismatch (0)' are evenly assigned to samples in the phenotype.
    ##    (i.e., 'Match' for (X_i < X_j) = 0 and 'Mismatch' for (X_i < X_j) = 1 in the tie case)
    ## If items in network m and phenotype k contain ' : ', this code would produce error or unexpected output.
    
    #Calculate the rates of gene pairs matching to a rank template in a sample (R_mkn)
    scoreDF = pd.DataFrame()
    sampleL = rankDF.drop(columns=['NetworkID', 'Ordering']).columns.tolist()
    phenotypeL = templateDF.drop(columns=['NetworkID', 'Ordering']).columns.tolist()
    for n in sampleL:
        scoreDF_n = pd.DataFrame()
        for k_template in phenotypeL:
            tempDF = rankDF[['NetworkID', 'Ordering']]
            tempS = (rankDF[n]==templateDF[k_template]).astype('int64')#If matching, 1; otherwise, 0
            tempDF['Match'] = tempS
            #Calculate the rank matching score
            scoreDF_k = tempDF.groupby(by='NetworkID', as_index=False, sort=False).mean()
            #Update the rank matching score dataframe of sample n
            scoreDF_k['Template'] = k_template
            scoreDF_k = scoreDF_k[['NetworkID', 'Template', 'Match']]
            scoreDF_n = pd.concat([scoreDF_n, scoreDF_k], axis=0)
        #Update the rank matching score dataframe of all samples
        scoreDF_n['Sample'] = n
        scoreDF = pd.concat([scoreDF, scoreDF_n], axis=0)
    ##Prepare dummy index and clean dataframe
    scoreDF['RMSmkID'] = scoreDF['NetworkID'].str.cat(scoreDF['Template'], sep=' : ')
    scoreDF = scoreDF.pivot(index='RMSmkID', columns='Sample', values='Match')#Sorted by index during this
    scoreDF = scoreDF.reset_index()#Index becomes row number here
    tempDF = scoreDF['RMSmkID'].str.split(pat=' : ', expand=True)
    tempDF = tempDF.rename(columns={0:'NetworkID', 1:'Template'})
    scoreDF = pd.concat([tempDF, scoreDF], axis=1)#Dropping columns name 'Sample' during this
    scoreDF = scoreDF.drop(columns='RMSmkID')
    return scoreDF

def rank_conservation_index(scoreDF, phenotypeS):
    # This function calculates the rank conservation index (muR) of a phenotype.
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a galance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    # Input:
    ## scoreDF: pd.DataFrame obtained from the above rank_matching_score() function
    ## phenotypeS: pd.Series containing phenotypes (k; 1-K) with sample (n; 1-N) indices (i.e., 1-on-1 long-format)
    # Output:
    ## pd.DataFrame containing the mean values of rank matching scores (RMSs) in a phenotype (muR_mkk)
    ## -> row: rank template (T_mk) (indicated by network (m) and template phenotype (k) columns)
    ## -> column: phenotype (k) (with exception of the NetworkID and Template columns)
    # Note:
    ## Rank conservation index (RCI) is used with a broader stance here, but following a strict interpretation,
    ## the term RCI is used for the mean of RMSs only when template phenotype is same with the sample phenotype.
    ## If an item in phenotype k is 'NetworkID' or 'Template', this code would produce error or unexpected output.
    
    #Calculate the mean values of RMSs in a phenotype (muR_mkk)
    conservationDF = scoreDF[['NetworkID', 'Template']]
    phenotypeL = phenotypeS.unique().tolist()
    for k in phenotypeL:
        sampleL_k = phenotypeS.loc[phenotypeS==k].index.tolist()
        tempS = scoreDF[sampleL_k].mean(axis=1)
        conservationDF[k] = tempS
    return conservationDF

## 1. Prepare dataset and metadata

### 1-1. Analyte data

In [None]:
#Import analyte data
fileDir = './ImportData/'
ipynbName = '220530_Arivale-DIRAC-proteomics-ver5-3_Preprocessing_'
fileName = 'protDF-robustZscore.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('UniProtID')
display(tempDF)

analyteDF = tempDF

### 1-2. Module–analyte metadata

In [None]:
#Import module-analyte metadata
fileDir = './ImportData/'
ipynbName = '220530_Arivale-DIRAC-proteomics-ver5-3_Preprocessing_'
fileName = 'module-metadata_QuickGO-GOBP-min-n4-cov30.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('ModuleID')
print(' - Unique analytes with module:', len(tempDF['UniProtID'].unique()))
print(' - Unique modules with analytes:', len(tempDF.index.unique()))

#Prepare moduleS
moduleS = tempDF['UniProtID']
display(moduleS)

#Import the cleaned module metadata in another notebook
fileDir = './ImportData/'
ipynbName = '220531_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group_'
fileName = 'module-metadata.tsv'
moduleDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('ModuleID')
display(moduleDF)
display(moduleDF.describe(include='all'))

### 1-3. Sample metadata

In [None]:
#Import participant metadata
fileDir = './ImportData/'
ipynbName = '220530_Arivale-DIRAC-proteomics-ver5-3_DataCleaning_'
fileName = 'participant-metadata.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.rename(columns={'public_client_id':'SampleID'})
tempDF = tempDF.loc[tempDF['SampleID'].isin(analyteDF.columns)]
tempDF = tempDF.set_index('SampleID')

#Prepare phenotypeS
tempDF['Phenotype'] = tempDF['CA10Group']
display(tempDF)
display(tempDF['Phenotype'].value_counts())

sampleDF = tempDF

## 2. Permutate DIRAC calculation while randomizing the sample labels

> Based on the null-hypothesis, not analyte labels but sample labels should be randomized.  
> ***–> Because the rankingDF calculation is time-consuming step, the previously calculated rankingDF is reused and randomization is applied for phenotypeS.***  
> * Although JupyterLab seems to have no maximum number in random seed, the random seeds are not fixed just in case.  
> * To calculate null-hypothesis distribution, the dataframes of rank conservation index are necessary; hence, save only them. (cf. If saving rank matching score for later usage, the shuffled phenotypeS is also needed.)  
> * To simplify the procedures and outputs, multiprocessing is implemented per the shuffled phenotypeS (not per the divided moduleS).  

In [None]:
#Import rankingDF
fileDir = './ImportData/'
ipynbName = '220531_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group_'
nSub = 5
tempDF = pd.DataFrame()
for list_i in range(nSub):
    fileName = 'NetworkRanking-'+str(list_i+1).zfill(3)+'.tsv'
    tempDF1 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    tempDF = pd.concat([tempDF, tempDF1], axis=0, ignore_index=True)

print('• Network ranking dataframe:')
display(tempDF)

rankingDF = tempDF

In [None]:
start = 1
niterations = 20000
nprocessors = 75
fileDir = './ExportData/'
ipynbName = '220616_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group-permutation_'

#Wrap as a single function
def random_dirac(iter_i):
    #Randomize the sample labels
    tempL = sampleDF['Phenotype'].tolist()
    tempL = random.sample(tempL, len(tempL))
    tempS = pd.Series(tempL, index=sampleDF.index, name='Phenotype')
    
    #Calculate the pairwise ordering of network genes (i.e., network ranking)
    #-> Skip this and reuse the previously calculated one
    
    #Generate the rank template (T) presenting the expected network ranking in a phenotype
    templateDF = rank_template(rankingDF, tempS)
    
    #Calculate the rank matching score (R) of a sample
    rmsDF = rank_matching_score(rankingDF, templateDF)
    
    #Calculate the rank conservation index (muR) of a phenotype
    rciDF = rank_conservation_index(rmsDF, tempS)
    
    #Save
    fileName = 'RankConservationIndex-'+str(start+iter_i).zfill(5)+'.tsv'
    rciDF.to_csv(fileDir+ipynbName+fileName, index=False, sep='\t')

#Parallel computing
if __name__=='__main__':
    t_start = time.time()
    p = Pool(nprocessors)
    p.map(random_dirac, range(niterations))
    t_finish = time.time()

In [None]:
#Record as reference
print(niterations, 'iterations with', nprocessors, 'processors')
print(' - Start:', time.ctime(t_start))
print(' - Finish:', time.ctime(t_finish))
t_elapsed = (t_finish - t_start)
print(' - Elapsed time:', int(t_elapsed//(60*60*24)), 'day',
      time.strftime('%H h %M min %S.{} sec'.format(str(t_elapsed%1)[2:]), time.gmtime(t_elapsed)))
t_elapsed = (t_finish - t_start) * nprocessors
print(' - Total elapsed time without parallel computing:', int(t_elapsed//(60*60*24)), 'day',
      time.strftime('%H h %M min %S.{} sec'.format(str(t_elapsed%1)[2:]), time.gmtime(t_elapsed)))
t_elapsed = (t_finish - t_start) / niterations
print(' - Mean apparent elapsed time per iteration:', int(t_elapsed//(60*60*24)), 'day',
      time.strftime('%H h %M min %S.{} sec'.format(str(t_elapsed%1)[2:]), time.gmtime(t_elapsed)))

## 3. Rank conservation index: general pattern across group

### 3-1. Extract RCI (the mean of RMSs under the own phenotype consensus)

In [None]:
#Import true (not-randomized) rciDF
fileDir = './ImportData/'
ipynbName = '220531_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group_'
nSub = 5
tempDF = pd.DataFrame()
for list_i in range(nSub):
    fileName = 'RankConservationIndex-'+str(list_i+1).zfill(3)+'.tsv'
    tempDF1 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    tempDF1 = tempDF1.rename(columns={'NetworkID':'ModuleID'})
    tempDF = pd.concat([tempDF, tempDF1], axis=0, ignore_index=True)

print('• Rank conservation index dataframe:')
display(tempDF)

rciDF = tempDF

In [None]:
#Extract RCI whose template phenotype corresponds to the own phenotype
phenotypeL = rciDF.drop(columns=['ModuleID', 'Template']).columns.tolist()
rciDF_kk = pd.DataFrame(index=pd.Index(rciDF['ModuleID'].unique(), name='ModuleID'))
tempDF = rciDF.set_index('ModuleID')
for k in phenotypeL:
    tempS = tempDF[k].loc[tempDF['Template']==k]
    rciDF_kk = pd.merge(rciDF_kk, tempS, left_index=True, right_index=True, how='left')

#Order and re-label
nQs = {'F':10, 'M':10}
tempL = [sex+'-Q'+str(i+1) for sex in nQs.keys() for i in range(nQs[sex])]
rciDF_kk = rciDF_kk[tempL]
display(rciDF_kk)
display(rciDF_kk.describe())

### 3-2. Null-hypothesis distribution of RCI median

In [None]:
#Prepare the null-hypothesis distribution
start = 1
niterations = 20000
fileDir = './ExportData/'
ipynbName = '220616_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group-permutation_'

nullDF = pd.DataFrame(columns=rciDF_kk.columns)
t_start = time.time()
for iter_i in range(niterations):
    #Import rciDF of the iteration
    fileName = 'RankConservationIndex-'+str(start+iter_i).zfill(5)+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    tempDF = tempDF.rename(columns={'NetworkID':'ModuleID'})
    
    #Extract RCI whose template phenotype corresponds to the own phenotype
    phenotypeL = tempDF.drop(columns=['ModuleID', 'Template']).columns.tolist()
    tempDF1 = pd.DataFrame(index=pd.Index(tempDF['ModuleID'].unique(), name='ModuleID'))
    tempDF = tempDF.set_index('ModuleID')
    for k in phenotypeL:
        tempS = tempDF[k].loc[tempDF['Template']==k]
        tempDF1 = pd.merge(tempDF1, tempS, left_index=True, right_index=True, how='left')
    
    #Calculate the median of RCIs in modules
    tempS = tempDF1.median(axis=0)
    nullDF.loc['Iteration_'+str(start+iter_i).zfill(5)] = tempS
t_finish = time.time()
t_elapsed = t_finish - t_start
print('Elapsed time for ', niterations, 'iterations:',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
display(nullDF.describe())

### 3-3. Visualization: boxplot

In [None]:
#Prepare plot DF
##Median of RCIs
tempDF1 = rciDF_kk.describe().T.reset_index()
tempDF = tempDF1['index'].str.split(pat='-', expand=True)
tempDF = tempDF.rename(columns={0:'Sex', 1:'Label'})
tempDF1 = pd.concat([tempDF1, tempDF], axis=1)
##Generate 95% CI of null-hypothesis distribution
tempDF2 = nullDF.quantile(q=[0.025, 0.5, 0.975], axis=0)
tempDF2.loc['low_diff'] = tempDF2.loc[0.5] - tempDF2.loc[0.025]
tempDF2.loc['high_diff'] = tempDF2.loc[0.975] - tempDF2.loc[0.5]
tempDF2 = tempDF2.T.reset_index()
tempDF = tempDF2['index'].str.split(pat='-', expand=True)
tempDF = tempDF.rename(columns={0:'Sex', 1:'Label'})
tempDF2 = pd.concat([tempDF2, tempDF], axis=1)

#Label and color
tempD1 = {'F':'Female', 'M':'Male'}
tempD2 = {'F':plt.get_cmap('RdBu')(0.0), 'M':plt.get_cmap('RdBu')(1.0)}

#Prepare significance labels (Of note, we cannot say about P < 0.001, which requires 100,000 iterations at least)
##Calculate two-sided P-value
tempL = []
for row_i in range(len(tempDF1)):
    group = tempDF1['index'].iloc[row_i]
    rci_median = tempDF1['50%'].iloc[row_i]
    tempS = nullDF[group]
    if (rci_median<tempS.min())|(rci_median>tempS.max()):
        tempL.append(0.0)#Not exactly 0.0 but just beyond the iteration range
    else:
        rci_median_absdev = abs(rci_median - tempS.median())
        tempS = abs(tempS - tempS.median())
        tempS = tempS.sort_values(ascending=False)
        count = 0#Initialize
        while rci_median_absdev<=tempS.iloc[count]:
            count += 1
            if count==len(tempS):#To surely stop even if RCI median == median of the null distribution
                break
        tempL.append(count/len(tempS))
tempDF3 = tempDF1[['index', 'Sex', 'Label']].set_index('index')
tempDF3['Pval'] = tempL
##P-value adjustment by using Holm–Bonferroni method
tempDF3['AdjPval'] = multi.multipletests(tempDF3['Pval'], alpha=0.05, method='holm',
                                         is_sorted=False, returnsorted=False)[1]
##Convert p-value to label
tempL = []
for row_i in range(len(tempDF3)):
    pval = tempDF3['AdjPval'].iloc[row_i]
    if pval<0.01:
        tempL.append('**')
    elif pval<0.05:
        tempL.append('*')
    else:
        pval_text = Decimal(str(pval)).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP)
        tempL.append(r'$P$ = '+str(pval_text))
tempDF3['SignifLabel'] = tempL
display(tempDF3)

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4.5), sharex=False, sharey=True)
ymax = 0.65
ymin = 0.5
yinter = 0.05
ymargin_t = 0.01
ymargin_b = 0.01
xoff = 0.005
yoff = 0.005
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD1.keys())[ax_i]
    tempL = ['Q'+str(i+1) for i in range(nQs[sex])]
    #Pointplot
    tempDF = tempDF1.loc[tempDF1['Sex']==sex]
    #sns.pointplot(data=tempDF, x='Label', y='50%', order=tempL, palette='PRGn_r',
    #              ci=None, dodge=False, join=False, ax=ax, markers='o', mec='0.3', mew=2)#Edgecolor doesn't work...
    sns.scatterplot(data=tempDF, x='Label', y='50%', hue='Label', hue_order=tempL,
                    palette='PRGn_r', edgecolor='k', s=100, legend=None, ax=ax)
    #Errorbars
    tempDF = tempDF2.loc[tempDF2['Sex']==sex]
    ax.errorbar(x=tempDF['Label'].tolist(), y=tempDF.loc[:, 0.5].tolist(),
                yerr=[tempDF['low_diff'].tolist(), tempDF['high_diff'].tolist()],
                marker='_', ls='none', color='0.3', ms=10, mew=2, capsize=8, zorder=0.5)
    #ax.margins(x=0, tight=True)
    plt.setp(ax, xlim=(-0.5, len(tempL)-0.5))
    plt.setp(ax.get_xticklabels(), rotation=70, horizontalalignment='right',
             verticalalignment='center', rotation_mode='anchor')
    #Add significance labels
    tempDF = tempDF1.loc[tempDF1['Sex']==sex]
    for row_i in range(len(tempDF)):
        xcoord = tempL.index(tempDF['Label'].iloc[row_i])
        ycoord = tempDF['50%'].iloc[row_i] + yinter/10
        group = tempDF['index'].iloc[row_i]
        label = tempDF3.loc[group, 'SignifLabel']
        if label in ['**', '*']:
            text_offset = yinter/25
            text_size = 'medium'
            text_rotation = 0
        else:
            text_offset = yinter/5
            text_size = 'x-small'
            text_rotation = 90
        ax.annotate(label, xy=(xcoord, ycoord+text_offset),
                    rotation=text_rotation, rotation_mode='default',
                    horizontalalignment='center', verticalalignment='bottom',
                    fontsize=text_size, color='k')
    #Facet
    ax.set_title(tempD1[sex], {'fontsize':'medium', 'color':'w'})
    rect = plt.Rectangle((0+xoff, 1+yoff), 1-xoff, 0.1,#Manual adjustment
                         transform=ax.transAxes, facecolor=tempD2[sex],
                         clip_on=False, linewidth=0, zorder=0)
    ax.add_patch(rect)
    if ax_i==0:
        plt.setp(ax, xlabel='', ylabel='Median of module RCIs')
        ax0_pos = ax.get_position().bounds#Save position to generate legend later
    else:
        plt.setp(ax.get_yticklabels(), visible=False)
        plt.setp(ax, xlabel='', ylabel='')
        ax1_pos = ax.get_position().bounds#Save position to generate legend later
sns.despine()
plt.setp(axes, ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax+yinter/10, yinter))
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]*1.25))/2, y=ax0_pos[1]-ax0_pos[3]*0.1,#Minor manual adjustment
         s='Chronological age quantile-based group', fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
fig.tight_layout()
##Save
fileDir = './ExportFigures/'
ipynbName = '220616_Arivale-DIRAC-proteomics-ver5-3_DIRAC-GOBP-CA10Group-permutation_'
fileName = 'RCImedian-boxplot.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

# — End of notebook —