# DIRAC Analysis of Arivale Metabolomics — Chronological Age-stratified Group

***by Kengo Watanabe***  

Tomasz Wilmanski performed weighted gene correlation network analysis (WGCNA; Langfelder, P. & Horvath, S. BMC Bioinform. 2008) on the Arivale metabolomics.  
–> In this notebook, the differential rank conservation (DIRAC; Eddy, J.A. et al. PLoS Comput. Biol. 2010) analysis is performed on the preprocessed Arivale metabolomics dataset (analytes detected in 90% and more participants; random forest imputation for missingness; sample-based robust Z-score followed by analyte-based robust Z-score) using the data-driven module set (modules identified by WGCNA; ≥4 analytes and ≥50% coverage).  
**–> Only the chronological age (CAge)-stratifed group version is included**, and the other consensus version is included in another notebook.  

Input:  
* Preprocessed analyte data: 220603_Arivale-DIRAC-metabolomics-ver2_Preprocessing_metDF-robustZscore.tsv  
* Module–analyte metadata: 220603_Arivale-DIRAC-metabolomics-ver2_Preprocessing_module-metadata_WGCNA-min-n4-cov50.tsv  
* Analyte metadata: 220603_Arivale-DIRAC-metabolomics-ver2_Preprocessing_analyte-metadata_Arivale.tsv  
* Sample–participant metadata: 220603_Arivale-DIRAC-metabolomics-ver2_DataCleaning_participant-metadata.tsv  

Output:  
* Supplementary Figure 5c–f  
* Supplementary Data 9 (ModuleMetadata, CA10Group_[sheet name suffix])  

Original notebook (memo for my future tracing):  
* dalek:[JupyterLab HOME]/210324_Arivale-DIRAC-metabolomics/220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group.ipynb  

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time

from itertools import combinations
import math
from multiprocessing import Pool
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf
from statsmodels.stats import multitest as multi
from decimal import Decimal, ROUND_HALF_UP
import re
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.offsetbox import AnchoredText
#!pip install venn
from venn import venn

!conda list

## 0. DIRAC code

> The original code for DIRAC is written in MATLAB, then, it is re-written in Python 3 here.  
> <– I don't care about computational speed here; rather, the code adheres to the story in the original paper.  

In [None]:
def network_ranking(DF, networkS):
    # This function calculates the pairwise ordering of network genes (i.e., network ranking).
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a glance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    ## from itertools import combinations: confirmed with Python 3.7.6 and 3.9.6
    # Input:
    ## DF: pd.DataFrame containing expression values (X_gn) with gene (g; 1-G) indices and sample (n; 1-N) columns
    ## networkS: pd.Series containing genes (g; 1-G) with network (m; 1-M) indices (i.e., 1-on-1 long-format)
    # Output:
    ## pd.DataFrame containing binary values of the networking ranking comparison for each sample
    ## -> row: comparison_id (indicated by network (m) and ordering (g_i < g_j) columns)
    ## -> column: sample (n) (with exception of the NetworkID and Ordering columns)
    # Note:
    ## If items in network m and gene g contain ' : ' or ' < ', this code would produce error or unexpected output.
    ## If an item in sample n is 'NetworkID' or 'Ordering', this code would produce error or unexpected output.
    
    #Calculate binary values of the network ranking comparison for each sample
    rankDF = pd.DataFrame()
    sampleL = DF.columns.tolist()
    networkL = networkS.index.unique()
    for n in sampleL:
        rankDF_n = pd.DataFrame()
        for m in networkL:
            #Pairs of genes (g_i, g_j) in the network m
            networkS_m = networkS.loc[m]
            pairL_m = list(combinations(range(0, len(networkS_m)), 2))
            pairL_m_i, pairL_m_j = [[pair[x] for pair in pairL_m] for x in (0,1)]
            rankDF_m = pd.DataFrame({'g_i':networkS_m.iloc[pairL_m_i],
                                     'g_j':networkS_m.iloc[pairL_m_j]})#Hold network (m; 1-M) index
            #Compare the expression values (X_gn) between pairwise genes (g_i vs. g_j)
            tempL = []
            for pair_i in range(0, len(rankDF_m)):
                g_i = rankDF_m.iloc[pair_i, 0]
                g_j = rankDF_m.iloc[pair_i, 1]
                X_i = DF.loc[g_i, n]
                X_j = DF.loc[g_j, n]
                #If X_i < X_j is true, add 1; otherwise (X_i >= X_j), add 0
                if X_i < X_j:
                    tempL.append(1)
                else:
                    tempL.append(0)
            rankDF_m['X_i<X_j'] = tempL
            #Update the network ranking dataframe of sample n
            rankDF_m.index.set_names('NetworkID', inplace=True)#Set/reset index name
            rankDF_m = rankDF_m.reset_index()
            rankDF_n = pd.concat([rankDF_n, rankDF_m], axis=0)
        #Updata the network ranking dataframe of all samples
        rankDF_n['Sample'] = n
        rankDF = pd.concat([rankDF, rankDF_n], axis=0)
    ##Prepare dummy index and clean dataframe
    rankDF['ComparisonID'] = rankDF['NetworkID'] + ' : ' + rankDF['g_i'] + ' < ' + rankDF['g_j']
    rankDF = rankDF.pivot(index='ComparisonID', columns='Sample', values='X_i<X_j')#Sorted by index during this
    rankDF = rankDF.reset_index()#Index becomes row number here
    tempDF = rankDF['ComparisonID'].str.split(pat=' : ', expand=True)
    tempDF = tempDF.rename(columns={0:'NetworkID', 1:'Ordering'})
    rankDF = pd.concat([tempDF, rankDF], axis=1)#Dropping columns name 'Sample' during this
    rankDF = rankDF.drop(columns='ComparisonID')
    return rankDF

def rank_template(rankDF, phenotypeS):
    # This function generates the rank template (T) presenting the expected network ranking in a phenotype.
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a galance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    # Input:
    ## rankDF: pd.DataFrame obtained from the above network_ranking() function
    ## phenotypeS: pd.Series containing phenotypes (k; 1-K) with sample (n; 1-N) indices (i.e., 1-on-1 long-format)
    # Output:
    ## pd.DataFrame containing the expected binary values of network ranking comparison for each phenotype (T_mk)
    ## -> row: comparison_id (indicated by network (m) and ordering (g_i < g_j) columns)
    ## -> column: phenotype (k) (with exception of the NetworkID and Ordering columns)
    # Note:
    ## True rate = 0.5 is assigned to 0 in this code.
    ## If an item in phenotype k is 'NetworkID' or 'Ordering', this code would produce error or unexpected output.
    
    #Calculate the expected binary values of network ranking comparison for each phenotype (T_mk)
    templateDF = rankDF[['NetworkID', 'Ordering']]
    phenotypeL = phenotypeS.unique().tolist()
    for k in phenotypeL:
        sampleL_k = phenotypeS.loc[phenotypeS==k].index.tolist()
        tempDF = rankDF[sampleL_k]
        tempS = tempDF.mean(axis=1)#True (=1) rate
        tempS = (tempS>0.5).astype('int64')#If true rate > 0.5, 1; otherwise (<= 0.5), 0
        templateDF[k] = tempS
    return templateDF

def rank_matching_score(rankDF, templateDF):
    # This function calculates the rank matching score (R) of a sample.
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a galance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    # Input:
    ## rankDF: pd.DataFrame obtained from the above network_ranking() function
    ## templateDF: pd.DataFrame obtained from the above rank_template() function
    # Output:
    ## pd.DataFrame containing the rates of gene pairs matching to a rank template in a sample (R_mkn)
    ## -> row: rank template (T_mk) (indicated by network (m) and template phenotype (k) columns)
    ## -> column: sample (n) (with exception of the NetworkID and Template columns)
    # Note:
    ## True rate = 0.5 was assigned to 0 in the above rank_template() function.
    ## -> 'Match (1)' and 'Mismatch (0)' are evenly assigned to samples in the phenotype.
    ##    (i.e., 'Match' for (X_i < X_j) = 0 and 'Mismatch' for (X_i < X_j) = 1 in the tie case)
    ## If items in network m and phenotype k contain ' : ', this code would produce error or unexpected output.
    
    #Calculate the rates of gene pairs matching to a rank template in a sample (R_mkn)
    scoreDF = pd.DataFrame()
    sampleL = rankDF.drop(columns=['NetworkID', 'Ordering']).columns.tolist()
    phenotypeL = templateDF.drop(columns=['NetworkID', 'Ordering']).columns.tolist()
    for n in sampleL:
        scoreDF_n = pd.DataFrame()
        for k_template in phenotypeL:
            tempDF = rankDF[['NetworkID', 'Ordering']]
            tempS = (rankDF[n]==templateDF[k_template]).astype('int64')#If matching, 1; otherwise, 0
            tempDF['Match'] = tempS
            #Calculate the rank matching score
            scoreDF_k = tempDF.groupby(by='NetworkID', as_index=False, sort=False).mean()
            #Update the rank matching score dataframe of sample n
            scoreDF_k['Template'] = k_template
            scoreDF_k = scoreDF_k[['NetworkID', 'Template', 'Match']]
            scoreDF_n = pd.concat([scoreDF_n, scoreDF_k], axis=0)
        #Update the rank matching score dataframe of all samples
        scoreDF_n['Sample'] = n
        scoreDF = pd.concat([scoreDF, scoreDF_n], axis=0)
    ##Prepare dummy index and clean dataframe
    scoreDF['RMSmkID'] = scoreDF['NetworkID'].str.cat(scoreDF['Template'], sep=' : ')
    scoreDF = scoreDF.pivot(index='RMSmkID', columns='Sample', values='Match')#Sorted by index during this
    scoreDF = scoreDF.reset_index()#Index becomes row number here
    tempDF = scoreDF['RMSmkID'].str.split(pat=' : ', expand=True)
    tempDF = tempDF.rename(columns={0:'NetworkID', 1:'Template'})
    scoreDF = pd.concat([tempDF, scoreDF], axis=1)#Dropping columns name 'Sample' during this
    scoreDF = scoreDF.drop(columns='RMSmkID')
    return scoreDF

def rank_conservation_index(scoreDF, phenotypeS):
    # This function calculates the rank conservation index (muR) of a phenotype.
    ## Ref. Eddy, J. A. et al. PLoS Comput. Biol. 2010 (Figure 1 at a galance)
    # Requirements:
    ## import numpy as np: confirmed with versions 1.17.5 and 1.21.1
    ## import pandas as pd: confirmed with versions 0.25.3 and 1.3.1
    # Input:
    ## scoreDF: pd.DataFrame obtained from the above rank_matching_score() function
    ## phenotypeS: pd.Series containing phenotypes (k; 1-K) with sample (n; 1-N) indices (i.e., 1-on-1 long-format)
    # Output:
    ## pd.DataFrame containing the mean values of rank matching scores (RMSs) in a phenotype (muR_mkk)
    ## -> row: rank template (T_mk) (indicated by network (m) and template phenotype (k) columns)
    ## -> column: phenotype (k) (with exception of the NetworkID and Template columns)
    # Note:
    ## Rank conservation index (RCI) is used with a broader stance here, but following a strict interpretation,
    ## the term RCI is used for the mean of RMSs only when template phenotype is same with the sample phenotype.
    ## If an item in phenotype k is 'NetworkID' or 'Template', this code would produce error or unexpected output.
    
    #Calculate the mean values of RMSs in a phenotype (muR_mkk)
    conservationDF = scoreDF[['NetworkID', 'Template']]
    phenotypeL = phenotypeS.unique().tolist()
    for k in phenotypeL:
        sampleL_k = phenotypeS.loc[phenotypeS==k].index.tolist()
        tempS = scoreDF[sampleL_k].mean(axis=1)
        conservationDF[k] = tempS
    return conservationDF

## 1. Prepare dataset and metadata

### 1-1. Analyte data

In [None]:
#Import analyte data
fileDir = './ExportData/'
ipynbName = '220603_Arivale-DIRAC-metabolomics-ver2_Preprocessing_'
fileName = 'metDF-robustZscore.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('BiochemicalName')
display(tempDF)

analyteDF = tempDF

### 1-2. Module–analyte metadata

In [None]:
#Import module-analyte metadata
fileDir = './ExportData/'
ipynbName = '220603_Arivale-DIRAC-metabolomics-ver2_Preprocessing_'
fileName = 'module-metadata_WGCNA-min-n4-cov50.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('ModuleID')
print(' - Unique analytes with module:', len(tempDF['BiochemicalName'].unique()))
print(' - Unique modules with analytes:', len(tempDF.index.unique()))

#Prepare moduleS
moduleS = tempDF['BiochemicalName']
display(moduleS)

#Retrieve module metadata
tempDF = tempDF[['ModuleName', 'nAnalytes', 'nBackgrounds', 'Coverage']]
moduleDF = tempDF.reset_index().drop_duplicates(keep='first').set_index('ModuleID')
display(moduleDF)
display(moduleDF.describe(include='all'))

> –> Add the mapped analytes to module metadata. (In metabolomics, analyte label is practically used as analyte ID, but the original chemical ID is named as analyte ID in the module metadata.)  

In [None]:
#Import analyte metadata
fileDir = './ExportData/'
ipynbName = '220603_Arivale-DIRAC-metabolomics-ver2_Preprocessing_'
fileName = 'analyte-metadata_Arivale.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'ChemicalID': str})
tempDF = pd.merge(moduleS.reset_index(), tempDF, on='BiochemicalName', how='left').set_index('ModuleID')
display(tempDF)
print(' - Unique analytes:', len(tempDF['ChemicalID'].unique()))
print(' - Unique labels:', len(tempDF['BiochemicalName'].unique()))

#Concatenate labels
t_start = time.time()
moduleDF['MappedAnalyteIDs'] = ''#Initialize
moduleDF['MappedAnalyteLabels'] = ''#Initialize
for module in moduleDF.index.tolist():
    tempS = tempDF['BiochemicalName'].loc[tempDF.index.isin([module])]#Retrieve as pd.Series
    label = tempS.str.cat(sep=';')
    moduleDF.loc[module, 'MappedAnalyteLabels'] = label
    tempS = tempDF['ChemicalID'].loc[tempDF.index.isin([module])]#Retrieve as pd.Series
    label = tempS.str.cat(sep=';')
    moduleDF.loc[module, 'MappedAnalyteIDs'] = label
t_elapsed = time.time() - t_start
print('Elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Clean
moduleDF['ModuleType'] = 'Data-driven'
moduleDF['Source'] = 'WGCNA'
moduleDF = moduleDF[['ModuleName', 'ModuleType', 'MappedAnalyteIDs', 'MappedAnalyteLabels',
                     'nAnalytes', 'nBackgrounds', 'Coverage', 'Source']]
moduleDF = moduleDF.sort_index(ascending=True)

#Save
fileDir = './ExportData/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = 'module-metadata.tsv'
moduleDF.to_csv(fileDir+ipynbName+fileName, index=True, sep='\t')

display(moduleDF)

### 1-3. Sample metadata

In [None]:
#Import participant metadata
fileDir = './ExportData/'
ipynbName = '220603_Arivale-DIRAC-metabolomics-ver2_DataCleaning_'
fileName = 'participant-metadata.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t', dtype={'public_client_id': str})
tempDF = tempDF.rename(columns={'public_client_id':'SampleID'})
tempDF = tempDF.loc[tempDF['SampleID'].isin(analyteDF.columns)]
tempDF = tempDF.set_index('SampleID')

#Prepare phenotypeS
tempDF['Phenotype'] = tempDF['CA10Group']
display(tempDF)
display(tempDF['Phenotype'].value_counts())

sampleDF = tempDF

## 2. Perform DIRAC using rank consensus stratified with sex and CAge

> When applied to overall dataset, the DIRAC code often stops runnning due to memory limit (e.g., transcriptomics).  
> –> Divide moduleS into subsets while considering the number of comparisons, and compute DIRAC in parallel.  

In [None]:
#Divide moduleS
cutoff = 50#The maximum number of analytes per module
tempDF = moduleDF.sort_values(by='nAnalytes', ascending=True)
module_subL = []
tempL = []#Initialize
count = 0#Initialize
for module in tempDF.index.tolist():
    nanalytes = tempDF.loc[module, 'nAnalytes']
    if nanalytes>cutoff:
        module_subL.append([module])
    else:
        tempL.append(module)
        count += nanalytes
        if count>cutoff:
            module_subL.append(tempL)
            tempL = []#Initialize
            count = 0#Initialize
if len(tempL)>0:#The last one but still count<=cutoff
    module_subL.append(tempL)
nSub = len(module_subL)
print('nSublists: ', nSub)
print('nModules per sublist:', [len(sublist) for sublist in module_subL])

#Check nAnalytes distribution
tempS = pd.Series(name='Subset')
for list_i in range(nSub):
    tempL = moduleS.loc[module_subL[list_i]].index.unique().tolist()
    tempS1 = pd.Series(np.repeat('Subset '+str(list_i+1).zfill(3), len(tempL)), index=tempL, name='Subset')
    tempS = pd.concat([tempS, tempS1], axis=0)
tempS.index.set_names('ModuleID', inplace=True)
tempDF = pd.merge(moduleDF, tempS, left_index=True, right_index=True, how='left')
tempDF = tempDF.sort_values(by='nAnalytes', ascending=False)
display(tempDF)
tempDF = tempDF.groupby('Subset').agg({'nAnalytes':[len, sum, max, np.median]})
tempDF = tempDF.sort_values(by=('nAnalytes', 'sum'), ascending=False)
print('Subset summary:')
display(tempDF.describe())
print(' -> Check subset which will need high computational cost:')
display(tempDF.loc[tempDF[('nAnalytes', 'max')]>cutoff])

In [None]:
nprocessors = 7
fileDir = './ExportData/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'

#Wrap as a single function
def parallel_dirac(list_i):
    tempS = moduleS.loc[module_subL[list_i]]
    
    #Calculate the pairwise ordering of network genes (i.e., network ranking)
    t_start1 = time.time()
    rankingDF = network_ranking(analyteDF, tempS)
    t_elapsed1 = time.time() - t_start1
    
    #Generate the rank template (T) presenting the expected network ranking in a phenotype
    t_start2 = time.time()
    templateDF = rank_template(rankingDF, sampleDF['Phenotype'])
    t_elapsed2 = time.time() - t_start2
    
    #Calculate the rank matching score (R) of a sample
    t_start3 = time.time()
    rmsDF = rank_matching_score(rankingDF, templateDF)
    t_elapsed3 = time.time() - t_start3
    
    #Calculate the rank conservation index (muR) of a phenotype
    t_start4 = time.time()
    rciDF = rank_conservation_index(rmsDF, sampleDF['Phenotype'])
    t_elapsed4 = time.time() - t_start4
    
    #Save
    fileName = 'NetworkRanking-'+str(list_i+1).zfill(3)+'.tsv'
    rankingDF.to_csv(fileDir+ipynbName+fileName, index=False, sep='\t')
    fileName = 'RankTemplate-'+str(list_i+1).zfill(3)+'.tsv'
    templateDF.to_csv(fileDir+ipynbName+fileName, index=False, sep='\t')
    fileName = 'RankMatchingScore-'+str(list_i+1).zfill(3)+'.tsv'
    rmsDF.to_csv(fileDir+ipynbName+fileName, index=False, sep='\t')
    fileName = 'RankConservationIndex-'+str(list_i+1).zfill(3)+'.tsv'
    rciDF.to_csv(fileDir+ipynbName+fileName, index=False, sep='\t')
    
    #Check results
    print('Subset '+str(list_i+1).zfill(3)+':',
          len(tempS.index.unique()), 'modules,', len(tempS.unique()), 'analytes')
    print('• Network ranking dataframe:')
    print('  - DF shape:', rankingDF.shape)
    print('  - Elapsed time:', round(t_elapsed1//60), 'min', round(t_elapsed1%60, 1), 'sec')
    print('• Rank template dataframe:')
    print('  - DF shape:', templateDF.shape)
    print('  - Elapsed time:', round(t_elapsed2//60), 'min', round(t_elapsed2%60, 1), 'sec')
    print('• Rank matching score dataframe:')
    print('  - DF shape:', rmsDF.shape)
    print('  - Elapsed time:', round(t_elapsed3//60), 'min', round(t_elapsed3%60, 1), 'sec')
    print('• Rank conservation index dataframe:')
    print('  - DF shape:', rciDF.shape)
    print('  - Elapsed time:', round(t_elapsed4//60), 'min', round(t_elapsed4%60, 1), 'sec')
    print('')

#Parallel computing
if __name__=='__main__':
    t_start = time.time()
    p = Pool(nprocessors)
    p.map(parallel_dirac, range(nSub))
    t_finish = time.time()

In [None]:
#Record as reference
print(nSub, 'sublists with', nprocessors, 'processors')
print(' - Start:', time.ctime(t_start))
print(' - Finish:', time.ctime(t_finish))
t_elapsed = (t_finish - t_start)
print(' - Elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')
t_elapsed = (t_finish - t_start) * nprocessors
print(' - Total (approximate) elapsed time:', round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

In [None]:
#Combine each result
rmsDF = pd.DataFrame()
rciDF = pd.DataFrame()
for list_i in range(nSub):
    fileName = 'RankMatchingScore-'+str(list_i+1).zfill(3)+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    tempDF = tempDF.rename(columns={'NetworkID':'ModuleID'})
    rmsDF = pd.concat([rmsDF, tempDF], axis=0, ignore_index=True)
    
    fileName = 'RankConservationIndex-'+str(list_i+1).zfill(3)+'.tsv'
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    tempDF = tempDF.rename(columns={'NetworkID':'ModuleID'})
    rciDF = pd.concat([rciDF, tempDF], axis=0, ignore_index=True)

print('• Rank matching score dataframe:')
display(rmsDF)
print('')
print('• Rank conservation index dataframe:')
display(rciDF)

## 3. Rank conservation index: general pattern across group

### 3-1. Extract RCI (the mean of RMSs under the own phenotype consensus)

In [None]:
#Extract RCI whose template phenotype corresponds to the own phenotype
phenotypeL = rciDF.drop(columns=['ModuleID', 'Template']).columns.tolist()
rciDF_kk = pd.DataFrame(index=pd.Index(rciDF['ModuleID'].unique(), name='ModuleID'))
tempDF = rciDF.set_index('ModuleID')
for k in phenotypeL:
    tempS = tempDF[k].loc[tempDF['Template']==k]
    rciDF_kk = pd.merge(rciDF_kk, tempS, left_index=True, right_index=True, how='left')

#Order and re-label
nQs = {'F':10, 'M':10}
tempL = [sex+'-Q'+str(i+1) for sex in nQs.keys() for i in range(nQs[sex])]
rciDF_kk = rciDF_kk[tempL]
display(rciDF_kk)
display(rciDF_kk.describe())

### 3-2. Visualization: boxplot

In [None]:
#Prepare plot DF
tempDF = rciDF_kk.reset_index().melt(var_name='Group', value_name='RCI', id_vars='ModuleID')
tempDF1 = tempDF['Group'].str.split(pat='-', expand=True)
tempDF1 = tempDF1.rename(columns={0:'Sex', 1:'Label'})
tempDF = pd.concat([tempDF, tempDF1], axis=1)

#Label and color
tempD1 = {'F':'Female', 'M':'Male'}
tempD2 = {'F':plt.get_cmap('RdBu')(0.0), 'M':plt.get_cmap('RdBu')(1.0)}

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4.5), sharex=False, sharey=True)
ymax = 0.65
ymin = 0.55
yinter = 0.05
ymargin_t = 0.03
ymargin_b = 0.02
xoff = 0.005
yoff = 0.005
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD1.keys())[ax_i]
    tempL = ['Q'+str(i+1) for i in range(nQs[sex])]
    tempDF1 = tempDF.loc[tempDF['Sex']==sex]
    #Boxplot
    sns.boxplot(data=tempDF1, y='RCI', x='Label', order=tempL, palette='PRGn_r', dodge=False,
                showfliers=False, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=False, ax=ax)
    sns.stripplot(data=tempDF1, y='RCI', x='Label', order=tempL, palette='PRGn_r', dodge=False,
                  jitter=0.15, size=5, edgecolor='k', linewidth=1.5, marker='o', alpha=0.5, ax=ax)
    plt.setp(ax, xlim=(-0.5, len(tempL)-0.5))
    plt.setp(ax.get_xticklabels(), rotation=70, horizontalalignment='right',
             verticalalignment='center', rotation_mode='anchor')
    #Facet
    ax.set_title(tempD1[sex], {'fontsize':'medium', 'color':'w'})
    rect = plt.Rectangle((0+xoff, 1+yoff), 1-xoff, 0.1,#Manual adjustment
                         transform=ax.transAxes, facecolor=tempD2[sex],
                         clip_on=False, linewidth=0, zorder=0)
    ax.add_patch(rect)
    if ax_i==0:
        plt.setp(ax, xlabel='', ylabel='Module RCI')
        ax0_pos = ax.get_position().bounds#Save position to generate legend later
    else:
        plt.setp(ax.get_yticklabels(), visible=False)
        plt.setp(ax, xlabel='', ylabel='')
        ax1_pos = ax.get_position().bounds#Save position to generate legend later
sns.despine()
plt.setp(axes, ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax+yinter/10, yinter))
fig.text(x=(ax0_pos[0]+(ax1_pos[0]+ax1_pos[2]*1.25))/2, y=ax0_pos[1]-ax0_pos[3]*0.1,#Minor manual adjustment
         s='Chronological age quantile-based group', fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
fig.tight_layout()
##Save
fileDir = './ExportFigures/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = 'RCI-boxplot.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

## 4. Rank matching score under a fixed consensus: general pattern across group

> * Test relationship between the mean values of RMSs and the CAge-stratified groups using Spearman's correlation per sex.  
>
> Because the stratified groups have no absolute control group and are categorical and ordered variables, neither repeated t-test nor Pearson's correlation but Spearman's correlation is used. The P-values are adjusted across sex.  

### 4-1. Spearman's correlation

In [None]:
#Prepare DF for the mean of RMSs
tempDF = rciDF.melt(var_name='Group', value_name='RMSmean', id_vars=['ModuleID', 'Template'])
tempDF1 = tempDF['Group'].str.split(pat='-', expand=True)
tempDF1 = tempDF1.rename(columns={0:'Sex', 1:'CAgroup'})
tempDF1['CAgroup_i'] = tempDF1['CAgroup'].str.replace('Q', '').astype('int64')
tempDF = pd.concat([tempDF, tempDF1], axis=1)

#Correlation analysis
tempD1 = {'Q1':'Q1', 'Q10':'Q10'}
tempD2 = {'F':'Female', 'M':'Male'}
tempDF1 = pd.DataFrame()
for consensus in tempD1.keys():
    tempDF2 = pd.DataFrame(columns=['Consensus', 'Sex',
                                    'N', 'nGroups', 'nModules', 'Rho', 'Pval'])
    count = 0
    for sex in tempD2.keys():
        tempDF3 = tempDF.loc[tempDF['Sex']==sex]
        template = sex+'-'+consensus
        tempDF3 = tempDF3.loc[tempDF3['Template']==template]
        #Calculate Spearman's correlation
        tempS1 = tempDF3['CAgroup_i']
        tempS2 = tempDF3['RMSmean']
        rho, pval = stats.spearmanr(tempS1, tempS2)
        #Clean
        size = len(tempDF3)
        ngroups = len(tempDF3['CAgroup'].unique())
        nmodules = len(tempDF3['ModuleID'].unique())
        tempDF2.loc[str(count)] = [tempD1[consensus], tempD2[sex],
                                   size, ngroups, nmodules, rho, pval]
        count += 1
    #P-value adjustment by using Holm–Bonferroni method
    tempDF2['AdjPval'] = multi.multipletests(tempDF2['Pval'], alpha=0.05, method='holm',
                                             is_sorted=False, returnsorted=False)[1]
    #Merge
    tempDF1 = pd.concat([tempDF1, tempDF2], axis=0)
display(tempDF1)#For easier checking

#Clean
tempDF = pd.DataFrame(index=pd.Index(tempD1.values(), name='Consensus'))
for sex in tempD2.values():
    tempDF2 = tempDF1.loc[tempDF1['Sex']==sex]
    tempDF2 = tempDF2.drop(columns=['Sex', 'nGroups', 'nModules'])
    tempDF2 = tempDF2.set_index('Consensus')
    tempDF2.columns = sex+'_'+tempDF2.columns
    tempDF = pd.merge(tempDF, tempDF2, left_index=True, right_index=True, how='left')
display(tempDF)

#Save
fileDir = './ExportData/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = 'RMSmean-correlation.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, index=True, sep='\t')

statDF = tempDF

### 4-2. Visualization: boxplot

In [None]:
#Prepare DF for plot
tempDF = rciDF.melt(var_name='Group', value_name='RMSmean', id_vars=['ModuleID', 'Template'])
tempDF1 = tempDF['Group'].str.split(pat='-', expand=True)
tempDF1 = tempDF1.rename(columns={0:'Sex', 1:'Label'})
tempDF = pd.concat([tempDF, tempDF1], axis=1)

#Label and color
tempD1 = {'F':'Female', 'M':'Male'}
tempD2 = {'F':plt.get_cmap('RdBu')(0.0), 'M':plt.get_cmap('RdBu')(1.0)}
tempD3 = {'Q1':plt.get_cmap('PRGn_r')(0.0), 'Q10':plt.get_cmap('PRGn_r')(1.0)}

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 7.5), sharex=False, sharey=True)
ymax = 0.7
ymin = 0.4
yinter = 0.1
ymargin_t = 0.025
ymargin_b = 0.025
xoff = 0.005
yoff = 0.005
for ax_i, ax in enumerate(axes.flat):
    #Select sex and template
    if ax_i%2==0:
        sex = list(tempD2.keys())[0]
    else:
        sex = list(tempD2.keys())[1]
    if ax_i//2==0:
        consensus = list(tempD3.keys())[0]
    else:
        consensus = list(tempD3.keys())[1]
    template = sex+'-'+consensus
    tempL = ['Q'+str(i+1) for i in range(nQs[sex])]
    tempDF1 = tempDF.loc[tempDF['Sex']==sex]
    tempDF1 = tempDF1.loc[tempDF1['Template']==template]
    #Boxplot
    sns.boxplot(data=tempDF1, y='RMSmean', x='Label', order=tempL, palette='PRGn_r', dodge=False,
                showfliers=False, flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4},
                showcaps=True, notch=False, ax=ax)
    sns.stripplot(data=tempDF1, y='RMSmean', x='Label', order=tempL, palette='PRGn_r', dodge=False,
                  jitter=0.15, size=5, edgecolor='k', linewidth=1.5, marker='o', alpha=0.5, ax=ax)
    if ax_i//2==0:
        plt.setp(ax.get_xticklabels(), visible=False)
        #Top facet
        ax.set_title(tempD1[sex], {'fontsize':'medium', 'color':'w'})
        rect = plt.Rectangle((0+xoff, 1+yoff), 1-xoff, 0.1,#Manual adjustment
                             transform=ax.transAxes, facecolor=tempD2[sex],
                             clip_on=False, linewidth=0, zorder=0)
        ax.add_patch(rect)
    else:
        plt.setp(ax.get_xticklabels(), rotation=70, horizontalalignment='right',
                 verticalalignment='center', rotation_mode='anchor')
    if ax_i%2==1:
        plt.setp(ax.get_yticklabels(), visible=False)
        #Right facet
        at = AnchoredText('Consensus: '+consensus, loc='center left',
                          bbox_to_anchor=(0.95, 0.51), bbox_transform=ax.transAxes,#Minor manual adjustment
                          frameon=False, prop={'size':'medium', 'color':'w', 'rotation':'vertical'})
        ax.add_artist(at)
        rect = plt.Rectangle((1+xoff, 0+yoff), 0.1, 1-yoff,#Manual adjustment
                             transform=ax.transAxes, facecolor=tempD3[consensus],
                             clip_on=False, linewidth=0, zorder=0)
        ax.add_patch(rect)
    #Add annotation
    rho = statDF.loc[consensus, tempD1[sex]+'_Rho']
    rho_text = Decimal(str(rho)).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP)
    rho_text = str(rho_text)
    pval = statDF.loc[consensus, tempD1[sex]+'_AdjPval']
    pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
    if pval_text=='1.000E+0':
        pval_text = '1.0'
    else:
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = r'${0} \times 10^{{-{1}}}$'.format(significand, exponent)
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    if ax_i//2==0:
        text_ycoord = 0.99
        text_valign = 'top'
    else:
        text_ycoord = 0.01
        text_valign = 'bottom'
    text = 'Spearman\'s '+r'$\rho$ = '+rho_text+'\nAdjusted '+r'$P$ = '+pval_text
    ax.annotate(text, xy=(0.99, text_ycoord), xycoords='axes fraction',
                horizontalalignment='right', verticalalignment=text_valign,
                multialignment='left', fontsize='x-small', color='k')
    #Save position to generate legend later
    if ax_i ==0:
        ax0_pos = ax.get_position().bounds
    elif ax_i==2:
        ax2_pos = ax.get_position().bounds
    elif ax_i==3:
        ax3_pos = ax.get_position().bounds
sns.despine()
plt.setp(axes, ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax+yinter/10, yinter))
plt.setp(axes, xlim=(-0.5, len(tempL)-0.5))#Otherwise, ax1 and ax3 are extended...
plt.setp(axes, xlabel='', ylabel='')
fig.text(x=(ax2_pos[0]+(ax3_pos[0]+ax3_pos[2]*0.95))/2, y=ax2_pos[1]-ax0_pos[3]*0.275,#Minor manual adjustment
         s='Chronological age quantile-based group', fontsize='medium',
         verticalalignment='top', horizontalalignment='center', rotation='horizontal')
fig.text(x=ax0_pos[0]-ax0_pos[2]*0.275, y=(ax2_pos[1]+(ax0_pos[1]+ax0_pos[3]*1.15))/2,#Minor manual adjustment
         s='Module mean of sample RMSs', fontsize='medium',
         verticalalignment='center', horizontalalignment='right', rotation='vertical')
fig.tight_layout()
fileDir = './ExportFigures/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = 'RMSmean-boxplot.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

## 5. Rank matching score under a fixed consensus: module association with CAge

> * Test relationship between RMS and CAge using regression analysis per sex for each module.  
>
> Each model includes RMS as dependent variable, CAge as continuous independent variable, and BMI and ancestory PCs as covariates. Instead of adding sex as a covariate, sex-stratified model is generated because rank consensus is different between sex. Although tricky, the P-value adjustment is performed across all models (sex x modules) under the assumption that modules are independent, which would be more conservative and less likely raise referees' eyebrows than using nominal P-value cutoff. For visualization, the RMS value adjusted with covariates is used.  

### 5-1. RMS under Q1 consensus

In [None]:
consensus = 'Q1'

#### 5-1-1. Extract RMS under the fixed consensus

In [None]:
#Extract RMS whose template phenotype corresponds to a sex-matched target consensus
rmsDF_kn = pd.DataFrame(index=pd.Index(rmsDF['ModuleID'].unique(), name='ModuleID'))
tempDF = rmsDF.set_index('ModuleID')
for sex in ['F', 'M']:
    tempL = sampleDF.loc[sampleDF['Sex']==sex].index.tolist()
    template = sex+'-'+consensus
    tempDF1 = tempDF[tempL].loc[tempDF['Template']==template]
    rmsDF_kn = pd.merge(rmsDF_kn, tempDF1, left_index=True, right_index=True, how='left')

display(rmsDF_kn)

#### 5-1-2. OLS regression

In [None]:
#Prepare DF
tempDF = rmsDF_kn.reset_index().melt(var_name='SampleID', value_name='RMS', id_vars='ModuleID')
tempDF = pd.merge(tempDF, sampleDF.reset_index(), on='SampleID', how='left')

#Regression analysis
tempD = {'F':'Female', 'M':'Male'}
tempL = ['RMS', 'BaseCAge', 'Sex', 'log_BaseBMI', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
variable = 'BaseCAge'
formula1 = 'RMS ~ Variable'#Univariate model
formula2 = 'RMS ~ Variable + log_BaseBMI + PC1 + PC2 + PC3 + PC4 + PC5'#Full model
statDF = pd.DataFrame()
for sex in tempD.keys():
    t_start = time.time()
    tempL1 = []#For N
    tempL2 = []#For R2 in univariate model
    tempL3 = []#For R2
    tempL4 = []#For beta-coefficient
    tempL5 = []#For SE of beta-coefficient
    tempL6 = []#For t-statistic
    tempL7 = []#For P-value
    for module in moduleDF.index.tolist():
        tempDF1 = tempDF.loc[(tempDF['Sex']==sex)&(tempDF['ModuleID']==module)]
        tempDF1 = tempDF1.set_index('SampleID')
        tempDF1 = tempDF1[tempL].rename(columns={variable:'Variable'})
        
        #Standardization of continuous variables
        ##Select continous variables
        tempDF2 = tempDF1.loc[:, tempDF1.dtypes=='float64']
        ##Z-score transformation
        scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
        tempA = scaler.fit_transform(tempDF2)#Column direction
        tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
        ##Recover categorical variables
        tempDF3 = tempDF1.loc[:, tempDF1.dtypes!='float64']
        tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
        
        #One-hot encoding for categorical covariates
        #-> In statsmodels, categorical variables are automatically recognized!
        #Add a constant for the intercept
        #-> Similar to R, smf automatically add a constant
        
        #Sort to consistent level for categorical variables
        tempDF1 = tempDF1.sort_values(by=tempDF1.loc[:, tempDF1.dtypes!='float64'].columns.tolist())
        tempL1.append(len(tempDF1))
        
        #Univariate model
        model = smf.ols(formula1, data=tempDF1).fit()
        tempL2.append(model.rsquared*100)#R2 [%]
        
        #Full model
        model = smf.ols(formula2, data=tempDF1).fit()
        tempL3.append(model.rsquared*100)#R2 [%]
        tempL4.append(model.params['Variable'])#Beta-coefficient
        tempL5.append(model.bse['Variable'])#SE of beta-coefficient
        tempL6.append(model.tvalues['Variable'])#t-statistic
        tempL7.append(model.pvalues['Variable'])#P-value
    #Clean the results
    tempDF1 = pd.DataFrame({'ModuleName':moduleDF['ModuleName'], 'Sex':tempD[sex],
                            'N':tempL1, 'R2':tempL3, variable+'_UnivarR2':tempL2,
                            variable+'_Coef':tempL4, variable+'_CoefSE':tempL5,
                            variable+'_tStat':tempL6, variable+'_Pval':tempL7},
                           index=moduleDF.index)
    statDF = pd.concat([statDF, tempDF1], axis=0)
    t_elapsed = time.time() - t_start
    print(tempD[sex], len(moduleDF), 'OLS regressions:',
          round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#P-value adjustment by using Benjamini–Hochberg method
statDF[variable+'_AdjPval'] = multi.multipletests(statDF[variable+'_Pval'],
                                                  alpha=0.05, method='fdr_bh',
                                                  is_sorted=False, returnsorted=False)[1]
statDF = statDF.sort_values(by=variable+'_AdjPval', ascending=True)

display(statDF)
tempDF = statDF.loc[statDF[variable+'_AdjPval']<0.05]
print('Adjusted P-value < 0.05:', len(tempDF))
print(' -> Female:', len(tempDF.loc[tempDF['Sex']=='Female']))
print(' -> Male:', len(tempDF.loc[tempDF['Sex']=='Male']))

#Clean and save
tempDF = moduleDF['ModuleName']#pd.Series() here
for sex in tempD.keys():
    tempDF1 = statDF.loc[statDF['Sex']==tempD[sex]]
    tempDF1 = tempDF1.drop(columns=['ModuleName', 'Sex'])
    tempDF1.columns = tempD[sex]+'_'+tempDF1.columns
    tempDF = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='left')
tempL = [tempD[sex]+'_'+variable+'_AdjPval' for sex in tempD.keys()]
tempDF = tempDF.sort_values(by=tempL, ascending=True)
display(tempDF)
fileDir = './ExportData/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = consensus+'-fixed-RMS-regression.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, index=True, sep='\t')

statDF = tempDF

### 5-2. RMS under Q10 consensus

In [None]:
consensus = 'Q10'

#### 5-2-1. Extract RMS under the fixed consensus

In [None]:
#Extract RMS whose template phenotype corresponds to a sex-matched target consensus
rmsDF_kn = pd.DataFrame(index=pd.Index(rmsDF['ModuleID'].unique(), name='ModuleID'))
tempDF = rmsDF.set_index('ModuleID')
for sex in ['F', 'M']:
    tempL = sampleDF.loc[sampleDF['Sex']==sex].index.tolist()
    template = sex+'-'+consensus
    tempDF1 = tempDF[tempL].loc[tempDF['Template']==template]
    rmsDF_kn = pd.merge(rmsDF_kn, tempDF1, left_index=True, right_index=True, how='left')

display(rmsDF_kn)

#### 5-2-2. OLS regression

In [None]:
#Prepare DF
tempDF = rmsDF_kn.reset_index().melt(var_name='SampleID', value_name='RMS', id_vars='ModuleID')
tempDF = pd.merge(tempDF, sampleDF.reset_index(), on='SampleID', how='left')

#Regression analysis
tempD = {'F':'Female', 'M':'Male'}
tempL = ['RMS', 'BaseCAge', 'Sex', 'log_BaseBMI', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
variable = 'BaseCAge'
formula1 = 'RMS ~ Variable'#Univariate model
formula2 = 'RMS ~ Variable + log_BaseBMI + PC1 + PC2 + PC3 + PC4 + PC5'#Full model
statDF = pd.DataFrame()
for sex in tempD.keys():
    t_start = time.time()
    tempL1 = []#For N
    tempL2 = []#For R2 in univariate model
    tempL3 = []#For R2
    tempL4 = []#For beta-coefficient
    tempL5 = []#For SE of beta-coefficient
    tempL6 = []#For t-statistic
    tempL7 = []#For P-value
    for module in moduleDF.index.tolist():
        tempDF1 = tempDF.loc[(tempDF['Sex']==sex)&(tempDF['ModuleID']==module)]
        tempDF1 = tempDF1.set_index('SampleID')
        tempDF1 = tempDF1[tempL].rename(columns={variable:'Variable'})
        
        #Standardization of continuous variables
        ##Select continous variables
        tempDF2 = tempDF1.loc[:, tempDF1.dtypes=='float64']
        ##Z-score transformation
        scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
        tempA = scaler.fit_transform(tempDF2)#Column direction
        tempDF2 = pd.DataFrame(data=tempA, index=tempDF2.index, columns=tempDF2.columns)
        ##Recover categorical variables
        tempDF3 = tempDF1.loc[:, tempDF1.dtypes!='float64']
        tempDF1 = pd.merge(tempDF2, tempDF3, left_index=True, right_index=True, how='left')
        
        #One-hot encoding for categorical covariates
        #-> In statsmodels, categorical variables are automatically recognized!
        #Add a constant for the intercept
        #-> Similar to R, smf automatically add a constant
        
        #Sort to consistent level for categorical variables
        tempDF1 = tempDF1.sort_values(by=tempDF1.loc[:, tempDF1.dtypes!='float64'].columns.tolist())
        tempL1.append(len(tempDF1))
        
        #Univariate model
        model = smf.ols(formula1, data=tempDF1).fit()
        tempL2.append(model.rsquared*100)#R2 [%]
        
        #Full model
        model = smf.ols(formula2, data=tempDF1).fit()
        tempL3.append(model.rsquared*100)#R2 [%]
        tempL4.append(model.params['Variable'])#Beta-coefficient
        tempL5.append(model.bse['Variable'])#SE of beta-coefficient
        tempL6.append(model.tvalues['Variable'])#t-statistic
        tempL7.append(model.pvalues['Variable'])#P-value
    #Clean the results
    tempDF1 = pd.DataFrame({'ModuleName':moduleDF['ModuleName'], 'Sex':tempD[sex],
                            'N':tempL1, 'R2':tempL3, variable+'_UnivarR2':tempL2,
                            variable+'_Coef':tempL4, variable+'_CoefSE':tempL5,
                            variable+'_tStat':tempL6, variable+'_Pval':tempL7},
                           index=moduleDF.index)
    statDF = pd.concat([statDF, tempDF1], axis=0)
    t_elapsed = time.time() - t_start
    print(tempD[sex], len(moduleDF), 'OLS regressions:',
          round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#P-value adjustment by using Benjamini–Hochberg method
statDF[variable+'_AdjPval'] = multi.multipletests(statDF[variable+'_Pval'],
                                                  alpha=0.05, method='fdr_bh',
                                                  is_sorted=False, returnsorted=False)[1]
statDF = statDF.sort_values(by=variable+'_AdjPval', ascending=True)

display(statDF)
tempDF = statDF.loc[statDF[variable+'_AdjPval']<0.05]
print('Adjusted P-value < 0.05:', len(tempDF))
print(' -> Female:', len(tempDF.loc[tempDF['Sex']=='Female']))
print(' -> Male:', len(tempDF.loc[tempDF['Sex']=='Male']))

#Clean and save
tempDF = moduleDF['ModuleName']#pd.Series() here
for sex in tempD.keys():
    tempDF1 = statDF.loc[statDF['Sex']==tempD[sex]]
    tempDF1 = tempDF1.drop(columns=['ModuleName', 'Sex'])
    tempDF1.columns = tempD[sex]+'_'+tempDF1.columns
    tempDF = pd.merge(tempDF, tempDF1, left_index=True, right_index=True, how='left')
tempL = [tempD[sex]+'_'+variable+'_AdjPval' for sex in tempD.keys()]
tempDF = tempDF.sort_values(by=tempL, ascending=True)
display(tempDF)
fileDir = './ExportData/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = consensus+'-fixed-RMS-regression.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, index=True, sep='\t')

statDF = tempDF

### 5-3. Association b/w module similarity and CAge

#### 5-3-1. Associated modules

In [None]:
#Prepare the adjusted P-values
pvalDF = moduleDF['ModuleName']#pd.Series() here
for consensus in ['Q1', 'Q10']:
    fileDir = './ExportData/'
    ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
    fileName = consensus+'-fixed-RMS-regression.tsv'
    variable = 'BaseCAge'
    tempDF1 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('ModuleID')
    tempDF1 = tempDF1.loc[:, tempDF1.columns.str.contains('AdjPval')]
    tempDF1.columns = consensus+'-fixed_'+tempDF1.columns.str.replace('_'+variable+'_AdjPval', '')
    pvalDF = pd.merge(pvalDF, tempDF1, left_index=True, right_index=True, how='left')
display(pvalDF)

print('Adjusted P-value < 0.05:')
tempDF = pvalDF.loc[:, pvalDF.columns.str.contains('-fixed')]#Adjusted P-value
for col_n in tempDF.columns.tolist():
    tempS = tempDF[col_n]
    tempS = tempS.loc[tempS<0.05]
    print(' - '+col_n, len(tempS))

#### 5-3-2. Visualization: venn diagram

> Ignore direction in this case.  

In [None]:
#Prepare label and color
tempD1 = {'Q10 consensus\nFemale':'tab:blue', 'Q1 consensus\nFemale':'tab:orange',
          'Q1 consensus\nMale':'tab:green', 'Q10 consensus\nMale':'tab:red'}

#Prepare module sets
tempD2 = {}
tempDF = pvalDF.loc[:, pvalDF.columns.str.contains('-fixed')]#Adjusted P-value
for col_n in tempDF.columns.tolist():
    tempS = tempDF[col_n]
    tempS = tempS.loc[tempS<0.05]
    consensus, sex = col_n.split(sep='-fixed_')
    label = consensus+' consensus\n'+sex
    tempD2[label] = set(tempS.index.tolist())
##Sort to make consistent order in manual legend generation
tempD = {}
for label in tempD1.keys():
    tempD[label] = tempD2[label]

#Not significant in all contrasts
tempDF = pvalDF.copy()
for moduleS in tempD.values():
    tempDF = tempDF.loc[~tempDF.index.isin(moduleS)]
count = len(tempDF)
print('No association in any:', count)

#Venn diagram
sns.set(style='ticks', font='Arial', context='talk')
fig, ax = plt.subplots(figsize=(4, 4))
venn(tempD, fmt='{size:,}', cmap=list(tempD1.values()), legend_loc=None, ax=ax)
plt.setp(ax, ylim=(0.1, 0.95))#Otherwise, weird space...
##Add legend annotation
x_coord = [0.1, 0.1, 0.9, 0.9]
y_coord = [0.25, 0.7, 0.7, 0.25]
h_align = ['right', 'right', 'left', 'left']
v_align = ['top', 'bottom', 'bottom', 'top']
for i in range(len(tempD1)):
    key = list(tempD1.keys())[i]
    total = f'{len(tempD[key]):,}'
    ax.text(x_coord[i], y_coord[i], key+'\n('+total+' modules)',
            fontsize='small', multialignment='center',
            horizontalalignment=h_align[i], verticalalignment=v_align[i],
            bbox={'boxstyle':'round', 'facecolor':tempD1[key], 'pad':0.2, 'alpha':0.5})
title = 'Modules associated with chronological age'
ax.set_title(title, fontsize='medium')
##Save
fileDir = './ExportFigures/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = 'fixed-RMS-vs-CAge_venn.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

#### 5-3-3. Visualization: correlogram-like bubble plot

In [None]:
#Prepare DF for plot
tempDF = pvalDF.drop(columns='ModuleName')
tempDF = tempDF.sort_values(by=tempDF.columns.tolist(), ascending=True)
tempDF = -np.log10(tempDF)
##Sort for inverse y-axis
tempL = tempDF.columns.tolist()
tempL.reverse()
tempDF = tempDF[tempL]
tempDF = tempDF.reset_index().melt(var_name='FullLabel', value_name='NegLog10AdjPval', id_vars='ModuleID')
tempDF1 = tempDF['FullLabel'].str.split(pat='-fixed_', expand=True)
tempDF1 = tempDF1.rename(columns={0:'Consensus', 1:'Group'})
tempDF = pd.concat([tempDF, tempDF1], axis=1)

#Label and color
tempD1 = {'Q1':plt.get_cmap('PRGn_r')(0.0), 'Q10':plt.get_cmap('PRGn_r')(1.0)}
tempD2 = {'Positive association':plt.get_cmap('RdBu')(0.0),
          'No association':'gray',
          'Negative association':plt.get_cmap('RdBu')(1.0)}
marker_size = (25, tempDF['NegLog10AdjPval'].max()*5)
##Prepare effect size for direction color
diffDF = moduleDF['ModuleName']#pd.Series() here
for consensus in ['Q1', 'Q10']:
    fileDir = './ExportData/'
    ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
    fileName = consensus+'-fixed-RMS-regression.tsv'
    variable = 'BaseCAge'
    tempDF1 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('ModuleID')
    tempDF1 = tempDF1.loc[:, tempDF1.columns.str.contains('Coef$')]
    tempDF1.columns = consensus+'-fixed_'+tempDF1.columns.str.replace('_'+variable+'_Coef', '')
    diffDF = pd.merge(diffDF, tempDF1, left_index=True, right_index=True, how='left')
tempL = []
for row_i in range(len(tempDF)):
    logadjpval = tempDF['NegLog10AdjPval'].iloc[row_i]
    module = tempDF['ModuleID'].iloc[row_i]
    fulllabel = tempDF['FullLabel'].iloc[row_i]
    bcoef = diffDF.loc[module, fulllabel]
    if logadjpval > -np.log10(0.05):
        if bcoef > 0:
            tempL.append(list(tempD2.keys())[0])
        else:
            tempL.append(list(tempD2.keys())[2])
    else:
        tempL.append(list(tempD2.keys())[1])
tempDF['Association'] = tempL

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(3, 3), sharex=True, sharey=False,
                         gridspec_kw={'hspace':0.1})
for ax_i, ax in enumerate(axes.flat):
    consensus = list(tempD1.keys())[ax_i]
    tempDF2 = tempDF.loc[tempDF['Consensus']==consensus]
    
    #Scatterplot
    if ax_i==0:
        add_legend = False
    else:
        add_legend = True
    sns.scatterplot(data=tempDF2, x='ModuleID', y='Group',
                    hue='Association', hue_order=list(tempD2.keys()), palette=tempD2,
                    size='NegLog10AdjPval', sizes=marker_size, marker='o',
                    legend=add_legend, ax=ax)
    ax.grid(axis='x', linestyle='-', color='lightgray', zorder=0)
    ax.grid(axis='y', linestyle='-', color='lightgray', zorder=0)
    ax.spines.top.set_visible(False)#sns.despine() overrides the below bottom spine setting
    ax.spines.right.set_visible(False)#sns.despine() overrides the below bottom spine setting
    if ax_i==0:
        plt.setp(ax.get_xticklabels(), visible=False)
        #plt.setp(ax.get_xaxis(), visible=False)
        ax.tick_params(axis='x', colors='w')#Add as dummy for grid lines
        ax.spines.bottom.set_visible(False)
    else:
        plt.setp(ax.get_xticklabels(), rotation=70, horizontalalignment='right',
                 verticalalignment='center', rotation_mode='anchor')
    #Add facet
    ax.set_ylabel('Consen-\nsus: '+consensus, color='w')
    rect = plt.Rectangle((-0.675, 0), 0.25, 1,#Manual adjustment
                         transform=ax.transAxes, facecolor=tempD1[consensus],
                         clip_on=False, linewidth=0, zorder=0)
    ax.add_patch(rect)
    if add_legend==True:
        handles, labels = ax.get_legend_handles_labels()#To modify later
        ax.legend_.remove()
plt.setp(axes, ylim=(-0.5, len(tempDF['Group'].unique())-0.5))
plt.setp(axes, xlim=(-0.75, len(tempDF['ModuleID'].unique())-0.25))
plt.setp(axes, xlabel='')
#Modify seaborn legend
legend = plt.legend(handles[1:(len(tempD2)+1)], labels[1:(len(tempD2)+1)], fontsize='small',
                    title='Module similarity\nvs. chronological age', title_fontsize='medium',
                    ncol=1, labelspacing=0.25, handletextpad=0.0,
                    bbox_to_anchor=(1.0, 2.0), loc='upper left', borderaxespad=0.0, frameon=False)
plt.gca().add_artist(legend)
legend = plt.legend(handles[(len(tempD2)+2):], labels[(len(tempD2)+2):], fontsize='small',
                    title=r'$-\log_{10}($'+'adjusted '+r'$P)$', title_fontsize='medium',
                    ncol=2, columnspacing=0.75, labelspacing=0.75, handletextpad=0.0,
                    bbox_to_anchor=(1.0, 0.5), loc='upper left', borderaxespad=0.0, frameon=False)
plt.gca().add_artist(legend)
#Save
fileDir = './ExportFigures/'
ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
fileName = 'fixed-RMS-vs-CAge_bubble.tif'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                  pil_kwargs={'compression':'tiff_lzw'})
plt.show()

#### 5-3-4. Visualization: regplot

In [None]:
#Select representatives
topX = np.min([30, len(pvalDF)])
topX_plot = np.min([3, len(pvalDF)])
sorted_by = pvalDF.columns.tolist()[1]
pvalDF = pvalDF.sort_values(by=sorted_by, ascending=True)
print('Top', topX, 'modules (sort by '+sorted_by+'):')
display(pvalDF.iloc[:topX])
plotL = pvalDF.index.tolist()[:topX_plot]

#Prepare DF for plot (raw RMS)
tempDF = rmsDF.melt(var_name='SampleID', value_name='RMS', id_vars=['ModuleID', 'Template'])
tempDF = pd.merge(tempDF, sampleDF.reset_index(), on='SampleID', how='left')

#Label and color
tempD1 = {'F':'Female', 'M':'Male'}
tempD2 = {'F':plt.get_cmap('RdBu')(0.0), 'M':plt.get_cmap('RdBu')(1.0)}
tempD3 = {'Q1':plt.get_cmap('PRGn_r')(0.0), 'Q10':plt.get_cmap('PRGn_r')(1.0)}
tempD4 = {'F':'tab:red', 'M':'tab:blue'}

#Visualize each representative
for rank_i in range(len(plotL)):
    print(' - Rank '+str(rank_i+1)+' (sort by '+sorted_by+'):')
    module = plotL[rank_i]
    #Check module summary
    tempDF1 = pd.DataFrame(moduleDF.loc[module]).T
    display(tempDF1)
    
    #Visualization
    sns.set(style='ticks', font='Arial', context='talk')
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 7.5), sharex=True, sharey=True)
    xmax = 90
    xmin = 20
    xinter = 10
    xmargin_l = 7.5
    xmargin_r = 2.5
    ymax = 1.0
    ymin = 0.0
    yinter = 0.2
    ymargin_t = 0.05
    ymargin_b = 0.05
    xoff = 0.005
    yoff = 0.005
    for ax_i, ax in enumerate(axes.flat):
        #Select sex and template
        if ax_i%2==0:
            sex = list(tempD2.keys())[0]
        else:
            sex = list(tempD2.keys())[1]
        if ax_i//2==0:
            consensus = list(tempD3.keys())[0]
        else:
            consensus = list(tempD3.keys())[1]
        template = sex+'-'+consensus
        tempDF1 = tempDF.loc[(tempDF['ModuleID']==module)&
                             (tempDF['Sex']==sex)&
                             (tempDF['Template']==template)]
        
        #Prepare the RMS values adjusted with covariates
        tempL = ['RMS', 'log_BaseBMI', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5']
        formula = 'RMS ~ log_BaseBMI + PC1 + PC2 + PC3 + PC4 + PC5'
        tempDF2 = tempDF1.set_index('SampleID')[tempL]
        ##Standardization of continuous variables
        ###Select continous variables
        tempDF3 = tempDF2.loc[:, tempDF2.dtypes=='float64']
        ##Z-score transformation
        scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
        tempA = scaler.fit_transform(tempDF3)#Column direction
        tempDF3 = pd.DataFrame(data=tempA, index=tempDF3.index, columns=tempDF3.columns)
        ##Recover categorical variables
        tempDF4 = tempDF2.loc[:, tempDF2.dtypes!='float64']
        tempDF2 = pd.merge(tempDF3, tempDF4, left_index=True, right_index=True, how='left')
        ##One-hot encoding for categorical covariates
        #-> In statsmodels, categorical variables are automatically recognized!
        ##Add a constant for the intercept
        #-> Similar to R, smf automatically add a constant
        ##Sort to consistent level for categorical variables
        tempDF2 = tempDF2.sort_values(by=tempDF2.loc[:, tempDF2.dtypes!='float64'].columns.tolist())
        ##OLS regression model
        model = smf.ols(formula, data=tempDF2).fit()
        print('   - '+consensus+'-fixed_'+tempD1[sex]+' OLS regression for adjusted RMS: N =',
              len(tempDF2), ', R2 = ', model.rsquared*100, '[%]')
        ##Calculate adjusted values
        tempDF2['RMS'] = model.resid + tempDF2['RMS'].mean()
        ##Scale back to the original RMS
        tempDF3 = tempDF2.loc[:, tempDF2.dtypes=='float64']
        tempA = scaler.inverse_transform(tempDF3)#Column direction
        tempDF3 = pd.DataFrame(data=tempA, index=tempDF3.index, columns=tempDF3.columns)
        tempDF1['AdjRMS'] = tempDF3['RMS'].tolist()
        
        #Scatterplot with regression line
        sns.regplot(data=tempDF1, x='BaseCAge', y='AdjRMS', color=tempD4[sex],
                    scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                    scatter_kws={'alpha':0.6, 'edgecolor':'k', 's':15}, ax=ax)
        #Draw consensus range
        tempS = tempDF1['BaseCAge'].loc[tempDF1['Phenotype']==tempDF1['Template']]
        ax.axvspan(tempS.min(), tempS.max(), facecolor=tempD3[consensus], alpha=0.3, zorder=0)
        if ax_i//2==0:
            plt.setp(ax.get_xticklabels(), visible=False)
            #Top facet
            ax.set_title(tempD1[sex], {'fontsize':'medium', 'color':'w'})
            rect = plt.Rectangle((0+xoff, 1+yoff), 1-xoff, 0.1,#Manual adjustment
                                 transform=ax.transAxes, facecolor=tempD2[sex],
                                 clip_on=False, linewidth=0, zorder=0)
            ax.add_patch(rect)
        if ax_i%2==1:
            plt.setp(ax.get_yticklabels(), visible=False)
            #Right facet
            at = AnchoredText('Consensus: '+consensus, loc='center left',
                              bbox_to_anchor=(0.95, 0.51), bbox_transform=ax.transAxes,#Minor manual adjustment
                              frameon=False, prop={'size':'medium', 'color':'w', 'rotation':'vertical'})
            ax.add_artist(at)
            rect = plt.Rectangle((1+xoff, 0+yoff), 0.1, 1-yoff,#Manual adjustment
                                 transform=ax.transAxes, facecolor=tempD3[consensus],
                                 clip_on=False, linewidth=0, zorder=0)
            ax.add_patch(rect)
        #Add annotation
        pval = pvalDF.loc[module, consensus+'-fixed_'+tempD1[sex]]
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        if pval_text=='1.000E+0':
            pval_text = '1.0'
        else:
            significand, exponent = pval_text.split(sep='E-')
            significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
            if significand=='10.0':
                significand = '1.0'
                exponent = str(int(exponent)-1)
            if int(exponent)>2:
                pval_text = r'${0} \times 10^{{-{1}}}$'.format(significand, exponent)
            elif int(exponent)>0:
                pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
            else:
                pval_text = significand
        if ax_i//2==0:
            text_ycoord = 0.99
            text_valign = 'top'
        else:
            text_ycoord = 0.01
            text_valign = 'bottom'
        text = 'Adjusted '+r'$P$ = '+pval_text+'\n'+r'$n$ = '+f'{len(tempDF1):,}'
        ax.annotate(text, xy=(0.99, text_ycoord), xycoords='axes fraction',
                    horizontalalignment='right', verticalalignment=text_valign,
                    multialignment='right', fontsize='x-small', color='k')
        #Save position to generate legend later
        if ax_i ==0:
            ax0_pos = ax.get_position().bounds
        elif ax_i==2:
            ax2_pos = ax.get_position().bounds
        elif ax_i==3:
            ax3_pos = ax.get_position().bounds
    sns.despine()
    plt.setp(axes, xlim=(xmin-xmargin_l, xmax+xmargin_r), xticks=np.arange(xmin, xmax+xinter/10, xinter))
    plt.setp(axes, ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax+yinter/10, yinter))
    plt.setp(axes, xlabel='', ylabel='')
    fig.text(x=(ax2_pos[0]+(ax3_pos[0]+ax3_pos[2]*0.95))/2, y=ax2_pos[1]-ax0_pos[3]*0.275,#Minor manual adjustment
             s='Chronological age [years]', fontsize='medium',
             verticalalignment='top', horizontalalignment='center', rotation='horizontal')
    fig.text(x=ax0_pos[0]-ax0_pos[2]*0.275, y=(ax2_pos[1]+(ax0_pos[1]+ax0_pos[3])*1.02)/2,#Minor manual adjustment
             s='Sample RMS (adjusted)', fontsize='medium',
             verticalalignment='center', horizontalalignment='right', rotation='vertical')
    fig.tight_layout()
    #Set title
    modulename = moduleDF.loc[module, 'ModuleName']
    initial = modulename[0].capitalize()
    title = re.sub('^.', initial, modulename)
    fig.suptitle(title, size='medium',
                 verticalalignment='bottom', horizontalalignment='center', wrap=True, y=0.975)
    fileDir = './ExportFigures/'
    ipynbName = '220604_Arivale-DIRAC-metabolomics-ver2_DIRAC-WGCNA-CA10Group_'
    fileName = 'fixed-RMS-vs-CAge_adjusted-RMS-lmplot-'+module+'.tif'
    plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04,
                      pil_kwargs={'compression':'tiff_lzw'})
    plt.show()

# — End of notebook —