# DIRAC Analysis of LC M001 Liver Proteomics — DIRAC with GOBP Modules in Sex-Stratified Groups

***by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) assessed the overall distribution of module rank conservation index (RCI) with sex stratification. The differential rank conservation (DIRAC; Eddy, J.A. et al. PLoS Comput. Biol. 2010) metrics were re-calculated in advance with the same code of Code01_LC-M001-proteomics_DIRAC-GOBP.ipynb, except for using the unadjusted data (i.e., the data before the potential effects of sex were regressed out) and sex-stratified sample groups.  

Input files:  
- Preprocessed analyte data: 210126_LCprotomics-M001-DIRAC-ver6_preprocessing_cleaned-robustZscored-data.tsv  
- Module–analyte metadata: 210126_LCprotomics-M001-DIRAC-ver6_preprocessing_QuickGO-GOBP_min-n4-cov50.tsv  
- Sample–mouse metadata: 210126_LCprotomics-M001-DIRAC-ver6_preprocessing_metadata-sample.tsv  
- DIRAC RMS data: 210127_LCproteomics-M001-DIRAC-ver6_DIRAC-GOBP_QuickGO-GOBP_min-n4-cov50_RankMatchingScore-FM.tsv  
- DIRAC RCI data: 210127_LCproteomics-M001-DIRAC-ver6_DIRAC-GOBP_QuickGO-GOBP_min-n4-cov50_RankConservationIndex-FM.tsv  

Output figures and tables:  
- Supplementary Figure 1a  

Original notebook (memo for my future tracing):  
- dalek:\[JupyterLab HOME\]/230628_LC-M001-CommunBiol-2ndRevision/230628_LC-M001-CommunBiol-2ndRevision_Sex-stratified-M001-DIRAC-GOBP.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time
#For exporting .pdf file with editable text
import matplotlib
matplotlib.rcParams['pdf.fonttype']=42
matplotlib.rcParams['ps.fonttype']=42

import re
from statsmodels.stats import multitest as multi
from decimal import Decimal, ROUND_HALF_UP

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

## 1. Prepare dataset and metadata

### 1-1. Analyte data

In [None]:
#Import analyte data
fileDir = '../210126_LCproteomics-M001-DIRAC-ver6/ExportData/'
ipynbName = '210126_LCprotomics-M001-DIRAC-ver6_preprocessing_'
fileName = 'cleaned-robustZscored-data.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('UniProtID')
display(tempDF)

analyteDF = tempDF

### 1-2. Module–analyte metadata

In [None]:
#Import module-analyte metadata
fileDir = '../210126_LCproteomics-M001-DIRAC-ver6/ExportData/'
ipynbName = '210126_LCprotomics-M001-DIRAC-ver6_preprocessing_'
fileName = 'QuickGO-GOBP_min-n4-cov50.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
tempD = {'NetworkID':'ModuleID', 'NetworkName':'ModuleName', 'nAnalyte':'nAnalytes', 'nBackground':'nBackgrounds'}
tempDF = tempDF.rename(columns=tempD).set_index('ModuleID')
print(' - Unique analytes with module:', len(tempDF['UniProtID'].unique()))
print(' - Unique modules with analytes:', len(tempDF.index.unique()))

#Prepare moduleS
moduleS = tempDF['UniProtID']
display(moduleS)

#Retrieve module metadata
tempDF = tempDF[['ModuleName', 'nAnalytes', 'nBackgrounds', 'Coverage']]
moduleDF = tempDF.reset_index().drop_duplicates(keep='first').set_index('ModuleID')
display(moduleDF)
display(moduleDF.describe(include='all'))

### 1-3. Sample–mouse metadata

In [None]:
#Import sample-mouse metadata
fileDir = '../210126_LCproteomics-M001-DIRAC-ver6/ExportData/'
ipynbName = '210126_LCprotomics-M001-DIRAC-ver6_preprocessing_'
fileName = 'metadata-sample.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
tempDF['SampleID'] = tempDF['MouseID']
tempDF = tempDF.loc[tempDF['SampleID'].isin(analyteDF.columns)]
tempDF = tempDF.set_index('SampleID')

#Prepare phenotypeS
tempDF['Phenotype'] = tempDF['Group']#Sex-stratified, matching labels with DIRAC table

#Clean to use the recent code
tempD = {'Control':'Ctrl', 'Acarbose':'Aca', 'Estradiol':'17aE2', 'Rapamycin':'Rapa'}
tempDF['Intervention'] = tempDF['Treatment'].map(tempD)
tempDF = tempDF.drop(columns=['Treatment'])
tempDF['Group'] = tempDF['Intervention']+'_'+tempDF['Sex']

display(tempDF)
display(tempDF['Phenotype'].value_counts())

sampleDF = tempDF

### 1-4. DIRAC metrics

In [None]:
#Import the combined DIRAC results
fileDir = '../210126_LCproteomics-M001-DIRAC-ver6/ExportData/'
ipynbName = '210127_LCproteomics-M001-DIRAC-ver6_DIRAC-GOBP_'
tempD1 = {'RMS':'QuickGO-GOBP_min-n4-cov50_RankMatchingScore-FM.tsv',
          'RCI':'QuickGO-GOBP_min-n4-cov50_RankConservationIndex-FM.tsv'}
tempD2 = {}
tempDF = sampleDF.reset_index()[['Phenotype', 'Group']].drop_duplicates(keep='first')
tempS = tempDF.set_index('Phenotype')['Group']
for metric in tempD1.keys():
    fileName = tempD1[metric]
    tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
    #Clean
    tempDF = tempDF.rename(columns={'NetworkID':'ModuleID'})
    tempDF['Template'] = tempDF['Template'].map(tempS)
    if metric=='RCI':
        tempDF = tempDF.rename(columns=tempS)
    tempD2[metric] = tempDF
    
    print(metric+' dataframe:', tempDF.shape)
    print(' - Unique modules:', len(tempDF['ModuleID'].unique()))
    print(' - Unique templates:', len(tempDF['Template'].unique()))
    display(tempDF)
    print('')
##Update
rmsDF = tempD2['RMS']
rciDF = tempD2['RCI']

## 2. Rank conservation index: general pattern

### 2-1. Extract RCI (the mean of RMSs under the own phenotype consensus)

In [None]:
#Extract RCI whose template phenotype corresponds to the own phenotype
phenotypeL = rciDF.drop(columns=['ModuleID', 'Template']).columns.tolist()
rciDF_kk = pd.DataFrame(index=pd.Index(rciDF['ModuleID'].unique(), name='ModuleID'))
tempDF = rciDF.set_index('ModuleID')
for k in phenotypeL:
    tempS = tempDF[k].loc[tempDF['Template']==k]
    rciDF_kk = pd.merge(rciDF_kk, tempS, left_index=True, right_index=True, how='left')
##Sort
tempL1 = ['Ctrl', 'Aca', '17aE2', 'Rapa']
tempL2 = ['F', 'M']
tempL = [intervention+'_'+sex for sex in tempL2 for intervention in tempL1]
rciDF_kk = rciDF_kk[tempL]

display(rciDF_kk)
display(rciDF_kk.describe())

### 2-2. Mann–Whitney U-test

> Note that the scipy API (scipy.stats.mannwhitneyu) is used, because only the one-sided test seems implemented in the current statsmodels API (statsmodels.stats.nonparametric.rank_compare_2indep). Actually, the output objects are same b/w the two APIs, which is contrast to the case of t-test (degrees of freedom is not reported in the scipy API).  

In [None]:
tempDF = rciDF_kk
tempL1 = ['Ctrl', 'Aca', '17aE2', 'Rapa']
tempL2 = ['F', 'M']
tempL = [intervention+'_'+sex for sex in tempL2 for intervention in tempL1]
control_intervention = 'Ctrl'

#Statistical tests
tempDF1 = pd.DataFrame(columns=['Ustat', 'Pval'])
for contrast in tempL:
    sex = re.sub('^.+_', '', contrast)
    control = control_intervention+'_'+sex
    if control!=contrast:
        tempS1 = tempDF[control]
        tempS2 = tempDF[contrast]
        #Two-sided Mann–Whitney U-test
        ustat, pval = stats.mannwhitneyu(tempS2, tempS1,#U-statistic corresponds to the contrast
                                         use_continuity=True, alternative='two-sided', method='auto')
        tempDF1.loc[contrast+'-vs-'+control] = [ustat, pval]
##P-value adjustment across all comparisons by using Benjamini–Hochberg method
tempDF1['AdjPval'] = multi.multipletests(tempDF1['Pval'], alpha=0.05, method='fdr_bh',
                                         is_sorted=False, returnsorted=False)[1]
tempDF1.index.rename('ComparisonLabel', inplace=True)
display(tempDF1)

#Calculate general statistics
tempDF2 = pd.DataFrame(columns=['N', 'RCImedian', 'RCImad'])
for group in tempL:
    tempS = tempDF[group]
    size = len(tempS)
    median = tempS.median()
    mad = stats.median_absolute_deviation(tempS)#Cf. pd.Series.mad() is not median absolute deviation but mean absolute deviation
    tempDF2.loc[group] = [size, median, mad]
tempDF2.index.rename('GroupLabel', inplace=True)
display(tempDF2)

#Clean
##Reformat while renaming column names
tempD1 = {'Comparison':tempDF1, 'Group':tempDF2}
tempD2 = {}
for target in tempD1.keys():
    tempDF3 = tempD1[target]
    tempL1 = tempDF3.index.tolist()#For sorting later
    tempL2 = tempDF3.columns.tolist()#For sorting later
    tempDF3 = tempDF3.reset_index().melt(var_name='Variable', value_name='Value', id_vars=target+'Label')
    tempDF3['Variable'] = tempDF3[target+'Label']+'_'+tempDF3['Variable']
    tempDF3['ModuleID'] = 'All'#Dummy
    tempDF3 = tempDF3.pivot(index='ModuleID', columns='Variable', values='Value')
    tempDF3.columns.name = None#Erase 'Variable'
    tempL3 = [label+'_'+variable for label in tempL1 for variable in tempL2]
    tempDF3 = tempDF3[tempL3]#Sort
    tempD2[target] = tempDF3
##Merge
tempDF3 = pd.merge(tempD2['Group'], tempD2['Comparison'], left_index=True, right_index=True, how='inner')
##Convert data type
for col_n in tempDF3.columns.tolist():
    if re.search('_N$', col_n):
        tempDF3[col_n] = tempDF3[col_n].astype(int)
#Add dummy module name
tempL1 = tempDF3.columns.tolist()#For sorting later
tempDF3['ModuleName'] = 'General pattern'
tempL2 = [col_n for sublist in [['ModuleName'], tempL1] for col_n in sublist]
tempDF3 = tempDF3[tempL2]
display(tempDF3)

#Save
fileDir = './ExportData/'
ipynbName = '230628_LC-M001-CommunBiol-2ndRevision_Sex-stratified-M001-DIRAC-GOBP_'
fileName = 'RCI-general-pattern.tsv'
tempDF3.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

statDF = tempDF3

### 2-3. Visualization: boxplot

In [None]:
#Prepare DF for plot
tempDF1 = rciDF_kk.reset_index().melt(var_name='Group', value_name='RCI', id_vars='ModuleID')
tempDF2 = sampleDF.reset_index()[['Group', 'Intervention', 'Sex']].drop_duplicates(keep='first')
tempDF1 = pd.merge(tempDF1, tempDF2, on='Group', how='left')

#Prepare label and color
tempD = {'Ctrl':'Control', 'Aca':'Acarbose',
         '17aE2':'17'+r'$\alpha$'+'-Estradiol', 'Rapa':'Rapamycin'}
tempDF1['Intervention'] = tempDF1['Intervention'].map(tempD)
tempD1 = {'Control':'tab:blue', 'Acarbose':'tab:red',
          '17'+r'$\alpha$'+'-Estradiol':'tab:green', 'Rapamycin':'tab:purple'}
tempD2 = {'F':'Female', 'M':'Male'}

#Prepare significance labels
##Retrieve statistical significance
module = 'All'
tempS = statDF.loc[module, statDF.columns.str.contains('AdjPval')]
tempS.index = tempS.index.str.replace('_AdjPval', '')
tempS.name = 'AdjPval'
##Clean
tempDF2 = tempS.index.to_series().str.split(pat='-vs-', expand=True)
tempDF2 = tempDF2.rename(columns={0:'Contrast', 1:'Control'})
tempDF2['Sex'] = tempDF2['Control'].str.replace('^.+_', '', regex=True)
tempDF2['Control'] = tempDF2['Control'].str.replace('_.$', '', regex=True).map(tempD)
tempDF2['Contrast'] = tempDF2['Contrast'].str.replace('_.$', '', regex=True).map(tempD)
tempDF2 = pd.merge(tempDF2, tempS, left_index=True, right_index=True, how='left')
##Convert p-value to label
tempL = []
for row_i in range(len(tempDF2)):
    pval = tempDF2['AdjPval'].iloc[row_i]
    if pval<0.001:
        tempL.append('***')
    elif pval<0.01:
        tempL.append('**')
    elif pval<0.05:
        tempL.append('*')
    else:
        pval_text = Decimal(str(pval)).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP)
        tempL.append(r'$P$='+str(pval_text))#Remove white spaces around "=" due to space limitation
tempDF2['SignifLabel'] = tempL
display(tempDF2)

#Visualization
ymax = 1.0
ymin = 0.5
yinter = 0.1
ymargin_t = 0.11
ymargin_b = 0.01
aline_ymin = 0.95
aline_ymargin = 0.05
sns.set(style='ticks', font='Arial', context='talk')
fig, axes = plt.subplots(nrows=1, ncols=len(tempD2), figsize=(0.5+2*len(tempD2), 4),
                         sharex=True, sharey=True, gridspec_kw={'wspace':0.2})
plt.setp(axes, ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax + yinter/10, yinter))
for ax_i, ax in enumerate(axes.flat):
    sex = list(tempD2.keys())[ax_i]
    tempDF = tempDF1.loc[tempDF1['Sex']==sex]
    sns.boxplot(data=tempDF, x='Intervention', y='RCI', order=tempD1.keys(), palette=tempD1,
                dodge=False, showcaps=True, notch=True, showfliers=True,
                flierprops={'marker':'o', 'markerfacecolor':'gray', 'alpha':0.4}, ax=ax)
    #Set axis
    sns.despine()
    plt.setp(ax.get_xticklabels(), rotation=70, horizontalalignment='right',
             verticalalignment='center', rotation_mode='anchor')
    if ax_i==0:
        plt.setp(ax, xlabel='', ylabel='Module RCI')
    else:
        plt.setp(ax.get_yticklabels(), visible=False)
        plt.setp(ax, xlabel='', ylabel='')
    #Add significance labels
    lines = ax.get_lines()#Line2D: [[Q1, Q1-1.5IQR], [Q3, Q3+1.5IQR], [Q1, Q1], [Q3, Q3], [Med, Med], [flier]]
    lines_unit = 5 + int(True)#showfliers=True
    tempDF = tempDF2.loc[tempDF2['Sex']==sex]
    for row_i in range(len(tempDF)):
        #Control
        group_0 = tempDF['Control'].iloc[row_i]
        index_0 = list(tempD1.keys()).index(group_0)
        whisker_0 = lines[index_0*lines_unit + 1]
        xcoord_0 = whisker_0._x[1]#Q3+1.5IQR
        #ycoord_0 = whisker_0._y[1]#Q3+1.5IQR
        #Contrast
        group_1 = tempDF['Contrast'].iloc[row_i]
        index_1 = list(tempD1.keys()).index(group_1)
        whisker_1 = lines[index_1*lines_unit + 1]
        xcoord_1 = whisker_1._x[1]#Q3+1.5IQR
        #ycoord_1 = whisker_1._y[1]#Q3+1.5IQR
        #Standard point of marker
        xcoord = (xcoord_0+xcoord_1)/2
        #ycoord = max(ycoord_0, ycoord_1)
        ycoord = aline_ymin + aline_ymargin*row_i
        label = tempDF['SignifLabel'].iloc[row_i]
        #Add annotation lines
        aline_offset = yinter/10
        aline_length = yinter/10 + aline_offset
        ax.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
                [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
                lw=1.5, c='k')
        #Add annotation text
        if label in ['***', '**', '*']:
            text_offset = yinter/25
            ax.annotate(label, xy=(xcoord, ycoord+text_offset),
                        horizontalalignment='center', verticalalignment='bottom',
                        fontsize='medium', color='k')
        else:
            text_offset = yinter/5
            ax.annotate(label, xy=(xcoord, ycoord+text_offset),
                        horizontalalignment='center', verticalalignment='bottom',
                        fontsize='x-small', color='k')
    #Add facet title
    ax.set_title(tempD2[sex], {'fontsize':'medium'})
fig.tight_layout()
##Save
fileDir = './ExportFigures/'
ipynbName = '230628_LC-M001-CommunBiol-2ndRevision_Sex-stratified-M001-DIRAC-GOBP_'
fileName = 'RCI-general-pattern_boxplot.pdf'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04, transparent=True)
plt.show()

## 3. Reproduce the plot DF for Source Data

### 3-1. RCI boxplot (Supplementary Fig. 1a)

> Statistical test summary is forcibly merged to be included in the same sheet.  

In [None]:
#Prepare RCI DF for plot
tempDF1 = rciDF_kk.sort_index(ascending=True)#Just for appearance purpose
tempDF1 = tempDF1.reset_index().melt(var_name='Group', value_name='RCI', id_vars='ModuleID')
display(tempDF1)

#Reproduce the statistical test summary without general statistics
tempDF = statDF.loc[:, statDF.columns.str.contains('-vs-.+_AdjPval$', regex=True)]
tempL1 = tempDF.columns.str.replace('_AdjPval', '')
module = 'All'
tempL2 = []
for comparison in tempL1:
    tempS = statDF.loc[module, statDF.columns.str.contains(comparison)]
    tempS.index = tempS.index.str.replace(comparison+'_', '')
    tempS.name = comparison
    tempL2.append(tempS)
tempDF2 = pd.concat(tempL2, axis=1)
tempDF2 = tempDF2.T
tempDF2.index.rename('ComparisonLabel', inplace=True)
tempDF2 = tempDF2.reset_index()
display(tempDF2)

#Forcibly merge
tempDF1[''] = np.nan#Add a dummy columns for spacing
tempDF = pd.concat([tempDF1, tempDF2], axis=1)
display(tempDF)
display(tempDF.describe(include='all'))
display(tempDF['Group'].value_counts())

#Save
fileDir = './ExportData/'
ipynbName = '230628_LC-M001-CommunBiol-2ndRevision_Sex-stratified-M001-DIRAC-GOBP_'
fileName = 'FigS1a.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=False)

# — End of notebook —