# Weighted Gene Co-expression Network Analysis of LC M001 Liver Proteomics — Analysis of WGCNA Modules

***by Kengo Watanabe***  

This Jupyter Notebook (with Python 3 kernel) performed the module analysis part of the weighted gene co-expression network analysis (WGCNA; Langfelder, P. & Horvath, S. BMC Bioinform. 2008) on the preprocessed Longevity Consortium (LC) M001 proteomics dataset (adjusted with sex and age; analytes with 50% or less missingness).  

Input files:  
- Preprocessed analyte data: 230303_LC-M001-proteomics-WGCNA-ver2_Annotation_processed-data-for-WGCNA.tsv  
- Module eigengene data: 230303_LC-M001-proteomics-WGCNA-ver2_WGCNA-module_module-eigengenes.tsv  
- Module–analyte metadata: 230303_LC-M001-proteomics-WGCNA-ver2_WGCNA-module_module-assignments.tsv  
- Sample–mouse metadata: 230213_LC-M001-proteomics-DIRAC-ver7-2_Preprocessing_sample-metadata.tsv  
- Analyte metadata: 230303_LC-M001-proteomics-WGCNA-ver2_Annotation_analyte-metadata_UniProt.tsv  

Output figures and tables:  
- Figure 3b, c, e  
- Supplementary Figure 2a, b, d  
- Supplementary Data 2  

Original notebook (memo for my future tracing):  
- dalek:\[JupyterLab HOME\]/230303_LC-M001-proteomics-WGCNA-ver2/230306_LC-M001-proteomics-WGCNA-ver2_Analysis.ipynb  

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#For Arial font
#!conda install -c conda-forge -y mscorefonts
##-> The below was also needed in matplotlib 3.4.2
#import shutil
#import matplotlib
#shutil.rmtree(matplotlib.get_cachedir())
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display
import time
#For exporting .pdf file with editable text
import matplotlib
matplotlib.rcParams['pdf.fonttype']=42
matplotlib.rcParams['ps.fonttype']=42

import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
from statsmodels.stats import weightstats
from statsmodels.stats import multitest as multi
import sys
from decimal import Decimal, ROUND_HALF_UP

!conda list

# packages in environment at /opt/conda/envs/arivale-py3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
analytics                 0.1                      pypi_0    pypi
argon2-cffi               21.1.0           py39h3811e60_0    conda-forge
arivale-data-interface    0.1.0                    pypi_0    pypi
async_generator           1.10                       py_0    conda-forge
atk-1.0                   2.36.0               h3371d22_4    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py39h3811e60_0    conda-forge
bleach 

In [None]:
#Custom function for the P-value conversion
def convert_pval(pval):
    #This function converts P-value to its annotation for a figure.
    #Of note, this function can also handle the case that P-value is zero due to the system limitation.
    #Input: P-value derived from a statistical test (float)
    #Output: P-value annotation (string)
    
    #import sys
    #from decimal import Decimal, ROUND_HALF_UP
    
    #Check the input
    if type(pval) is str:
        text = 'Check the input object type!'
        print(text)
        return text
    if pval<0 or pval>1:
        text = 'Check whether the input value was P-value!'
        print(text)
        return text
    
    below_limit = 0#Initialize
    if pval==1.0:
        pval_text = '1.0'
    else:
        if pval==0.0:#Due to smaller than the system float minimum
            pval = sys.float_info.min
            below_limit = 1#Update
        pval_text = f'{Decimal(str(pval)):.3E}'#Take more digits because rounding is bad here
        significand, exponent = pval_text.split(sep='E-')
        significand = str(Decimal(significand).quantize(Decimal('0.1'), rounding=ROUND_HALF_UP))
        if significand=='10.0':
            significand = '1.0'
            exponent = str(int(exponent)-1)
        if int(exponent)>2:
            pval_text = significand+r'$\times$'+'10'+r'$^{{-{0}}}$'.format(exponent)##Font is different in r'$ $' but \mathrm{\mathsf{{0}} doesn't work...
        elif int(exponent)>0:
            pval_text = '0.'+'0'*(int(exponent)-1)+significand.replace('.', '')
        else:
            pval_text = significand
    if below_limit==1:
        text = '<'+pval_text
    else:
        text = '='+pval_text
    
    return text

## 1. Prepare dataset and metadata

### 1-1. Analyte data

In [None]:
#Import analyte data
fileDir = './ExportData/'
ipynbName = '230303_LC-M001-proteomics-WGCNA-ver2_Annotation_'
fileName = 'processed-data-for-WGCNA.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('UniProtID')
display(tempDF)

dataDF = tempDF

### 1-2. Module eigengene

In [None]:
#Import module eigengene values
fileDir = './ExportData/'
ipynbName = '230303_LC-M001-proteomics-WGCNA-ver2_WGCNA-module_'
fileName = 'module-eigengenes.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('SampleID')
tempDF = tempDF.T
tempDF.index.name = 'ModuleID'
print('Original:', tempDF.shape)

#Eliminate the dummy module
tempDF = tempDF.loc[tempDF.index!='Grey']
print('Cleaned:', tempDF.shape)

display(tempDF)

eigenDF = tempDF

### 1-3. Module–analyte metadata, including intramodular connectivity

In [None]:
#Import module-analyte metadata
fileDir = './ExportData/'
ipynbName = '230303_LC-M001-proteomics-WGCNA-ver2_WGCNA-module_'
fileName = 'module-assignments.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('UniProtID')
print('Original:', tempDF.shape)
print(' - Unique analytes with module:', len(tempDF.index.unique()))
print(' - Unique modules with analytes:', len(tempDF['ModuleID'].unique()))

#Eliminate the dummy module
tempDF = tempDF.loc[tempDF['ModuleID']!='Grey']
print('Cleaned:', tempDF.shape)
print(' - Unique analytes with module:', len(tempDF.index.unique()))
print(' - Unique modules with analytes:', len(tempDF['ModuleID'].unique()))
display(tempDF)

#Generate module metadata
tempS = tempDF['ModuleID'].value_counts()
tempS.name = 'nAnalytes'
tempS.index.name = 'ModuleID'
tempDF1 = tempDF.reset_index()[['ModuleID', 'ModuleNumber']].drop_duplicates(keep='first')
tempDF1['ModuleName'] = tempDF1['ModuleID']+' WGCNA-module'
tempDF1 = tempDF1.set_index('ModuleID')
tempDF1 = pd.merge(tempDF1, tempS, left_index=True, right_index=True, how='right')#Order by nAnalytes
display(tempDF1)

analyteDF = tempDF
moduleDF = tempDF1

### 1-4. Sample–mouse metadata

In [None]:
#Import sample-mouse metadata
fileDir = '../230206_LC-M001-proteomics-DIRAC-ver7/ExportData/'
ipynbName = '230213_LC-M001-proteomics-DIRAC-ver7-2_Preprocessing_'
fileName = 'sample-metadata.tsv'
tempDF = pd.read_csv(fileDir+ipynbName+fileName, sep='\t')
tempDF = tempDF.loc[tempDF['SampleID'].isin(dataDF.columns)]
tempDF = tempDF.set_index('SampleID')

#Prepare phenotypeS
tempDF['Phenotype'] = tempDF['Intervention']#Sex-pooled
display(tempDF)
display(tempDF['Phenotype'].value_counts())

sampleDF = tempDF

## 2. Module eigengene comparison

> Test specific hypothesis: control ME == intervention ME per module (i.e., inter-group module comparison).  
> 1. Testing the main effect of intervention on module eigengenes (MEs) for each module using ANOVA model  
> 2. Then, performing post-hoc comparisons of MEs between control vs. each intervention using Welch's t-tests  
>  
> Sex and its interaction term is NOT included in the ANOVA model, because the dataset was already adjusted with sex and age. Because the post-hoc comparisons (2) are to address the effect of each intervention within a specific module, the P-values are adjusted across interventions only within the module (not across modules).  

### 2-1. ANOVA test (ME ~ Intervention), followed by Welch's t-tests (Intervention)

#### 2-1-1. Perform all statistical tests

In [None]:
tempL1 = ['Ctrl', 'Aca', '17aE2', 'Rapa']#Target sample groups to be assessed
control = 'Ctrl'#For post-hoc comparisons
tempDF1 = eigenDF
tempDF2 = sampleDF.loc[sampleDF['Phenotype'].isin(tempL1)]
tempI = moduleDF.index
formula = 'ME ~ C(Intervention)'
tempL2 = ['C(Intervention)']#For variables of interest in ANOVA

#Statistical tests per module
t_start = time.time()
tempL3 = []#For ANOVA table
tempL4 = []#For post-hoc test table
for module in tempI.tolist():
    #Select the target module
    tempS = tempDF1.loc[module]
    tempS.name = 'ME'
    #Add metadata while selecting the target sample samples
    tempDF = pd.merge(tempS, tempDF2, left_index=True, right_index=True, how='inner')
    
    #ANOVA
    model = smf.ols(formula, data=tempDF).fit()
    anovaDF = anova_lm(model, typ=2)#ANOVA type doesn't matter in this case
    ##Take the results per variable
    tempDF3 = pd.DataFrame(columns=['DoF', 'Fstat', 'Pval'])
    for variable in tempL2:
        dof1 = int(anovaDF.at[variable, 'df'])#Between-groups
        dof2 = int(anovaDF.at['Residual', 'df'])#Within-groups
        dof = (dof1, dof2)
        fstat = anovaDF.at[variable, 'F']
        pval = anovaDF.at[variable, 'PR(>F)']
        tempDF3.loc[variable] = [dof, fstat, pval]
    tempDF3['AdjPval'] = 1.0#Add dummy column for now
    ##Convert to wide-format
    tempS = pd.Series(len(tempDF), index=['N'], name=module)
    tempL = [tempS]
    for variable in tempDF3.index.tolist():
        tempS = tempDF3.loc[variable]
        tempS.index = variable+'_'+tempS.index
        tempS.name = module
        tempL.append(tempS)
    tempS = pd.concat(tempL, axis=0)
    tempL3.append(tempS)
    
    #Post-hoc tests per control vs. contrast
    tempDF4 = pd.DataFrame(columns=['DoF', 'tStat', 'Pval'])
    for contrast in tempL1:
        if control!=contrast:
            tempS1 = tempDF['ME'].loc[tempDF['Phenotype']==control]
            tempS2 = tempDF['ME'].loc[tempDF['Phenotype']==contrast]
            #Two-sided Welch's t-test
            tstat, pval, dof = weightstats.ttest_ind(tempS2, tempS1,#t-statistic reflects direction from the baseline
                                                     alternative='two-sided', usevar='unequal')
            tempDF4.loc[contrast+'-vs-'+control] = [dof, tstat, pval]
    ##P-value adjustment across all comparisons per module by using Benjamini–Hochberg method
    tempDF4['AdjPval'] = multi.multipletests(tempDF4['Pval'], alpha=0.05, method='fdr_bh',
                                             is_sorted=False, returnsorted=False)[1]
    ##Convert to wide-format
    tempL = []
    for comparison in tempDF4.index.tolist():
        tempS = tempDF4.loc[comparison]
        tempS.index = comparison+'_'+tempS.index
        tempS.name = module
        tempL.append(tempS)
    tempS = pd.concat(tempL, axis=0)
    tempL4.append(tempS)
t_elapsed = time.time() - t_start
print('Elapsed time for', len(tempI), 'ANOVA and',
      (len(tempL1)-1)*len(tempI), 'post-hoc tests (',
      len(tempL1)-1, 'comparisons x', len(tempI), 'modules):',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Generate ANOVA table
tempDF3 = pd.concat(tempL3, axis=1).T
tempDF3.index.name = tempI.name
##P-value adjustment across all tests by using Benjamini–Hochberg method
for variable in tempL2:
    #Overwrite the dummy values
    tempDF3[variable+'_AdjPval'] = multi.multipletests(tempDF3[variable+'_Pval'], alpha=0.05, method='fdr_bh',
                                                       is_sorted=False, returnsorted=False)[1]
##Convert back dtypes (due to the forced change during wide-format)
for col_n in tempDF3.columns.tolist():
    if 'N'==col_n:
        tempDF3[col_n] = tempDF3[col_n].astype(int)
    elif 'DoF' in col_n:
        tempDF3[col_n] = tempDF3[col_n].astype(str)
    else:
        tempDF3[col_n] = tempDF3[col_n].astype(float)
##Rename columns (because only one variable in this case)
tempDF3.columns = 'ANOVA_'+tempDF3.columns.str.replace('^.*_', '', regex=True)
display(tempDF3)

#Generate post-hoc test table
tempDF4 = pd.concat(tempL4, axis=1).T
tempDF4.index.name = tempI.name
display(tempDF4)

statDF1 = tempDF3
statDF2 = tempDF4

In [None]:
tempL1 = ['Ctrl', 'Aca', '17aE2', 'Rapa']#Target sample groups to be summarized
tempDF1 = eigenDF
tempDF2 = sampleDF.loc[sampleDF['Phenotype'].isin(tempL1)]

#Calculate general statistics per intervention group
tempL2 = []
for phenotype in tempL1:
    #Select the target samples
    tempL = tempDF2.loc[tempDF2['Phenotype']==phenotype].index.tolist()
    tempDF = tempDF1[tempL]
    #Calculate general statistics
    tempS1 = len(tempL) - tempDF.isnull().sum(axis=1)
    tempS1.name = phenotype+'_N'
    tempS2 = tempDF.mean(axis=1)
    tempS2.name = phenotype+'_MEmean'
    tempS3 = tempDF.sem(axis=1, ddof=1)
    tempS3.name = phenotype+'_MEsem'
    #Merge
    tempDF = pd.concat([tempS1, tempS2, tempS3], axis=1)
    tempL2.append(tempDF)
tempDF = pd.concat(tempL2, axis=1)
display(tempDF)

#Merge all the tables
print('General statistics table:', tempDF.shape)
print('ANOVA table:', statDF1.shape)
print('Post-hoc test table:', statDF2.shape)
tempDF = pd.concat([moduleDF['ModuleName'], tempDF, statDF1, statDF2], axis=1)

#Sort
tempDF = tempDF.sort_values(by='ANOVA_Pval', ascending=True)
display(tempDF)

#Save
fileDir = './ExportData/'
ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
fileName = 'inter-group-comparison_ME.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

statDF = tempDF

#### 2-1-2. Changed modules (ANOVA)

In [None]:
#Prepare variables in the model
#-> In this case, only the one variable (intervention) was included.
variableL = ['ANOVA']

#Changed modules
for variable in variableL:
    tempDF = statDF.loc[statDF[variable+'_AdjPval']<0.05]
    tempDF = tempDF.sort_values(by=variable+'_AdjPval', ascending=True)
    tempL1 = tempDF.loc[:, tempDF.columns.str.contains('_MEmean')].columns.tolist()
    tempL2 = tempDF.loc[:, tempDF.columns.str.contains('^'+variable+'_')].columns.tolist()
    tempDF = tempDF[[col_n for subL in [['ModuleName'], tempL1, tempL2] for col_n in subL]]
    print(variable+' (adjusted P < 0.05):', len(tempDF))
    display(tempDF)

#### 2-1-3. Changed modules by each intervention (Welch's t-test)

In [None]:
#Extract only the changed modules
variable = 'ANOVA'
tempDF = statDF.loc[statDF[variable+'_AdjPval']<0.05]
tempDF = tempDF.sort_values(by=variable+'_AdjPval', ascending=True)
print(variable+' (adjusted P < 0.05):', len(tempDF))

#Take adjusted P-value
tempDF1 = tempDF.loc[:, tempDF.columns.str.contains('-vs-.*_AdjPval$')]
tempDF1.columns = tempDF1.columns.str.replace('_AdjPval$', '')
tempDF1 = pd.merge(tempDF[['ModuleName', variable+'_AdjPval']], tempDF1,
                   left_index=True, right_index=True, how='right')
print('Adjusted P-value:')
display(tempDF1)
display(tempDF1.describe())

#Take t-statistic (for direction)
tempDF2 = tempDF.loc[:, tempDF.columns.str.contains('-vs-.*_tStat$')]
tempDF2.columns = tempDF2.columns.str.replace('_tStat$', '')
tempDF2 = pd.merge(tempDF[['ModuleName', variable+'_AdjPval']], tempDF2,
                   left_index=True, right_index=True, how='right')
print('Changed direction (t-statistic):')
display(tempDF2)
display(tempDF2.describe())

pvalDF = tempDF1
diffDF = tempDF2

### 2-2. Visualization: pointplot

#### 2-2-1. Modules changed by any intervention

In [None]:
target = 'ANOVA_AdjPval'
nPlots = 10
sort_var = target

#Prepare the target module set
tempL = statDF.loc[statDF[target]<0.05].index.tolist()
print(len(tempL), 'significantly changed modules based on '+target)

#Select representatives and sort the target modules
tempDF = statDF.loc[:, statDF.columns.str.contains('Pval$')]
tempDF = pd.merge(statDF['ModuleName'], tempDF, left_index=True, right_index=True, how='left')
tempDF = tempDF.loc[tempL].sort_values(by=sort_var, ascending=True)
topX = np.min([30, len(tempL)])
print('Top', topX, 'modules (sort by '+sort_var+'):')
display(tempDF.iloc[:topX])
plotL = tempDF.index.tolist()[:np.min([nPlots, len(tempL)])]

#Prepare ME DF for plot
tempDF1 = eigenDF.reset_index().melt(var_name='SampleID', value_name='ME', id_vars='ModuleID')
tempDF = sampleDF.reset_index()[['SampleID', 'Phenotype']]
tempDF1 = pd.merge(tempDF1, tempDF, on='SampleID', how='left')

#Prepare label and color
tempD0 = {'Ctrl':'Control', 'Aca':'Acarbose',
          '17aE2':'17'+r'$\alpha$'+'-Estradiol', 'Rapa':'Rapamycin'}
tempDF1['Group'] = tempDF1['Phenotype'].map(tempD0)
tempD1 = {'Control':'tab:blue', 'Acarbose':'tab:red',
          '17'+r'$\alpha$'+'-Estradiol':'tab:green', 'Rapamycin':'tab:purple'}

#Prepare P-value DF for plot
tempDF2 = statDF.loc[:, statDF.columns.str.contains('-vs-.*_AdjPval$')]
tempDF2.columns = tempDF2.columns.str.replace('_AdjPval$', '')

#Visualize each module
for rank_i in range(len(plotL)):
    print(' - Rank '+str(rank_i+1)+' (sort by '+sort_var+'):')
    module = plotL[rank_i]
    #Check module summary
    tempDF = pd.DataFrame(moduleDF.loc[module]).T
    display(tempDF)
    
    #Select ME
    tempDF3 = tempDF1.loc[tempDF1['ModuleID']==module]
    
    #Check ME summary
    tempDF = tempDF3.groupby(['Group'])['ME'].agg(['count', 'mean', 'std'])
    tempL1 = []
    tempL2 = []
    for row_n in tempDF.index.tolist():
        count, mean, std = tempDF.loc[row_n]
        tempL1.append(mean - 1.96*std/np.sqrt(count))
        tempL2.append(mean + 1.96*std/np.sqrt(count))
    tempDF['0.025'] = tempL1
    tempDF['0.975'] = tempL2
    tempDF = tempDF.loc[list(tempD1.keys())]#Sort
    display(tempDF)
    
    #Prepare significance labels
    tempS = tempDF2.loc[module]
    tempS.name = 'AdjPval'
    ##Clean
    tempDF = tempS.index.to_series().str.split(pat='-vs-', expand=True)
    tempDF = tempDF.rename(columns={0:'Contrast', 1:'Baseline'})
    tempDF4 = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')
    tempDF4['Contrast'] = tempDF4['Contrast'].map(tempD0)
    tempDF4['Baseline'] = tempDF4['Baseline'].map(tempD0)
    ##Convert p-value to label
    tempL = []
    for row_i in range(len(tempDF4)):
        pval = tempDF4['AdjPval'].iloc[row_i]
        if pval<0.001:
            tempL.append('***')
        elif pval<0.01:
            tempL.append('**')
        elif pval<0.05:
            tempL.append('*')
        else:
            pval_text = Decimal(str(pval)).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP)
            tempL.append(r'$P$ = '+str(pval_text))
    tempDF4['SignifLabel'] = tempL
    display(tempDF4)
    
    #Visualization
    ymax = 0.4
    ymin = -0.4
    yinter = 0.2
    ymargin_t = 0.25
    ymargin_b = 0.05
    aline_ymin = 0.3
    aline_ymargin = 0.1
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(2, 4))
    sns.pointplot(data=tempDF3, x='Group', y='ME', order=list(tempD1.keys()), palette=tempD1,
                  markers='o', dodge=False, join=False, capsize=0.6, estimator=np.mean, ci=95)
    p = sns.stripplot(data=tempDF3, x='Group', y='ME',
                      order=list(tempD1.keys()), palette=tempD1, dodge=False, jitter=0.15,
                      size=5, edgecolor='black', linewidth=1, **{'marker':'o', 'alpha':0.5})
    ##Set axis
    sns.despine()
    p.set(ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax + yinter/10, yinter))
    plt.setp(p.get_xticklabels(), rotation=70, horizontalalignment='right',
             verticalalignment='center', rotation_mode='anchor')
    ##Add significance labels
    for row_i in range(len(tempDF4)):
        #Baseline
        group_0 = tempDF4['Baseline'].iloc[row_i]
        index_0 = list(tempD1.keys()).index(group_0)
        xcoord_0 = index_0
        #Contrast
        group_1 = tempDF4['Contrast'].iloc[row_i]
        index_1 = list(tempD1.keys()).index(group_1)
        xcoord_1 = index_1
        #Standard point of marker
        xcoord = (xcoord_0+xcoord_1)/2
        ycoord = aline_ymin + aline_ymargin*row_i
        label = tempDF4['SignifLabel'].iloc[row_i]
        #Add annotation lines
        aline_offset = yinter/5
        aline_length = yinter/5 + aline_offset/2
        plt.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
                 [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
                 lw=1.5, c='k')
        #Add annotation text
        if label in ['***', '**', '*']:
            text_offset = yinter/7
            p.annotate(label, xy=(xcoord, ycoord+text_offset),
                       horizontalalignment='center', verticalalignment='bottom',
                       fontsize='medium', color='k')
        else:
            text_offset = yinter/3.5
            p.annotate(label, xy=(xcoord, ycoord+text_offset),
                       horizontalalignment='center', verticalalignment='bottom',
                       fontsize='x-small', color='k')
    ##Set axis label and title
    plt.setp(p, xlabel='', ylabel='Module eigengene')
    p.set_title(module, {'fontsize':'medium'})
    ##Save
    fileDir = './ExportFigures/'
    ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
    fileName = 'ME-pointplot-'+module+'.pdf'
    plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04, transparent=True)
    plt.show()
    print('')

#### 2-2-2. Unchanged modules

In [None]:
target = 'ANOVA_AdjPval'
nPlots = 10
sort_var = target

#Prepare the target module set
tempL = statDF.loc[statDF[target]>=0.05].index.tolist()
print(len(tempL), 'unchanged modules based on '+target)

#Select representatives and sort the target modules
tempDF = statDF.loc[:, statDF.columns.str.contains('Pval$')]
tempDF = pd.merge(statDF['ModuleName'], tempDF, left_index=True, right_index=True, how='left')
tempDF = tempDF.loc[tempL].sort_values(by=sort_var, ascending=True)
topX = np.min([30, len(tempL)])
print('Top', topX, 'modules (sort by '+sort_var+'):')
display(tempDF.iloc[:topX])
plotL = tempDF.index.tolist()[:np.min([nPlots, len(tempL)])]

#Prepare ME DF for plot
tempDF1 = eigenDF.reset_index().melt(var_name='SampleID', value_name='ME', id_vars='ModuleID')
tempDF = sampleDF.reset_index()[['SampleID', 'Phenotype']]
tempDF1 = pd.merge(tempDF1, tempDF, on='SampleID', how='left')

#Prepare label and color
tempD0 = {'Ctrl':'Control', 'Aca':'Acarbose',
          '17aE2':'17'+r'$\alpha$'+'-Estradiol', 'Rapa':'Rapamycin'}
tempDF1['Group'] = tempDF1['Phenotype'].map(tempD0)
tempD1 = {'Control':'tab:blue', 'Acarbose':'tab:red',
          '17'+r'$\alpha$'+'-Estradiol':'tab:green', 'Rapamycin':'tab:purple'}

#Prepare P-value DF for plot
tempDF2 = statDF.loc[:, statDF.columns.str.contains('-vs-.*_AdjPval$')]
tempDF2.columns = tempDF2.columns.str.replace('_AdjPval$', '')

#Visualize each module
for rank_i in range(len(plotL)):
    print(' - Rank '+str(rank_i+1)+' (sort by '+sort_var+'):')
    module = plotL[rank_i]
    #Check module summary
    tempDF = pd.DataFrame(moduleDF.loc[module]).T
    display(tempDF)
    
    #Select ME
    tempDF3 = tempDF1.loc[tempDF1['ModuleID']==module]
    
    #Check ME summary
    tempDF = tempDF3.groupby(['Group'])['ME'].agg(['count', 'mean', 'std'])
    tempL1 = []
    tempL2 = []
    for row_n in tempDF.index.tolist():
        count, mean, std = tempDF.loc[row_n]
        tempL1.append(mean - 1.96*std/np.sqrt(count))
        tempL2.append(mean + 1.96*std/np.sqrt(count))
    tempDF['0.025'] = tempL1
    tempDF['0.975'] = tempL2
    tempDF = tempDF.loc[list(tempD1.keys())]#Sort
    display(tempDF)
    
    #Prepare significance labels
    tempS = tempDF2.loc[module]
    tempS.name = 'AdjPval'
    ##Clean
    tempDF = tempS.index.to_series().str.split(pat='-vs-', expand=True)
    tempDF = tempDF.rename(columns={0:'Contrast', 1:'Baseline'})
    tempDF4 = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')
    tempDF4['Contrast'] = tempDF4['Contrast'].map(tempD0)
    tempDF4['Baseline'] = tempDF4['Baseline'].map(tempD0)
    ##Convert p-value to label
    tempL = []
    for row_i in range(len(tempDF4)):
        pval = tempDF4['AdjPval'].iloc[row_i]
        if pval<0.001:
            tempL.append('***')
        elif pval<0.01:
            tempL.append('**')
        elif pval<0.05:
            tempL.append('*')
        else:
            pval_text = Decimal(str(pval)).quantize(Decimal('0.01'), rounding=ROUND_HALF_UP)
            tempL.append(r'$P$ = '+str(pval_text))
    tempDF4['SignifLabel'] = tempL
    display(tempDF4)
    
    #Visualization
    ymax = 0.4
    ymin = -0.4
    yinter = 0.2
    ymargin_t = 0.25
    ymargin_b = 0.05
    aline_ymin = 0.3
    aline_ymargin = 0.1
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(2, 4))
    sns.pointplot(data=tempDF3, x='Group', y='ME', order=list(tempD1.keys()), palette=tempD1,
                  markers='o', dodge=False, join=False, capsize=0.6, estimator=np.mean, ci=95)
    p = sns.stripplot(data=tempDF3, x='Group', y='ME',
                      order=list(tempD1.keys()), palette=tempD1, dodge=False, jitter=0.15,
                      size=5, edgecolor='black', linewidth=1, **{'marker':'o', 'alpha':0.5})
    ##Set axis
    sns.despine()
    p.set(ylim=(ymin-ymargin_b, ymax+ymargin_t), yticks=np.arange(ymin, ymax + yinter/10, yinter))
    plt.setp(p.get_xticklabels(), rotation=70, horizontalalignment='right',
             verticalalignment='center', rotation_mode='anchor')
    ##Add significance labels
    for row_i in range(len(tempDF4)):
        #Baseline
        group_0 = tempDF4['Baseline'].iloc[row_i]
        index_0 = list(tempD1.keys()).index(group_0)
        xcoord_0 = index_0
        #Contrast
        group_1 = tempDF4['Contrast'].iloc[row_i]
        index_1 = list(tempD1.keys()).index(group_1)
        xcoord_1 = index_1
        #Standard point of marker
        xcoord = (xcoord_0+xcoord_1)/2
        ycoord = aline_ymin + aline_ymargin*row_i
        label = tempDF4['SignifLabel'].iloc[row_i]
        #Add annotation lines
        aline_offset = yinter/5
        aline_length = yinter/5 + aline_offset/2
        plt.plot([xcoord_0, xcoord_0, xcoord_1, xcoord_1],
                 [ycoord+aline_offset, ycoord+aline_length, ycoord+aline_length, ycoord+aline_offset],
                 lw=1.5, c='k')
        #Add annotation text
        if label in ['***', '**', '*']:
            text_offset = yinter/7
            p.annotate(label, xy=(xcoord, ycoord+text_offset),
                       horizontalalignment='center', verticalalignment='bottom',
                       fontsize='medium', color='k')
        else:
            text_offset = yinter/3.5
            p.annotate(label, xy=(xcoord, ycoord+text_offset),
                       horizontalalignment='center', verticalalignment='bottom',
                       fontsize='x-small', color='k')
    ##Set axis label and title
    plt.setp(p, xlabel='', ylabel='Module eigengene')
    p.set_title(module, {'fontsize':'medium'})
    ##Save
    fileDir = './ExportFigures/'
    ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
    fileName = 'ME-pointplot-'+module+'.pdf'
    plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04, transparent=True)
    plt.show()
    print('')

## 3. Intramodular connectivity vs. intervention effect

> Test correlation between intramodular connectivity and intervention effect per module.  
> 1. Calculating the main effect of intervention on each protein of a module using ANOVA model  
> 2. Then, testing correlation between intramodular connectivity vs. the intervention effect per module using Spearman's correlation test  
>  
> Sex and its interaction term is NOT included in the ANOVA model, because the dataset was already adjusted with sex and age. As well as the previous version, nominal P-value of the main effect is used for the intervention effect. Missingness is ignored in this version.  

### 3-1. ANOVA test (Protein ~ Intervention)

> Although unnecessary for the current analysis, the P-value adjustment and the post-hoc comparisons are also included and the significantly changed proteins are checked.  

#### 3-1-1. Perform all statistical tests

In [None]:
tempL1 = ['Ctrl', 'Aca', '17aE2', 'Rapa']#Target sample groups to be assessed
control = 'Ctrl'#For post-hoc comparisons
tempDF1 = dataDF
tempDF2 = sampleDF.loc[sampleDF['Phenotype'].isin(tempL1)]
tempI = dataDF.index#Including the Grey module
formula = 'Analyte ~ C(Intervention)'
tempL2 = ['C(Intervention)']#For variables of interest in ANOVA

#Statistical tests per module
t_start = time.time()
tempL3 = []#For ANOVA table
tempL4 = []#For post-hoc test table
for analyte in tempI.tolist():
    #Select the target module
    tempS = tempDF1.loc[analyte]
    tempS.name = 'Analyte'
    tempS = tempS.dropna()#Eliminate missingness
    #Add metadata while selecting the target sample samples
    tempDF = pd.merge(tempS, tempDF2, left_index=True, right_index=True, how='inner')
    
    #ANOVA
    model = smf.ols(formula, data=tempDF).fit()
    anovaDF = anova_lm(model, typ=2)#ANOVA type doesn't matter in this case
    ##Take the results per variable
    tempDF3 = pd.DataFrame(columns=['DoF', 'Fstat', 'Pval'])
    for variable in tempL2:
        dof1 = int(anovaDF.at[variable, 'df'])#Between-groups
        dof2 = int(anovaDF.at['Residual', 'df'])#Within-groups
        dof = (dof1, dof2)
        fstat = anovaDF.at[variable, 'F']
        pval = anovaDF.at[variable, 'PR(>F)']
        tempDF3.loc[variable] = [dof, fstat, pval]
    tempDF3['AdjPval'] = 1.0#Add dummy column for now
    ##Convert to wide-format
    tempS = pd.Series(len(tempDF), index=['N'], name=analyte)
    tempL = [tempS]
    for variable in tempDF3.index.tolist():
        tempS = tempDF3.loc[variable]
        tempS.index = variable+'_'+tempS.index
        tempS.name = analyte
        tempL.append(tempS)
    tempS = pd.concat(tempL, axis=0)
    tempL3.append(tempS)
    
    #Post-hoc tests per control vs. contrast
    tempDF4 = pd.DataFrame(columns=['DoF', 'tStat', 'Pval'])
    for contrast in tempL1:
        if control!=contrast:
            tempS1 = tempDF['Analyte'].loc[tempDF['Phenotype']==control]
            tempS2 = tempDF['Analyte'].loc[tempDF['Phenotype']==contrast]
            #Two-sided Welch's t-test
            tstat, pval, dof = weightstats.ttest_ind(tempS2, tempS1,#t-statistic reflects direction from the baseline
                                                     alternative='two-sided', usevar='unequal')
            tempDF4.loc[contrast+'-vs-'+control] = [dof, tstat, pval]
    ##P-value adjustment across all comparisons per module by using Benjamini–Hochberg method
    tempDF4['AdjPval'] = multi.multipletests(tempDF4['Pval'], alpha=0.05, method='fdr_bh',
                                             is_sorted=False, returnsorted=False)[1]
    ##Convert to wide-format
    tempL = []
    for comparison in tempDF4.index.tolist():
        tempS = tempDF4.loc[comparison]
        tempS.index = comparison+'_'+tempS.index
        tempS.name = analyte
        tempL.append(tempS)
    tempS = pd.concat(tempL, axis=0)
    tempL4.append(tempS)
t_elapsed = time.time() - t_start
print('Elapsed time for', len(tempI), 'ANOVA and',
      (len(tempL1)-1)*len(tempI), 'post-hoc tests (',
      len(tempL1)-1, 'comparisons x', len(tempI), 'analytes):',
      round(t_elapsed//60), 'min', round(t_elapsed%60, 1), 'sec')

#Generate ANOVA table
tempDF3 = pd.concat(tempL3, axis=1).T
tempDF3.index.name = tempI.name
##P-value adjustment across all tests by using Benjamini–Hochberg method
for variable in tempL2:
    #Overwrite the dummy values
    tempDF3[variable+'_AdjPval'] = multi.multipletests(tempDF3[variable+'_Pval'], alpha=0.05, method='fdr_bh',
                                                       is_sorted=False, returnsorted=False)[1]
##Convert back dtypes (due to the forced change during wide-format)
for col_n in tempDF3.columns.tolist():
    if 'N'==col_n:
        tempDF3[col_n] = tempDF3[col_n].astype(int)
    elif 'DoF' in col_n:
        tempDF3[col_n] = tempDF3[col_n].astype(str)
    else:
        tempDF3[col_n] = tempDF3[col_n].astype(float)
##Rename columns (because only one variable in this case)
tempDF3.columns = 'ANOVA_'+tempDF3.columns.str.replace('^.*_', '', regex=True)
display(tempDF3)

#Generate post-hoc test table
tempDF4 = pd.concat(tempL4, axis=1).T
tempDF4.index.name = tempI.name
display(tempDF4)

statDF1 = tempDF3
statDF2 = tempDF4

In [None]:
tempL1 = ['Ctrl', 'Aca', '17aE2', 'Rapa']#Target sample groups to be summarized
tempDF1 = dataDF
tempDF2 = sampleDF.loc[sampleDF['Phenotype'].isin(tempL1)]

#Calculate general statistics per intervention group
tempL2 = []
for phenotype in tempL1:
    #Select the target samples
    tempL = tempDF2.loc[tempDF2['Phenotype']==phenotype].index.tolist()
    tempDF = tempDF1[tempL]
    #Calculate general statistics
    tempS1 = len(tempL) - tempDF.isnull().sum(axis=1)
    tempS1.name = phenotype+'_N'
    tempS2 = tempDF.mean(axis=1)
    tempS2.name = phenotype+'_Mean'
    tempS3 = tempDF.sem(axis=1, ddof=1)
    tempS3.name = phenotype+'_SEM'
    #Merge
    tempDF = pd.concat([tempS1, tempS2, tempS3], axis=1)
    tempL2.append(tempDF)
tempDF = pd.concat(tempL2, axis=1)
display(tempDF)

#Merge all the tables
print('General statistics table:', tempDF.shape)
print('ANOVA table:', statDF1.shape)
print('Post-hoc test table:', statDF2.shape)
##Re-prepare module-analyte metadata (including the Grey module)
fileDir = './ExportData/'
ipynbName = '230303_LC-M001-proteomics-WGCNA-ver2_WGCNA-module_'
fileName = 'module-assignments.tsv'
tempDF3 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('UniProtID')
tempDF = pd.concat([tempDF3[['GeneSymbol', 'ModuleID']], tempDF, statDF1, statDF2], axis=1)

#Sort
tempDF = tempDF.sort_values(by='ANOVA_Pval', ascending=True)
display(tempDF)

#Save
fileDir = './ExportData/'
ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
fileName = 'inter-group-comparison_analyte.tsv'
tempDF.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

statDF = tempDF

#### 3-1-2. Changed proteins (ANOVA)

In [None]:
#Prepare variables in the model
#-> In this case, only the one variable (intervention) was included.
variableL = ['ANOVA']

#Prepare the number of members (including Grey module)
tempS1 = statDF['ModuleID'].value_counts()
tempS1.name = 'nAnalytes'
tempS1.index.name = 'ModuleID'

#Changed analytes
for variable in variableL:
    tempDF = statDF.loc[statDF[variable+'_AdjPval']<0.05]
    tempDF = tempDF.sort_values(by=variable+'_AdjPval', ascending=True)
    tempL1 = tempDF.loc[:, tempDF.columns.str.contains('_Mean')].columns.tolist()
    tempL2 = tempDF.loc[:, tempDF.columns.str.contains('^'+variable+'_')].columns.tolist()
    tempDF = tempDF[[col_n for subL in [['GeneSymbol', 'ModuleID'], tempL1, tempL2] for col_n in subL]]
    print(variable+' (adjusted P < 0.05):', len(tempDF))
    
    #Summarize the number of changed analytes per module
    tempS2 = tempDF['ModuleID'].value_counts()
    tempS2.name = 'nChanged'
    tempS2.index.name = 'ModuleID'
    tempDF1 = pd.merge(tempS1, tempS2, left_index=True, right_index=True, how='left')
    tempDF1['nChanged[%]'] = tempDF1['nChanged'] / tempDF1['nAnalytes'] * 100
    display(tempDF1)
    
    for module in tempDF1.index:
        tempDF2 = tempDF.loc[tempDF['ModuleID']==module]
        print(module+' (top 15)')
        display(tempDF2.iloc[:15])
        print('')

#### 3-1-3. Changed modules by each intervention (Welch's t-test)

In [None]:
#Extract only the changed modules
variable = 'ANOVA'
tempDF = statDF.loc[statDF[variable+'_AdjPval']<0.05]
tempDF = tempDF.sort_values(by=variable+'_AdjPval', ascending=True)
print(variable+' (adjusted P < 0.05):', len(tempDF))

#Take adjusted P-value
tempDF1 = tempDF.loc[:, tempDF.columns.str.contains('-vs-.*_AdjPval$')]
tempDF1.columns = tempDF1.columns.str.replace('_AdjPval$', '')

#Count the changed analytes
tempDF2 = (tempDF1<0.05)
tempS = tempDF2.sum(axis=0)
display(tempS)
##Count the changed analytes per module
tempDF2 = pd.merge(tempDF[['ModuleID', variable+'_AdjPval']], tempDF2,
                   left_index=True, right_index=True, how='right')
tempDF2[variable+'_AdjPval'] = (tempDF2[variable+'_AdjPval']<0.05)
tempDF2 = tempDF2.groupby('ModuleID').agg(np.sum)
display(tempDF2)

#Check
tempDF2 = pd.merge(tempDF[['GeneSymbol', 'ModuleID', variable+'_AdjPval']], tempDF1,
                   left_index=True, right_index=True, how='right')
for comparison in tempDF1.columns:
    tempDF3 = tempDF2.sort_values(by=comparison, ascending=True)
    print(comparison+' (top 15)')
    display(tempDF3.iloc[:15])
    print('')

### 3-2. Spearman's correlation

> Although unnecessary for the current analysis, the P-value adjustment is also included.  

In [None]:
#Prepare X and Y variables
xvar = 'IntramodularConnectivity'
tempDF1 = analyteDF#w/o the Grey module
yvar = 'ANOVA_Pval'
tempS = statDF[yvar]#Nominal P-value as the intervention effect
tempS = np.log10(tempS) * (-1)#Log-transformation
tempDF1 = pd.merge(tempDF1, tempS, left_index=True, right_index=True, how='left')
tempI = moduleDF.index#w/o the Grey module
topQ = 0.9#Top hubs used in the enrichment analysis

#Statistical test per module
tempDF2 = pd.DataFrame(columns=['N', 'DoF', 'Rho', 'Pval'])
for module in tempI.tolist():
    #Select the target module
    tempDF = tempDF1.loc[tempDF1['ModuleID']==module]
    #Spearman's correlation
    rho, pval = stats.spearmanr(tempDF[xvar], tempDF[yvar])
    size = len(tempDF)
    dof = size - 2
    tempDF2.loc[module] = [size, dof, rho, pval]
##P-value adjustment by using Benjamini–Hochberg method
tempDF2['AdjPval'] = multi.multipletests(tempDF2['Pval'], alpha=0.05, method='fdr_bh',
                                         is_sorted=False, returnsorted=False)[1]
tempDF2.index.name = tempI.name
tempDF2['N'] = tempDF2['N'].astype('int64')#Otherwise, float64!
tempDF2['DoF'] = tempDF2['DoF'].astype('int64')#Otherwise, float64!
display(tempDF2)
##Save
fileDir = './ExportData/'
ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
fileName = 'connectivity-vs-intervention.tsv'
tempDF2.to_csv(fileDir+ipynbName+fileName, sep='\t', index=True)

#Visualization per module
for module in tempI.tolist():
    #Select the target module
    tempDF = tempDF1.loc[tempDF1['ModuleID']==module]
    
    #Scatterplot with regression line
    sns.set(style='ticks', font='Arial', context='talk')
    plt.figure(figsize=(3.5, 3.5))
    p = sns.regplot(data=tempDF, x=xvar, y=yvar, color='k',
                    scatter=True, fit_reg=True, ci=95, truncate=False, marker='o',
                    scatter_kws={'alpha':0.2, 'edgecolor':'k', 's':15})
    #Highlight top hubs
    p.axvspan(xmin=tempDF[xvar].quantile(q=topQ), xmax=tempDF[xvar].max(),
              facecolor='orange', alpha=0.2, zorder=0)
    #Annotate Spearman's correlation
    rho = tempDF2.loc[module, 'Rho']
    rho_text = str(Decimal(str(rho)).quantize(Decimal('0.001'), rounding=ROUND_HALF_UP))
    pval = tempDF2.loc[module, 'Pval']#Nominal P-value
    pval_text = convert_pval(pval).replace('=', '= ').replace('<', '< ')#Add a white space
    text = 'Spearman\'s '+r'$\rho$'+' = '+rho_text+'\n'+r'$P$'+' '+pval_text
    p.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction',
               horizontalalignment='left', verticalalignment='top',
               multialignment='left', fontsize='small', color='k')
    #Axis setting and title
    sns.despine()
    plt.setp(p, xlabel='Intramodular connectivity',
             ylabel='Intervention effect\n'+r'$(-\log_{10} P_{\mathrm{\mathsf{ANOVA}}})$')
    p.set_title(module+' WGCNA-module', {'fontsize':'medium'})
    ##Save
    fileDir = './ExportFigures/'
    ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
    fileName = 'connectivity-vs-intervention_'+module+'.pdf'
    plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04, transparent=True)
    plt.show()
    print('')

## 4. Visualization of top hub proteins

### 4-1. Prechecking for visual categorization

In [None]:
tempDF1 = analyteDF#w/o the Grey module
tempDF1 = tempDF1[['GeneSymbol', 'ModuleID', 'IntramodularConnectivity']]

#Import analyte metadata
fileDir = './ExportData/'
ipynbName = '230303_LC-M001-proteomics-WGCNA-ver2_Annotation_'
fileName = 'analyte-metadata_UniProt.tsv'
tempDF2 = pd.read_csv(fileDir+ipynbName+fileName, sep='\t').set_index('UniProtID')
tempDF2 = tempDF2[['ProteinNames', 'GOBP', 'GOMF', 'GOCC']]

#Merge
tempDF = pd.merge(tempDF1, tempDF2, left_index=True, right_index=True, how='left')

display(tempDF)
display(tempDF.describe(include='all'))

analyteDF_uniprot = tempDF

In [None]:
#Check
tempDF1 = analyteDF_uniprot
#tempI = moduleDF.index#w/o the Grey module
tempL = ['Blue', 'Pink']
topX = 30#Top hubs

for module in tempL:
    #Select the target module
    tempDF = tempDF1.loc[tempDF1['ModuleID']==module]
    #Select the top X hubs
    tempDF = tempDF.sort_values(by='IntramodularConnectivity', ascending=False)
    tempDF = tempDF.iloc[:topX]
    
    print(module)
    display(tempDF)
    display(tempDF['GOCC'].tolist())

> –> Blue module has mitochondrial proteins for many of the top hub proteins. Hence, subcellular localization would be good for visual categorization. Note that cytoplasm \[GO:0005737\] includes all subcellular structures other than plasma membrane and nucleus, and mitochondrion \[GO:0005739\] is its child! A child of cytoplasm, cytosol \[GO:0005829\], would be suitable for cytosolic localization. As summary, 'mitochondrion \[GO:0005739\]', 'nucleus \[GO:0005634\]', 'endosome \[GO:0005768\]', 'peroxisome \[GO:0005777\]', and 'cytosol \[GO:0005829\]' are enough for the top 30 in Blue module, and other compatments can be skipped.  
> –> Pink module has ribosomal proteins for most of the top hub proteins. Hence, ribosomal proteins and others would be good for visual categorization.  

### 4-2. Blue module

In [None]:
tempDF1 = analyteDF_uniprot
module = 'Blue'
topX = 30#Top hubs
tempL1 = ['mitochondrion [GO:0005739]', 'nucleus [GO:0005634]', 'endosome [GO:0005768]',
          'peroxisome [GO:0005777]', 'cytosol [GO:0005829]']#For the top 30 of Blue modules

#Select the target module
tempDF = tempDF1.loc[tempDF1['ModuleID']==module]
#Select the top X hubs
tempDF = tempDF.sort_values(by='IntramodularConnectivity', ascending=False)
tempDF = tempDF.iloc[:topX]

#Prepare subcellular localization
tempDF1 = tempDF['GOCC'].str.split(pat='; ', expand=True)
tempL = [tempDF1]#Initialize for checking
for gocc in tempL1:
    tempS = (tempDF1==gocc).sum(axis=1)
    tempS.name = gocc
    tempL.append(tempS)
tempDF1 = pd.concat(tempL, axis=1)
display(tempDF1)

#Clean as label
tempDF1 = tempDF1[tempL1]
tempDF1.columns = tempDF1.columns.str.capitalize()
tempDF1.columns = tempDF1.columns.str.replace('go:', 'GO:')
tempL = []
for analyte in tempDF1.index:
    tempS = tempDF1.loc[analyte]
    tempS = tempS.loc[tempS==1]
    if len(tempS)>0:
        tempL.append(tempS.index.str.cat(sep=', '))
    else:
        tempL.append('Other')
tempS = pd.Series(tempL, index=tempDF1.index, name='Category')
display(tempS.value_counts())

#Merge minor categories
tempS1 = tempS.value_counts()
tempS1 = tempS1.loc[tempS1<2]
tempS.loc[tempS.isin(tempS1.index)] = 'Other'
print('\nAfter merging minor categories:')
display(tempS.value_counts())

#Prepare DF for plot
tempDF = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')
display(tempDF)

hubDF = tempDF

In [None]:
tempDF = hubDF.reset_index()
print('Top', len(tempDF), 'hubs:')
tempD = {'Mitochondrion [GO:0005739]':'blue',
         'Mitochondrion [GO:0005739], Peroxisome [GO:0005777]':'dodgerblue',
         'Mitochondrion [GO:0005739], Cytosol [GO:0005829]':'darkcyan',
         'Mitochondrion [GO:0005739], Nucleus [GO:0005634]':'blueviolet',
         'Nucleus [GO:0005634], Cytosol [GO:0005829]':'darkgray',
         'Other':'white'}#Manual preparation while seeing the above output

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(10, 3))
p = sns.barplot(data=tempDF, y='IntramodularConnectivity', x='GeneSymbol',
                hue='Category', hue_order=tempD.keys(), palette=tempD, dodge=False, edgecolor='k')
p.grid(axis='y', linestyle='--', color='black')
sns.despine()
plt.xticks(rotation=70, horizontalalignment='right', verticalalignment='center', rotation_mode='anchor')
plt.xlabel('')
plt.ylabel('Intramodular connectivity')
plt.legend(title='Subcellular localization (GOCC)',
           bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1, ncol=1)
##Save
fileDir = './ExportFigures/'
ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
fileName = 'hub-proteins_'+module+'.pdf'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04, transparent=True)
plt.show()

### 4-3. Pink module

In [None]:
tempDF1 = analyteDF_uniprot
module = 'Pink'
topX = 30#Top hubs
tempL1 = ['cytosolic ribosome [GO:0022626]', 'endoplasmic reticulum [GO:0005783]']#For the top 30 of Pink modules

#Select the target module
tempDF = tempDF1.loc[tempDF1['ModuleID']==module]
#Select the top X hubs
tempDF = tempDF.sort_values(by='IntramodularConnectivity', ascending=False)
tempDF = tempDF.iloc[:topX]

#Prepare subcellular localization
tempDF1 = tempDF['GOCC'].str.split(pat='; ', expand=True)
tempL = [tempDF1]#Initialize for checking
for gocc in tempL1:
    tempS = (tempDF1==gocc).sum(axis=1)
    tempS.name = gocc
    tempL.append(tempS)
tempDF1 = pd.concat(tempL, axis=1)
display(tempDF1)

#Clean as label
tempDF1 = tempDF1[tempL1]
tempDF1.columns = tempDF1.columns.str.capitalize()
tempDF1.columns = tempDF1.columns.str.replace('go:', 'GO:')
tempL = []
for analyte in tempDF1.index:
    tempS = tempDF1.loc[analyte]
    tempS = tempS.loc[tempS==1]
    if len(tempS)>0:
        tempL.append(tempS.index.str.cat(sep=', '))
    else:
        tempL.append('Other')
tempS = pd.Series(tempL, index=tempDF1.index, name='Category')
display(tempS.value_counts())

#Merge minor categories
tempS1 = tempS.value_counts()
tempS1 = tempS1.loc[tempS1<2]
tempS.loc[tempS.isin(tempS1.index)] = 'Other'
print('\nAfter merging minor categories:')
display(tempS.value_counts())

#Prepare DF for plot
tempDF = pd.merge(tempDF, tempS, left_index=True, right_index=True, how='left')
#display(tempDF)

hubDF = tempDF

In [None]:
tempDF = hubDF.reset_index()
print('Top', len(tempDF), 'hubs:')
tempD = {'Cytosolic ribosome [GO:0022626]':'hotpink',
         'Cytosolic ribosome [GO:0022626], Endoplasmic reticulum [GO:0005783]':'red',
         'Endoplasmic reticulum [GO:0005783]':'orangered',
         'Other':'white'}#Manual preparation while seeing the above output

#Visualization
sns.set(style='ticks', font='Arial', context='talk')
plt.figure(figsize=(10, 3))
p = sns.barplot(data=tempDF, y='IntramodularConnectivity', x='GeneSymbol',
                hue='Category', hue_order=tempD.keys(), palette=tempD, dodge=False, edgecolor='k')
p.grid(axis='y', linestyle='--', color='black')
sns.despine()
plt.xticks(rotation=70, horizontalalignment='right', verticalalignment='center', rotation_mode='anchor')
plt.xlabel('')
plt.ylabel('Intramodular connectivity')
plt.legend(title='Subcellular localization (GOCC)',
           bbox_to_anchor=(1, 0.5), loc='center left', borderaxespad=1, ncol=1)
##Save
fileDir = './ExportFigures/'
ipynbName = '230306_LC-M001-proteomics-WGCNA-ver2_Analysis_'
fileName = 'hub-proteins_'+module+'.pdf'
plt.gcf().savefig(fileDir+ipynbName+fileName, dpi=300, bbox_inches='tight', pad_inches=0.04, transparent=True)
plt.show()

# — End of notebook —