This script can be used to calculate the observed t-statistics prior to using the HPC.

In [None]:
#Load libraries 
import pandas as pd
from sklearn.preprocessing import StandardScaler
import sspa
import scipy
import numpy as np 

We make two different networks, one for the COVID cases 1-2 compared to COVID cases 3-7 <br>
This is because there are only 18 samples in common between the metabolomic and proteomic datasets

0       Common samples: 18           Metabolomic samples: 133        Proteomic samples: 123 <br>
1-2       Common samples: 45          Metabolomic samples: 45        Proteomic samples: 48 <br>
3-4       Common samples: 56          Metabolomic samples: 57        Proteomic samples: 59 <br>
5-7       Common samples: 27          Metabolomic samples: 28        Proteomic samples: 28 <br>

146 common samples overall,   128 cases (45 samples (WHO 1-2) vs 83 samples (WHO 3-7))

In [None]:
df = pd.read_csv('../Data/Su_COVID_metabolomics_processed_commoncases.csv', index_col=0)
#df = pd.read_csv('../Data/Su_COVID_proteomics_processed_commoncases.csv', index_col=0)
#df = pd.read_csv("../Data/Su_integrated_data.csv", index_col=0)

In [None]:
df_mild = (df[df["WHO_status"] == '1-2']).iloc[:,:-2] #45 samples, remove the metadata
df_severe = (df[(df["WHO_status"] == '3-4') | (df["WHO_status"] == '5-7')]).iloc[:,:-2] #83 samples

### Scale data after subsetting

In [None]:
df_mild = pd.DataFrame(StandardScaler().fit_transform(df_mild),columns=df_mild.columns, index=df_mild.index)
df_severe = pd.DataFrame(StandardScaler().fit_transform(df_severe),columns=df_severe.columns, index=df_severe.index)

Check that data is centred at zero (mean=zero and standard deviation=1 for each molecule):

In [None]:
def check_centred(type):
    print(type.max().max())
    print(type.min().min())
    print(type.mean(axis = 0)) #mean of 0
    print(type.std(axis = 0)) #sd of 1

In [None]:
check_centred(df_mild)

In [None]:
check_centred(df_severe)

In [None]:
#Download the reactome pathways
reactome_pathways = sspa.process_gmt("../Data/Reactome_Homo_sapiens_pathways_compounds_R84.gmt")
#reactome_pathways = sspa.process_reactome('Homo sapiens', infile = '../Data/UniProt2Reactome_all_Levels_ver84.txt', download_latest = False, filepath = None)
#reactome_pathways = pd.read_csv("../Data/Reactome_multi_omics_ChEBI_Uniprot.csv", index_col=0,dtype="str") #Dtype warning because in some columns, some values are in string format whereas some are in integer format, that's why I specify dtype="str"


#Download the root pathways
root_path = pd.read_excel('../Data/Root_pathways.xlsx', header=None)
root_pathway_dict = {root_path[0][i]:root_path[1][i] for i in range(0,len(root_path))}
root_pathway_names = list(root_pathway_dict.keys())



### Step 1: Determine observed test-statistics

Calculating observed test statistics at the pathway level:

In [None]:
#Function to calculate the squared Spearman correlation matrix 

def squared_spearman_corr(data):
    kpca_scores = sspa.sspa_kpca(data, reactome_pathways)   
    kpca_scores = kpca_scores.drop(columns = list(set(root_pathway_names) & set(kpca_scores.columns))) #using Sara's code to drop root pathways

    spearman_results = scipy.stats.spearmanr(kpca_scores)
    squared_spearman_coef = np.square(spearman_results[0]) #correlation coefficients (spearman_results[1] gives the p-values)

    return squared_spearman_coef,list(kpca_scores.columns)




#Function to calculate the difference between two matrices 

def observed_tstat(data1,data2,edgelist):
    #abs_squared = np.absolute(np.array(data1) - np.array(data2))
    delta_squared = (np.array(data1) - np.array(data2))

    #Mask the upper half of the dataframe (so I don't view the comparisons between the two same genes, and also the duplicate comparisons are removed)
    mask = delta_squared.copy()
    mask = np.triu(np.ones(mask.shape)).astype(bool)
    mask = np.invert(mask) #invert true and false values so the diagonal is False as well
    non_dup_delta_squared = pd.DataFrame(delta_squared, columns = edgelist, index = edgelist)
    non_dup_delta_squared = pd.DataFrame(non_dup_delta_squared).where(mask) #Replace all false values with NaN using mask

    squared_list = non_dup_delta_squared.stack().reset_index()
    squared_list['level_0'] = squared_list["level_0"].astype(str) + ", " + squared_list['level_1'] #Join the pathway pairs together to form an edge
    squared_list.columns = ["Edges","na","Observed_tstat"]
    squared_list.index = squared_list["Edges"]
    squared_list = squared_list.drop(columns = ["Edges","na"])

    return(squared_list)

Note: For the delta squared correlation values for the unshuffled data (i.e. the real data) I keep the indices (pathway edges). Since I already have a record of the edges, there is no need to keep the edges for each permutation, since the order is the same each time.  </br>

Note: I take the difference (and not absolute difference) between the mild and severe matrices because we care about the directionality of the association and also in case the distribution is skewed.

In [None]:
spearman_mild,edgelist = squared_spearman_corr(df_mild)
spearman_severe,edgelist = squared_spearman_corr(df_severe)

output = observed_tstat(spearman_mild,spearman_severe,edgelist)

Note: There are no values with a Spearman correlation value of zero. The reason why some of the observed test statistic values have a value of zero is because for both groups both values are one.

In [None]:
test_df = pd.DataFrame(spearman_mild, columns = edgelist, index = edgelist)
display(test_df.iloc[:25,:25])

In [None]:
output[:20]

In [None]:
#output.to_csv("../Data/permutation_test_files_metabolomics/observed_tstats.csv")
#output.to_csv("../Data/permutation_test_files_proteomics/observed_tstats.csv")
#output.to_csv("../Data/permutation_test_files_integrated/observed_tstats.csv")

Calculating observed test statistics at the molecular level without pathway analysis:

In [None]:
#Function to calculate the squared Spearman correlation matrix 

def squared_spearman_corr(data):
    spearman_results = scipy.stats.spearmanr(data)
    squared_spearman_coef = np.square(spearman_results[0]) #correlation coefficients (spearman_results[1] gives the p-values)

    return squared_spearman_coef,list(data.columns)




#Function to calculate the difference between two matrices 

def observed_tstat(data1,data2,edgelist):
    #abs_squared = np.absolute(np.array(data1) - np.array(data2))
    delta_squared = (np.array(data1) - np.array(data2))

    #Mask the upper half of the dataframe (so I don't view the comparisons between the two same genes, and also the duplicate comparisons are removed)
    mask = delta_squared.copy()
    mask = np.triu(np.ones(mask.shape)).astype(bool)
    mask = np.invert(mask) #invert true and false values so the diagonal is False as well
    non_dup_delta_squared = pd.DataFrame(delta_squared, columns = edgelist, index = edgelist)
    non_dup_delta_squared = pd.DataFrame(non_dup_delta_squared).where(mask) #Replace all false values with NaN using mask

    squared_list = non_dup_delta_squared.stack().reset_index()
    squared_list['level_0'] = squared_list["level_0"].astype(str) + ", " + squared_list['level_1']
    squared_list.columns = ["Edges","na","Observed_tstat"]
    squared_list.index = squared_list["Edges"]
    squared_list = squared_list.drop(columns = ["Edges","na"])

    return(squared_list)

In [None]:
spearman_mild,edgelist = squared_spearman_corr(df_mild)
spearman_severe,edgelist = squared_spearman_corr(df_severe)

output = observed_tstat(spearman_mild,spearman_severe,edgelist)

In [None]:
output

In [None]:
#output.to_csv("../Data/permutation_test_files_integrated_withoutPA/observed_tstats.csv")

### Step 2: Shuffle the labels

The sample labels are shuffled, rather than assigning the samples to two different groups (since the sizes of the 1-2 class with the 3-7 class is not equal).

### Step 3: Read in the permutation files 

Using the HPC, 10 files each store 10k permutations. 10 array jobs are carried out to read in all 10k values, and to output how many are above the observed test statistic. See permutation_distribution.ipynb for more info.

Also I take all the permutation values for some randomly selected pathways to check the distribution.

### Step 4: Compare the difference in edges with other networks