# Generate gene-gene network modules
This notebook generate gene-gene network modules for an exploratory analysis.

The objective is____

In [1]:
%load_ext rpy2.ipython
import pandas as pd
import os
import argparse
from functions import process_data

In [2]:
base_dir = os.path.abspath(os.path.join(os.getcwd(),"../"))

In [3]:
# Input files
normalized_data_file = os.path.join(
    base_dir,
    "pilot_experiment",
    "data",
   "input",
    "train_set_normalized.pcl")

metadata_file = os.path.join(
    base_dir,
    "pilot_experiment",
    "data",
    "annotations",
    "sample_annotations.tsv")

# Load in annotation file
# Annotation file contains the list of all PAO1 specific genes
gene_mapping_file = os.path.join(
    base_dir,
    "pilot_experiment",
    "data",
    "annotations",
    "PAO1_ID_PA14_ID.csv")

In [4]:
# Select specific experiment
lst_experiments = ["E-GEOD-8083",
                   "E-GEOD-29789",
                   "E-GEOD-48982",
                   "E-GEOD-24038",
                   "E-GEOD-29879",
                   "E-GEOD-49759"]

In [5]:
# Output files
selected_data_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "input",
        "selected_normalized_data.tsv")

shuffled_selected_data_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "input",
        "shuffled_selected_normalized_data.tsv")

gene_annot_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "annotations",
        "selected_gene_annotations.txt")

# Select subset of samples

Select experiments that contain either PAO1 or PA14 strains. We will use only PAO1 and PA14 strains as a first pass because these two strains are the most common and well studied *P. aeruginosa* strains, therefore we will be able to verify the resulting gene-gene interactions with those found in the literature.

In [6]:
process_data.select_expression_data(normalized_data_file,
                                   metadata_file,
                                   lst_experiments,
                                   selected_data_file)

  index_col=0).T
  index_col=0)


(39, 5549)
Gene_symbol      PA0001    PA0002    PA0003    PA0004    PA0005    PA0006  \
GSM199982.CEL  0.602146  0.619571  0.453851  0.616299  0.296061  0.193150   
GSM199983.CEL  0.565232  0.634272  0.459963  0.639738  0.266436  0.186819   
GSM199984.CEL  0.507166  0.679804  0.457609  0.714618  0.335181  0.338886   
GSM199985.CEL  0.528545  0.669231  0.462730  0.657953  0.305657  0.282472   
GSM199986.CEL  0.542822  0.704964  0.447164  0.649575  0.267066  0.321489   
GSM738261.CEL  0.519756  0.427725  0.276953  0.427328  0.276952  0.329655   
GSM738262.CEL  0.502303  0.478588  0.191828  0.389341  0.254869  0.234301   
GSM738263.CEL  0.490648  0.460936  0.237385  0.404926  0.286027  0.280460   
GSM738264.CEL  0.554336  0.529259  0.250871  0.441458  0.305741  0.274767   
GSM738265.CEL  0.519577  0.531612  0.263087  0.439124  0.365923  0.340706   

Gene_symbol      PA0007    PA0008    PA0009    PA0010  ...    PA5561  \
GSM199982.CEL  0.596934  0.423352  0.500333  0.111480  ...  0.143319 

# Permute selected expression data
This permuted version will serve as a baseline for our analysis

In [7]:
process_data.permute_expression_data(selected_data_file,
                                     shuffled_selected_data_file)

  index_col=0)


                 PA0001    PA0002    PA0003    PA0004    PA0005    PA0006  \
GSM199982.CEL  0.254324  0.300978  0.597989  0.544134  0.233893  0.412026   
GSM199983.CEL  0.301997  0.300047  0.513272  0.189616  0.241617  0.315204   
GSM199984.CEL  0.454081  0.071528  0.665741  0.626161  0.544547  0.918408   
GSM199985.CEL  0.206609  0.176365  0.223115  0.465326  0.476631  0.280493   
GSM199986.CEL  0.298055  0.194905  0.283896  0.524215  0.722790  0.661516   
GSM738261.CEL  0.160034  0.264213  0.605888  0.422431  0.479334  0.167845   
GSM738262.CEL  0.745840  0.231143  0.468362  0.285103  0.359606  0.562178   
GSM738263.CEL  0.697154  0.492544  0.300052  0.369154  0.301793  0.758621   
GSM738264.CEL  0.234895  0.662232  0.277930  0.267817  0.493642  0.263918   
GSM738265.CEL  0.537777  0.249338  0.552190  0.297733  0.141178  0.300772   

                 PA0007    PA0008    PA0009    PA0010  ...    PA5561  \
GSM199982.CEL  0.137871  0.558695  0.158522  0.767242  ...  0.601635   
GSM19998

# Annotate genes as core and accessory

Annotate genes as either **core** if PAO1 gene is homologous to PA14 gene, or **accessory** if there does not exist a homolog. 

These homologous mappings are based on the [Bactome database](https://bactome.helmholtz-hzi.de/cgi-bin/h-pange.cgi?STAT=1&Gene=PA0135)

In [8]:
process_data.annotate_genes(selected_data_file,
                            gene_mapping_file,
                            gene_annot_file)

  index_col=0)


  PAO1_ID  Name                                    Product.Name     PA14_ID
0  PA0001  dnaA  chromosomal replication initiator protein DnaA  PA14_00010
1  PA0002  dnaN                  DNA polymerase III, beta chain  PA14_00020
2  PA0003  recF                                    RecF protein  PA14_00030
3  PA0004  gyrB                            DNA gyrase subunit B  PA14_00050
4  PA0005  lptA     lysophosphatidic acid acyltransferase, LptA  PA14_00060
5  PA0006   NaN                  conserved hypothetical protein  PA14_00070
6  PA0007   NaN                            hypothetical protein  PA14_00080
7  PA0008  glyS               glycyl-tRNA synthetase beta chain  PA14_00090
8  PA0009  glyQ              glycyl-tRNA synthetase alpha chain  PA14_00100
9  PA0010   tag               DNA-3-methyladenine glycosidase I  PA14_00110
No. of PAO1 only genes: 201
  PAO1_gene_id annotation
0       PA0001       core
1       PA0002       core
2       PA0003       core
3       PA0004       core
4     

# Network construction and module detection

Networks provide a straightforward representation of interactions between the nodes. A node corresponds to the gene expression profile of a given gene. Nodes are connected if they have a significant pairwise expression profile association across the environmental perturbations (cell- or tissue- samples). It is standard to use the (Pearson) correlation coefficient as a co-expression measure, e.g., the absolute value of Pearson correlation. 

## Get parameters for network generation

**Question:** How do we pick a threshold to determine what Pearson correlation score is sufficient to say that 2 nodes are associated?

A 'hard threshold' may lead to loss of information and sensitity.
Instead, a 'soft thresholding' is proposed. Soft thresholding weighs each connection of a float [0,1]   

**Reference:**
* http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471.9599&rep=rep1&type=pdf

In [9]:
%%R -i selected_data_file
# Get threshold param for true gene expression data
source("functions/network_utils.R")

get_threshold(selected_data_file)



Attaching package: ‘fastcluster’



    hclust



Attaching package: ‘WGCNA’



    cor


Attaching package: ‘flashClust’



    hclust





pickSoftThreshold: will use block size 5549.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 5549 of 5549
   Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
1      1  0.52600  2.000          0.961 1860.00   1860.00 2780.0
2      2  0.00944  0.105          0.857  890.00    848.00 1780.0
3      3  0.30300 -0.572          0.847  500.00    449.00 1260.0
4      4  0.56000 -0.942          0.856  309.00    258.00  938.0
5      5  0.68800 -1.120          0.907  204.00    157.00  727.0
6      6  0.74200 -1.250          0.920  142.00     99.80  579.0
7      7  0.76000 -1.340          0.921  102.00     65.60  470.0
8      8  0.76900 -1.390          0.924   75.20     44.40  387.0
9      9  0.77800 -1.450          0.930   56.90     31.20  324.0
10    10  0.77300 -1.490          0.926   44.00     22.20  273.0
11    12  0.80800 -1.510          0.949   27.60     11.60  200.0
12    14  0.82800 -1.520          0.960   18.20      6.46  151.0


In [10]:
%%R -i shuffled_selected_data_file
# Get threshold param for true gene expression data
source("functions/network_utils.R")

get_threshold(shuffled_selected_data_file)




pickSoftThreshold: will use block size 5549.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 5549 of 5549
   Power SFT.R.sq  slope truncated.R.sq  mean.k. median.k.   max.k.
1      1 0.000397  -1.01          0.992 7.29e+02  7.29e+02 7.78e+02
2      2 0.040100  -5.33          0.989 1.49e+02  1.48e+02 1.70e+02
3      3 0.089600  -5.44          0.986 3.80e+01  3.80e+01 4.64e+01
4      4 0.036900  -2.55          0.977 1.13e+01  1.13e+01 1.47e+01
5      5 0.127000  -4.05          0.978 3.78e+00  3.77e+00 5.41e+00
6      6 0.272000  -5.31          0.864 1.38e+00  1.37e+00 2.25e+00
7      7 0.336000 -10.10          0.397 5.39e-01  5.33e-01 1.03e+00
8      8 0.404000  -9.15          0.436 2.24e-01  2.21e-01 5.11e-01
9      9 0.458000  -8.24          0.464 9.84e-02  9.57e-02 2.73e-01
10    10 0.494000  -7.44          0.474 4.51e-02  4.32e-02 1.54e-01
11    12 0.941000  -4.08          0.958 1.07e-02  9.77e-03 5.61e-02
12    14 0.488000  -4.95     

In [11]:
# Prompt user to choose threshold params
power_param_true = int(input("Treshold for true data:"))

Treshold for true data:8


In [12]:
# Prompt user to choose threshold params
power_param_shuffled = int(input("Threshold for permuted data:"))

Threshold for permuted data:20


## Generate network modules

In [13]:
# Output module files
gene_modules_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "networks",
        "selected_modules.tsv")

shuffled_gene_modules_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "networks",
        "shuffled_selected_modules.tsv")

In [14]:
%%R -i power_param_true -i selected_data_file -i gene_modules_file
# Generate network modules using threshold params selected for true expression data
source("functions/network_utils.R")

generate_network_modules(power_param_true,
                         selected_data_file,
                         gene_modules_file)







..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
 ..cutHeight not given, setting it to 0.997  ===>  99% of the (truncated) height range in dendro.
 ..done.
 mergeCloseModules: Merging modules whose distance is less than 0.25
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 37 module eigengenes in given set.
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 22 module eigengenes in given set.
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 20 module eigengenes in given set.
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 19 module eigengenes in given set.
   Calculating new MEs...
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 19 module eigengenes in given set.


In [15]:
%%R -i power_param_shuffled -i shuffled_selected_data_file -i shuffled_gene_modules_file
# Generate network modules using threshold params selected for permuted expression data
source("functions/network_utils.R")

generate_network_modules(power_param_shuffled,
                         shuffled_selected_data_file,
                         shuffled_gene_modules_file)




..connectivity..
..matrix multiplication (system BLAS)..
..normalization..
..done.
 ..cutHeight not given, setting it to 1  ===>  99% of the (truncated) height range in dendro.
 ..done.
 mergeCloseModules: Merging modules whose distance is less than 0.25
   multiSetMEs: Calculating module MEs.
     Working on set 1 ...
     moduleEigengenes: Calculating 1 module eigengenes in given set.
Error in mergeCloseModules(expression_data, dynamicColors, cutHeight = MEDissThres,  : 
  Error in moduleEigengenes(expr = exprData[[set]]$data, colors = setColors,  : 
  Color levels are empty. Possible reason: the only color is grey and grey module is excluded from the calculation.



  Error in moduleEigengenes(expr = exprData[[set]]$data, colors = setColors,  : 
  Color levels are empty. Possible reason: the only color is grey and grey module is excluded from the calculation.



