# Process data
This notebook processes the expression data that will be used in this pilot analysis. Specifically this notebook performs the following steps:

1. Selects a small subset of expression data and outputs this dataset to file
2. Permutes the subsetted data to use as a control and outputs this dataset to file
3. Generates a mapping between *P. aeruginosa* gene id (PA####) and core, accessory label

In [1]:
%load_ext rpy2.ipython
import pandas as pd
import os
import argparse
from functions import process_data

In [None]:
base_dir = os.path.abspath(os.path.join(os.getcwd(),"../"))

### About input data
Normalized expression data is the *P. aeruginosa* compendium from [Tan et. al.](https://msystems.asm.org/content/1/1/e00025-15). The dataset can be found in the associated [ADAGE github repository](https://github.com/greenelab/adage/blob/master/Data_collection_processing/Pa_compendium_02.22.2014.pcl).

The corresponding metadata was downloaded from the [ADAGE website](https://adage.greenelab.com/#/download).

In [3]:
# Input files
normalized_data_file = os.path.join(
    base_dir,
    "pilot_experiment",
    "data",
   "input",
    "train_set_normalized.pcl")

metadata_file = os.path.join(
    base_dir,
    "pilot_experiment",
    "data",
    "annotations",
    "sample_annotations.tsv")

# Load in annotation file
# Annotation file contains the list of all PAO1 specific genes
gene_mapping_file = os.path.join(
    base_dir,
    "pilot_experiment",
    "data",
    "annotations",
    "PAO1_ID_PA14_ID.csv")

In [4]:
# Select specific experiments
# In this case we selected 6 experiments (3 experiments PAO1 strains and the other 3 experiments contain PA14 strains).
# We will use only PAO1 and PA14 strains as a first pass because these two strains are the most common and well studied
# P. aeruginosa strains and therefore we will be able to verify the resulting gene-gene interactions with those found
# in the literature.
lst_experiments = ["E-GEOD-8083",
                   "E-GEOD-29789",
                   "E-GEOD-48982",
                   "E-GEOD-24038",
                   "E-GEOD-29879",
                   "E-GEOD-49759"]

In [5]:
# Output files
selected_data_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "input",
        "selected_normalized_data.tsv")

shuffled_selected_data_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "input",
        "shuffled_selected_normalized_data.tsv")

gene_annot_file = os.path.join(
        base_dir,
        "pilot_experiment",
        "data",
        "annotations",
        "selected_gene_annotations.txt")

# Select subset of samples

Select subset of experiments to use

In [6]:
process_data.select_expression_data(normalized_data_file,
                                   metadata_file,
                                   lst_experiments,
                                   selected_data_file)

  index_col=0).T
  index_col=0)


(39, 5549)
Gene_symbol      PA0001    PA0002    PA0003    PA0004    PA0005    PA0006  \
GSM199982.CEL  0.602146  0.619571  0.453851  0.616299  0.296061  0.193150   
GSM199983.CEL  0.565232  0.634272  0.459963  0.639738  0.266436  0.186819   
GSM199984.CEL  0.507166  0.679804  0.457609  0.714618  0.335181  0.338886   
GSM199985.CEL  0.528545  0.669231  0.462730  0.657953  0.305657  0.282472   
GSM199986.CEL  0.542822  0.704964  0.447164  0.649575  0.267066  0.321489   
GSM738261.CEL  0.519756  0.427725  0.276953  0.427328  0.276952  0.329655   
GSM738262.CEL  0.502303  0.478588  0.191828  0.389341  0.254869  0.234301   
GSM738263.CEL  0.490648  0.460936  0.237385  0.404926  0.286027  0.280460   
GSM738264.CEL  0.554336  0.529259  0.250871  0.441458  0.305741  0.274767   
GSM738265.CEL  0.519577  0.531612  0.263087  0.439124  0.365923  0.340706   

Gene_symbol      PA0007    PA0008    PA0009    PA0010  ...    PA5561  \
GSM199982.CEL  0.596934  0.423352  0.500333  0.111480  ...  0.143319 

# Permute selected expression data
This permuted version will serve as a baseline for our analysis

In [7]:
process_data.permute_expression_data(selected_data_file,
                                     shuffled_selected_data_file)

  index_col=0)


                 PA0001    PA0002    PA0003    PA0004    PA0005    PA0006  \
GSM199982.CEL  0.254324  0.300978  0.597989  0.544134  0.233893  0.412026   
GSM199983.CEL  0.301997  0.300047  0.513272  0.189616  0.241617  0.315204   
GSM199984.CEL  0.454081  0.071528  0.665741  0.626161  0.544547  0.918408   
GSM199985.CEL  0.206609  0.176365  0.223115  0.465326  0.476631  0.280493   
GSM199986.CEL  0.298055  0.194905  0.283896  0.524215  0.722790  0.661516   
GSM738261.CEL  0.160034  0.264213  0.605888  0.422431  0.479334  0.167845   
GSM738262.CEL  0.745840  0.231143  0.468362  0.285103  0.359606  0.562178   
GSM738263.CEL  0.697154  0.492544  0.300052  0.369154  0.301793  0.758621   
GSM738264.CEL  0.234895  0.662232  0.277930  0.267817  0.493642  0.263918   
GSM738265.CEL  0.537777  0.249338  0.552190  0.297733  0.141178  0.300772   

                 PA0007    PA0008    PA0009    PA0010  ...    PA5561  \
GSM199982.CEL  0.137871  0.558695  0.158522  0.767242  ...  0.601635   
GSM19998

# Annotate genes as core and accessory

Annotate genes as either **core** if PAO1 gene is homologous to PA14 gene, or **accessory** if there does not exist a homolog. 

These homologous mappings are based on the [Bactome database](https://bactome.helmholtz-hzi.de/cgi-bin/h-pange.cgi?STAT=1&Gene=PA0135)

In [8]:
process_data.annotate_genes(selected_data_file,
                            gene_mapping_file,
                            gene_annot_file)

  index_col=0)


  PAO1_ID  Name                                    Product.Name     PA14_ID
0  PA0001  dnaA  chromosomal replication initiator protein DnaA  PA14_00010
1  PA0002  dnaN                  DNA polymerase III, beta chain  PA14_00020
2  PA0003  recF                                    RecF protein  PA14_00030
3  PA0004  gyrB                            DNA gyrase subunit B  PA14_00050
4  PA0005  lptA     lysophosphatidic acid acyltransferase, LptA  PA14_00060
5  PA0006   NaN                  conserved hypothetical protein  PA14_00070
6  PA0007   NaN                            hypothetical protein  PA14_00080
7  PA0008  glyS               glycyl-tRNA synthetase beta chain  PA14_00090
8  PA0009  glyQ              glycyl-tRNA synthetase alpha chain  PA14_00100
9  PA0010   tag               DNA-3-methyladenine glycosidase I  PA14_00110
No. of PAO1 only genes: 201
  PAO1_gene_id annotation
0       PA0001       core
1       PA0002       core
2       PA0003       core
3       PA0004       core
4     