# Prepare data for G-G network

This notebook prepares input data to create gene-gene co-expression network using correlation amongst eADAGE latent variables. We are using the [visualize_gene_network function from ADAGEpath](https://rdrr.io/github/greenelab/ADAGEpath/man/visualize_gene_network.html) to create and plot this gene-gene network.

The goal is to create a gene-gene network, highlighting the generic genes identified by SOPHIE. In order to highlight these genes we need to provide a dataframe mapping each gene to a label (generic vs not generic). This notebook is creating `gene_color_value` argument that will be passed into the `visualize_gene_network` function found in [make_GiG_network.R](../gene-expression-modules/make_GiG_network.R)

In [1]:
%load_ext autoreload
%load_ext rpy2.ipython
%autoreload 2

import os
import pandas as pd

from ponyo import utils, simulate_expression_data

Using TensorFlow backend.


In [2]:
# Read in config variables
base_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))

config_filename = os.path.abspath(
    os.path.join(base_dir, "configs", "config_pseudomonas_33245.tsv")
)
params = utils.read_config(config_filename)

In [3]:
# Load params
normalized_compendium_filename = params["normalized_compendium_filename"]

generic_genes_filename = os.path.join("data", "SOPHIE_GAPE_generic.tsv")

Create dataframe with two columns: gene id, label=1 if generic, 0 otherwise

In [4]:
# Read expression data
expression_data = pd.read_csv(
    normalized_compendium_filename, sep="\t", index_col=0, header=0
).T
expression_data.head()

Unnamed: 0,05_PA14000-4-2_5-10-07_S2.CEL,54375-4-05.CEL,AKGlu_plus_nt_7-8-09_s1.CEL,anaerobic_NO3_1.CEL,anaerobic_NO3_2.CEL,control1aerobic_Pae_G1a.CEL,control1_anaerobic_Pae_G1a.CEL,control2aerobic_Pae_G1a.CEL,control2_anaerobic_Pae_G1a.CEL,control3aerobic_Pae_G1a.CEL,...,Van_Delden_Kohler_0311_BAL6+_1.CEL,Van_Delden_Kohler_0311_BAL6_2.CEL,Van_Delden_Kohler_0311_BAL6+_2.CEL,Van_Delden_Kohler_0311_BAL6_3.CEL,Van_Delden_Kohler_0311_BAL6+_3.CEL,Van_Delden_Kohler_0311_PT5_1.CEL,Van_Delden_Kohler_0311_PT5_2.CEL,Van_Delden_Kohler_0311_PT5_3.CEL,WT12935-18-05.CEL,WT12935-4-05.CEL
PA0001,0.853,0.779,0.789,0.716,0.658,0.366,0.689,0.353,0.674,0.399,...,0.46,0.49,0.383,0.359,0.417,0.362,0.543,0.516,0.747,0.763
PA0002,0.725,0.768,0.73,0.585,0.592,0.573,0.723,0.581,0.681,0.654,...,0.661,0.647,0.648,0.635,0.655,0.676,0.731,0.748,0.706,0.753
PA0003,0.641,0.615,0.726,0.39,0.41,0.418,0.51,0.303,0.515,0.329,...,0.434,0.465,0.419,0.389,0.397,0.498,0.467,0.493,0.528,0.577
PA0004,0.811,0.908,0.719,0.193,0.246,0.663,0.802,0.64,0.747,0.693,...,0.696,0.531,0.645,0.552,0.636,0.614,0.615,0.677,0.8,0.945
PA0005,0.694,0.399,0.53,0.279,0.312,0.425,0.619,0.282,0.657,0.482,...,0.246,0.368,0.241,0.229,0.21,0.322,0.341,0.485,0.461,0.41


In [5]:
# Read generic gene data
generic_genes_data = pd.read_csv(
    generic_genes_filename, sep="\t", index_col=0, header=0
)
generic_gene_ids = list(generic_genes_data["gene id"])

In [6]:
# Map generic genes
expression_data["label"] = 0
expression_data.loc[generic_gene_ids, "label"] = 1

In [8]:
# Truncate df
annot_df = expression_data["label"].to_frame()
annot_df.head()

Unnamed: 0,label
PA0001,0
PA0002,0
PA0003,0
PA0004,0
PA0005,0


In [9]:
# Save
annot_df.to_csv("annot_df.tsv", sep="\t")