## Tb TF KO expression data wrangling

Created by Emanuel Flores and Adrian Jinich. 

In [142]:
import pandas as pd 

# black magic for code style 
%load_ext blackcellmagic

The blackcellmagic extension is already loaded. To reload it, use:
  %reload_ext blackcellmagic


Welcome! In this notebook we're going to do some simple data selection in order to make a dataset that we can work with later on. The data we'll be working with is fold-change (FC) data using single transcription factor (TF) overexpression. What this means is that they measured gene expression of all the transcriptome in normal conditions and contrasted that with the conditions of overexpressing each TF in the network, and they computed the $\mathrm{log} \left(\frac{g_{i, KO}}{g_i} \right)$, where $g_i$ is the relative level of expression of gene $i$. We're going to merge this dataset with annotation from the oxidorreductases data. 

First, we're going to load the fold-change data. 

In [2]:
# Using pandas to load data.
fc_xl = pd.read_excel('~/Downloads/fold_changes_SI_Table_2.xlsx')

In [3]:
# take a look
fc_xl.head(3)

Unnamed: 0,ID,Name,Function,Count,FC:Rv0019c,FC:Rv0020c,FC:Rv0022c,FC:Rv0023,FC:Rv0038,FC:Rv0042c,...,Rv3681c,Rv3736,Rv3744,Rv3765c,Rv3830c,Rv3833,Rv3849,Rv3855,Rv3862c,Rv3911
0,Rv0001,dnaA,Chromosomal replication initiator protein DnaA,1,0.142,0.021,-0.734,-1.683,0.102,0.272,...,,,,,,,,,,
1,Rv0002,dnaN,DNA polymerase III beta subunit (EC 2.7.7.7),0,0.053,-0.022,-0.181,0.63,0.159,0.04,...,,,,,,,,,,
2,Rv0003,recF,DNA recombination and repair protein RecF,0,0.081,0.188,-0.203,-0.686,0.222,0.074,...,,,,,,,,,,
3,Rv0004,,"Zn-ribbon-containing, possibly RNA-binding pro...",1,0.21,0.163,0.169,-1.077,0.112,0.292,...,,,,,,,,,,
4,Rv0005,gyrB,DNA gyrase subunit B (EC 5.99.1.3),0,0.053,0.091,-0.001,-0.689,0.114,0.165,...,,,,,,,,,,


In [39]:
fc_xl.shape

(4026, 560)

We can see that there are some weird entries with NaN values. Let's filter only the FC dat.a 

## Filter only fold-change values 

Because the dataset contains p-values and ChIP-seq binding information,we'll filter the dataset to get only FC data. 

In [4]:
# Getting column names for fold-change data. 
fc_columns = [col for col in fc_xl.columns.to_list() if 'FC' in col]

We know that there are 206 TFs, let's make sure we got all of the data. 

In [41]:
# Make sure we got 206 columns 
len(fc_columns)

206

In [5]:
# Filtering annotation data 
annot = fc_xl.iloc[:, :4]

annot.head()

In [8]:
# Merging column names with annotation or FC data into a list
cols = annot.columns.to_list() + fc_columns

In [9]:
# Selection of fold-change data using fancy indexing on columns 
fc_data = fc_xl[cols]

In [42]:
# take a look 
fc_data.head(3)

Unnamed: 0,Rv_ID,Name,Function,Count,FC:Rv0019c,FC:Rv0020c,FC:Rv0022c,FC:Rv0023,FC:Rv0038,FC:Rv0042c,...,FC:Rv3744,FC:Rv3765c,FC:Rv3830c,FC:Rv3833,FC:Rv3840,FC:Rv3849,FC:Rv3852,FC:Rv3855,FC:Rv3862c,FC:Rv3911
0,Rv0001,dnaA,Chromosomal replication initiator protein DnaA,1,0.142,0.021,-0.734,-1.683,0.102,0.272,...,-0.227,0.264,0.082,0.082,-0.039,-0.634,-0.353,0.163,-0.06,-0.669
1,Rv0002,dnaN,DNA polymerase III beta subunit (EC 2.7.7.7),0,0.053,-0.022,-0.181,0.63,0.159,0.04,...,0.004,0.1,0.053,0.274,0.179,-0.917,-0.004,0.211,-0.118,-0.178
2,Rv0003,recF,DNA recombination and repair protein RecF,0,0.081,0.188,-0.203,-0.686,0.222,0.074,...,-0.295,-0.065,0.009,-0.035,0.1,-0.16,-0.251,0.035,0.121,-0.448


In [11]:
fc_data.shape

(4026, 210)

Nice, now we can save the dataset. 

In [None]:
#fc_data.to_csv('../data/fold_change_tf_ko.csv', index = False)

### Load redox annotation data

Now, we want to integrate the fold-change dataset with the redox annotation, 

In [121]:
redox = pd.read_excel('~/Downloads/redox_uniprot_myco.xlsx')

In [122]:
redox['redox_enzyme'] = 1

Let's make an outer join keeping all the data from both dataframes. 

In order to do that, we'll first have to change the column name of the FC data to Rv_ID, to join on that column as an anchor. 

In [123]:
# Change to have the locus name, Rv_ID as an anchor 
#fc_data.rename(columns = {'Name': 'gene_name', 'ID': 'Rv_ID'}, inplace = True)

# Change redox table column names for consistency 
redox.rename(
    columns={"Gene names": "alternative_gene_names", "Name": "redox_gene_name", 
             'Function': 'function_redox_'},
    inplace=True,
)

# Delete organism column because we know we're working with TB 
del(redox['Organism'])

In [124]:
# Confirm the only anchor column is the locus tag 
common_cols = list(set(redox.columns.tolist()).intersection(fc_data.columns.tolist()))

common_cols

All right, we're ready to do the merge. We'll do a left outer join because we know (after some analysis) that there are some genes in the redox dataset not in the fold change dataset. 

In [136]:
# Get genes not in FC dataset
set(redox.Rv_ID.values) - set(fc_data.Rv_ID.values).intersection(set(redox.Rv_ID.values))

{'Rv1508', 'Rv1990', 'Rv2250', 'Rv2970'}

In [137]:
df = pd.merge(fc_data, redox, on = 'Rv_ID', how = 'left')

In [138]:
df.shape

(4031, 231)

Finally, we only need to populate some annotation entries with zeros. This will help later for classification and visualization purposes. 

In [139]:
# Fill the entries of the non-redox annotated proteins with 0s
df.redox_enzyme.fillna(0, inplace = True)
df.UK_score_4.fillna(0, inplace = True)
df.Annotation.fillna('None', inplace = True)
df.Annotation_int.fillna(0, inplace = True)

In [140]:
# Save dataset 
df.to_csv('../data/fold_change_tf_ko_plus_redox_annot.csv', index = False)

In [None]:
%load_ext watermark

%watermark -v -p pandas