# Process transporter data for use with DESeq2

Because statistical analysis of metagenomes may suffer due to genes with low abundance ([Jonsson et. al 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727335/)) we will filter transporters with an average read count <100. This is a tradeoff between producing trustworthy results and producing any results at all (because filtering at higher average read counts will remove too many transporters). In addition, the statistical analysis is performed on representative protein families for each transporter cluster. Representative families are selected by sorting by mean abundance across the samples.

In [1]:
import pandas as pd

Read selected transporters.

In [2]:
transinfo = pd.read_table("results/selected_transporters_classified.tab", header=0, sep="\t", index_col=0)

Read raw counts for transporters (calculated from representative protein families).

In [3]:
mg_trans_reps = pd.read_table("results/mg/rep_trans.raw_counts.tsv", header=0, sep="\t", index_col=0)
mt_trans_reps = pd.read_table("results/mt/rep_trans.raw_counts.tsv", header=0, sep="\t", index_col=0)

Intersect with the selected transporters.

In [4]:
mg_select_trans_reps = mg_trans_reps.reindex(transinfo.index)
mt_select_trans_reps = mt_trans_reps.reindex(transinfo.index)

## Filter out transporters with low coverage.

In [5]:
threshold = 100

In [6]:
mg_select_trans_reps_filt = mg_select_trans_reps.loc[mg_select_trans_reps.mean(axis=1)>=100]
mg_select_trans_reps_filt.to_csv("results/mg/rep_trans_filt.raw_counts.tsv", sep="\t")
print("{} transporters remaining after filtering".format(len(mg_select_trans_reps_filt)))

37 transporters remaining after filtering


In [7]:
mt_select_trans_reps_filt = mt_select_trans_reps.loc[mt_select_trans_reps.mean(axis=1)>=100]
mt_select_trans_reps_filt.to_csv("results/mt/rep_trans_filt.raw_counts.tsv", sep="\t")
print("{} transporters remaining after filtering".format(len(mt_select_trans_reps_filt)))

22 transporters remaining after filtering
