# Processing the transporter data

## Overview

This notebook fetches transporter cluster information from the [transporter](https://github.com/johnne/transporters) GitHub repository and metaomic gene abundances and annotations from [figshare](https://figshare.com/s/6e05aa0ea8353098a503).

In [None]:
import pandas as pd
import os
import numpy as np
import urllib
import hashlib

In [None]:
def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

In [None]:
def file2df(f, drop=None, axis=1, rename=None):
    df = pd.read_csv(f, index_col=0, sep="\t", header=0)
    if drop:
        df.drop(drop, axis=axis, inplace=True)
    if rename:
        if axis == 1:
            df.rename(columns=rename, inplace=True)
        elif axis == 0:
            df.rename(index=rename, inplace=True)
    return df

## Set up the metaomic data

### Download data from figshare

Download the abundance data of ORFs in the co-assembly, as well as tables containing taxonomic information.

In [None]:
# Define data files
data_files = {'data/mg/all_genes.raw_counts.taxonomy.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168053', 'sha256': '700b83f864791ba801a5912f2673d2e3c09f0e70cf8a0ee685489f705fa75dbc'},
              'data/mg/all_genes.raw_counts.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168047', 'sha256': '4d532c1f2126028cef6be531fb39802d1d31a27a5e5abba480782607eb419f4f'},
              'data/mg/all_genes.tpm.taxonomy.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168017', 'sha256': '9f4b29218009c75969d312b0d62243459699fde548efc03545d30e2f262f19e1'},
              'data/mg/all_genes.tpm.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168011', 'sha256': 'f38aedf2151277d88f6f9fe92af64503baa9ed146ad607b050fb48a488a9a8d8'},
              'data/mt/all_genes.tpm.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168020', 'sha256': '881a73bcc670f74b567b3875c65c64fd0097ba8b18eabc1ddf159ae39699ac86'},
              'data/mt/all_genes.tpm.taxonomy.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168023', 'sha256': '89bb9ab8fd34e29df486961147875d3f2aa613b993c88e13694863322475af71'},
              'data/mt/all_genes.raw_counts.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168026', 'sha256': 'c32ca173e359369f5558b695e1d3108105e59dcc6b3478c7669e32ea3d93825a'},
              'data/mt/all_genes.raw_counts.taxonomy.tsv.gz': {'url': 'https://ndownloader.figshare.com/files/15168035', 'sha256': '77efa5d4a1cd22cbd367b41981553ec489a85598ee7328f778c5497976278a04'}}

In [None]:
for f, d in data_files.items():
    os.makedirs(os.path.dirname(f), exist_ok=True)
    download = False
    if os.path.exists(f):
        if sha256sum(f) == d['sha256']:
            print("File {} exists".format(f))
            continue
        else:
            print("File {} has wrong hash. Re-downloading")
            download = True
    else:
        download = True
    if download:
        url = d['url']
        print("Downloading file {} from {}".format(f, url))
        urllib.request.urlretrieve(url, f)
        if sha256sum(f) == d['sha256']:
            print("{} OK".format(f))
        else:
            print("{} FAILED. Please try re-downloading.".format(f))

Download the environmental data.

In [None]:
urllib.request.urlretrieve("https://ndownloader.figshare.com/files/15175808", "data/LMO.time.series.metadata.csv")

Download TIGRFAM annotations for ORFs. This is done directly from the [Alneberg et al 2018](https://doi.org/10.6084/m9.figshare.c.3831631.v1) collection.

In [None]:
os.makedirs("data/annotations", exist_ok=True)
urllib.request.urlretrieve("https://ndownloader.figshare.com/files/9448027", "data/annotations/all.TIGRFAM.standardized.tsv.gz")

### Retrieve transporter information

Protein families associated with transporter functions have been identified using the https://github.com/johnne/transporters repository. Transporter protein families are clustered using cross-referencing of reviewed entries in the UniProt database (see the GitHub transporter [wiki](https://github.com/johnne/transporters/wiki) for details). Here we use transporter clustering created using the `2017_12` UniProt version.

In [None]:
uniprot_ver = "2017_12"

In [None]:
transdef = pd.read_csv("https://raw.githubusercontent.com/johnne/transporters/master/results/transport-clusters.{}.tab".format(uniprot_ver), 
                       header=None, sep="\t", names=["transporter","fam"])
print("{} transporters, {} protein families".format(len(transdef.transporter.unique()), len(transdef.fam)))

We limit transporters to the ones with at least one TIGRFAM entry.

In [None]:
transdef = transdef.loc[transdef.fam.str.contains("TIGR")]
print("{} remaining transporters, {} TIGRFAMs".format(len(transdef.transporter.unique()), len(transdef.fam)))

### TIGRFAM annotations

Here we load the TIGRFAM annotations for ORFs in the metagenomic co-assembly.

In [None]:
tigrfams = pd.read_csv("data/annotations/all.TIGRFAM.standardized.tsv.gz", usecols=[0,1],names=["gene_id","fam"],header=0,sep="\t")
tigrfams.head(10)

### Merge with annotations

The annotation table is then merged with the transporter definitions.

In [None]:
gene_trans = pd.merge(tigrfams, transdef, left_on="fam", right_on="fam")
print(" {} open reading frames, {} transporters, {} TIGRFAMs".format(len(gene_trans.gene_id.unique()), len(gene_trans.transporter.unique()), len(gene_trans.fam.unique())))

In [None]:
gene_trans.sample(10)

In [None]:
gene_trans.set_index("gene_id", inplace=True)

### Merge with abundances

#### Metagenomes

The metagenomic time-series has some dubious samples that may have been mis-labeled.

In [None]:
dubious = ["120507","120521","120910","121123"]

Read abundance tables for metagenomic samples

In [None]:
mg_cov = file2df("data/mg/all_genes.tpm.tsv.gz", drop=dubious+["gene_length"])
mg_raw = file2df("data/mg/all_genes.raw_counts.tsv.gz", drop=dubious+["gene_length"])

Read abundance tables with taxonomic info

In [None]:
mg_taxcov = file2df("data/mg/all_genes.tpm.taxonomy.tsv.gz")
mg_taxraw = file2df("data/mg/all_genes.raw_counts.taxonomy.tsv.gz")

Merge with transporters table.

In [None]:
mg_transcov = pd.merge(gene_trans, mg_taxcov, left_index=True, right_index=True)
mg_transraw = pd.merge(gene_trans, mg_taxraw, left_index=True, right_index=True)

Store total raw counts per sample.

In [None]:
os.makedirs("results/mg", exist_ok=True)
mg_raw_tot = mg_raw.loc[mg_raw.index.str.match("^k.+")].sum()
mg_raw_tot = pd.DataFrame(mg_raw_tot,columns=["total_counts"])
mg_raw_tot.to_csv("results/mg/all_genes.total_counts.tsv", sep="\t")

#### Metatranscriptomes

The metatranscriptomic time-series needs to have the sample_ids renamed to sample dates.

In [None]:
mt_sample_names = {"P1456_101":"120516", "P1456_102":"120613", "P1456_103":"120712", 
                   "P1456_104":"120813", "P1456_105":"120927", "P1456_106":"121024", 
                   "P1456_107":"121220", "P1456_108":"130123", "P1456_109":"130226", 
                   "P1456_110":"130403", "P1456_111":"130416", "P1456_112":"130422", 
                   "P3764_101":"130507", "P3764_102":"130605", "P3764_103":"130705", 
                   "P3764_104":"130815", "P3764_105":"130905", "P3764_106":"131003", 
                   "P3764_112":"140408", "P3764_113":"140506", "P3764_114":"140604", 
                   "P3764_115":"140709", "P3764_116":"140820", "P3764_117":"140916", 
                   "P3764_118":"141013"}

In [None]:
mt_cov = file2df("data/mt/all_genes.tpm.tsv.gz", drop="gene_length", rename=mt_sample_names)
mt_raw = file2df("data/mt/all_genes.raw_counts.tsv.gz", drop="gene_length", rename=mt_sample_names)

Read the files with taxonomic annotations as well.

In [None]:
mt_taxcov = file2df("data/mt/all_genes.tpm.taxonomy.tsv.gz")
mt_taxraw = file2df("data/mt/all_genes.raw_counts.taxonomy.tsv.gz")

Merge with transporters table.

In [None]:
mt_transcov = pd.merge(gene_trans, mt_taxcov, left_index=True, right_index=True)
mt_transraw = pd.merge(gene_trans, mt_taxraw, left_index=True, right_index=True)

Store total raw counts per sample.

In [None]:
mt_raw_tot = mt_raw.loc[mt_raw.index.str.match("^k.+")].sum()
mt_raw_tot = pd.DataFrame(mt_raw_tot,columns=["total_counts"])
mt_raw_tot.to_csv("results/mt/all_genes.total_counts.tsv", sep="\t")

## Calculate total transporter abundance

Transporter abundances are calculated using the normalized TPM values. However, the DeSeq2 package requires raw counts so for that purpose the summed raw counts are calculated for 1 representative protein family per transporter cluster.

In [None]:
def get_representatives(df):
    '''Finds representative families for each transporter based on highest mean'''
    df_mean = df.groupby(["fam","transporter"]).sum().mean(axis=1).reset_index()
    df_mean.sort_values(0,ascending=False,inplace=True)
    df_mean.index = list(range(0,len(df_mean)))
    reps = {}
    for i in df_mean.index:
        fam = df_mean.loc[i,"fam"]
        t = df_mean.loc[i,"transporter"]
        if t in reps.keys():
            continue
        reps[t] = fam
    return reps

Sum to protein family.

In [None]:
mg_fam_sum = mg_transcov.groupby(["fam","transporter"]).sum().reset_index()
# Get representative families for each transporter cluster (for use with DSeq2)
mg_reps = get_representatives(mg_fam_sum)
mg_reps = pd.DataFrame(data=mg_reps,index=["fam"]).T

In [None]:
mt_fam_sum = mt_transcov.groupby(["fam","transporter"]).sum().reset_index()
# Get representative families for each transporter cluster (for use with DSeq2)
mt_reps = get_representatives(mt_fam_sum)
mt_reps = pd.DataFrame(data=mt_reps,index=["fam"]).T

Group by transporter and calculate means.

In [None]:
mg_trans = mg_fam_sum.groupby("transporter").mean()
mg_trans_percent = mg_trans.div(mg_trans.sum())*100
mg_trans.to_csv("results/mg/all_trans.tpm.tsv", sep="\t")
mg_trans_percent.to_csv("results/mg/all_trans.tpm.percent.tsv", sep="\t")

In [None]:
mt_trans = mt_fam_sum.groupby("transporter").mean()
mt_trans_percent = mt_trans.div(mt_trans.sum())*100
mt_trans.to_csv("results/mt/all_trans.tpm.tsv", sep="\t")
mt_trans_percent.to_csv("results/mt/all_trans.tpm.percent.tsv", sep="\t")

Calculate transporter maximum (in % of total transporters) across all samples.

In [None]:
mg_trans_percent_max = mg_trans_percent.max(axis=1)
mt_trans_percent_max = mt_trans_percent.max(axis=1)

Output max abundances for transporters for filtering

In [None]:
print("{} transporters with max% > 0.5 in the mg-samples".format(len(mg_trans_percent_max.loc[mg_trans_percent_max>=0.5])))

In [None]:
print("{} transporters with max% > 0.5 in the mt-samples".format(len(mt_trans_percent_max.loc[mt_trans_percent_max>=0.5])))

Write raw counts for representative protein families.

In [None]:
mg_reps_raw = pd.merge(mg_reps,mg_transraw,left_on="fam",right_on="fam")
mg_reps_raw_sum = mg_reps_raw.groupby("transporter").sum()
mg_reps_raw_sum.to_csv("results/mg/rep_trans.raw_counts.tsv", sep="\t")

In [None]:
mt_reps_raw = pd.merge(mt_reps,mt_transraw,left_on="fam",right_on="fam")
mt_reps_raw_sum = mt_reps_raw.groupby("transporter").sum()
mt_reps_raw_sum.to_csv("results/mt/rep_trans.raw_counts.tsv", sep="\t")

### Calculate transporter abundances for bacteria

Metagenome

In [None]:
# Get genes classified as bacteria but not cyanobacteria
mg_transcov_bac = mg_transcov.loc[(mg_transcov.superkingdom=="Bacteria")&(mg_transcov.phylum!="Cyanobacteria")]
# Calculate sum of protein families 
mg_transcov_bac_fam = mg_transcov_bac.groupby(["fam","transporter"]).sum().reset_index()
# Calculate mean of transporters
mg_trans_bac = mg_transcov_bac_fam.groupby("transporter").mean()
mg_trans_bac.to_csv("results/mg/bac_trans.tpm.tsv", sep="\t")

Metatranscriptome

In [None]:
# Get genes classified as bacteria but not cyanobacteria
mt_transcov_bac = mt_transcov.loc[(mt_transcov.superkingdom=="Bacteria")&(mt_transcov.phylum!="Cyanobacteria")]
# Calculate sum of protein families 
mt_transcov_bac_fam = mt_transcov_bac.groupby(["fam","transporter"]).sum().reset_index()
# Calculate mean of transporters
mt_trans_bac = mt_transcov_bac_fam.groupby("transporter").mean()
mt_trans_bac.to_csv("results/mt/bac_trans.tpm.tsv", sep="\t")

## Selected transporters

A subset of 58 transporters were selected for this study, based on abundances in the dataset (>=0.5% max in at least one sample) and their putative substrates. They were classified manually using TIGRFAM roles and Gene Ontology mappings. 

The curated table is at the [GitHub repository](https://github.com/johnne/transporters/blob/master/article/selected_transporters_classified.tab)

In [None]:
transinfo = pd.read_csv("https://raw.githubusercontent.com/johnne/transporters/master/article/selected_transporters_classified.tab", index_col=0, sep="\t")
transinfo.head()

Limit the transporter definitions to the selected transporters.

In [None]:
transdef_select = transdef.loc[transdef.transporter.isin(transinfo.index)]
print("{} transporters remaining, comprising {} TIGRFAMS".format(len(transdef_select.transporter.unique()), len(transdef_select.fam.unique())))

Add substrate categories to the dataframes.

In [None]:
mg_trans_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mg_trans,left_index=True,right_index=True)
mg_trans_select.head()

In [None]:
# Mean abundances of transporters for selected transporters
mg_trans_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mg_trans,left_index=True,right_index=True)
mg_trans_select.to_csv("results/mg/select_trans.tpm.tsv", sep="\t")
# Mean abundances of transporters for bacteria and selected transporters
mg_trans_bac_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mg_trans_bac,left_index=True,right_index=True)
mg_trans_select.to_csv("results/mg/bac_select_trans.tpm.tsv", sep="\t")
# TPM values per gene for genes matching selected transporters
mg_transcov_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mg_transcov,left_index=True,right_on="transporter")
mg_transcov_select.to_csv("results/mg/select_trans_genes.tpm.tsv", sep="\t")
# TPM values per gene for bacterial genes matching selected transporters
mg_transcov_bac_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mg_transcov_bac,left_index=True,right_on="transporter")
mg_transcov_bac_select.to_csv("results/mg/bac_select_trans_genes.tpm.tsv", sep="\t")

Metatranscriptomes

In [None]:
# Mean abundances of transporters for selected transporters
mt_trans_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mt_trans,left_index=True,right_index=True)
mt_trans_select.to_csv("results/mt/select_trans.tpm.tsv", sep="\t")
# Mean abundances of transporters for bacteria and selected transporters
mt_trans_bac_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mt_trans_bac,left_index=True,right_index=True)
mt_trans_select.to_csv("results/mt/bac_select_trans.tpm.tsv", sep="\t")
# TPM values per gene for genes matching selected transporters
mt_transcov_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mt_transcov,left_index=True,right_on="transporter")
mt_transcov_select.to_csv("results/mt/select_trans_genes.tpm.tsv", sep="\t")
# TPM values per gene for bacterial genes matching selected transporters
mt_transcov_bac_select = pd.merge(transinfo.loc[transdef_select.transporter.unique()],mt_transcov_bac,left_index=True,right_on="transporter")
mt_transcov_bac_select.to_csv("results/mt/bac_select_trans_genes.tpm.tsv", sep="\t")

#### Transporter type and substrate summary

Generate count summary across transporter type and substrate category.

In [None]:
# Group by and count type and substrate category
type_counts = transinfo.groupby(["type","substrate_category"]).count().reset_index().iloc[:,[0,1,2]]
SUM = transinfo.groupby("type").count().iloc[:,0]
SUM

In [None]:
# Group by and count type and substrate category
type_counts = transinfo.groupby(["type","substrate_category"]).count().reset_index().iloc[:,[0,1,2]]
# Calculate total type sum
SUM = transinfo.groupby("type").count().iloc[:,0]
# Calculate total substrate category sum
colsum = transinfo.groupby("substrate_category").count().iloc[:,0]
colsum.name = "SUM"
colsum = pd.DataFrame(colsum).T
colsum = colsum.assign(SUM=SUM.sum())
# Pivot count table
type_counts.columns = ["type","substrate_category","counts"]
type_counts = pd.pivot_table(type_counts, index=["type"], columns=["substrate_category"])
type_counts.fillna("0", inplace=True)
type_counts = type_counts["counts"]
# Add row sums
type_counts = type_counts.assign(SUM=SUM)
# Add col sums
type_counts = pd.concat([type_counts,colsum])
# Convert to integer
type_counts = type_counts.astype(int)
type_counts.to_csv("results/transporter_type_table.tsv", sep="\t")
type_counts

Show fraction of the different transporter types.

In [None]:
type_sums = type_counts.sum(axis=1).drop("SUM")
round(type_sums.div(type_sums.sum()),2)