# Feature transformation
After filtering species on abundance criterion, we will now transform the data using a log-transformation. We will use the same transformational function that is used in @zeller2014potential. The transformational function is given in @eq-trans. 

$$ log_{10}(x + x_0)$${#eq-trans}

Here

$x$ is a relative abundance value

$x_0$ is a small constant (1e-6)

We will apply this transformation on filtered species. The code below perform that step.

In [2]:
#| echo: false

import pandas as pd

# loading tab-seperated data file using pandas and transposing it
data = pd.read_csv('Nine_CRC_cohorts_taxon_profiles.tsv',sep='\t',header=None).T

# setting the first row as column names and then removing it
data = data.rename(columns=data.loc[0]).drop(0, axis=0)

# accessing Zeller et al., 2014 dataset
zeller_db = data.loc[data['dataset_name'] == 'ZellerG_2014',:]

# fetching microbacterial organism information-related columns
bacteria_colnames = [col for col in data.columns if 'k__Bacteria' in col]

# metadata colnames
metadata_colnames = ['dataset_name', 'sampleID', 'subjectID', 'body_site', 'study_condition',
                     'disease', 'age', 'age_category', 'gender', 'country','ajcc','alcohol',
                     'antibiotics_current_use','curator','disease_subtype','ever_smoke','fobt',
                     'hba1c','hdl','ldl','location','BMI']


import matplotlib.pyplot as plt

# dataset containing only bacterial microoganism's relative abundace
microbiome = zeller_db[bacteria_colnames]

# converting data types
for col in microbiome:
    microbiome.loc[:,col] = pd.to_numeric(microbiome[col], errors='coerce')

# fetching names of columns with abundance exceeding .001
columns_to_fetch = microbiome.columns[microbiome.max(axis=0) > 0.001]

# filtered dataset
microbiome_filtered = microbiome[columns_to_fetch]


In [4]:
import numpy as np

# log transformation
microbiome_log = microbiome_filtered.applymap(lambda x: np.log10(x+.000001))

## References