# Normalization and filtering steps

OTU data (counts) matrix was obtained from the original study of [Wallace et al. 2018](https://apsjournals.apsnet.org/doi/10.1094/PBIOMES-02-18-0008-R).

This notebook describes the separation of OTUs for day and night samples, to analyze co-occurrence networks and the cross-correlation networks (integration with the host transcriptome).

This notebook also describes the filtering steps, with OTUs being kept if 50% of samples have at least 0.001 relative abundance and a coefficient of variation above the first quartile.


In [1]:
import pandas as pd

otu_table_df = pd.read_table('/home/santosrac/Projects/UGA_RACS/16S/otu_matrices/2f_otu_table.sample_filtered.no_mitochondria_chloroplast.tsv',
                                      comment='#', dtype = {'OTU': str})

otu_table_df.set_index('OTU_ID', inplace=True)
otu_table_df.head()

  otu_table_df = pd.read_table('/home/santosrac/Projects/UGA_RACS/16S/otu_matrices/2f_otu_table.sample_filtered.no_mitochondria_chloroplast.tsv',


Unnamed: 0_level_0,LMAN.8.14A0051,LMAN.8.14A0304,LMAD.8.14A0247,LMAN.8.14A0159,LMAD.8.14A0051,LMAD.26.14A0381,LMAD.26.14A0533,LMAD.8.14A0281,LMAD.8.14A0295,LMAN.26.14A0319,...,LMAN.26.14A0303,LMAN.8.14A0011,LMAD.26.14A0137,LMAN.26.14A0327,LMAN.8.14A0205,LMAD.8.14A0265,LMAD.26.14A0155,LMAD.26.14A0167,LMAD.26.14A0481,LMAN.26.14A0329
OTU_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,1.0,2.0,1.0,1.0,1.0,3.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995900,0.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0,8.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533625,1.0,36.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Creating new dataframes with either the day or the night samples:

In [2]:
# Day samples
otu_table_day_df = otu_table_df.filter(like='LMAD')
otu_table_day_df.head()
# Night samples
otu_table_night_df = otu_table_df.filter(like='LMAN')
otu_table_night_df.head()

Unnamed: 0_level_0,LMAN.8.14A0051,LMAN.8.14A0304,LMAN.8.14A0159,LMAN.26.14A0319,LMAN.26.14A0341,LMAN.8.14A0119,LMAN.8.14A0135,LMAN.26.14A0465,LMAN.8.14A0343,LMAN.26.14A0169,...,LMAN.8.14A0197,LMAN.8.14A0247,LMAN.26.14A0211,LMAN.8.14A0339,LMAN.26.14A0093,LMAN.26.14A0303,LMAN.8.14A0011,LMAN.26.14A0327,LMAN.8.14A0205,LMAN.26.14A0329
OTU_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995900,0.0,1.0,0.0,15.0,2.0,5.0,3.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533625,1.0,36.0,0.0,12.0,2.0,56.0,0.0,42.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
# Renaming columns of the day OTU dataframe
column_mapping_d = {col: col.split(".")[2]+"_"+col.split(".")[1] for col in otu_table_day_df.columns}
otu_table_day_df.rename(columns=column_mapping_d, inplace=True)
otu_table_day_df.head()
# Renaming columns of the night OTU dataframe
column_mapping_n = {col: col.split(".")[2]+"_"+col.split(".")[1] for col in otu_table_night_df.columns}
otu_table_night_df.rename(columns=column_mapping_n, inplace=True)
otu_table_night_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  otu_table_day_df.rename(columns=column_mapping_d, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  otu_table_night_df.rename(columns=column_mapping_n, inplace=True)


Unnamed: 0_level_0,14A0051_8,14A0304_8,14A0159_8,14A0319_26,14A0341_26,14A0119_8,14A0135_8,14A0465_26,14A0343_8,14A0169_26,...,14A0197_8,14A0247_8,14A0211_26,14A0339_8,14A0093_26,14A0303_26,14A0011_8,14A0327_26,14A0205_8,14A0329_26
OTU_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995900,0.0,1.0,0.0,15.0,2.0,5.0,3.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533625,1.0,36.0,0.0,12.0,2.0,56.0,0.0,42.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Select samples with paired RNA-Seq, normalize and filter

Co-occurrence networks and cross-correlations were carried out only for those microbiome samples with a paired RNA-Seq data for the host. Expression data will be imported for this purpose, followed by normalization of OTU counts (relative abundance) and filtering based on abundance and the coefficient of variation.

Importing expression data:

In [5]:
expression_counts_day = pd.read_csv("/home/santosrac/Projects/UGA_RACS/Transcriptome/paper2025/paper2025/kremling_expression_v5_day.tsv",
                                    sep="\t")
expression_counts_day.set_index("Name", inplace=True)
expression_counts_night = pd.read_csv("/home/santosrac/Projects/UGA_RACS/Transcriptome/paper2025/paper2025/kremling_expression_v5_night.tsv",
                                    sep="\t")
expression_counts_night.set_index("Name", inplace=True)

In [6]:
# Filter columns of otu_table_day_df based on columns of expression_counts_day
paired_otu_table_day_df = otu_table_day_df[otu_table_day_df.columns.intersection(expression_counts_day.columns)]
paired_otu_table_day_df.head()
# Filter columns of otu_table_night_df based on columns of expression_counts_night
paired_otu_table_night_df = otu_table_night_df[otu_table_night_df.columns.intersection(expression_counts_night.columns)]
paired_otu_table_night_df.head()
# Print dataframe sizes
print(paired_otu_table_day_df.shape)
print(paired_otu_table_night_df.shape)

(9057, 176)
(9057, 228)


In total, there are **176 samples** for the day and **228 samples** for the night dataset.

In [7]:
# Renaming columns before merging again
column_mapping_d_filtered = {col: col+"_d" for col in paired_otu_table_day_df.columns}
paired_otu_table_day_df.rename(columns=column_mapping_d_filtered, inplace=True)
paired_otu_table_day_df.head()
column_mapping_n_filtered = {col: col+"_n" for col in paired_otu_table_night_df.columns}
paired_otu_table_night_df.rename(columns=column_mapping_n_filtered, inplace=True)
paired_otu_table_night_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paired_otu_table_day_df.rename(columns=column_mapping_d_filtered, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paired_otu_table_night_df.rename(columns=column_mapping_n_filtered, inplace=True)


Unnamed: 0_level_0,14A0051_8_n,14A0159_8_n,14A0319_26_n,14A0119_8_n,14A0135_8_n,14A0465_26_n,14A0343_8_n,14A0169_26_n,14A0263_8_n,14A0117_8_n,...,14A0085_8_n,14A0345_8_n,14A0409_26_n,14A0247_8_n,14A0211_26_n,14A0339_8_n,14A0093_26_n,14A0011_8_n,14A0205_8_n,14A0329_26_n
OTU_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995900,0.0,0.0,15.0,5.0,3.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533625,1.0,0.0,12.0,56.0,0.0,42.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Concatenate the two dataframes before filtering
paired_otu_table_day_night_df = pd.concat([paired_otu_table_day_df, paired_otu_table_night_df], axis=1)
paired_otu_table_day_night_df.shape

(9057, 404)

Obtaining the total OTU counts of samples with paired RNA-Seq before filtering to generate OTU subset for co-occurrence/cross-correlation:

In [10]:
total_sum = paired_otu_table_day_night_df.sum().sum()
print(total_sum)

37828305.0


Computing relative abundance for the dataframe (day and night samples that have a pair in the RNA-Seq data)

In [8]:
paired_otu_table_day_night_rel_abund = paired_otu_table_day_night_df.divide(paired_otu_table_day_night_df.sum())
paired_otu_table_day_night_rel_abund = paired_otu_table_day_night_rel_abund * 100
paired_otu_table_day_night_rel_abund.head()

Unnamed: 0_level_0,14A0247_8_d,14A0051_8_d,14A0381_26_d,14A0533_26_d,14A0295_8_d,14A0169_26_d,14A0069_8_d,14A0497_26_d,14A0023_8_d,14A0547_26_d,...,14A0085_8_n,14A0345_8_n,14A0409_26_n,14A0247_8_n,14A0211_26_n,14A0339_8_n,14A0093_26_n,14A0011_8_n,14A0205_8_n,14A0329_26_n
OTU_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4479944,0.011221,0.000866,0.039422,0.011879,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995900,0.0,0.0,0.0,0.0,0.008165,0.004026,0.000922,0.01086,0.002537,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1124709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007685,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
541139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
533625,0.0,0.0,0.026281,0.0,0.0,0.0,0.0,0.0,0.0,0.007685,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Filtering OTUs based on relative abundance across samples:

In [9]:
otus_tokeep = paired_otu_table_day_night_rel_abund[(paired_otu_table_day_night_rel_abund > 0.001).sum(axis=1) >= (paired_otu_table_day_night_rel_abund.shape[1] * 0.5)].index

In [10]:
filtered_otu_table_day_night_filtered_rel_abund = paired_otu_table_day_night_df[paired_otu_table_day_night_df.index.isin(otus_tokeep)]
filtered_otu_table_day_night_filtered_rel_abund.shape

(368, 404)

Filtering OTUs based on the quartile value of coefficient of variation:

In [11]:
import numpy as np
filtered_otu_table_day_night_filtered_rel_abund_cv = filtered_otu_table_day_night_filtered_rel_abund.apply(lambda row: np.std(row) / np.mean(row), axis=1)
filtered_otu_table_day_night_filtered_rel_abund_cv_filtered = paired_otu_table_day_night_df.loc[filtered_otu_table_day_night_filtered_rel_abund_cv[filtered_otu_table_day_night_filtered_rel_abund_cv > filtered_otu_table_day_night_filtered_rel_abund_cv.quantile(q=0.25)].index]
filtered_otu_table_day_night_filtered_rel_abund_cv_filtered.shape

(276, 404)

In [12]:
filtered_otu_table_day_night_filtered_rel_abund_cv_filtered.head()

Unnamed: 0_level_0,14A0247_8_d,14A0051_8_d,14A0381_26_d,14A0533_26_d,14A0295_8_d,14A0169_26_d,14A0069_8_d,14A0497_26_d,14A0023_8_d,14A0547_26_d,...,14A0085_8_n,14A0345_8_n,14A0409_26_n,14A0247_8_n,14A0211_26_n,14A0339_8_n,14A0093_26_n,14A0011_8_n,14A0205_8_n,14A0329_26_n
OTU_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
969149,0.0,21.0,10.0,1.0,2.0,30.0,0.0,4.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3486915,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,1.0,4.0,2.0,0.0,0.0,17.0,1.0,0.0,0.0
808437,1.0,15.0,2.0,41.0,0.0,7.0,3.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,5.0,0.0,2.0,0.0,0.0,0.0
750840,0.0,0.0,4.0,9.0,1.0,1.0,0.0,2.0,7.0,18.0,...,0.0,0.0,11.0,5.0,1.0,0.0,4.0,0.0,0.0,5.0
542475,0.0,0.0,11.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,2.0,4.0,0.0,12.0,6.0,0.0,0.0


In [13]:
filtered_otu_table_day_filtered_rel_abund_cv_filtered = filtered_otu_table_day_night_filtered_rel_abund_cv_filtered.filter(like='_d')
print(filtered_otu_table_day_filtered_rel_abund_cv_filtered.shape)
filtered_otu_table_night_filtered_rel_abund_cv_filtered = filtered_otu_table_day_night_filtered_rel_abund_cv_filtered.filter(like='_n')
print(filtered_otu_table_night_filtered_rel_abund_cv_filtered.shape)

(276, 176)
(276, 228)


In [14]:
column_mapping_d_final = {col: col.replace("_d", "") for col in filtered_otu_table_day_filtered_rel_abund_cv_filtered.columns}
filtered_otu_table_day_filtered_rel_abund_cv_filtered.rename(columns=column_mapping_d_final, inplace=True)
column_mapping_n_final = {col: col.replace("_n", "") for col in filtered_otu_table_night_filtered_rel_abund_cv_filtered.columns}
filtered_otu_table_night_filtered_rel_abund_cv_filtered.rename(columns=column_mapping_n_final, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_otu_table_day_filtered_rel_abund_cv_filtered.rename(columns=column_mapping_d_final, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_otu_table_night_filtered_rel_abund_cv_filtered.rename(columns=column_mapping_n_final, inplace=True)


In [15]:
print(filtered_otu_table_night_filtered_rel_abund_cv_filtered.shape)
print(filtered_otu_table_day_filtered_rel_abund_cv_filtered.shape)

(276, 228)
(276, 176)


In [16]:
filtered_otu_table_night_filtered_rel_abund_cv_filtered.to_csv("/home/renato/projects/fapesp_bepe_pd/microbiome/filtered_otu_table_night_filtered_rel_abund_cv_filtered.tsv", sep="\t")
filtered_otu_table_day_filtered_rel_abund_cv_filtered.to_csv("/home/renato/projects/fapesp_bepe_pd/microbiome/filtered_otu_table_day_filtered_rel_abund_cv_filtered.tsv", sep="\t")