# Normalization and filtering steps

Expression data (counts) obtained in Salmon for day and night samples were used in this notebook for normalization and filtering based on abundance and variation (VST).


## Importing tables

### Importing expression counts with day and night samples



In [1]:
import pandas as pd

expression_counts_day = pd.read_csv("/home/renato/projects/fapesp_bepe_pd/transcriptome/kremling_expression_v5_day.tsv",
                                    sep="\t")
expression_counts_day.set_index("Name", inplace=True)
expression_counts_day.columns = expression_counts_day.columns.str.replace('^','exp_day_', regex=True)
expression_counts_night = pd.read_csv("/home/renato/projects/fapesp_bepe_pd/transcriptome/kremling_expression_v5_night.tsv",
                                    sep="\t")
expression_counts_night.set_index("Name", inplace=True)
expression_counts_night.columns = expression_counts_night.columns.str.replace('^','exp_night_', regex=True)

In [2]:
expression_counts_day_night = pd.concat([expression_counts_day,
                                         expression_counts_night], axis=1)
expression_counts_day_night.head()

Unnamed: 0_level_0,exp_day_14A0253_26,exp_day_14A0171_26,exp_day_14A0045_8,exp_day_14A0085_8,exp_day_14A0079_26,exp_day_14A0467_26,exp_day_14A0039_8,exp_day_14A0403_26,exp_day_14A0063_8,exp_day_14A0513_26,...,exp_night_14A0519_26,exp_night_14A0005_8,exp_night_14A0027_8,exp_night_14A0533_26,exp_night_14A0333_26,exp_night_14A0473_26,exp_night_14A0047_8,exp_night_14A0453_26,exp_night_14A0345_8,exp_night_14A0343_8
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zm00001eb371370_T002,2,3,7,7,1,8,7,24,20,4,...,1,0,0,3,1,1,0,0,0,0
Zm00001eb371350_T001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Zm00001eb371330_T001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Zm00001eb371310_T001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Zm00001eb371280_T001,3,0,6,22,3,9,10,5,8,4,...,3,0,6,0,4,1,4,0,9,4


## Normalizing data - transcript expression

Importing transcript lenghts required in RPKM and TPM normalization methods:

In [4]:
gene_length_table = pd.read_csv('/home/renato/projects/fapesp_bepe_pd/transcriptome/Zmays_Zm_B73_REFERENCE_NAM_5_0_55_transcripts_PrimaryTranscriptOnly_length.txt',
                               sep="\t")
gene_length_table.set_index('Name', inplace=True)

### RPKM

In [5]:
from bioinfokit.analys import norm

kremling_raw_expression_v5_gene_length = pd.merge(expression_counts_day_night, gene_length_table, on="Name")
nm = norm()
nm.rpkm(df=kremling_raw_expression_v5_gene_length, gl='Length')
kremling_expression_v5_rpkm = nm.rpkm_norm

### TPM

In [6]:
from bioinfokit.analys import norm

nm = norm()
nm.tpm(df=kremling_raw_expression_v5_gene_length, gl='Length')
kremling_expression_v5_tpm = nm.tpm_norm

### Transcript expression

Filter genes based on RPKM values:

In [7]:
genes_tokeep = kremling_expression_v5_rpkm[(kremling_expression_v5_rpkm > 3).sum(axis=1) >= (kremling_expression_v5_rpkm.shape[1] * 0.8)].index

Filtering the TPM matrix based on RPKM threshold above:

In [8]:
kremling_expression_v5_tpm_filtered = kremling_expression_v5_tpm[kremling_expression_v5_tpm.index.isin(genes_tokeep)]

Filtering based on coefficient of variation:

In [9]:
import numpy as np

# Calculate the coefficient of variation for each row
kremling_expression_v5_tpm_filtered_cv = kremling_expression_v5_tpm_filtered.apply(lambda row: np.std(row) / np.mean(row), axis=1)
kremling_expression_v5_tpm_filtered_cv_filtered = kremling_expression_v5_tpm_filtered.loc[kremling_expression_v5_tpm_filtered_cv[kremling_expression_v5_tpm_filtered_cv > kremling_expression_v5_tpm_filtered_cv.quantile(q=0.25)].index]

## Exporting the whole filtered matrix

The final filtered expression matrix was used to compute co-expression (corALS) and cross-correlation (SparXCC) networks.


In [10]:
kremling_expression_v5_tpm_filtered_cv_filtered.to_csv("/home/renato/projects/fapesp_bepe_pd/transcriptome/kremling_expression_v5_tpm_filtered_cv_filtered.tsv",
                                                        sep="\t")