# Filter Snakemake Pairwise Results

This notebook filters pairwise alignment results from Snakemake workflow based on different contig datasets:
1. **All deduplicated contigs**: The complete set of deduplicated contigs
2. **Non-cellular filtered contigs**: Contigs that have been filtered to remove cellular contamination

The goal is to create a filtered dataset that only includes pairwise alignments for contigs that passed the non-cellular filtering step.

In [None]:
import sys
sys.path.append('../scripts')
from utils import *

## Step 1: Load and Analyze All Deduplicated Pairwise Alignments

First, we load the complete pairwise alignment results from Snakemake and extract unique SRR identifiers.

In [None]:
# Load the pairwise alignment results from Snakemake workflow
# This CSV contains alignment results for all deduplicated contigs
pairwise_aln_all_deduplicated = pd.read_csv('/home/tobamo/analize/project-tobamo/analysis/model/results/snakemake/testing_input.csv')

# Extract SRR identifiers from contig names using regex
# Contig names typically end with "_SRR123456" format
pairwise_aln_all_deduplicated['SRR'] = pairwise_aln_all_deduplicated['contig_name'].str.extract(r"_([A-Za-z0-9]+)$")

# Count unique SRR identifiers in the pairwise alignment data
pairwise_aln_all_deduplicated.SRR.nunique()

139

In [None]:
# Parse the deduplicated contigs FASTA file to get all sequence IDs
all_deduplicated_ids = [seq.id for seq in SeqIO.parse('/home/tobamo/analize/project-tobamo/analysis/data/contigs/contigs_all_deduplicated.fasta', 'fasta')]

# Extract unique SRR identifiers from the FASTA sequence IDs
all_deduplicated_srrs = pd.Series(all_deduplicated_ids).str.extract(r"_([A-Za-z0-9]+)$")[0].unique()

# Display the total number of contigs and unique SRR samples in the deduplicated dataset
len(all_deduplicated_ids), len(all_deduplicated_srrs)

(2549, 139)

## Step 2: Analyze Non-Cellular Filtered Contigs

Now we load the contigs that have been filtered to remove cellular contamination and compare with the pairwise alignment data.

In [None]:
# Parse the non-cellular filtered contigs FASTA file
# Replace '=' with '_' in sequence IDs for consistency (some FASTA files may have '=' characters)
all_non_cellular_filtered_ids = [seq.id.replace('=','_') for seq in SeqIO.parse('/home/tobamo/analize/project-tobamo/analysis/data/contigs/contigs_non_cellular_filtered.fasta', 'fasta')]

# Extract unique SRR identifiers from the filtered contig IDs
all_non_cellular_filtered_srrs = pd.Series(all_non_cellular_filtered_ids).str.extract(r"_([A-Za-z0-9]+)$")[0].unique()

# Display the total number of filtered contigs and unique SRR samples
len(all_non_cellular_filtered_ids), len(all_non_cellular_filtered_srrs)

(510, 131)

## Step 3: Filter Pairwise Alignments

Filter the pairwise alignment results to only include contigs that passed the non-cellular filtering step.

In [None]:
# Filter the pairwise alignment dataframe to only include contigs 
# that are present in the non-cellular filtered dataset
# This removes alignments for contigs that were identified as cellular contamination
non_cellular_filteted_df = pairwise_aln_all_deduplicated[pairwise_aln_all_deduplicated.contig_name.isin(all_non_cellular_filtered_ids)]

# Count the number of unique contigs in the filtered dataset
non_cellular_filteted_df['contig_name'].nunique()

510

## Step 4: Save Filtered Results

Save the filtered pairwise alignment results to a new CSV file for downstream analysis.

In [None]:
# Save the filtered pairwise alignment results to a new CSV file
# This file contains only alignments for contigs that passed non-cellular filtering
# and can be used for downstream viral analysis without cellular contamination
non_cellular_filteted_df.to_csv('/home/tobamo/analize/project-tobamo/analysis/model/results/snakemake/pairwise_aln_all_deduplicated_non_cellular_filtered.csv', index=False)