# Strict Pre-Filtering Data Before REDITs Step:

To create a strict filtered file we used this `ipynb` where any rows in any sample that are `___Edited` are removed along with their non edited counterpart if even one sample has a 0 for that region and position (edited).

## Required Inputs:

### Specify Output File BaseName:

Specify the basename of the output file. It is reccomended that you include `Strict_PreFiltered` in the output file name as it is created via this strict pre-filteration script.

In [None]:
# Output file Base Name (Manually Specified)
output_file_name <- "Strict_PreFiltered_Data_From_REDItools_For___DescriptionOfYourChoice"


### Load TSV To Be Processed:

Specify the path to the tsv found in `Step_4___FilterationViaProportion___BasicPreFilteration` in the Part 5 subfolder.

In [None]:

# Path to Basic Pre-filtered tsv
tsv_file_path <- "/path/to/Step_04___BasicPreFiltrationDirectory/Part_5___MergingPivotedTables_BasicPreFiltered_REDITsInput/Output_File.tsv"



## Find Output Directory Path Relative To Input TSV File:

In [None]:
# Extracting the directory of the tsv_file_path
output_directory <- file.path(dirname(dirname(dirname(tsv_file_path))), "Step_05___Strict_PreFilteration")

# Now, output_directory is the desired directory path
cat("*Output directory:*", output_directory)

### Check Paths Exist:

In order for the script to contine this check is put in place to make sure the paths to the output directory and the input `tsv` exist.

In [None]:
# Check if the TSV file exists
if (file.exists(tsv_file_path)) {
  cat("*TSV File Path:*", tsv_file_path, "\n")
} else {
  cat("Error: TSV file does not exist. Please provide a valid file path.\n")
  cat("Quitting...\n")
  q("no")
}

# Check if the output directory exists or create it
if (dir.exists(output_directory)) {
  cat("Output Directory:", output_directory, "\n")
} else {
  cat("Output directory does not exist. Creating...\n")
  dir.create(output_directory, recursive = TRUE)
  cat("*Output Directory:*", output_directory, "\n")
}

### Load And Briefly View The Input File Header:

In [None]:
# Use read.delim to load the TSV file
df <- read.delim(tsv_file_path, header = TRUE, sep = "\t")

# Print "Input file header"
cat("*Input file header:*\n")

# Display the header of the dataframe
head(df)

The number of rows found in the code block below will be used for small calculations (found near the end of this file).

In [None]:
# Store the number of rows in df:
num_rows___df_Before_Being_Processed <- nrow(df)

## Processing:

## Find The Sample Columns:

The column you want to check for suffixes `___Edited` and `___Non_Edited` is 'Region_Position___Count_Type'

In [None]:

# Store the specified column name
Specified_Column <- "Region_Position___Count_Type"

# Identify sample columns (columns that are not 'Region_Position___Count_Type')
sample_columns <- names(df)[names(df) != Specified_Column]

# Print the sample column names
cat("*Sample Columns:*", paste(sample_columns, collapse = ", "), "\n")

### Load In Relevant Library

In [None]:
library(dplyr)

## Main:

In [None]:
# Iterate through each sample column
for (sample_column in sample_columns) {
  # Find rows where the sample column has 0
  rows_to_remove <- df[df[, sample_column] == 0, ]
  
  # Identify rows ending with '___Edited' in Specified_Column
  edited_rows <- grep("___Edited$", rows_to_remove[, Specified_Column])
  
  # Remove the rows and the row directly under them
  df <- df[-c(as.numeric(rownames(rows_to_remove[edited_rows, ])), 
             as.numeric(rownames(rows_to_remove[edited_rows, ])) + 1), ]
}

# Print the updated dataframe
head(df)

### Rows In Strict Filtered Dataframe:

The number of rows found in the code block below will be used for small calculations (found near the end of this file).

In [None]:
num_rows___df_processed <- nrow(df)
cat("Number of rows in the processed data frame:", num_rows___df_processed , "\n")

## Small Calculations To Check:

This is where the number of rows before and after processing are used for basic calculations:

In [None]:

# Calculate the difference in the number of rows
rows_difference <- num_rows___df_Before_Being_Processed - num_rows___df_processed

# Calculate the percentage reduction
percentage_reduction <- (rows_difference / num_rows___df_Before_Being_Processed) * 100

# Print the results
cat("\n\n**Summary Statistics:**\n\n")
cat("*Number of rows in the initial file before strict filtering:*", num_rows___df_Before_Being_Processed, "\n")
cat("*Number of rows in the processed dataframe:*", num_rows___df_processed, "\n")
cat("*Difference in rows:*", rows_difference, "\n")
cat("*Percentage reduction:*", percentage_reduction, "%\n")


## Write To CSV and TSV:

Write the dataframe to a `csv` and a `tsv`.

In [None]:
# Write to TSV
tsv_file <- file.path(output_directory, paste0(output_file_name, ".tsv"))
write.table(df, tsv_file, sep = "\t", quote = FALSE, row.names = FALSE)
cat("\n\n*Output TSV Path:*", tsv_file, "\n")

# Write to CSV
csv_file <- file.path(output_directory, paste0(output_file_name, ".csv"))
write.csv(df, csv_file, row.names = FALSE)
cat("\n\n*Output CSV Path:*", csv_file, "\n")


## Session Information:

In [None]:

cat("\n\n**Session Information:**\n\n")
sessionInfo()
