# Filtering By Proportion

This `ipynb` does 3 filtration substeps all a part of basic pre-filtration used to create the basic pre-filtered files used as an input for REDITs.

**Basic Filterations Defined (The Filteration Is Done Primarily In The Part 1 and Part 2 ipynb):**

The original dataframe is the MergedSamplesIntoOneTable/Merged_Part_2___MergedSamples.tsv file created in the previous step.
* The 0th Filteration is where all rows (Region & Position) where there is no editing for any sample are removed where the sum of proportions is 0.
* The 1st Filteration is where we keep rows where at least one value in the selected columns is greater than 0.1 (remove rows where all values are less than or equal to 0.1).
* The 2nd Filteration is where at least 1 value In The proportion columns is less than 0.9 (removed rows where all values are greater than or equal to 0.9).

## Define Input Path And Load File:


Give the full path to `/Step_3___RenamingRelevantColumns_And_MergingIntoOneTable/Part_2___MergedAllSampleTSVsIntoOneTable/ Part_2*.tsv` where the * is a wild card.


In [None]:

file_path <- "/path/to/Step_03___RenamingRelevantColumns_And_MergingIntoOneDataSet/Part_2___MergedAllSampleTSVsIntoOneTable/Merged_Part_2___MergedSamples.tsv"


### Check the file path exists:

In [None]:
if (file.exists(file_path)) {
  cat("*TSV Path:*", file_path, "\n")
} else {
  cat("The file does not exist. Quiting . . .\n")
  q("no", status = 0, runLast = FALSE)
}

### Read in the tsv file

In [None]:
# Read the TSV file into a data frame
df <- read.table(file_path,
                 header = TRUE,
                 sep = "\t",
                 stringsAsFactors = FALSE)

### Get The Number Of Rows In Df:

Knowing the number of rows in the original file will help when looking at summary statistics.

In [None]:

number_of_rows_in_df <- nrow(df)


### View The Header Of The File:

In [None]:

head(df)


## Create The Main Output Folder Relative To The Input Path:

In [None]:

# Define the output subfolder name
output_subfolder <- "Step_04___FiltrationViaProportion___BasicPreFiltration"

# Create the output folder path (going three levels up)
output_folder <- file.path(dirname(dirname(dirname(file_path))), output_subfolder)

# Create the output folder if it doesn't exist
if (!file.exists(output_folder)) {
  dir.create(output_folder, recursive = TRUE)
}

# Print the path to the output folder
cat("*Main Output Directory:*", output_folder, "\n")


## Select For Proportion Columns Ending With `_Edited_Count_Proportion`

In [None]:

selected_columns <- grep("_Edited_Count_Proportion$", names(df), value = TRUE)

# Print the selected columns
print(selected_columns)

### Library

In [None]:
library(dplyr)

It seems to have issues with filteration and this may be due to not seeing numbers as numeric so I will force it to see it as numeric in the hopes that this will help with filteration.

In [None]:

df <- df %>%
  mutate(across(-Region_Position, as.numeric))

# Print the updated data frame
head(df)


## Sum Of Proportions:

This is a sum of the proportions of the edited proportions. It is stored in a column.

In [None]:

df <- df %>%
  mutate(Sum_Of_Proportions = rowSums(select(df, all_of(selected_columns))))

# Print the updated data frame with the new column
head(df)


## Remove Zeros Where The Sum Of Proportions Equals 0:

Filter rows where Sum_Of_Proportions is equal to 0. This is to remove all the rows where there is no editing at all.

In [None]:
# Filter rows where Sum_Of_Proportions is equal to 0
df_zeros_removed <- df %>%
  filter(Sum_Of_Proportions != 0)

# Print the filtered data frame
head(df_zeros_removed)


### Get The Number Of Rows In Dataframe Where Sum Of Proportions == 0 Is Removed

In [None]:
number_of_rows_in_df_zeros_removed <- nrow(df_zeros_removed)

## Filter To Remove Rows With Proportion Less Or Equal To 0.1


Keep rows where at least one value in the selected columns is greater than 0.1 (remove rows where all values are less than or equal to 0.1)


In [None]:
# Keep rows where at least one value in the selected columns is greater than 0.1
## Remove rows where all values are less than or equal to 0.1

filtered_df <- df_zeros_removed %>%
  filter(!if_all(all_of(selected_columns), ~ . <= 0.1))

# Print the filtered data frame
head(filtered_df)

### Get The Number Of Rows In The Dataframe Where At Least 1 Value In The Proportion Columns Is Greater Than 0.1 (Removed Rows Where All Values Are Less Than Or Equal To 0.1)

The number of rows in this filteration will be useful for summary statistics.

In [None]:
number_of_rows_first_filtration <- nrow(filtered_df)

## Filter To Remove Rows With Proportion Greater Than Or Equal To 0.9 In All Proportions Columns:

In [None]:

# Keep rows where at least one value in the selected columns is less than or equal to 0.9
## Remove rows where all values are greater than 0.9 for the proportions column

second_filtered_df <- filtered_df %>%
  filter(!if_all(all_of(selected_columns), ~ . >= 0.9))

# Print the second filtered data frame
head(second_filtered_df)


### Get The Number Of Rows In The Dataframe Where At Least 1 Value In The Proportion Columns Is Less Than 0.9 (Removed Rows Where All Values Are Greater Than Or Equal To 0.9)

The number of rows in this filteration will be useful for summary statistics.

In [None]:

number_of_rows_second_filtration <- nrow(second_filtered_df)


## Summary Statistics:

The number of rows per filteration dataframe were collected to have some summary statistics.

In [None]:

# Get the number rows in the original dataframe
cat("**Number Of Rows In Original Dataframe:**", number_of_rows_in_df, "\n")

# Calculate the difference between the original dataframe and the zeros removed filteration (the 0th filteration)
cat("**Number Of Rows In Dataframe With Zeros Removed:**", number_of_rows_in_df_zeros_removed, "\n")
diff_1 <- number_of_rows_in_df - number_of_rows_in_df_zeros_removed
percent_diff_1 <- (diff_1 / number_of_rows_in_df) * 100
cat("*Difference From Original:*", diff_1, "\n*Percentage Difference:*", percent_diff_1, "%\n\n")

# Calculate the difference between the zeros removed filteration (0th filteration) and the first filteration
cat("**Number Of Rows In The Dataframe Where At Least 1 Value In The Proportion Columns Is Greater Than 0.1 (Removed Rows Where All Proportions Are Less Than Or Equal To 0.1):**", number_of_rows_first_filtration, "\n")
diff_2 <- number_of_rows_in_df_zeros_removed - number_of_rows_first_filtration
percent_diff_2 <- (diff_2 / number_of_rows_in_df_zeros_removed) * 100
cat("*Difference Between Filteration 0 And The First Filteration:*", diff_2, "\n*Percentage Difference:*", percent_diff_2, "%\n\n")

# Calculate the difference between the first and second filteration
cat("**Number Of Rows In The Dataframe Where At Least 1 Value In The Proportion Columns Is Less Than 0.9 (Removed Rows Where All Values Are Greater Than Or Equal To 0.9):**", number_of_rows_second_filtration, "\n")
diff_3 <- number_of_rows_first_filtration - number_of_rows_second_filtration
percent_diff_3 <- (diff_3 / number_of_rows_second_filtration) * 100
cat("*Difference Between Second Filteration And First Filteration:*", diff_3, "\n*Percentage Difference:*", percent_diff_3, "%\n\n")

# Calculate the difference after all the filteration from the original dataframe
diff_4 <- number_of_rows_in_df - number_of_rows_second_filtration
percent_diff_4 <- (diff_4 / number_of_rows_in_df) * 100
cat("**Difference Between Number Of Rows In Original Dataframe and Second/Last Filteration:**", diff_4, "\n")
cat("*Percentage Difference:*", percent_diff_4, "%\n")



## Write The Dataframe With Proportions (The Second Filteration Dataframe)

Write the dataframe with proportions to a part 1 folder as a `csv` and a `tsv`.

In [None]:
# Create the output subfolder within output_folder
subfolder_name <- "Part_1___Filtered_Proportions"
subfolder_path <- file.path(output_folder, subfolder_name)

# Create the subfolder if it doesn't exist
if (!file.exists(subfolder_path)) {
  dir.create(subfolder_path, recursive = TRUE)
}

# Print the path to the output subfolder
cat("*Part 1 Filtered Tables With Proportions Folder:*", subfolder_path, "\n")

# Write main_df to a TSV file in the subfolder
tsv_file_path <- file.path(subfolder_path, "Part_1___Proportions_Filtered.tsv")
write.table(second_filtered_df, file = tsv_file_path, sep = "\t", quote = FALSE, row.names = FALSE)

# Write main_df to a CSV file in the subfolder
csv_file_path <- file.path(subfolder_path, "Part_1___Proportions_Filtered.csv")
write.csv(second_filtered_df, file = csv_file_path, quote = FALSE, row.names = FALSE)


## Remove Sum Of Proportions Column

In [None]:

main_df <- select(second_filtered_df, -Sum_Of_Proportions)

# Print the main data frame
head(main_df)


## Remove Per Sample Proportions Columns And Write To Files:

In [None]:

main_df_without_proportions <- select(main_df, -all_of(selected_columns))

# Print the updated data frame
head(main_df_without_proportions)


Write the dataframe without proportions to a part 2 folder as a `csv` and a `tsv`.

In [None]:
# Create the subfolder within output_folder
subfolder_name <- "Part_2___FilteredTablesWithoutProportion"
subfolder_path <- file.path(output_folder, subfolder_name)

# Create the subfolder if it doesn't exist
if (!file.exists(subfolder_path)) {
  dir.create(subfolder_path, recursive = TRUE)
}

# Print the path to the output subfolder
cat("*Part 2 Filtered Tables Without Proportions Folder:*", subfolder_path, "\n")

# Write main_df to a TSV file in the subfolder
tsv_file_path <- file.path(subfolder_path, "Part_2___Filtered_Without_Proportions.tsv")
write.table(main_df_without_proportions, file = tsv_file_path, sep = "\t", quote = FALSE, row.names = FALSE)

# Write main_df to a CSV file in the subfolder
csv_file_path <- file.path(subfolder_path, "Part_2___Filtered_Without_Proportions.csv")
write.csv(main_df_without_proportions, file = csv_file_path, quote = FALSE, row.names = FALSE)


## Session Information

In [None]:

cat("\n\n**Session Information:**\n\n")

sessionInfo()
