### Prepare RNA Editing data from SPRINT for REDIT Statistical Inference
##### The purpose of this notebook is to manipulate the data into the proper format to run REDIT R package that tests the statistical inference of RNA editing sites. For each site, the format required by REDIT is as follows:

|                       | Mutant | Mutant | Mutant | Control | Control | Control |
|-----------------------|--------|--------|--------|---------|---------|---------|
| Number of edited reads     |        |        |        |         |         |         |
| Number of non-edited reads |        |        |        |         |         |         |

##### We will also create an ID column with the chr_site (eg. 1_209063) to identify the site at which the editing is occuring and append that on after running the REDIT package but for now we will just create the proper format to easily extract the data in R.

In [68]:
import pandas as pd

In [69]:
df=pd.read_csv("/mnt/vast/hpc/csg/hcs2152/ZFR_RNA_Editing/SPRINT/Output/A2I_Editing/5dpf/SPRINT_ZFR_editing_5dpf.tsv",sep='\t')

In [70]:
#First let's get rid of any data we won't be using
df

Unnamed: 0,chr,start,stop,editing type,read type,strand,Ctrl-04 coverage,Ctrl-05 coverage,Ctrl-06 coverage,NO-04 coverage,NO-05 coverage,NO-06 coverage,Ctrl-04 coverage_editing_percentage,Ctrl-05 coverage_editing_percentage,Ctrl-06 coverage_editing_percentage,NO-04 coverage_editing_percentage,NO-05 coverage_editing_percentage,NO-06 coverage_editing_percentage
0,1,141515,141516,AG,regular_and_hyper,+,43:57,30:45,46:63,43:60,26:44,29:57,57.000000,60.000000,57.798165,58.252427,62.857143,66.279070
1,1,141638,141639,AG,regular_and_hyper,+,31:37,15:15,23:28,18:18,12:13,17:19,54.411765,50.000000,54.901961,50.000000,52.000000,52.777778
2,1,141639,141640,AG,regular_and_hyper,+,36:37,15:15,27:28,18:18,12:13,18:19,50.684932,50.000000,50.909091,50.000000,52.000000,51.351351
3,1,141640,141641,AG,regular_and_hyper,+,36:37,15:15,27:28,18:18,13:13,18:19,50.684932,50.000000,50.909091,50.000000,50.000000,51.351351
4,1,141648,141649,AG,regular_and_hyper,+,29:37,12:15,17:28,15:19,8:13,16:20,56.060606,55.555556,62.222222,55.882353,61.904762,55.555556
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19688,KN150699.1,24761,24762,TC,hyper,-,5:5,6:6,7:7,13:13,5:5,7:7,50.000000,50.000000,50.000000,50.000000,50.000000,50.000000
19689,KZ115990.1,7134,7135,AG,hyper,+,2:25,2:19,4:21,1:18,3:28,2:23,92.592593,90.476190,84.000000,94.736842,90.322581,92.000000
19690,KZ115990.1,7175,7176,AG,hyper,+,2:17,2:19,4:18,1:12,3:28,1:19,89.473684,90.476190,81.818182,92.307692,90.322581,95.000000
19691,KZ115990.1,7184,7185,AG,hyper,+,2:14,2:18,1:17,1:12,1:21,1:15,87.500000,90.000000,94.444444,92.307692,95.454545,93.750000


In [71]:
# Filter rows based on the "chr" column
filtered_df = df[df['chr'].astype(str).str.match(r'^[1-9]|1[0-9]|2[0-5]$')]

In [72]:
# Create a new DataFrame with the selected columns
new_df = filtered_df.iloc[:, [0, 2] + list(range(6, 12))]

In [73]:
new_df

Unnamed: 0,chr,stop,Ctrl-04 coverage,Ctrl-05 coverage,Ctrl-06 coverage,NO-04 coverage,NO-05 coverage,NO-06 coverage
0,1,141516,43:57,30:45,46:63,43:60,26:44,29:57
1,1,141639,31:37,15:15,23:28,18:18,12:13,17:19
2,1,141640,36:37,15:15,27:28,18:18,12:13,18:19
3,1,141641,36:37,15:15,27:28,18:18,13:13,18:19
4,1,141649,29:37,12:15,17:28,15:19,8:13,16:20
...,...,...,...,...,...,...,...,...
19302,9,55613766,2:10,2:11,4:9,2:12,1:24,5:15
19303,9,55613771,7:10,6:11,2:8,5:12,13:24,10:11
19304,9,55613772,10:10,10:11,4:8,8:12,14:24,9:11
19305,9,55613779,9:10,4:11,2:8,5:12,12:24,7:12


In [74]:
# Create the "ID" column
id_column = new_df['chr'].astype(str) + '_' + new_df['stop'].astype(str)

# Create a DataFrame for the "ID" column
id_matrix = pd.DataFrame({'ID': id_column})

# Display the "ID" matrix
print(id_matrix)

               ID
0        1_141516
1        1_141639
2        1_141640
3        1_141641
4        1_141649
...           ...
19302  9_55613766
19303  9_55613771
19304  9_55613772
19305  9_55613779
19306  9_55613780

[19307 rows x 1 columns]


In [75]:
# Define a function to split, rearrange, and swap the values
def split_rearrange_and_swap(row):
    # Split the values in the coverage columns by ':'
    split_values = [value.split(':') for value in row[2:]]
    
    # Swap the values
    swapped_values = [(value[1], value[0]) for value in split_values]
    
    # Transpose the result to have separate rows for 'Edited' and 'Non-Edited'
    edited_nonedited_values = list(zip(*swapped_values))
    
    # Create a DataFrame for the result
    result_df = pd.DataFrame(edited_nonedited_values, index=['Edited', 'Non-Edited'], columns=new_df.columns[2:])
    
    return result_df

# Apply the function to each row
result_matrices = new_df.apply(split_rearrange_and_swap, axis=1)

# Concatenate the list of DataFrames into a single DataFrame
result_df = pd.concat(result_matrices.tolist(), keys=id_matrix['ID'], axis=0)

# Convert the index values to strings and include "Edited" and "Non-Edited"
result_df.index = result_df.index.map(lambda x: f"{str(x[0])} {x[1]}")

# Display the final result
print(result_df)

                      Ctrl-04 coverage Ctrl-05 coverage Ctrl-06 coverage  \
1_141516 Edited                     57               45               63   
1_141516 Non-Edited                 43               30               46   
1_141639 Edited                     37               15               28   
1_141639 Non-Edited                 31               15               23   
1_141640 Edited                     37               15               28   
...                                ...              ...              ...   
9_55613772 Non-Edited               10               10                4   
9_55613779 Edited                   10               11                8   
9_55613779 Non-Edited                9                4                2   
9_55613780 Edited                   10               11                8   
9_55613780 Non-Edited                4                7                3   

                      NO-04 coverage NO-05 coverage NO-06 coverage  
1_141516 Edited   

In [67]:
result_df.to_csv('/mnt/vast/hpc/csg/hcs2152/ZFR_RNA_Editing/SPRINT/Output/A2I_Editing/5dpf/5dpf_REDIT_input.tsv', sep='\t')