<a href="https://colab.research.google.com/github/marcexpositg/CRISPRed/blob/master/02.Model/2.2.LabelGen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2.2. Label Generation

This notebook shows the script used to get the labels for the machine learning model. Hence, it quantifies both:
- The efficiency of each gRNA
- The frequency of each gene editing outcome for each target region.

The script starts with the simulated data. For each target region (there is one target region in each column of the simulated data), the script generates all possible outcomes in a dictionary. Then, counts the number of simulated reads that belong to each outcome.

Finally, the script generates a csv file with the information of the efficiency, that relates each target region with the efficiency (defined as percentage of edited simulated reads). The information of the outcomes is stored in a data frame that contains the frequency of each outcome for each target region.

Note: The multiple comments in the script are introduced because a previous version required mounting Google Drive to get access to the required files. The new version downloads the files from GitHub so that there is no need to mount Google Drive.

Note2: The script takes relatively long to run when using all simulated data. Hence, only an example simulated data is analyzed. The simulated data example contains 5000 reads from 3 regions, instead of 1785 like the whole set of simulated data.


### 1. Import simulated data

Two files are required for the execution of the script:

- C3H_targets_example has the original sequence without editions of the region of interest. It is required to generate all possible outcomes from the non-edited sequence.

- sim_data_grna_lib_ex has 5000 reads of simulated data for 3 regions of interest.

Note that this files are only imported in this section to display them. The python script also gets the file from the git repository.

In [1]:
import pandas as pd

NonEditedSeqs = pd.read_csv('https://raw.githubusercontent.com/marcexpositg/CRISPRed/master/02.Model/SimData/C3H_targets_example.csv', names=['target_region_id','original_sequence'])
NonEditedSeqs

Unnamed: 0,target_region_id,original_sequence
0,ENSMUSG00000033788_gR434r,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...
1,ENSMUSG00000033788_gR113f,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...
2,ENSMUSG00000033788_gR70f,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...


In [2]:
SimData = pd.read_csv('https://raw.githubusercontent.com/marcexpositg/CRISPRed/master/02.Model/SimData/sim_data_grna_lib_ex.csv')
SimData

Unnamed: 0,ENSMUSG00000033788_gR434r,ENSMUSG00000033788_gR113f,ENSMUSG00000033788_gR70f
0,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
1,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
2,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
3,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
4,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
...,...,...,...
4995,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
4996,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
4997,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...
4998,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGG...,GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAAC...,CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGG...


In [3]:
# Previous scripts using Google Drive:
# previous script used to mount google drive
#from google.colab import drive
#drive.mount('/content/drive')
#%cd "drive/My Drive/CRISPred/02.Model"
#!cat SimData/C3H_targets_example.csv
#!head -n 5 Labeling/sim_data_grna_lib_ex.csv

### 2. Quantify the outcomes 

Below is the script used to quantify the outcomes. Note that the generated file is exported as CSV, and displayed as a DataFrame just to show them. 

In [4]:
# A progress bar is helpful when analyzing 1785 regions. For 3 regions it not even shown.
!pip install -q progress

In [5]:
from itertools import product
import pandas as pd
from progress.bar import Bar

def deletion_outcomes(seq_outcomes, sequence, cutsite, maxdelsize, dwindow_up, dwindow_down):
    # iterate by deletion size
    for del_size in range(1, maxdelsize):

        # changing size changes the deletion start position
        low_del_pos = dwindow_up - del_size + 1

        # generate list of deletion start sites within defined window
        del_start_site = range(low_del_pos, dwindow_down)

        # iterate by deletion start site
        for del_pos in del_start_site:
            # create an identifier for the deletion
            delname_size = format(del_size, '02d')
            delname_pos = format(del_pos, '+03d')
            delname = 'D' + delname_size + delname_pos

            # generate sequence with the deletion, both with gaps and without gaps
            up_seq = sequence[: cutsite + del_pos]
            down_seq = sequence[cutsite + del_pos + del_size:]

            delseq_gaps = up_seq + '-' * del_size + down_seq
            delseq_nogap = up_seq + down_seq

            # save both values in a dictionary in a key with the deletion ID
            seq_outcomes[delname] = [delseq_gaps, delseq_nogap]

    # create an empty class for all sequences equal or above max_del_size

    return (seq_outcomes)


def insertion_outcomes(seq_outcomes, sequence, cutsite, maxinsize):
    nt = 'ACGT'
    # iterate by all insertion sizes (omit insertion size 0, which is the WT seq)
    for ins_size in range(1, maxinsize):
        # generate all possible insertions sequences for a given insertion size
        insertion_list = list(map("".join, product(nt, repeat=ins_size)))

        # iterate by each insertion sequence
        for insertion in insertion_list:
            # assign a unique identifier for the insertion
            insname_size = format(ins_size, '02d')
            insname_type = insertion.zfill(3) #fills the insertion sequence to 3chr
            insname = 'I' + insname_size + insname_type

            # create the sequence with the insertion
            insseq = sequence[:cutsite] + insertion + sequence[cutsite:]

            # save in the dictionary
            seq_outcomes[insname] = [insseq]

    return seq_outcomes

def read_target_seqs(target_seqs_file):
    col_names = ["gRNA_id", "target_seq"]
    target_seqs = pd.read_csv(target_seqs_file, names = col_names)
    return target_seqs

def read_mut_data(mut_file):
    mut_seq_data = pd.read_csv(mut_file)
    return mut_seq_data


## Set parameters of outcomes to quantify
cut_site = 60
max_ins_size = 2 # only insertions below max_ins_size will be classified in independent classes
max_del_size = 30  # only deletions below max_del_size will be classified in independent classes
del_window_up = -3
del_window_down = +2


## Read target genomic regions without mutations
# import the wild type (non mutated) reference sequences from a csv file to a pandas data frame
# modified to get the files from GitHub
target_seqs_file = 'https://raw.githubusercontent.com/marcexpositg/CRISPRed/master/02.Model/SimData/C3H_targets_example.csv'
# originally:
#target_seqs_file = 'SimData/C3H_targets_example.csv'
target_seqs_data = read_target_seqs(target_seqs_file)

## Read sequencing mutated data
# import the simulated data sequences from a file to a list
# modified to get the files from GitHub
mut_file = 'https://raw.githubusercontent.com/marcexpositg/CRISPRed/master/02.Model/SimData/sim_data_grna_lib_ex.csv'
# original:
#mut_file = 'Labeling/sim_data_grna_lib_ex.csv'
mut_seq_data = read_mut_data(mut_file)

## Quantify outcomes for each target region
outcome_count_all = pd.DataFrame()

bar = Bar('Simulating sequences:', max=len(target_seqs_data))

# For each target region (row in dataframe),
for i in range(len(target_seqs_data)):
    # get id and sequence
    target_id = target_seqs_data.iloc[i]['gRNA_id']
    target_seq = target_seqs_data.iloc[i]['target_seq']

    ## generate possible outcomes for the target sequence
    seq_outcomes = {}
    # add wild type sequence
    seq_outcomes['W00000'] = [target_seq]
    # add deletions up to max_del_size
    seq_outcomes = deletion_outcomes(seq_outcomes, target_seq, cut_site, max_del_size, del_window_up, del_window_down)
    # add insertions up to max_ins_size
    seq_outcomes = insertion_outcomes(seq_outcomes, target_seq, cut_site, max_ins_size)
    # add a class to count all insertions equal or more than max_ins_size
    eq_more_th_ins = 'I' + format(max_ins_size, '02d') + 'EMT'
    seq_outcomes[eq_more_th_ins] = []
    # add a class to count all deletions equal or more than max_del_size
    eq_more_th_del = 'D' + format(max_del_size, '02d') + 'EMT'
    seq_outcomes[eq_more_th_del] = []
    # add a class for non planed (unidentified) outcomes
    seq_outcomes['UNIDNT'] = []

    ## Count frequency of each outcome in sequencing mutated data
    # initialize an empty object to count each outcome
    outcome_count = pd.Series(0, index=seq_outcomes.keys())
    # for each mutated sequence
    for mut_seq in mut_seq_data[target_id]:
        found = False
        # look if it is any of the defined outcomes
        for indel_id, indel_seq in seq_outcomes.items():
            if mut_seq in indel_seq:
                outcome_count[indel_id] += 1
                #outcome_count['W00000'] += 1
                found = True
        # if it does not
        if found == False:
            # if it has more gaps than the maximum, belongs to the equal or more than max_del_size
            if mut_seq.count('-') >= max_del_size:
                outcome_count[eq_more_th_del] += 1
            # if it is longer than sequence + max_ins_size then group into a category
            elif len(mut_seq) >= len(target_seq) + max_ins_size:
                outcome_count[eq_more_th_ins] += 1
            # if is not any of that options, go to an unidentified class
            else:
                outcome_count['UNIDNT'] += 1
    outcome_count_all[target_id] = outcome_count
    # print(outcome_count.sum())
    # print(outcome_count)
    # print(outcome_count_all)
    bar.next()

bar.finish()
## Calculate % of edited sequences
wt_counts = outcome_count_all.loc['W00000']
mut_counts = outcome_count_all.iloc[1:-1] #exclude last row with unidentified mutations
mut_counts_sum = mut_counts.sum()

edit_effcy = mut_counts_sum / (wt_counts + mut_counts_sum)

## Calculate frequency of each outcome
# normalize mut_counts to get the frequency
mut_freq = mut_counts / mut_counts_sum

## Export results
# normally they are exported into a folder:
#edit_effcy.to_csv('Labeling/editing_efficiency_sim_ex.csv', header=None)
#mut_freq.to_csv('Labeling/outcomes_frequency_sim_ex.csv')

# in case you want to check them, you can download them from the Files in this session.
edit_effcy.to_csv('editing_efficiency_sim_ex.csv', header=None)
mut_freq.to_csv('outcomes_frequency_sim_ex.csv')


### 3. Get the results

The results generated are shown below.

In [6]:
# to check the CSV files produced, run:
#!cat 'editing_efficiency_sim_ex.csv'
#!cat 'outcomes_frequency_sim_ex.csv'

In [7]:
# The presentation looks better when visualizing them directly as generated in the script above
# This file has the efficiency determined for each target region, used in 2.5.EffModel
edit_effcy

ENSMUSG00000033788_gR434r    0.01
ENSMUSG00000033788_gR113f    0.03
ENSMUSG00000033788_gR70f     0.30
dtype: float64

In [8]:
# This file contains the generated outcomes, used in 2.3.OutcomesProfiling.ipynb to generate the labels for the results prediction model
mut_freq

Unnamed: 0,ENSMUSG00000033788_gR434r,ENSMUSG00000033788_gR113f,ENSMUSG00000033788_gR70f
D01-03,0.0,0.000000,0.000000
D01-02,0.0,0.000000,0.000000
D01-01,0.0,0.046667,0.038667
D01+00,0.0,0.033333,0.036000
D01+01,0.0,0.000000,0.000000
...,...,...,...
I0100C,0.0,0.006667,0.014667
I0100G,0.0,0.006667,0.008000
I0100T,0.0,0.006667,0.011333
I02EMT,0.0,0.000000,0.000000
