<a href="https://colab.research.google.com/github/marcexpositg/CRISPRed/blob/master/01.DescriptiveAnalysis/1.3.CoordiantesToC3H.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.3. Convert target regions coordinates to C3H genome

**Note: No Python was used in this process, hence this notebook is just a draft markdown file with all the bash commands used in the process. However, all files used for the analysis are kept in the C3HCoordinates/ folder**


**Note2: The process was repeated twice because the in the first attempt incorrect coordinates were used. The information described here cointains both attempts, as the first attempt is more documented than the second one.**

**Note3: I highly recommend reading the Summary section presented above, that was originally written at the end of the whole process.**

The objective is getting the coordinates for each target region containing sequences from the gRNA library in the reference genome C3H. Target regions coordinates are known in the mm10, but not in the C3H genome, so they will have to be converted.



## Summary

Previously, target regions were converted to the C3H genome, only to discover that a target region is just a place in the genome which has a gRNA +-500bp. This is not useful because we are interested in the cut site. This was done in the `C:\Users\Usuario\Documents\02.TransSynBio\BioinfoThings\coordinates\ConversionC3H` folder.

Then, we got the coordinates of the gRNA (the exact bp locations) in the mm10 genome. Work done on the `C:\Users\Usuario\Documents\02.TransSynBio\BioinfoThings\coordinates\ConversionC3Hv2` folder. These were used to get the following files with useful information:

- `C3H_targets.csv`: gRNA_IDs,+-60bp sequence of the gRNA cutsite. Used to simulate data.
- `C3H_gRNA-coordinates.bed`: Coordinates of the gRNAs in the C3H genome, only 1735 of 1785 gRNAs. this is the one that should be used to get the coverage.
- `mm10_gRNA-coordinates.bed`: File with the original coordinates, but considering the indexing system (the ones used previously are the mm10_gRNA-coordinates_orig.bed)
- `C3Hv2coordsMatch.tsv` contains relation of the gRNAs ID, sequence (not in Rev comp, so Rev gRNAs are not there), and old and new coordinates. Is like a summary of the conversion for only the selected subset of 1735 gRNAs.
- `C3H_cutsite120nt_Seq_id_flt.bed`: could be useful because it cointains the same info from `C3H_targets.csv` but includes the coordinates in the C3H genome of each cut site. So it could be used to get coverage or so.

Summary of the process:

1. Converted gRNA coordinates (mm10_gRNA-coordinates_orig.bed) into a minus one nt bed file to find the correct location (mm10_gRNA-coordinates.bed) in the genome (error with 0 and 1 indexing).
2. Used Remap to map this file from mm10 genome to C3H genome. Got a couple of files: remapped_mm10_gRNA-coordinates.bed (not used),report_mm10_gRNA-coordinates.bed.xls (used).
3. Converted Remap output to tab separated file to work with it: report_mm10_gRNA-coordinates.tsv
4. Delete gRNAs that were duplicated. 3 gRNAs we are not sure if we know the exact location, but we keep them. Created `C3Hv2_preproccess_v2.tsv`.
5. 30 gRNAs were not found by Remap. Deleted them. `C3Hv2_preproccess_v2.tsv` -> `C3Hv2_preproccess_v3.tsv`. From 1785 to 1755 gRNAs.
6. Delete 20 gRNAs with differences in coverage (that could be corrected, but we didn't because manually it would take long) in `C3H_preproccess_v4.tsv`. Go from 1755 to 1735 (one coordinate is for two gRNAs totally revcomp).
7. Convert to bed format (`C3Hv2coord_sub.bed`), and use `bedtools getfasta` to obtain the sequences, `C3Hv2coord_sub_Seq.bed`.
8. In order to match coordinates with gRNA Id, the file C3Hv2MatchtoGrna.bed is created, which is a condensed version of the  C3Hv2_preprocess_v4.tsv (which has all the info between old and new coordinates).
9. Then, we add the name (C3Hv2coordsMatch_noSeq.tsv) and sequnce of each gRNA, to create C3Hv2coordsMatch.tsv, that has all the coordinates information and the sequences and IDs.
10. From this file with all the information we get the `bed` coordinates of the matched gRNAs in the C3H genome, in the `C3H_gRNA-coordinates.bed` file, that recalls the `mm10_gRNA-coordinates.bed` original coordinates format. Keep in mind the indexing of 0 and 1 for different apps.
11. The names of the missing gRNAs are identified and kept in the `missinggRNAs.txt` file. We do that looking for the gRNAs in the original library, stored in tab sep format in the gRNAorigSeq.tsv file.  Since the sequence in the C3Hv2coordsMatch.tsv file is only in one direction and we need to do joins, we also create `gRNAorigSeqrev` that contains the revcomp sequence. 
12. We also try to get the gRNAs with point mutations, I am not sure if they are 74 or 24 (because i don't know if the 74 include the 50 gRNAs not matched on C3H). We keep them on `missinggRNAs.txt`. For that, we create the `notInFwd.txt` `norInReverse.txt` files that have intermidiate steps.
13. Finally, to get the cut sites in an appropiate format, we create `discrepancy.awk` and see that the strand info of some gRNAs is not correct in the mm10, but luckly it is correct in the C3H genome, se we use that coordinates. To transform the coordinates we use the file with all the information `C3Hv2coordsMatch.tsv` and a awk script to make the coordinates cut site center in the 120nt, and take into account if sequence is fwd or reverse, this file is `C3H_cutsite120nt.bed`.
14. From this file, we use `get-sequenceFromCord_strand.sh` to obtain the sequences of the target sites, in `C3H_cutsite120nt_Seq.bed`. We need to **remove 13 gRNAs** because their 120nt sequence contains `N` which would probably yield an error. So at the end, we obtain 1722 in the `C3H_cutsite120nt_Seq_flt.bed` file.
15. Match each sequence with its ID. From the `C3H_cutsite120nt_Seq_flt.bed` file we match them using initial coordinates and get `C3H_cutsite120nt_Seq_id.bed`. There were 2 gRNAs that had same starting point (one of them is RevComp), so after deleting the duplication caused by this ones we end up with `C3H_cutsite120nt_Seq_id_flt.bed`
16. This file is converted to the `csv` format required by the script of mutations, `C3H_targets.csv`.
17. Conclusion, we look at 1722 targets. The difference from the full library of 1785 comes from 30 not matched + 20 coverage != 1 + 13 contain N in the C3H genome in regions adjacent to the cutsite. Some of the 1722 contain point mutations when compared to the mm10 genome.

## First attempt



### 3.1. Previous problems

A BLAT was done with all the gRNAs in the new genome. But it returns a lot of possibilities and only perfect matches should be selected. The problem is that some gRNAs are not perfect in the C3H genome and we don't get them. And also that some gRNAs make perfect matches in more than one site and we need to define the one that is of our interest.

Hence, we change the approach, using the original target coordinates in the mm10 genome to identify the gRNA of interest. Hence, here, the target coordinates are converted from the mm10 genome to the C3H genome.




### 3.2. Batch convert

The target coordinates are obtained from [BitBucket](https://bitbucket.org/synbiolab/library-design/src/master/Cut%20coordinates/target-sequences_mm10.bed) as `target-sequences_mm10.bed`. This file has one line for each gRNA specifying the target region of that gRNA. It is the target cut site of the gRNA with 25bp at both ends, like that:

```
chr5	73647358	73647408
chr5	73647451	73647501
...
up to 1785 lines
...
chr6	87043097	87043147
chr6	87043275	87043325
```

This file is used as input for NCBI [Remap](https://www.ncbi.nlm.nih.gov/genome/tools/remap), specifying:

- Source assembly: GRCm38(mm10), GCF_000001635.20
- Target assembly: C3H_HeJ_v1, GCA_001632575.1
- The rest of parameters are defaults
- The input data is the `target-sequences_mm10.bed` file.

The results are downloaded in the .xls format because it contains chromosome and strand information which were not displayed in the `.bed` format. The file is `report_target-sequences_mm10.bed.xls`. This file is converted to tab separated values using MSExcel, generating `report_target-sequences_mm10.tsv`. File format by columns:

1. #feat_name: Unique ID assigned to each read, composed of the chromosome and start coordinates in the original (mm10) genome.
2. source_int: Number of times the ID appears in the mm10 genome, is always one
3. mapped_int: Number of times the ID appears in the mapped genome (in this case, C3H). Some were found more than once in the C3H genome so it is higher than one in some cases, or null if they were not found.
4. source_id: Chromosome where the transript is (assumes it is the same for mm10 and C3H)
5. mapped_id: Chromosome where transcript is in mm10 genome, they have a different codification (ex. chr5 --> CM004239.1)
6. source_length: Query is always 50bp
7. mapped_length: In some cases the region identified is less than 50bp, this will need revision
8. source_start: Initial coordinate in mm10
9. source_stop: Final coordinate in mm10
10. source_strand: Strand in mm10
11. source_sub_start: Same as column 8
12. source_sub_stop: Same as column 9
13. mapped_start: Initial coordinate in mapped genome C3H
14. mapped_stop: Final coordinate in mapped genome C3H
15. mapped_strand: Strand in C3H
16. coverage: If the mapped length is less than 50, the coverage is less than 1.
17. recip: Don't know why but some are Second Pass
18. asm_unit: Always "Primary Assembly"




### 3.3. Refining the output

We explore the results using the .tsv format. Below a summary of the modifications done on the file along this section to get the filnal file:

**Summary of the process:**

- Initial coordinates in mm10 genome are in `target-sequences_mm10.bed`.
- This initial coordinates are used as input for batch conversion to NCBI Remap. The batch output generates a xls file, converted to tsv to work with data: `report_target-sequences_mm10.bed.xls` -> `report_target-sequences_mm10.tsv`
- Delete repetitions of targets that appear multiple times, with different IDs (data that was just repeated, we maintain the correct ones): `report_target-sequences_mm10.tsv` -> `C3H_preproccess_v1.tsv`
- Delete targets that appear multiple times, sameIDs `C3H_preproccess_v1.tsv` -> `C3H_preproccess_v2.tsv`. Since we can't identify a single region where this targets match, we delete 3 gRNAs (2 appeared 7 times and 1 appreared 6 times).
- Delete 26 targets that were not found on the C3H genome `C3H_preproccess_v2.tsv` -> `C3H_preproccess_v3.tsv`.
- Delete 53 target regions with differences in coverage (that could be corrected, but we didn't because manually it would take long) `C3H_preproccess_v3.tsv` -> `C3H_preproccess_v4.tsv`
- We end up with 1700 target regions, so we rename indicating that this is a subset and produce a bed file `C3H_preproccess_v4.tsv` -> `C3Hcoord_sub_Seq.bed`

#### 3.3.1. Final list of targets

Since it was difficult to identify all gRNAs in the C3H genome, some of the original gRNAs were not identified in the final set used for C3H. This is the list of the gRNAs absent in the final set and the reason they are not there. The initial file had 1785 targets and the final only 1700.

```bash
# some gRNAs appear multiple times in the C3H genome and we don't know the exact location
#gRNA_numb                    #sequence               #tsv_name
>ENSMUSG00000071816_gR76f     GGGAGCCACCATTTGGTTGCTGG chrX_8805433
>ENSMUSG00000071816_gR28f     GTCTTCAGACACCCCAGAATTGG chrX_8805385
>ENSMUSG00000071816_gR132r    TGGCTGGCGAGATAGCTAAGTGG chrX_8805489
# delete some which were not identified in the C3H genome
# TODO: identify all gRNAs which have been removed from the subset.

```

#### 3.3.2. Targets found multiple times

There are targets which were found multiple times on the genome. Some of them were assigned the same ID and found multiple times, while some others were assigned different IDs.

##### Multiple times, different ID

If we count we see that we get two IDs more than the unique gRNAs. This is because the targets that were found twice were assigned a different ID. That is why they don't appear in the search above.

```bash
$cut -f1 report_target-sequences_mm10.tsv | uniq -c | sort -nk1 | wc -l
1786

$ cat target-sequences_mm10.bed | wc -l
1784

$ cut -f1 report_target-sequences_mm10.tsv | uniq -c | grep __
      1 chr19_3281198__0
      1 chr1_75361545__1
      1 chr4_53730598__2
      1 chr11_61066368__3

$ cat target-sequences_mm10.bed | grep __
# Nothing out


## Explanation:
## by manually looking here $cut -f1 report_target-sequences_mm10.tsv | uniq -c | sort -nk1
## we see:
      1 chr4_53730598
      1 chr4_53730598__2
      1 chr19_3281198
      1 chr19_3281198__0
      1 chr11_61066368
      1 chr11_61066368__3
      1 chr1_75361545
      1 chr1_75361545__1
```

Do this repeated ones have the same coordinates or different coordinates?

All the repeated ones have the same coordinates in the C3H genome, as seen in columns number 13 and 14.


```bash
$ grep 'chr4_53730598' report_target-sequences_mm10.tsv
chr4_53730598   1       1       chr4    CM004238.1      50      50      53730598        53730647        +       53730598        53730647        52288061        52288110        +       1       First Pass      Primary Assembly
chr4_53730598__2        1       1       chr4    CM004238.1      50      50      53730598        53730647        +       53730598        53730647        52288061        52288110        +       1       First PassPrimary Assembly

$ grep 'chr19_3281198' report_target-sequences_mm10.tsv
chr19_3281198   1       1       chr19   CM004253.1      50      50      3281198 3281247 +       3281198 3281247 31736   31785   +  1First Pass      Primary Assembly
chr19_3281198__0        1       1       chr19   CM004253.1      50      50      3281198 3281247 +       3281198 3281247 31736   31785       +       1       First Pass      Primary Assembly

$ grep 'chr11_61066368' report_target-sequences_mm10.tsv
chr11_61066368  1       1       chr11   CM004245.1      50      50      61066368        61066417        +       61066368        61066417    60562764        60562813        +       1       First Pass      Primary Assembly
chr11_61066368__3       1       1       chr11   CM004245.1      50      50      61066368        61066417        +       61066368   61066417 60562764 

$ grep 'chr1_75361545' report_target-sequences_mm10.tsv
chr1_75361545   1       1       chr1    CM004235.1      50      50      75361545        75361594        +       75361545        75361594    75650425        75650474        +       1       First Pass      Primary Assembly
chr1_75361545__1        1       1       chr1    CM004235.1      50      50      75361545        75361594        +       75361545   75361594 75650425        75650474        +       1       First Pass      Primary Assembly
```

Hence, they can be directly deleted from the data. We do so with `grep -v __` to exclude the repeated IDs. So first step is creating the first preprocessed file: Delete multiple times, different IDs (data that was just repeated), `report_target-sequences_mm10.tsv` -> `C3H_preproccess_v1.tsv`. The code demonstrates that only the four results have been deleted.

```bash
$ grep -v "__" report_target-sequences_mm10.tsv > C3H_preprocess_v1tsv

$ wc -l C3H_preprocess_v1.tsv
1799 C3H_preprocess_v1.tsv

$ wc -l report_target-sequences_mm10.tsv
1803 report_target-sequences_mm10.tsv

$ grep 'chr4_53730598' C3H_preprocess_v1.tsv
chr4_53730598   1       1       chr4    CM004238.1      50      50      53730598        53730647        +       53730598        53730647        52288061        52288110        +       1       First Pass      Primary Assembly

$ grep 'chr4_53730598' report_target-sequences_mm10.tsv
chr4_53730598   1       1       chr4    CM004238.1      50      50      53730598        53730647        +       53730598        53730647        52288061        52288110        +       1       First Pass      Primary Assembly
chr4_53730598__2        1       1       chr4    CM004238.1      50      50      53730598        53730647        +       53730598        53730647        52288061        52288110        +       1       First Pass Primary Assembly
```

##### Multiple times, same ID

Sorting by ID, we see that one target was found 6 times and two targets found 7 times. The rest appear only once in the genome (but there are some with different ID)

```bash
$ cut -f1 report_target-sequences_mm10.tsv | uniq -c | sort -nk1 | tail
      1 chrX_93831579
      1 chrX_93831651
      1 chrX_9468696
      1 chrX_9468749
      1 chrX_98151471
      1 chrX_98151661
      1 chrX_98151809
      6 chrX_8805433
      7 chrX_8805385
      7 chrX_8805489
```

We can see also see the ones that appear more than once by using the 3rd column of the file, `mapped_int`, which contains information on the number of times a transcript was found on the C3H genome. It returns the same information as written above.

```bash
$ sort -nk3 report_target-sequences_mm10.tsv | tail -25
chrX_93831274   1       1       chrX    CM004254.1      50      50      93831274        93831323        +       93831274        93831323        90567861        90567910        +       1       First Pass      Primary Assembly
chrX_93831579   1       1       chrX    CM004254.1      50      50      93831579        93831628        +       93831579        93831628        90568166        90568215        +       1       First Pass      Primary Assembly
chrX_93831651   1       1       chrX    CM004254.1      50      50      93831651        93831700        +       93831651        93831700        90568238        90568287        +       1       First Pass      Primary Assembly
chrX_9468696    1       1       chrX    CM004254.1      50      50      9468696 9468745 +       9468696 9468745 4319777 4319826 +       1       First Pass      Primary Assembly
chrX_9468749    1       1       chrX    CM004254.1      50      50      9468749 9468798 +       9468749 9468798 4319830 4319879 +       1       First Pass      Primary Assembly
chrX_98151471   1       1       chrX    CM004254.1      50      50      98151471        98151520        +       98151471        98151520        94994035        94994084        +       1       First Pass      Primary Assembly
chrX_98151661   1       1       chrX    CM004254.1      50      50      98151661        98151710        +       98151661        98151710        94994225        94994274        +       1       First Pass      Primary Assembly
chrX_98151809   1       1       chrX    CM004254.1      50      50      98151809        98151858        +       98151809        98151858        94994373        94994422        +       1       First Pass      Primary Assembly
chrX_8805385    1       2       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3169088 3169137 +       1       Second Pass     Primary Assembly
chrX_8805433    1       2       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3169136 3169185 +       1       Second Pass     Primary Assembly
chrX_8805489    1       2       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3169192 3169241 +       1       Second Pass     Primary Assembly
chrX_8805385    1       3       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3580247 3580296 +       1       Second Pass     Primary Assembly
chrX_8805433    1       3       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3580295 3580344 +       1       Second Pass     Primary Assembly
chrX_8805489    1       3       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3580351 3580400 +       1       Second Pass     Primary Assembly
chrX_8805385    1       4       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3521395 3521444 +       1       Second Pass     Primary Assembly
chrX_8805433    1       4       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3521443 3521492 +       1       Second Pass     Primary Assembly
chrX_8805489    1       4       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3521499 3521548 +       1       Second Pass     Primary Assembly
chrX_8805385    1       5       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3070567 3070616 +       1       First Pass      Primary Assembly
chrX_8805433    1       5       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3338477 3338526 -       1       Second Pass     Primary Assembly
chrX_8805489    1       5       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3072293 3072342 +       1       First Pass      Primary Assembly
chrX_8805385    1       6       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3338525 3338574 -       1       Second Pass     Primary Assembly
chrX_8805433    1       6       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3447354 3447403 -       1       Second Pass     Primary Assembly
chrX_8805489    1       6       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3338421 3338470 -       1       Second Pass     Primary Assembly
chrX_8805385    1       7       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3447402 3447451 -       1       Second Pass     Primary Assembly
chrX_8805489    1       7       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3447298 3447347 -       1       Second Pass     Primary Assembly
```

To solve that, we are going to focus on each of the 3 individually.

###### chrX_8805433

This one appears 6 times.

```bash
$ grep 'chrX_8805433' report_target-sequences_mm10.tsv
chrX_8805433    1       1       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3116031 3116080 +       1       Second Pass     Primary Assembly
chrX_8805433    1       2       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3169136 3169185 +       1       Second Pass     Primary Assembly
chrX_8805433    1       3       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3580295 3580344 +       1       Second Pass     Primary Assembly
chrX_8805433    1       4       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3521443 3521492 +       1       Second Pass     Primary Assembly
chrX_8805433    1       5       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3338477 3338526 -       1       Second Pass     Primary Assembly
chrX_8805433    1       6       chrX    CM004254.1      50      50      8805433 8805482 +       8805433 8805482 3447354 3447403 -       1       Second Pass     Primary Assembly
```

We look at the UCSC genome browser using chrX:8805433-8805482 to get the sequence in the mm10 genome. Just look for that region and go to View>DNA, or search [here](https://genome-euro.ucsc.edu/cgi-bin/hgc?hgsid=237315516_1DY0PTTUVko0aD171AnuwhtNBzci&o=8805432&g=getDna) directly.

```txt
>mm10_dna range=chrX:8805433-8805482 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCC
```

This could also be done using bash by:

```bash
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
chmod a+x twoBitToFa
twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000
```

We look for that sequence using BLAT in the mm10 genome to see if it is repeated more than once, and it is! However, there is only one perfect perfect match, which is the first one, indicating that 50bp are identical and this is on chromosome X.

``` bash
   ACTIONS      QUERY   SCORE START   END QSIZE IDENTITY  CHROM                 STRAND  START       END   SPAN
--------------------------------------------------------------------------------------------------------------
browser details YourSeq    50     1    50    50   100.0%  chrX                  +     8805433   8805482     50
browser details YourSeq    49     1    50    50   100.0%  chrX                  -     8587447   8587676    230
browser details YourSeq    45     1    49    50    96.0%  chrX                  -    36751607  36751655     49
more results with 100% identity but on other chromosomes.
```

Let's see what gRNA is in that region using `grep`. We keep the DNA sequence in a file so that we can search all the perfect gRNA on the file, and grep returns that the gRNA matching this is: GGGAGCCACCATTTGGTTGCTGG, which corresponds to >ENSMUSG00000071816_gR76f. 

```bash
$ echo GATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCC > DNA.txt
$ cat DNA.txt
GATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCC
$ grep -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta DNA.txt
GATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCC
# Colored region: GGGAGCCACCATTTGGTTGCTGG
```

Doing BLAT of that gRNA the top 4 results show that the first top result has identical 100% and is located within the region of the other one.

```txt
   ACTIONS      QUERY   SCORE START   END QSIZE IDENTITY  CHROM                 STRAND  START       END   SPAN
--------------------------------------------------------------------------------------------------------------
browser details YourSeq    23     1    23    23   100.0%  chrX                  +     8805440   8805462     23
browser details YourSeq    22     2    23    23   100.0%  chr12                 -   111642051 111642072     22
browser details YourSeq    22     1    22    23   100.0%  chr10                 -   107309794 107309815     22
browser details YourSeq    21     1    23    23    95.7%  chrX                  -    36751626  36751648     23
``` 

The next step is doing blat of this sequences in the C3H genome. To see if we identify a clear match and we can clarify the multiple times match. This will be done on the virtual machine.

To run the conditions of BLAT in the same way as on the online, use the following parameters (retreived from [BLAT FAQ num.5](https://genome.ucsc.edu/FAQ/FAQblat.html#blat5)):

```bash
blat -stepSize=5 -repMatch=2253 -minScore=20 -minIdentity=0 database.2bit query.fa output.psl
```

We run BLAT like that:

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat get-coordinates.sh
#!/bin/bash

#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mexposit@upf.edu
#SBATCH --mem=60000

#SBATCH -e stderr_filt_%j.err
#SBATCH -o stdout_filt_%j.out

# Getting variables


# Load modules
module load BLAT/3.5-foss-2016b

# BLAT
blat -t=dna -q=dna -repMatch=2253 -stepSize=5 -minMatch=1 -minScore=20 -minIdentity=0 -noHead -out=psl ../../Reference_Genomes/GCA_001632575.1_C3H_HeJ_v1.fa Repeated_sequences.fasta Repeated_C3H_gRNAcoordinates.psl
```

We get almost 600 results, so we need to filter. By filtering by chromosome (X) and by perfect match (50bp) we get:

```bash
[mexposit@mr-login Coordinates_MarcExp]$ grep chrX Repeated_C3H_gRNAcoordinates.psl | awk -F, 'int($1) == 50'
50      0       0       0       0       0       0       0       +       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3070614 3070664 1       50,     0,      3070614,
50      0       0       0       0       0       0       0       +       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3116030 3116080 1       50,     0,      3116030,
50      0       0       0       0       0       0       0       +       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3169135 3169185 1       50,     0,      3169135,
50      0       0       0       0       0       0       0       +       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3521442 3521492 1       50,     0,      3521442,
50      0       0       0       0       0       0       0       +       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3580294 3580344 1       50,     0,      3580294,
50      0       0       0       0       0       0       0       -       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3338476 3338526 1       50,     0,      3338476,
50      0       0       0       0       0       0       0       -       chrX_8805433    50      0       50 chrX                                                                                                   168548779        3447353 3447403 1       50,     0,      3447353,
50      0       0       0       0       0       1       180     -       chrX_8805433    50      0       50 chrUn_LVXL01033584v1                                                                                   2160     878     1108    2       45,5,   0,45,   878,1103,
```

This is more than the results we had initially. Some of them are similar to the 6 results we got. We will make a search of 150 bp instead of just 50bp. For that, we go to genome browser again, input coordinates and download the sequence. Before we used chrX:8805433-8805482, now we will add 50bp to each end, so the coordinates searched now are: chrX:8805383-8805532, and the sequence is:

```bash
>mm10_dna range=chrX:8805383-8805532 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ACTGTAACTGTCTTCAGACACCCCAGAATTGGGCCTCAGATCTCATTACAGATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCCGGAAGAGCAGCCAGTGTCCTTTACCACTTAGCTATCTCGCCAGCCAACTC
```

We input that into the file we are using to BLAT and repeat the blat. So:

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat Repeated_sequences.fasta
>chrX_8805433
ACTGTAACTGTCTTCAGACACCCCAGAATTGGGCCTCAGATCTCATTACAGATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCCGGAAGAGCAGCCAGTGTCCTTTACCACTTAGCTATCTCGCCAGCCAACTC
[mexposit@mr-login Coordinates_MarcExp]$ sbatch get-coordinates.sh
Submitted batch job 14429159
```

The results show that we get exactly the same results as before.

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat Repeated_C3H_gRNAcoordinates.psl | awk -F, 'int($1) == 150'
150     0       0       0       0       0       0       0       +       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3070564 3070714 1       150,    0,      3070564,
150     0       0       0       0       0       0       0       +       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3115980 3116130 1       150,    0,      3115980,
150     0       0       0       0       0       0       0       +       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3169085 3169235 1       150,    0,      3169085,
150     0       0       0       0       0       0       0       +       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3521392 3521542 1       150,    0,      3521392,
150     0       0       0       0       0       0       0       +       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3580244 3580394 1       150,    0,      3580244,
150     0       0       0       0       0       0       0       -       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3338426 3338576 1       150,    0,      3338426,
150     0       0       0       0       0       0       0       -       chrX_8805433    150     0       150                                                                                                       chrX     168548779       3447303 3447453 1       150,    0,      3447303,
```

Now we tried getting 400bp each side in addition to the 150bp we have.

```bash
GTCCTGCGTCCTACCAGCATGTGAGGGGATGAGGCAGTGATGAAGCTTAGCTGACACTTCCTAACCAATCAGAAGGAGACTTCTAAAAGGGCACCTGTCACTGGTCAGATGAATGAGAGCCAGGCTCCACTAGCTCCTCCTCCATGAACGGGCCCACTCTCCACTCTAGCCTTCCCTAAATCAGCAAGATTCTGAGCCTTTGGAATATTTACTAACAAACAAAACCTTCTGATTTGTCATTATCTGTAGATGAATGCTTTGAAGAATCTTTTGGTGTGACACCGAGAAAACGAATGAAGGCAAGTATCACCTCTTTCCTCAGGAATGCACCCTGTTTGTCCCTCAGTTCATGTCCGCATTGTTTTTTTTTTAAGTCCTATTTATTATTATATGTAAGTACACTGTAACTGTCTTCAGACACCCCAGAATTGGGCCTCAGATCTCATTACAGATGTTTGGGAGCCACCATTTGGTTGCTGGGATTTGAACTCAGGACCTCCGGAAGAGCAGCCAGTGTCCTTTACCACTTAGCTATCTCGCCAGCCAACTCTTTTTTTTTTTCCGATTTTTTCTTTCCACATTGTTGAGGACACAGTTTTCTTCTTGGTGGTGGATAAGTGTATAATTATATGTTTGATATTGAAAGTGTAAATTTTAATGTGTAGCTTCACATGCTGTTTCAAATGCCAATTCTTACTGTATCTTTCATAGAGTTATATATAGTTAGTCTATATATTTATAGTCATTGAATCTTCGGGTTTGGTTAATTTTGGCATGTTATGATTTTACTTATTTGCCAGGCCCTGTTCTAAATATTATCCAGGAAATTAGAACACCCTTTATCAAGTAGCAGAAACACTCCCTGCAGATTAATATTTGCCAAAATGTCTTTTCTCTGATTCTCTTTTATAGCTGACATCAGTGACAATAAGTATTCATAATGTAGAAGG
```

The results are identical, again.

```bash
950     0       0       0       0       0       0       0       +       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3070164 3071114 1       950,    0,      3070164,
950     0       0       0       0       0       0       0       +       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3115580 3116530 1       950,    0,      3115580,
950     0       0       0       0       0       0       0       +       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3168685 3169635 1       950,    0,      3168685,
950     0       0       0       0       0       0       0       +       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3520992 3521942 1       950,    0,      3520992,
950     0       0       0       0       0       0       0       +       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3579844 3580794 1       950,    0,      3579844,
950     0       0       0       0       0       0       0       -       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3338026 3338976 1       950,    0,      3338026,
950     0       0       0       0       0       0       0       -       chrX_8805433    950     0       950                                                                                                       chrX     168548779       3446903 3447853 1       950,    0,      3446903,
```

###### chrX_8805385

This one appears 7 times.

```bash
$ grep 'chrX_8805385' report_target-sequences_mm10.tsv
chrX_8805385    1       1       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3115983 3116032 +       1       Second Pass     Primary Assembly
chrX_8805385    1       2       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3169088 3169137 +       1       Second Pass     Primary Assembly
chrX_8805385    1       3       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3580247 3580296 +       1       Second Pass     Primary Assembly
chrX_8805385    1       4       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3521395 3521444 +       1       Second Pass     Primary Assembly
chrX_8805385    1       5       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3070567 3070616 +       1       First Pass      Primary Assembly
chrX_8805385    1       6       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3338525 3338574 -       1       Second Pass     Primary Assembly
chrX_8805385    1       7       chrX    CM004254.1      50      50      8805385 8805434 +       8805385 8805434 3447402 3447451 -       1       Second Pass     Primary Assembly
```

To get the sequence of that region we get the DNA sequence using chrX:8805385-8805434 in UCSC Genome Browser. Then we save the sequence inside of `DNA.txt` and use `grep` to search which gRNA is in the target sequence. The gRNA matching is: >ENSMUSG00000071816_gR28f, which sequence is GTCTTCAGACACCCCAGAATTGG

```bash
$ echo TGTAACTGTCTTCAGACACCCCAGAATTGGGCCTCAGATCTCATTACAGA > DNA.txt
$ grep --color -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta DNA.txt
TGTAACTGTCTTCAGACACCCCAGAATTGGGCCTCAGATCTCATTACAGA
# Colored region: GTCTTCAGACACCCCAGAATTGG
```

###### chrX_8805489

This one appears 7 times.

```bash
$ grep 'chrX_8805489' report_target-sequences_mm10.tsv
chrX_8805489    1       1       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3116087 3116136 +       1       Second Pass     Primary Assembly
chrX_8805489    1       2       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3169192 3169241 +       1       Second Pass     Primary Assembly
chrX_8805489    1       3       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3580351 3580400 +       1       Second Pass     Primary Assembly
chrX_8805489    1       4       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3521499 3521548 +       1       Second Pass     Primary Assembly
chrX_8805489    1       5       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3072293 3072342 +       1       First Pass      Primary Assembly
chrX_8805489    1       6       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3338421 3338470 -       1       Second Pass     Primary Assembly
chrX_8805489    1       7       chrX    CM004254.1      50      50      8805489 8805538 +       8805489 8805538 3447298 3447347 -       1       Second Pass     Primary Assembly
```

To get the sequence of that region we get the DNA sequence using chrX:8805489-8805538 in UCSC Genome Browser. We don't find the positive sense sequence in the gRNAs, so we try the negative sense. The negative sense sequence is AAAAAAGAGTTGGCTGGCGAGATAGCTAAGTGGTAAAGGACACTGGCTGC, which has the TGGCTGGCGAGATAGCTAAGTGG gRNA that is the ENSMUSG00000071816_gR132r.

```bash
$ echo AAAAAAGAGTTGGCTGGCGAGATAGCTAAGTGGTAAAGGACACTGGCTGC > DNA.txt
$ grep -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta DNA.txt
AAAAAAGAGTTGGCTGGCGAGATAGCTAAGTGGTAAAGGACACTGGCTGC
# colored region: TGGCTGGCGAGATAGCTAAGTGG
```

Let's try to take 900bp each side of the target region and do a blast in the C3H genome.

```bash
CTAGTGGACCTGCTCTTTGGGTCAATAATCTAGCACAGTTGAGGCAATACTCATCATCAGTTCTCTAAGTACTAAAATGTTCTAATTGAAGAAACACGGGTCCTGTATTGTAACTCTGACCAAGAACCTGTAGTTATACAGGACACCTGGGAAACCCGTTACTATGATTATGATAAATGAATGGCCCCTGGTAGGTCAACCATACTCTAGTGGATGATTCTACATCCTACAGTATGTACATAGCACAAATTGCCCTTGGGTTTTTTATTTATAAAAAAATTAAAATAAAAAAGGTATGGATGTTGGGTGAAAAACAGCTGGGAAAGATGTTTATATTCATTGGATGAAAGTGTGGGGTATGTGTGAATATGATTAAAATAAATGATATAAAATCCTCAAAGAATTAAATAAATACCAAATTGACAAACAATAACATATGTTAAGGATAAGAATTAGATTGTAGCCAACCTGCCAAGTTGGAATCATTCTCTCCAGAGGCAAGGCTTCCTTCTACATTATGAATACTTATTGTCACTGATGTCAGCTATAAAAGAGAATCAGAGAAAAGACATTTTGGCAAATATTAATCTGCAGGGAGTGTTTCTGCTACTTGATAAAGGGTGTTCTAATTTCCTGGATAATATTTAGAACAGGGCCTGGCAAATAAGTAAAATCATAACATGCCAAAATTAACCAAACCCGAAGATTCAATGACTATAAATATATAGACTAACTATATATAACTCTATGAAAGATACAGTAAGAATTGGCATTTGAAACAGCATGTGAAGCTACACATTAAAATTTACACTTTCAATATCAAACATATAATTATACACTTATCCACCACCAAGAAGAAAACTGTGTCCTCAACAATGTGGAAAGAAAAAATCGGAAAAAAAAAAAGAGTTGGCTGGCGAGATAGCTAAGTGGTAAAGGACACTGGCTGCTCTTCCGGAGGTCCTGAGTTCAAATCCCAGCAACCAAATGGTGGCTCCCAAACATCTGTAATGAGATCTGAGGCCCAATTCTGGGGTGTCTGAAGACAGTTACAGTGTACTTACATATAATAATAAATAGGACTTAAAAAAAAAACAATGCGGACATGAACTGAGGGACAAACAGGGTGCATTCCTGAGGAAAGAGGTGATACTTGCCTTCATTCGTTTTCTCGGTGTCACACCAAAAGATTCTTCAAAGCATTCATCTACAGATAATGACAAATCAGAAGGTTTTGTTTGTTAGTAAATATTCCAAAGGCTCAGAATCTTGCTGATTTAGGGAAGGCTAGAGTGGAGAGTGGGCCCGTTCATGGAGGAGGAGCTAGTGGAGCCTGGCTCTCATTCATCTGACCAGTGACAGGTGCCCTTTTAGAAGTCTCCTTCTGATTGGTTAGGAAGTGTCAGCTAAGCTTCATCACTGCCTCATCCCCTCACATGCTGGTAGGACGCAGGACATTCTGACTCTCTTTGAAGGTGTTACACATTGTCCATTTACTTGTCCAGAATAATGTACACAATAATGTTTTAGAATTGTGCATGACAGGATCTTCTACATAATCCTCAGAATGAGCTCTGTTCCACACAACCCTGATCTGAACAAGAAAAGAGCTGCCTATCACTTTAGCATGCATCCTGACTGAGTAGGTATACTTTCACAGCCCCTCCCTGATCACTCACCTTCACTGTCATGATCTTCAATGCCTTCAACCAGGAATTGCTTGGCCTGCTCCTTGCCACGCATGAAAACTGGTTGGTTCACGTTGACCCCTAAGAGGAAGCAGAGTGCTTGTTCTTTAACAGGCAGCACAGTGAGGTGAAATGCTGTTAGGAGCTGGGCTACAGCCCATA
```

We locate 5 regions of the genome which are full length repetitions. Four of them are coincident with the ones identified by NCBI's tool.

```bash
[mexposit@mr-login Coordinates_MarcExp]$ awk '{print $1}' Repeated_C3H_gRNAcoordinates.psl | sort -nr | uniq -c | head
      5 1850
      2 1557
      1 1549
      1 714
      1 689
      1 594
      1 531
      1 512
      1 509
      1 508
[mexposit@mr-login Coordinates_MarcExp]$ cat Repeated_C3H_gRNAcoordinates.psl | awk -F, 'int($1) == 1850'
1850    0       0       0       0       0       1       1622    -       test    1850    0       1850    chrX                                 168548779        3069770 3073242 2       890,960,        0,890,  3069770,3072282,
1850    0       0       0       0       0       1       1622    -       test    1850    0       1850    chrX                                 168548779        3115186 3118658 2       890,960,        0,890,  3115186,3117698,
1850    0       0       0       0       0       1       1622    -       test    1850    0       1850    chrX                                 168548779        3168291 3171763 2       890,960,        0,890,  3168291,3170803,
1850    0       0       0       0       0       1       1622    -       test    1850    0       1850    chrX                                 168548779        3520598 3524070 2       890,960,        0,890,  3520598,3523110,
1850    0       0       0       0       0       1       1622    -       test    1850    0       1850    chrX                                 168548779        3579450 3582922 2       890,960,        0,890,  3579450,3581962,
```

**Conclusion:** We believe that they belong to repetitive regions in the C3H genome. Since we are only talking about 3 targets, we will just delete them from the total. So the studied set will not have all gRNAs of the original library. We delete this gRNAs: chrX_8805433, chrX_8805385, and chrX_8805489.

We do it with `grep -v` by passing in multiple patterns using `\|` as separator. We can see the v2 version has 20 lines minus, which corresponds to 6+7+7 the matches of the 3 gRNAs. By doing a grep we confirm that we have eliminated the three lines with the gRNAs.

```bash
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ grep -v 'chrX_8805433\|chrX_8805385\|chrX_8805489' C3H_preprocess_v1.tsv >C3H_preprocess_v2.tsv
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ wc -l C3H_preprocess_v*
  1799 C3H_preprocess_v1.tsv
  1779 C3H_preprocess_v2.tsv

$ grep 'chrX_8805433\|chrX_8805385\|chrX_8805489' C3H_preprocess_v2.tsv
# empty
```

#### 3.3.3. Missing targets

There are 26 targets which have not been identified on the C3H genome.

```bash
Usuario@DESKTOP-7MAB2EU MINGW64 ~/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H
$ grep NOMAP report_target-sequences_mm10.tsv | wc -l
26

Usuario@DESKTOP-7MAB2EU MINGW64 ~/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H
$ grep NULL report_target-sequences_mm10.tsv
chr14_73561051  1       NULL    chr14   NULL    50      NULL    73561051        73561100        +       NOMAP   ALIGNGAP
chr14_73561084  1       NULL    chr14   NULL    50      NULL    73561084        73561133        +       NOMAP   ALIGNGAP
chr14_73561022  1       NULL    chr14   NULL    50      NULL    73561022        73561071        +       NOMAP   ALIGNGAP
chrX_8413590    1       NULL    chrX    NULL    50      NULL    8413590 8413639 +       NOMAP   ALIGNGAP
chrX_8413972    1       NULL    chrX    NULL    50      NULL    8413972 8414021 +       NOMAP   ALIGNGAP
chr10_61147149  1       NULL    chr10   NULL    50      NULL    61147149        61147198        +       NOMAP   EXPANDED
chrX_8591470    1       NULL    chrX    NULL    50      NULL    8591470 8591519 +       NOMAP   ALIGNGAP
chrX_8591564    1       NULL    chrX    NULL    50      NULL    8591564 8591613 +       NOMAP   ALIGNGAP
chrX_8591155    1       NULL    chrX    NULL    50      NULL    8591155 8591204 +       NOMAP   ALIGNGAP
chr9_21457955   1       NULL    chr9    NULL    50      NULL    21457955        21458004        +       NOMAP   LOWCOV
chr7_4519891    1       NULL    chr7    NULL    50      NULL    4519891 4519940 +       NOMAP   ALIGNGAP
chrX_8750174    1       NULL    chrX    NULL    50      NULL    8750174 8750223 +       NOMAP   LOWCOV
chrX_8750156    1       NULL    chrX    NULL    50      NULL    8750156 8750205 +       NOMAP   LOWCOV
chrX_8523109    1       NULL    chrX    NULL    50      NULL    8523109 8523158 +       NOMAP   ALIGNGAP
chrX_8522851    1       NULL    chrX    NULL    50      NULL    8522851 8522900 +       NOMAP   ALIGNGAP
chrX_8367645    1       NULL    chrX    NULL    50      NULL    8367645 8367694 +       NOMAP   ALIGNGAP
chrX_8367609    1       NULL    chrX    NULL    50      NULL    8367609 8367658 +       NOMAP   ALIGNGAP
chrX_8367602    1       NULL    chrX    NULL    50      NULL    8367602 8367651 +       NOMAP   ALIGNGAP
chr19_3852850   1       NULL    chr19   NULL    50      NULL    3852850 3852899 +       NOMAP   ALIGNGAP
chr19_3852890   1       NULL    chr19   NULL    50      NULL    3852890 3852939 +       NOMAP   ALIGNGAP
chrX_8461411    1       NULL    chrX    NULL    50      NULL    8461411 8461460 +       NOMAP   ALIGNGAP
chr17_54298027  1       NULL    chr17   NULL    50      NULL    54298027        54298076        +       NOMAP   ALIGNGAP
chr12_106037247 1       NULL    chr12   NULL    50      NULL    106037247       106037296       +       NOMAP   LOWCOV
chrX_8876309    1       NULL    chrX    NULL    50      NULL    8876309 8876358 +       NOMAP   ALIGNGAP
chrX_8876458    1       NULL    chrX    NULL    50      NULL    8876458 8876507 +       NOMAP   ALIGNGAP
chr11_53890658  1       NULL    chr11   NULL    50      NULL    53890658        53890707        +       NOMAP   EXPANDED
```

Let's try to study one of them: chr14_73561051, which is chr14:73561051-73561100. We get the sequence in mm10 using UCSC as described above, which is: GATTTCTGAGTTCGAGGCCAGCCTGGTCTACAAAGTGAGTGCCAGGACAG. We have to search in reverse, so it is: CTGTCCTGGCACTCACTTTGTAGACCAGGCTGGCCTCGAACTCAGAAATC. This corresponds to >ENSMUSG00000022110_gR354r gRNA.

```bash
$ grep --color -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta DNA.txt
CTGTCCTGGCACTCACTTTGTAGACCAGGCTGGCCTCGAACTCAGAAATC
# colored region: ACTCACTTTGTAGACCAGGCTGG
```

By doing a BLAST we see no results are obtained when looking for the gRNA sequence. Hence, I will consider this gRNAs as missing in the C3H genome, even if they could be in another part.

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat Repeated_sequences.fasta
>missing_grna
ACTCACTTTGTAGACCAGGCTGG
[mexposit@mr-login Coordinates_MarcExp]$ sbatch get-coordinates.sh
```

Hence, we will eliminate all the missing coordinates from this files.

**TODO:** Maybe by looking at the coverage on regions not included in the subset of targets after eliminating the missing or repeated we could get the locations of targeted regions which have not been identified by BLAST or so.

To eliminate all the missing coordinates from the file, we will just grep inverse using the `NULL` tag that all missing sequences have. We can see that the final file contains 26 sequences less, corresponding to the 26 target regions eliminated.

```bash
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ grep -v NULL C3H_preprocess_v2.tsv > C3H_preprocess_v3.tsv
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ wc -l C3H_preprocess_v*
  1799 C3H_preprocess_v1.tsv
  1779 C3H_preprocess_v2.tsv
  1753 C3H_preprocess_v3.tsv
```

#### 3.3.4. Targets with coverage different than 1

If we look at the coverage we will see that some targets do not match completely and some match longer than the others. The 26 empty are the ones not found on results. There are 1724 with perfect coverage and the rest have multiple coverage. Those with coverage<1 have target length<50, while those with coverage>1 have target length>50. This target sites are the sites whith indels or mutations between mm10 and C3H genomes.

```bash
$ cut -f16 report_target-sequences_mm10.tsv | sort -n | uniq -c
     26
      1 coverage
   1724 1
      1 0.6
      2 0.7
      1 0.8
      2 1.5
      1 0.56
      1 0.62
      1 0.76
      1 0.78
      2 0.84
      2 0.86
      1 0.88
      2 0.92
      1 0.94
      9 0.96
      9 0.98
      9 1.02
      3 1.04
      1 1.16
      1 1.22
      2 1.26
```

It would be nice to find a way to see the mutations of all of this coordinates. However, for now, the easiest solution is to eliminate them. We have a total of 53 gRNAs which coverage is not equal to 1 (which is the same to say that the regions in the coordinates are not equal to 49, which is the length of the target region).

```bash
$ awk '$14 - $13 != 49' C3H_preprocess_v3.tsv | wc -l
53
$ awk '$16 != 1.0' C3H_preprocess_v3.tsv | wc -l
53
```

To eliminate them, we will use `awk`, specifying to print only those with coverage equal to 1. We see that we have eliminated the 53 gRNAs which were identified with coverage != to 1 in the C3H genome.

```bash
~/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ awk '$16 == 1.0' C3H_preprocess_v3.tsv >C3H_preprocess_v4.tsv

~/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ wc -l C3H_preprocess_v*
  1799 C3H_preprocess_v1.tsv
  1779 C3H_preprocess_v2.tsv
  1753 C3H_preprocess_v3.tsv
  1700 C3H_preprocess_v4.tsv
  7031 total
```




### 3.4. Final target coordinates in BED format

We will convert the refined target regions in C3H into a BED format, by taking the columns `$13` and `$14` and the chromosome at column `4`. See guidelines for `bed` format [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). We also add the name in the fourth column to identify each line.

```bash
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ awk -v OFS='\t' '{print $4, $13, $14, $1}' C3H_preprocess_v4.tsv > C
3Hcoord_sub.bed
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ head -n 1 C3Hcoord_sub.bed
chr5    73597558        73597607        chr5_73647359
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ wc -l C3Hcoord_sub.bed
1700 C3Hcoord_sub.bed
```

We will use that BED Format to get the sequence of the regions in the C3H genome as explained [here](https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html), selecting the "-bedOut Creating a tab-delimited BED file in lieu of FASTA output" option. Hence, the sequence of the target region will be on the 4th column which is usually reserved for a name.

This is done on the Marvin server because we have the genome there. So we transfer the `.bed` file into the "Coordinates_MarcExp" folder. However, since Marvin has installed the "BEDTools/2.27.1-foss-2018b" version, which is not up date with 2.29 probably the "bedOut" option doesn't work. We prepare the bash file shown below, and after execute it get the file shown below. The 2nd column is the target sequence in the C3H genome, but this is not a real .bed file, so we have to modify it manually. We just need to insert tabs instead of `:` and `-`. Then, we have a real `bed` file.

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat get-sequenceFromCord.sh
#!/bin/bash

#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marc.exposit@upf.edu
#SBATCH --mem=60000

#SBATCH -e stderr_filt_%j.err
#SBATCH -o stdout_filt_%j.out

# Getting variables


# Load modules
module load BEDTools/2.27.1-foss-2018b

# BEDtools
bedtools getfasta -fi ../../Reference_Genomes/GCA_001632575.1_C3H_HeJ_v1.fa -bed C3Hcoord_sub.bed -fo C3Hcoord_sub_Seq.bed -tab


[mexposit@mr-login Coordinates_MarcExp]$ head C3Hcoord_sub_Seq.bed
chr5:73597558-73597607  CGACTGACGTTATTTTTATCCGCATGGTTTCTGGCTGCGGAGAGGCACG
chr5:73597651-73597700  GAGGGACAACTCCGCTCTCTGGCCGGCCGGGAGCCGCTGCCCAGGATCC
chr5:73597699-73597748  CAGCCCGGCCTTCGCCAGCATCTTGGCTGCGGACAGCTCCCGCCCTGCG
chr7:29844927-29844976  GATCCTCTTGTCCTCTCCACCCCCTCAGAACGTGGAAACACCCGTATCT
chr7:29844875-29844924  atttagagagatcacccgactctgcctggattaaaaccctgcTCAGCCT
chr7:29844914-29844963  tgcTCAGCCTTTTGATCCTCTTGTCCTCTCCACCCCCTCAGAACGTGGA
chr16:64537654-64537703 TTCACTGGGGAATGACAGCCGTGTTAATTAGTGACCTGGAGTAACAGTT
chr16:64537863-64537912 ATTAAAAATTAACTGTCGTGGAACAATAATCTTTAAAACCCACCTAGGA
chr10:66825377-66825426 CCCAACTTGCGCATCGCCCCAAAGTGAACAGGGTTAACAAGCCGAGGCG
chr10:66825197-66825246 CGCTACTGGGCTAGGGTCAAAGAGATGGGAAAGTTCATCAGTCGGGTTA

[mexposit@mr-login Coordinates_MarcExp]$ sed 's/:/\t/g' <C3Hcoord_sub_Seq.bed >process1.txt
[mexposit@mr-login Coordinates_MarcExp]$ sed 's/-/\t/g' <process1.txt >C3Hcoord_sub_Seq.bed
[mexposit@mr-login Coordinates_MarcExp]$ rm process1.txt
rm: remove regular file ‘process1.txt’? y
[mexposit@mr-login Coordinates_MarcExp]$ head C3Hcoord_sub_Seq.bed
chr5    73597558        73597607        CGACTGACGTTATTTTTATCCGCATGGTTTCTGGCTGCGGAGAGGCACG
chr5    73597651        73597700        GAGGGACAACTCCGCTCTCTGGCCGGCCGGGAGCCGCTGCCCAGGATCC
chr5    73597699        73597748        CAGCCCGGCCTTCGCCAGCATCTTGGCTGCGGACAGCTCCCGCCCTGCG
chr7    29844927        29844976        GATCCTCTTGTCCTCTCCACCCCCTCAGAACGTGGAAACACCCGTATCT
chr7    29844875        29844924        atttagagagatcacccgactctgcctggattaaaaccctgcTCAGCCT
chr7    29844914        29844963        tgcTCAGCCTTTTGATCCTCTTGTCCTCTCCACCCCCTCAGAACGTGGA
chr16   64537654        64537703        TTCACTGGGGAATGACAGCCGTGTTAATTAGTGACCTGGAGTAACAGTT
chr16   64537863        64537912        ATTAAAAATTAACTGTCGTGGAACAATAATCTTTAAAACCCACCTAGGA
chr10   66825377        66825426        CCCAACTTGCGCATCGCCCCAAAGTGAACAGGGTTAACAAGCCGAGGCG
chr10   66825197        66825246        CGCTACTGGGCTAGGGTCAAAGAGATGGGAAAGTTCATCAGTCGGGTTA

[mexposit@mr-login Coordinates_MarcExp]$ wc -l C3Hcoord_sub_Seq.bed
1700 C3Hcoord_sub_Seq.bed
```



### 3.5. Check which gRNAs were excluded

First we prepare a reverse complement version of the gRNA sequence, using an online tool, so we create: mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta. Which has the same gRNAs as mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta but reverse complement.

```bash
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ head -n 4 mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI*.fasta
==> mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta <==
>ENSMUSG00000078919_gR57r
GCAGGAAGGAGAAAGACGCGGGG
>ENSMUSG00000078919_gR97r
ATCTGTCAACAGATAGACACCGG

==> mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_mutated.fasta <==
>ENSMUSG00000078919_gR57r
GCAAGAAGGAGAAAGACGCGGGG
>ENSMUSG00000078919_gR97r
ATCTCTCAACAGATAGACACCGG

==> mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta <==
>ENSMUSG00000078919_gR57r_reverse_complement
CCCCGCGTCTTTCTCCTTCCTGC
>ENSMUSG00000078919_gR97r_reverse_complement
CCGGTGTCTATCTGTTGACAGAT
```

Then, we use grep to count how many gRNA appear in the target regions converted to C3H genome.

```bash
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ grep -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta C3Hcoord_sub_Seq.bed | wc -l
451
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ grep -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta C3Hcoord_sub_Seq.bed | wc -l
343

mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ grep -v -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta C3Hcoord_sub_Seq.bed | grep -v -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta | wc -l
934
```

Only a total of 451+343=794 gRNAs appear in the target regions. About 934 target regions appear not to have any gRNA (so 934+86 gRNAs in target regions are absent or contain mutations)!



### 3.6. Get new gRNA coordinates

We have now obtained the coordinates of the target region, which are about 50bp, but it would also be interesting to get the exact coordinates of the gRNA, which can be in any position of this 50bp.

Another alternative was getting the exact gRNA coordinates instead of the target regions. The idea was to trim the BLAT results, but it seems too complex.

Brief recap of a BLAT of the gRNAs against the C3H genome:

```
# BLAT
sufix=$1 ## The original file is divided to speet up the alignment
blat -t=dna -q=dna -tileSize=12 -stepSize=6 -minMatch=1 -minScore=11 -minIdentity=100 -noHead -out=psl ../Reference_Genomes/GCA_001632575.1_C3H_HeJ_v1.fa /scratch/lab_mguell/projects/shared_data/Muscle-editing_library/Previous_data/mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.$sufix.fasta gRNA-coordinates_C3H_$sufix.psl

# Join all psl files in one
cat gRNA-coordinates_C3H_00.psl >> Coordinates_def/gRNA-coordinates_C3H.psl
[...]
cat gRNA-coordinates_C3H_09.psl >> Coordinates_def/gRNA-coordinates_C3H.psl
```

And we obtained a .psl file with multiple results for each gRNA, like:

```
23	0	0	0	0	0	0	0	+	ENSMUSG00000062380_gR176f	23	0	23	chr8	131879273	126005127	126005150	1	23,	0,	126005127,
23	0	0	0	0	0	0	0	+	ENSMUSG00000006457_gR105r	23	0	23	chr19	61067434	1744704	1744727	1	23,	0,	1744704,
23	0	0	0	0	0	0	0	+	ENSMUSG00000043639_gR113f	23	0	23	chr7	150256143	29033057	29033080	1	23,	0,	29033057,
23	0	0	0	0	0	0	0	+	ENSMUSG00000043639_gR113f	23	0	23	chr4	158643861	22301174	22301197	1	23,	0,	22301174,
23	0	0	0	0	0	0	0	+	ENSMUSG00000043639_gR113f	23	0	23	chr2	186646354	28627212	28627235	1	23,	0,	28627212,
```

Then converted the .xls file into a `.csv` to browse. We have some that appear more than once and some which were not identified on the new C3H.

```
$ cut -d";" -f8 report_target-sequences_mm10.csv | sort | uniq -c | sort -nr | head
      7 8805489
      7 8805385
      6 8805433
      2 75361545
      2 61066368
      2 53730598
      2 3281198
      1 source_start
      1 99962760
      1 99962523
```


## Second attempt

Now, using the correct coordinates, and creating the final files used. 

The first conversion was of the "target coordinates", which are +/- 500bp of the gRNA cut site. So it was very tricky identify which gRNA belong to each cut site and get the final coordinates.

In order to improve that, now we have the coordinates of the gRNAs (not the target regions).

### 3.1. Setup

Work on the local computer, inside the synbio folder in a newly created folder `ConversionC3Hv2` (C:\Users\Usuario\Documents\02.TransSynBio\BioinfoThings\coordinates\ConversionC3Hv2).

The gRNA coordinates are mm10_gRNA-coordinates.bed. 

### 3.2. Decrease bed coordinates in 1nt

When extracting the sequence using bedtools, the coordinates have to be decreased in one unit. This is probably due to errors in 0 or 1 indentation (the script that R generated for the bet may be 1 indexed while bedtools is 0 indexed, or something like that).

The first step is moving the coordinates one nucleotide down so that bedtools can be used to get their sequence. This is done using `awk`. We make a copy `_orig` with the coordinates that marta sent me, but it is not used anymore.

```bash
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3Hv2$
$ cp mm10_gRNA-coordinates.bed mm10_gRNA-coordinates_orig.bed
$ awk -v FS='\t' -v OFS='\t' '{print $1, $2-1, $3-1, $4, $5, $6}' mm10_gRNA-coordinates_orig.bed >mm10_gRNA-coordinates.bed
$ head mm10_gRNA-coordinates*
==> mm10_gRNA-coordinates.bed <==
chr5    73647260        73647283        ENSMUSG00000029156_gR308f       0       -
chr5    73647167        73647190        ENSMUSG00000029156_gR401f       0       -
chr5    73647109        73647132        ENSMUSG00000029156_gR449r       0       +
chr7    30464328        30464351        ENSMUSG00000006649_gR334r       0       -
chr7    30464266        30464289        ENSMUSG00000006649_gR282f       0       +
chr7    30464315        30464338        ENSMUSG00000006649_gR321r       0       -
chr16   65550826        65550849        ENSMUSG00000004843_gR30f        0       -
chr16   65550617        65550640        ENSMUSG00000004843_gR239f       0       -
chr10   67538887        67538910        ENSMUSG00000037868_gR406r       0       -
chr10   67538697        67538720        ENSMUSG00000037868_gR226f       0       +

==> mm10_gRNA-coordinates_orig.bed <==
chr5    73647261        73647284        ENSMUSG00000029156_gR308f       0       -
chr5    73647168        73647191        ENSMUSG00000029156_gR401f       0       -
chr5    73647110        73647133        ENSMUSG00000029156_gR449r       0       +
chr7    30464329        30464352        ENSMUSG00000006649_gR334r       0       -
chr7    30464267        30464290        ENSMUSG00000006649_gR282f       0       +
chr7    30464316        30464339        ENSMUSG00000006649_gR321r       0       -
chr16   65550827        65550850        ENSMUSG00000004843_gR30f        0       -
chr16   65550618        65550641        ENSMUSG00000004843_gR239f       0       -
chr10   67538888        67538911        ENSMUSG00000037868_gR406r       0       -
chr10   67538698        67538721        ENSMUSG00000037868_gR226f       0       +
```


### 3.2.2. Check initial library

First, we check if extracting the sequences from the mm10 genome with this coordinates yields to the library of gRNAs.

To get the sequence from coordinates, adapt previous script:

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat get-sequenceFromCord.sh
#!/bin/bash

#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marc.exposit@upf.edu
#SBATCH --mem=60000

#SBATCH -e stderr_filt_%j.err
#SBATCH -o stdout_filt_%j.out

# Getting variables


# Load modules
module load BEDTools/2.27.1-foss-2018b

# BEDtools
bedtools getfasta -fi ../../Reference_Genomes/mm10.fa -bed mm10_gRNA-coordinates.bed -fo mm10gRNAcoord_Seq.bed -tab

After running...
[mexposit@mr-login Coordinates_MarcExp]$ head mm10gRNAcoord_Seq.bed
chr5:73647261-73647284  CACAGTGGGCGGGGAATGGCAGA
chr5:73647168-73647191  CAGGAAACACTTGATAACGCATT
chr5:73647110-73647133  GTTACAGCTTGATATCTGAAGGG
chr7:30464329-30464352  CACCCCCTCAGAACGTGGAAACA
chr7:30464267-30464290  gagatcacccgactctgcctgga
chr7:30464316-30464339  CTCTTGTCCTCTCCACCCCCTCA
chr16:65550827-65550850 CGCAAAGGACTGAACGCTGCTTC
chr16:65550618-65550641 CACCTAGGAGCTATGATATTATT
chr10:67538888-67538911 CCCAAAGTGAACAGGGTTAACAA
chr10:67538698-67538721 GGGCTAGGGTCAAAGAGATGGGA
```

Copy into local folder and manually convert to a real `bed` file, and the sequences are converted to uppercase.

```bash
$ sed 's/:/\t/g' <mm10gRNAcoord_Seq.bed >process1.txt
$ sed 's/-/\t/g' <process1.txt >process2.txt
$ awk -v FS='\t' -v OFS='\t' '$4 = toupper($4)' process2.txt >mm10gRNAcoord_Seq.bed
$ rm process*.txt
$ head mm10gRNAcoord_Seq.bed
chr5    73647260        73647283        CCACAGTGGGCGGGGAATGGCAG
chr5    73647167        73647190        CCAGGAAACACTTGATAACGCAT
chr5    73647109        73647132        AGTTACAGCTTGATATCTGAAGG
chr7    30464328        30464351        CCACCCCCTCAGAACGTGGAAAC
chr7    30464266        30464289        AGAGATCACCCGACTCTGCCTGG
chr7    30464315        30464338        CCTCTTGTCCTCTCCACCCCCTC
chr16   65550826        65550849        CCGCAAAGGACTGAACGCTGCTT
chr16   65550617        65550640        CCACCTAGGAGCTATGATATTAT
chr10   67538887        67538910        CCCCAAAGTGAACAGGGTTAACA
chr10   67538697        67538720        TGGGCTAGGGTCAAAGAGATGGG
```

Use grep to search for the library of gRNAs in this file. No gRNA (nor forward nor inverted) is NOT found on the coordinates -> All gRNAs are identified.

```bash
$ grep -i -v -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta mm10gRNAcoord_Seq.bed | grep -i -v -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta | wc -l
0
```

In total, 945 gRNAs are found as they are on the file, and 842 are found in the Rev complement. This is a total of 1787 lines, and there are 1785.

```bash
$ grep -i -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta mm10gRNAcoord_Seq.bed | wc -l
945
$ grep -i -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta mm10gRNAcoord_Seq.bed | wc -l
842
```

There is one gRNA which is found on both orientations.

```bash
$ grep -i -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta mm10gRNAcoord_Seq.bed | grep -i -f mou
se_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta
chr4    53730614        53730637        CCTAGCATGCAGGAAGCCCTGGG
chr4    53730614        53730637        CCTAGCATGCAGGAAGCCCTGGG
```

Looking for this gRNA...there are two gRNAs which are identical >ENSMUSG00000028414_gR493f:CCTAGCATGCAGGAAGCCCTGGG, and the reverse complement of this one >ENSMUSG00000028414_gR483r:CCCAGGGCTTCCTGCATGCTAGG. They actually have the same coordinates, they cut in different but overlapping sites of the genome, one in one strand and the other one in the other strand.


### 3.3. Remap

Using `NCBI Remap`, the target coordinates are converted from the `GRCm38(mm10)` genome to the `C3H_HeJ_v1` genome.

We download the `Download full mapping report`, which creates a file: `report_mm10_gRNA-coordinates.bed.xls`. Also curious to see the `Download Annotation Data`. 

#### 3.3.1. Refining the output

We explore the results using the .tsv format. Below a summary of the modifications done on the file along this section to get the filnal file:

- To work with data: `report_mm10_gRNA-coordinates.bed.xls` -> `report_mm10_gRNA-coordinates.tsv`
- Delete a duplication of a gRNA which was repeated with a different Id (__0), but we keep the correct number of gRNAs, we go from 1803->1802 lines: `report_mm10_gRNA-coordinates.tsv` -> `C3Hv2_preproccess_v1.tsv`
- Delete duplicated entries of gRNAs mapped multiple times `C3Hv2_preproccess_v1.tsv` -> `C3Hv2_preproccess_v2.tsv`. Since we can't identify a single region where this targets match, we keep only one of the mapping regions even if we don't know if that is correct, at least we maintian the gRNAs. (3gRNA affected: 2 appeared 7 times and 1 appreared 6 times). Since we kept one of the coordinates, the file is reduced by 2*(7-1)+1(6-1)=17 lines, so we go from 1802 to 1785 lines.
- Notice that 1785 lines is 1784 entries + header. We had 1785 gRNAs. Remap has only worked on 1784 because there is one gRNA which has same init and end coordinates but reverse orientation, and remap just saw the coordinates and thought it was a duplication. But this gRNA is added at the end when matching coordinates with gRNAs.
- **Delete 30 gRNAs** that were not found on the C3H genome `C3Hv2_preproccess_v2.tsv` -> `C3Hv2_preproccess_v3.tsv`. From 1785 to 1755.
- Delete 20 gRNAs with differences in coverage (that could be corrected, but we didn't because manually it would take long), we also delete the header so file decreases 21 lines, `C3H_preproccess_v3.tsv` -> `C3H_preproccess_v4.tsv`. Go from 1755 to 1734.
- We end up with 1734 coordinates corresponding to 1735 gRNA targets (one of the coordinates has two gRNAs). We only deleted 50 gRNAs.

```bash
$ wc -l report_mm10_gRNA-coordinates.tsv
1803 report_mm10_gRNA-coordinates.tsv
$ wc -l C3Hv2_preprocess_v*
  1802 C3Hv2_preprocess_v1.tsv
  1785 C3Hv2_preprocess_v2.tsv
  1755 C3Hv2_preprocess_v3.tsv
  1734 C3Hv2_preprocess_v4.tsv
  7076 total
```

#### 3.3.2. List of modified gRNAs

Since it was difficult to identify all gRNAs in the C3H genome, some of the original gRNAs were not identified in the final set used for C3H. This is the list of the gRNAs which have been modified in the final set and the reason:

```bash
# some gRNAs appear multiple times in the C3H genome and we don't know the exact location (we selected one of the positions randomly, so we don't know the exact location of this coordinates, but they have been included)
#gRNA code        #original bed position
chrX_8805506    chrX    8805506 8805528
chrX_8805392    chrX    8805392 8805414
chrX_8805440    chrX    8805440 8805462

# delete 30 gRNAs which were not identified in the C3H genome
chrX_86193175   chrX    86193175        86193197
chr14_73561068  chr14   73561068        73561090
chr14_73561101  chr14   73561101        73561123
chr14_73561029  chr14   73561029        73561051
chrX_8413607    chrX    8413607 8413629
chrX_8413989    chrX    8413989 8414011
chr12_16580442  chr12   16580442        16580464
chrX_8591298    chrX    8591298 8591320
chrX_8591194    chrX    8591194 8591216
chrX_8591603    chrX    8591603 8591625
chr7_4519915    chr7    4519915 4519937
chr7_4519884    chr7    4519884 4519906
chrX_8750181    chrX    8750181 8750203
chrX_8750188    chrX    8750188 8750210
chrX_8750163    chrX    8750163 8750185
chrX_8522850    chrX    8522850 8522872
chrX_8522907    chrX    8522907 8522929
chr1_68739930   chr1    68739930        68739952
chrX_8367662    chrX    8367662 8367684
chrX_8367616    chrX    8367616 8367638
chrX_8367619    chrX    8367619 8367641
chr4_19708599   chr4    19708599        19708621
chr2_119662446  chr2    119662446       119662468
chr19_3852857   chr19   3852857 3852879
chr19_3852897   chr19   3852897 3852919
chr17_54298067  chr17   54298067        54298089
chr3_32616215   chr3    32616215        32616237
chr12_106037254 chr12   106037254       106037276
chrX_8876326    chrX    8876326 8876348
chrX_8876475    chrX    8876475 8876497 

# delete 20 gRNAs with coverage not equal to 1
chr4_147902815  chr4    147902815       147902837
chr4_99962384   chr4    99962384        99962406
chr10_76613836  chr10   76613836        76613858
chr9_40270433   chr9    40270433        40270455
chr2_152331772  chr2    152331772       152331794
chr11_101071651 chr11   101071651       101071673
chr14_51198676  chr14   51198676        51198698
chr14_51198694  chr14   51198694        51198716
chr9_21457972   chr9    21457972        21457994
chr1_172166289  chr1    172166289       172166311
chr2_127012255  chr2    127012255       127012277
chr2_118598667  chr2    118598667       118598689
chr4_135760135  chr4    135760135       135760157
chr15_7397706   chr15   7397706 7397728
chrX_8523175    chrX    8523175 8523197
chr17_28233406  chr17   28233406        28233428
chr12_111713713 chr12   111713713       111713735
chr3_32615904   chr3    32615904        32615926
chr6_4746405    chr6    4746405 4746427
chr1_90829745   chr1    90829745        90829767
```

#### 3.3.3. Targets found multiple times

There are targets which were found multiple times on the genome. Some of them were assigned the same ID and found multiple times, while some others were assigned different IDs.

##### Multiple times, different ID

If we count we see that we get two IDs more than the unique gRNAs. This is because the targets that were found twice were assigned a different ID. That is why they don't appear in the search above.

```bash
$ cat mm10_gRNA-coordinates.bed | wc -l
1785
$ cat mm10_gRNA-coordinates.bed | wc -l
1785
$ cut -f1 report_mm10_gRNA-coordinates.tsv | uniq -c | grep __
      1 chr4_53730615__0

$ grep chr4_53730615 report_mm10_gRNA-coordinates.tsv
chr4_53730615   1       1       chr4    CM004238.1      23      23      53730615        53730637        +  53730615     53730637 52288078        52288100        +       1.00000 First Pass      Primary Assembly
chr4_53730615__0        1       1       chr4    CM004238.1      23      23      53730615        53730637    53730615 53730637        52288078        52288100        -       1.00000 First Pass      Primary Assembly
```

This repeated one has the same coordinates, so we delete the entry with the __0.

Hence, they can be directly deleted from the data. We do so with `grep -v __` to exclude the repeated IDs. So first step is creating the first preprocessed file: Delete multiple times, different IDs (data that was just repeated), `report_target-sequences_mm10.tsv` -> `C3Hv2_preproccess_v1.tsv`. The code demonstrates that only one result has been deleted.

```bash
$ grep -v "__" report_mm10_gRNA-coordinates.tsv > C3Hv2_preprocess_v1.tsv
$ wc -l C3Hv2_preprocess_v1.tsv
1802 C3Hv2_preprocess_v1.tsv
$ wc -l report_mm10_gRNA-coordinates.tsv
1803 report_mm10_gRNA-coordinates.tsv
```

##### Multiple times, same ID

Sorting by ID, we see that one target was found 6 times and two targets found 7 times. The rest appear only once in the genome (but there are some with different ID)

```bash
$ cut -f1 report_mm10_gRNA-coordinates.tsv | uniq -c | sort -nk1 | tail -n 4
      1 chrX_98151816
      6 chrX_8805440
      7 chrX_8805392
      7 chrX_8805506
```

The same thing happened before using the target coordinates (the same gRNAs, the same positions). Before, we deleted the three of them. Now, I think it could be interesting to see their coverage on the sequencing data to see which one is the real site.

First, get the coordinates at which one gRNA has been mapped.

```bash
$ grep chrX_8805440 C3Hv2_preprocess_v1.tsv | cut -f1,13,14
chrX_8805440    3116038 3116060
chrX_8805440    3169143 3169165
chrX_8805440    3580302 3580324
chrX_8805440    3521450 3521472
chrX_8805440    3338497 3338519
chrX_8805440    3447374 3447396
```

Now, use `samtools mpileup` on the Marvin server to check the coverage. We do it using the first sample (`Cas9P`), but we could use any other one.

```bash
[mexposit@mr-login Coverage]$ cat pileup_indivCoord.sh
#!/bin/bash
#SBATCH -p normal
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu 1000
#SBATCH --time=10:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marc.exposit@upf.edu


#SBATCH -e stderr_filt_%j.err
#SBATCH -o stdout_filt_%j.out

### LOAD INITIAL MODULES ###
module load SAMtools/1.9-foss-2016b


### SAVE VARIABLES ###
GenomeFile=$1
CoordFile=$2
OutFile=$3

### PIPELINE ###

samtools mpileup  -f ../Reference_Genomes/GCA_001632575.1_C3H_HeJ_v1.fa -l $CoordFile $GenomeFile -o $OutFile -a

#### Command:
#### execute inside /Coverage/ folder
#### sbatch pileup_prova.sh <genomeFile.bam> <coordfile.bed> <outputFile.pileup>

[mexposit@mr-login Coverage]$ sbatch pileup_indivCoord.sh ../Alignments_C3H/1-Cas9-pef-lib_C3H_aln.sorted.bam repeatedC3H.bed repeatedC3H_1.pileup
```

The resulting .pileup file shows that for the chrX_8805440 all options have similar coverage, probably because more than one gRNA is targetting that area. They all have perfect match in relation to the gRNA sequence. Hence, we select only one of the matches (so we don't know the exact match). 

The gRNA chrX_8805392 has one of the matching sites with very low coverage (13x), so we avoid selecting the first coordinates chrX 3072310 3072332. Remarkably, this coordinates chrX 3521517 3521538 have about 80x of coverage.

For the other gRNA, chrX_8805506, we see same results as with chrX_8805392 (they must be very similar).

The affected gRNA are:

```bash
chrX_8805506    chrX    8805506 8805528
chrX_8805392    chrX    8805392 8805414
chrX_8805440    chrX    8805440 8805462
```

We look at the original coordinates and mapped coordinates for each of the gRNAs that were repeated.

```bash
$ grep -e chrX_8805440 -e chrX_8805506 -e chrX_8805392 report_mm10_gRNA-coordinates.tsv | cut -f1,4,8,9,10,13,14,15,16
chrX_8805392    chrX    8805392 8805414 +       3115990 3116012 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3169095 3169117 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3580254 3580276 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3521402 3521424 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3070574 3070596 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3338545 3338567 -       1.00000
chrX_8805392    chrX    8805392 8805414 +       3447422 3447444 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3338431 3338453 +       1.00000
chrX_8805506    chrX    8805506 8805528 -       3447308 3447330 +       1.00000
chrX_8805506    chrX    8805506 8805528 -       3116104 3116126 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3169209 3169231 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3580368 3580390 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3521516 3521538 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3072310 3072332 -       1.00000
chrX_8805440    chrX    8805440 8805462 +       3116038 3116060 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3169143 3169165 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3580302 3580324 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3521450 3521472 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3338497 3338519 -       1.00000
chrX_8805440    chrX    8805440 8805462 +       3447374 3447396 -       1.00000
```

We delete those mapped coordinates which are in the opposite strand as the original one (so if the original is + the mapped must be + as well). Then, we discard a region which had very low coverage (13x), which is chrX 3072310 3072332.

```bash
$ grep -e chrX_8805440 -e chrX_8805506 -e chrX_8805392 report_mm10_gRNA-coordinates.tsv | cut -f1,4,8,9,10,13,14,15,16
chrX_8805392    chrX    8805392 8805414 +       3115990 3116012 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3169095 3169117 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3580254 3580276 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3521402 3521424 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3070574 3070596 +       1.00000
chrX_8805506    chrX    8805506 8805528 -       3116104 3116126 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3169209 3169231 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3580368 3580390 -       1.00000
chrX_8805506    chrX    8805506 8805528 -       3521516 3521538 -       1.00000
chrX_8805440    chrX    8805440 8805462 +       3116038 3116060 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3169143 3169165 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3580302 3580324 +       1.00000
chrX_8805440    chrX    8805440 8805462 +       3521450 3521472 +       1.00000
```

Now, we arbirtrarily pick regions. We saw higher coverage in `chrX 3521517 3521538` 80x vs 60x in the rest of areas. Since target regions have higher coverage, there is a higher chance this is the correct place. Hence, we select this region for the gRNA that has it, which is `chrX_8805506`. The coordinates in `chrX_8805440` must be a bit lower than the coordinates of `chrX_8805506`, exactly 66nt lower. We find this region `3521450 3521472` which is exactly 66nt below the position found for `chrX_8805506`, so it might be correct! Finally, for `chrX_8805392`, it is 48 positions below `chrX_8805440`. Hence, we retain coordinates for `chrX_8805392` which are 48 positions below the ones of `chrX_8805440`. The final selected coordinates for those gRNA are:

```bash
chrX_8805506    chrX    8805506 8805528 -       3521516 3521538 -       1.00000
chrX_8805440    chrX    8805440 8805462 +       3521450 3521472 +       1.00000
chrX_8805392    chrX    8805392 8805414 +       3521402 3521424 +       1.00000
```

**Conclusion**: 3 gRNAs (ENSMUSG00000071816_gR28f, ENSMUSG00000071816_gR132r, and ENSMUSG00000071816_gR76f) are in a very repetitive region of the `chrX`. So the remap is not exact. Previously we had studied that region and saw that even 1kb next to each possible position is repetitive. Since the 3 gRNAs are very close, we select only one of that regions. We made the selection nearly arbitrarily. Looking for the coverage we identified one region with higher coverage (~80x) than the rest (~60x), so we selected the 80x coverage area as the target of our guides. Then, we selected coordinates so that each of this three gRNAs are spaced evenly and in the same location.

We manually remove the repetitions to select only the coordinates of interest, by doing `C3Hv2_preproccess_v1.tsv` -> `C3Hv2_preproccess_v2.tsv`.

```bash
$ cp C3Hv2_preprocess_v1.tsv C3Hv2_preprocess_v2.tsv
$ nano C3Hv2_preprocess_v2.tsv
$ grep -e chrX_8805440 -e chrX_8805506 -e chrX_8805392 C3Hv2_preprocess_v2.tsv | cut -f1,4,8,9,10,13,14,15,17
chrX_8805392    chrX    8805392 8805414 +       3521402 3521424 +       Second Pass
chrX_8805506    chrX    8805506 8805528 -       3521516 3521538 -       Second Pass
chrX_8805440    chrX    8805440 8805462 +       3521450 3521472 +       Second Pass
```

#### 3.3.4. Missing targets

There are 30 gRNAs which have not been identified on the C3H genome. These are just deleted.

```bash
$ grep NOMAP C3Hv2_preprocess_v2.tsv | wc -l
30

$ grep NOMAP C3Hv2_preprocess_v2.tsv | cut -f1,4,8,9
chrX_86193175   chrX    86193175        86193197
chr14_73561068  chr14   73561068        73561090
chr14_73561101  chr14   73561101        73561123
chr14_73561029  chr14   73561029        73561051
chrX_8413607    chrX    8413607 8413629
chrX_8413989    chrX    8413989 8414011
chr12_16580442  chr12   16580442        16580464
chrX_8591298    chrX    8591298 8591320
chrX_8591194    chrX    8591194 8591216
chrX_8591603    chrX    8591603 8591625
chr7_4519915    chr7    4519915 4519937
chr7_4519884    chr7    4519884 4519906
chrX_8750181    chrX    8750181 8750203
chrX_8750188    chrX    8750188 8750210
chrX_8750163    chrX    8750163 8750185
chrX_8522850    chrX    8522850 8522872
chrX_8522907    chrX    8522907 8522929
chr1_68739930   chr1    68739930        68739952
chrX_8367662    chrX    8367662 8367684
chrX_8367616    chrX    8367616 8367638
chrX_8367619    chrX    8367619 8367641
chr4_19708599   chr4    19708599        19708621
chr2_119662446  chr2    119662446       119662468
chr19_3852857   chr19   3852857 3852879
chr19_3852897   chr19   3852897 3852919
chr17_54298067  chr17   54298067        54298089
chr3_32616215   chr3    32616215        32616237
chr12_106037254 chr12   106037254       106037276
chrX_8876326    chrX    8876326 8876348
chrX_8876475    chrX    8876475 8876497 
```

Mapping them using BLAT was not possible.

Hence, we will eliminate all the missing coordinates from this files.

**TODO:** Maybe by looking at the coverage on regions not included in the subset of targets after eliminating the missing or repeated we could get the locations of targeted regions which have not been identified by BLAST or so.

To eliminate all the missing coordinates from the file, we will just grep inverse using the `NULL` tag that all missing sequences have. We can see that the final file contains 30 sequences less, corresponding to the 30 target regions eliminated.

```bash
$ grep -v NULL C3Hv2_preprocess_v2.tsv > C3Hv2_preprocess_v3.tsv
$ wc -l C3Hv2_preprocess_v*
  1802 C3Hv2_preprocess_v1.tsv
  1785 C3Hv2_preprocess_v2.tsv
  1755 C3Hv2_preprocess_v3.tsv
```

#### 3.3.5. Targets with coverage different than 1

If we look at the coverage we will see that some targets do not match completely and some match longer than the others. The 30 empty are the ones not found on results. There are 1724 with perfect coverage and the rest have multiple coverage. Those with coverage<1 have target length<50, while those with coverage>1 have target length>50. This target sites are the sites whith indels or mutations between mm10 and C3H genomes.

```bash
$ cut -f16 C3Hv2_preprocess_v3.tsv | sort -n | uniq -c
      1 coverage
      2 0.56522
      1 0.69565
      1 0.78261
      1 0.82609
      1 0.86957
      1 0.91304
      3 0.95652
   1734 1.00000
      5 1.04348
      2 1.08696
      1 1.13043
      2 1.17391
```

It would be nice to find a way to see the mutations of all of this coordinates. However, for now, the easiest solution is to eliminate them. We have a total of 20 gRNAs which coverage is not equal to 1. Note that the number is 21 because the header `coverage` is also included.

```bash
$ awk '$16 != 1.0' C3Hv2_preprocess_v3.tsv | wc -l
21

awk '$16 != 1.0' C3Hv2_preprocess_v3.tsv | cut -f1,4,8,9,10,16
#feat_name      source_id       source_start    source_stop     source_strand   coverage
chr4_147902815  chr4    147902815       147902837       +       1.17391
chr4_99962384   chr4    99962384        99962406        -       0.95652
chr10_76613836  chr10   76613836        76613858        -       0.86957
chr9_40270433   chr9    40270433        40270455        +       0.95652
chr2_152331772  chr2    152331772       152331794       +       1.04348
chr11_101071651 chr11   101071651       101071673       +       0.69565
chr14_51198676  chr14   51198676        51198698        -       1.04348
chr14_51198694  chr14   51198694        51198716        -       1.04348
chr9_21457972   chr9    21457972        21457994        -       0.56522
chr1_172166289  chr1    172166289       172166311       -       0.82609
chr2_127012255  chr2    127012255       127012277       -       0.95652
chr2_118598667  chr2    118598667       118598689       +       1.04348
chr4_135760135  chr4    135760135       135760157       +       1.17391
chr15_7397706   chr15   7397706 7397728 -       1.04348
chrX_8523175    chrX    8523175 8523197 -       0.91304
chr17_28233406  chr17   28233406        28233428        -       1.08696
chr12_111713713 chr12   111713713       111713735       +       1.08696
chr3_32615904   chr3    32615904        32615926        -       1.13043
chr6_4746405    chr6    4746405 4746427 +       0.56522
chr1_90829745   chr1    90829745        90829767        +       0.78261
```

To eliminate them, we will use `awk`, specifying to print only those with coverage equal to 1. We see that we have eliminated the 20 gRNAs which were identified with coverage != to 1 in the C3H genome and we have also eliminated the header.

```bash
$ awk '$16 == 1.0' C3Hv2_preprocess_v3.tsv >C3Hv2_preprocess_v4.tsv

~/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3H$ wc -l C3Hv2_preprocess_v*
  1802 C3Hv2_preprocess_v1.tsv
  1785 C3Hv2_preprocess_v2.tsv
  1755 C3Hv2_preprocess_v3.tsv
  1734 C3Hv2_preprocess_v4.tsv
```


### 3.4. Final target coordinates in BED format

At the end, the converted coordinates have 1734 gRNAs from the 1785 initial ones. We really have 1735, because the one that was repeated in forward and reverse was deleted here.

They have to be converted to decrease in one point the initial coordinate (I don't know why). We will also convert the refined target regions in C3H into a BED format, by taking the columns `$13` and `$14` and the chromosome at column `4`. See guidelines for `bed` format [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). We also add the name in the fourth column to identify each line, and the strand at which the target was found.

```bash
$ awk -v OFS='\t' '{print $4, $13-1, $14, $1, "0", $15}' C3Hv2_preprocess_v4.tsv >C3Hv2coord_sub.bed
$ head -n 1 C3Hv2coord_sub.bed
chr5    73597459        73597482        chr5_73647261   0       -
$ wc -l C3Hv2coord_sub.bed
1734 C3Hv2coord_sub.bed
```

We will use that BED Format to get the sequence of the regions in the C3H genome as explained [here](https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html), selecting the "-bedOut Creating a tab-delimited BED file in lieu of FASTA output" option. Hence, the sequence of the target region will be on the 4th column which is usually reserved for a name.

This is done on the Marvin server because we have the genome there. So we transfer the `.bed` file into the "Coordinates_MarcExp" folder. However, since Marvin has installed the "BEDTools/2.27.1-foss-2018b" version, which is not up date with 2.29 probably the "bedOut" option doesn't work. We prepare the bash file shown below, and after execute it get the file shown below. The 2nd column is the target sequence in the C3H genome, but this is not a real .bed file, so we have to modify it manually. We just need to insert tabs instead of `:` and `-`. Then, we have a real `bed` file.

```bash
[mexposit@mr-login Coordinates_MarcExp]$ cat get-sequenceFromCord.sh
#!/bin/bash

#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marc.exposit@upf.edu
#SBATCH --mem=60000

#SBATCH -e stderr_filt_%j.err
#SBATCH -o stdout_filt_%j.out

# Getting variables


# Load modules
module load BEDTools/2.27.1-foss-2018b

# BEDtools
bedtools getfasta -fi ../../Reference_Genomes/GCA_001632575.1_C3H_HeJ_v1.fa -bed C3Hv2coord_sub.bed -fo C3Hv2coord_sub_Seq.bed -tab
```

After that, the file is copied into the local computer manually converted to a real `bed` file, with the sequence as "name", sequences are converted to uppercase.

```bash
$ sed 's/:/\t/g' <C3Hv2coord_sub_Seq.bed >process1.txt
$ sed 's/-/\t/g' <process1.txt >process2.txt
$ awk -v FS='\t' -v OFS='\t' '$4 = toupper($4)' process2.txt >C3Hv2coord_sub_Seq.bed
$ rm process*.txt
$ head C3Hv2coord_sub_Seq.bed
chr5    73597459        73597482        CCACAGTGGGCGGGGAATGGCAG
chr5    73597366        73597389        CCAGGAAACACTTGATAACGCAT
chr5    73597308        73597331        AGTTACAGCTTGATATCTGAAGG
chr7    29844943        29844966        CCACCCCCTCAGAACGTGGAAAC
chr7    29844881        29844904        AGAGATCACCCGACTCTGCCTGG
chr7    29844930        29844953        CCTCTTGTCCTCTCCACCCCCTC
chr16   64538111        64538134        CCGCAAAGGACTGAACGCTGCTT
chr16   64537902        64537925        CCACCTAGGAGCTATGATATTAT
chr10   66825393        66825416        CCCCAAAGTGAACAGGGTTAACA
chr10   66825203        66825226        TGGGCTAGGGTCAAAGAGATGGG
```

Use grep to search for the library of gRNAs in this file. 74 gRNAs are not found in this file. We only expected 54 or something like that. May some of them have mutations?

```bash
$ grep -i -v -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta C3Hv2coord_sub_Seq.bed | grep -i -v -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta | wc -l
74
```

In total, 880 gRNAs are found as they are on the file, and 781 are found in the Rev complement. This is a total of 1661 lines, and there are 1785 of which we expected 1734. If we had the 74 missing gRNAs we would have it. What is going on with this extra 20 gRNAs that we did expect to find? They had point mutations.

```bash
$ grep -i -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta C3Hv2coord_sub_Seq.bed | wc -l
880
$ grep -i -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta C3Hv2coord_sub_Seq.bed | wc -l
781
```

The reason behind it is that there are 24 gRNAs containing point mutations, its coverage was still 1 but they had wrong mutations. How can we match them?? Later on we see that it seems that there are 74 gRNAs with point mutations.


### 3.5. Matching coordinates with gRNAs

Since some contain point mutations, the coordinates can't be matched with the guide using the sequence. We do it using the final coordinates (we don't use the initials because the initials have been decreased in 1 nt to match the entire sequence). 

To do the match, we create a file containing the new and the old coordinates, starting from the 4th preprocessed file. The file contains: chromosome, original position in mm10 start end and strand, and matched position in C3H start end and strand.

```bash
$ awk -v OFS='\t' '{print $4, $8, $9, $10, $13-1, $14, $15}' C3Hv2_preprocess_v4.tsv >C3Hv2MatchtoGrna.bed
$ head C3Hv2MatchtoGrna.bed
chr5    73647261        73647283        -       73597459        73597482        -
chr5    73647168        73647190        -       73597366        73597389        -
chr5    73647110        73647132        +       73597308        73597331        +
chr7    30464329        30464351        -       29844943        29844966        -
chr7    30464267        30464289        +       29844881        29844904        +
chr7    30464316        30464338        -       29844930        29844953        -
chr16   65550827        65550849        -       64538111        64538134        -
chr16   65550618        65550640        -       64537902        64537925        -
chr10   67538888        67538910        -       66825393        66825416        -
chr10   67538698        67538720        +       66825203        66825226        +
```

Now we check if the original end coordinate is unique for each entry and can be used to match with the gRNAs. It efectively is.

```bash
$ wc -l C3Hv2MatchtoGrna.bed
1734 C3Hv2MatchtoGrna.bed
$ cut -f2 C3Hv2MatchtoGrna.bed | uniq -c | sort -nk1 | wc -l
1734
```

We join them based on the third column which contains their end position on the mm10 genome. The `-t $'\t'` indicates that we want the output tab separated (if not, is space separated by default). The sort is because the join needs them to be sorted.

```bash
$ join -j 3 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.1,2.2,2.3,2.4,2.5,2.6 -t $'\t' <(sort -k3 C3Hv2MatchtoGrna.bed) <(sort -k3 mm10_gRNA-coordinates.bed) > C3Hv2coordsMatch_noSeq.tsv
$ wc -l C3Hv2coordsMatch_noSeq.tsv
1735 C3Hv2coordsMatch_noSeq.tsv
$ wc -l C3Hv2MatchtoGrna.bed
1734 C3Hv2MatchtoGrna.bedç
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3Hv2$ head C3Hv2coordsMatch_noSeq.tsv
chr5    100038250       100038272       -       101283718       101283741       -       chr5    100038249  100038272                                                                                              ENSMUSG00000029328_gR364f        0       -
chr5    100038479       100038501       +       101283941       101283964       +       chr5    100038478  100038501                                                                                              ENSMUSG00000029328_gR125r        0       +
chr5    100038563       100038585       -       101284025       101284048       -       chr5    100038562  100038585                                                                                              ENSMUSG00000029328_gR51f 0       -
chr13   100125107       100125129       +       100745175       100745198       +       chr13   100125106  100125129                                                                                              ENSMUSG00000021645_gR59f 0       +
chr13   100125366       100125388       +       100745434       100745457       +       chr13   100125365  100125388                                                                                              ENSMUSG00000021645_gR318f        0       +
chr13   100125443       100125465       +       100745511       100745534       +       chr13   100125442  100125465                                                                                              ENSMUSG00000021645_gR395f        0       +
chr7    100177080       100177102       +       103213377       103213400       +       chr7    100177079  100177102                                                                                              ENSMUSG00000035165_gR214f        0       +
chr7    100177119       100177141       +       103213416       103213439       +       chr7    100177118  100177141                                                                                              ENSMUSG00000035165_gR253f        0       +
chr7    100177172       100177194       -       103213469       103213492       -       chr7    100177171  100177194                                                                                              ENSMUSG00000035165_gR296r        0       -
chr11   100397201       100397223       +       102089404       102089427       +       chr11   100397200  100397223                                                                                              ENSMUSG00000001552_gR319r        0       +
```

There is one repeated!! We have no repeated names, but one repeated end position...

```bash
$ cut -f3 C3Hv2coordsMatch_noSeq.tsv | uniq -c | sort -nk1 | tail -n 3
      1 99962562
      1 99962789
      2 53730637
$ cut -f11 C3Hv2coordsMatch_noSeq.tsv | uniq -c | sort -nk1 | tail -n 3
      1 ENSMUSG00000107482_gR84f
      1 ENSMUSG00000116461_gR127r
      1 ENSMUSG00000116461_gR176f

$ grep 53730637 C3Hv2coordsMatch_noSeq.tsv
chr4    53730615        53730637        +       52288077        52288100        +       chr4    53730614   53730637                                                                                               ENSMUSG00000028414_gR483r        0       -
chr4    53730615        53730637        +       52288077        52288100        +       chr4    53730614   53730637                                                                                               ENSMUSG00000028414_gR493f        0       +
```

The cause is that repeated gRNA (There are two gRNAs which are identical >ENSMUSG00000028414_gR493f:CCTAGCATGCAGGAAGCCCTGGG, and the reverse complement of this one >ENSMUSG00000028414_gR483r:CCCAGGGCTTCCTGCATGCTAGG. They actually have the same coordinates, they cut in different but overlapping sites of the genome, one in one strand and the other one in the other strand).

Probably this duplication was deleted by the Remap algorithm, that considered it a repeated entry. This is the reason why the number of deleted and others was not matching!!

At the end, we get 1735 gRNAs matching the 1734 mapped coordinates. This is what we expected.

Next, we append the retrieved C3H sequence for each guide RNA in the file. We do it using the 5th column of the created file `C3Hv2coordsMatch_noSeq.tsv`, and the 2nd column of the bed file with the sequences.

```bash
$ head C3Hv2coord_sub_Seq.bed
chr5    73597459        73597482        CCACAGTGGGCGGGGAATGGCAG
chr5    73597366        73597389        CCAGGAAACACTTGATAACGCAT
chr5    73597308        73597331        AGTTACAGCTTGATATCTGAAGG
chr7    29844943        29844966        CCACCCCCTCAGAACGTGGAAAC
chr7    29844881        29844904        AGAGATCACCCGACTCTGCCTGG
chr7    29844930        29844953        CCTCTTGTCCTCTCCACCCCCTC
chr16   64538111        64538134        CCGCAAAGGACTGAACGCTGCTT
chr16   64537902        64537925        CCACCTAGGAGCTATGATATTAT
chr10   66825393        66825416        CCCCAAAGTGAACAGGGTTAACA
chr10   66825203        66825226        TGGGCTAGGGTCAAAGAGATGGG
$ head C3Hv2coordsMatch_noSeq.tsv
chr5    100038250       100038272       -       101283718       101283741       -       chr5    100038249  100038272        ENSMUSG00000029328_gR364f       0       -
chr5    100038479       100038501       +       101283941       101283964       +       chr5    100038478  100038501        ENSMUSG00000029328_gR125r       0       +
chr5    100038563       100038585       -       101284025       101284048       -       chr5    100038562  100038585        ENSMUSG00000029328_gR51f        0       -
chr13   100125107       100125129       +       100745175       100745198       +       chr13   100125106  100125129        ENSMUSG00000021645_gR59f        0       +
chr13   100125366       100125388       +       100745434       100745457       +       chr13   100125365  100125388        ENSMUSG00000021645_gR318f       0       +
chr13   100125443       100125465       +       100745511       100745534       +       chr13   100125442  100125465        ENSMUSG00000021645_gR395f       0       +
chr7    100177080       100177102       +       103213377       103213400       +       chr7    100177079  100177102        ENSMUSG00000035165_gR214f       0       +
chr7    100177119       100177141       +       103213416       103213439       +       chr7    100177118  100177141        ENSMUSG00000035165_gR253f       0       +
chr7    100177172       100177194       -       103213469       103213492       -       chr7    100177171  100177194        ENSMUSG00000035165_gR296r       0       -
chr11   100397201       100397223       +       102089404       102089427       +       chr11   100397200  100397223        ENSMUSG00000001552_gR319r       0       +

$ join -1 5 -2 2 -t $'\t' -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.7,1.8,1.9,1.10,1.11,1.12,1.13,2.4 <(sort -k5 C3Hv2coordsMatch_noSeq.tsv) <(sort -k2 C3Hv2coord_sub_Seq.bed) >C3Hv2coordsMatch.tsv

$ head -n 5 C3Hv2coordsMatch.tsv
chr6    84009529        84009551        -       100230452       100230475       +       +       chr6    84009528                                                                                 84009551 ENSMUSG00000033788_gR434r       0       -       TGGGGTGGGAAAGGGGAGAGAGG
chr6    84009198        84009220        +       100230783       100230806       -       -       chr6    84009197                                                                                 84009220 ENSMUSG00000033788_gR113f       0       +       CCTGGTCCAGCTGAAGCACAGTT
chr6    84009155        84009177        +       100230826       100230849       -       -       chr6    84009154                                                                                 84009177 ENSMUSG00000033788_gR70f        0       +       CCTCAGTTTCCCCTGAGGTGCCC
chr15   99726287        99726309        -       100614257       100614280       -       -       chr15   99726286                                                                                 99726309 ENSMUSG00000023020_gR346r       0       -       CCGTTAGTCAGAAGTTGTTTTCT
chr15   99726317        99726339        -       100614287       100614310       -       -       chr15   99726316                                                                                 99726339 ENSMUSG00000023020_gR376r       0       -       CCAAGTCTCTCCAAAGAGGGCTA
```

Note that we get one gRNA sequence twice (CCTAGCATGCAGGAAGCCCTGGG), which belongs to the gRNA that is present in forward and in reverse. This sequence will not be repeated later on because the sequence will be inverted for the reverse guides.

```bash
$ cut -f15 C3Hv2coordsMatch_noSeq.tsv | uniq -c | sort -nk1 | tail -n 3
   1735
$ cut -f14 C3Hv2coordsMatch_noSeq.tsv | uniq -c | sort -nk1 | tail -n 3
   1735
$ cut -f15 C3Hv2coordsMatch.tsv | uniq -c | sort -nk1 | tail -n 3
      1 TTTGTATCAGGGCAGCTTTCGGG
      1 TTTGTTTCCCAAGTGGTGAGTGG
      2 CCTAGCATGCAGGAAGCCCTGGG
```

#### 3.5.2. Create gRNA coordinates in C3H

Just as we have the gRNA coordinates in mm10, we can now have the gRNA coordinates for C3H. For that we use the file with all the information C3Hv2coordsMatch.tsv, and use awk to print the required columns in a BED file with the coordinates and the ID of the gRNA.

```
$ awk -v FS='\t' -v OFS='\t' '{print $1,$5,$6,$12,"0",$7}' C3Hv2coordsMatch.tsv >C3H_gRNA-coordinates.bed

$ head -n 1 C3Hv2coordsMatch.tsv
chr6    84009529        84009551        -       100230452       100230475       +       +       chr6    84009528                                                                                                  84009551 ENSMUSG00000033788_gR434r       0       -       TGGGGTGGGAAAGGGGAGAGAGG

$ head -n 1 *gRNA-coordinates.bed
==> C3H_gRNA-coordinates.bed <==
chr6    100230452       100230475       ENSMUSG00000033788_gR434r       0       +

==> mm10_gRNA-coordinates.bed <==
chr5    73647260        73647283        ENSMUSG00000029156_gR308f       0       -

$ wc -l C3H_gRNA-coordinates.bed
1735 C3H_gRNA-coordinates.bed
```


### 3.6. Identify mutated and missing gRNAs

The sequence could be used to compare it with the original ones and identify the mutated gRNAs, which are those whose name is present in the list but which sequence does not match. To do it, we transform the file of the library from fasta to a tsv, which maps each guide with its sequence as it is in the first column and as reverse complement on the other column.

For that, we first delete the new lines, and then create new lines using the '>' that precedes the names. Then, we remove the _reverse_complement identifier using sed so that we can join both files to have one column with the forward and another with the reverse.

```bash
$ tr '\n' ' ' < mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta > gRNAorigSeq_1.tsv
$ tr '>' '\n' < gRNAorigSeq_1.tsv > gRNAorigSeq.tsv
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3Hv2$ head gRNAorigSeq.tsv

ENSMUSG00000078919_gR57r GCAGGAAGGAGAAAGACGCGGGG
ENSMUSG00000078919_gR97r ATCTGTCAACAGATAGACACCGG
ENSMUSG00000078919_gR361f GCCGAGTTTGATTAGGAACCCGG
ENSMUSG00000093752_gR369r AACAACACACAAATAAGCCCGGG
ENSMUSG00000093752_gR429r ATTCTTGTTCCTAAGAGCTAGGG
ENSMUSG00000026024_gR252f GGGAAAAGTGAAGGGGGGCGGGG
ENSMUSG00000026024_gR250f TGGGGAAAAGTGAAGGGGGGCGG
ENSMUSG00000026024_gR225r TTTCCCCACGATATGCATCCTGG
ENSMUSG00000016534_gR266r TCCAGGAAACCAAGCCTCCCCGG

$ tr '\n' ' ' < mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI_revcomp.fasta > gRNAorigSeqrev_1.tsv
$ tr '>' '\n' < gRNAorigSeqrev_1.tsv > gRNAorigSeqrev.tsv
$ rm gRNAorigSeq_1.tsv gRNAorigSeqrev_1.tsv
$ head gRNAorigSeqrev.tsv

ENSMUSG00000078919_gR57r_reverse_complement CCCCGCGTCTTTCTCCTTCCTGC
ENSMUSG00000078919_gR97r_reverse_complement CCGGTGTCTATCTGTTGACAGAT
ENSMUSG00000078919_gR361f_reverse_complement CCGGGTTCCTAATCAAACTCGGC
ENSMUSG00000093752_gR369r_reverse_complement CCCGGGCTTATTTGTGTGTTGTT
ENSMUSG00000093752_gR429r_reverse_complement CCCTAGCTCTTAGGAACAAGAAT
ENSMUSG00000026024_gR252f_reverse_complement CCCCGCCCCCCTTCACTTTTCCC
ENSMUSG00000026024_gR250f_reverse_complement CCGCCCCCCTTCACTTTTCCCCA
ENSMUSG00000026024_gR225r_reverse_complement CCAGGATGCATATCGTGGGGAAA
ENSMUSG00000016534_gR266r_reverse_complement CCGGGGAGGCTTGGTTTCCTGGA
$ sed 's/_reverse_complement//' gRNAorigSeqrev.tsv > gRNAorigSeqrev2.tsv
$ mv gRNAorigSeqrev2.tsv gRNAorigSeqrev.tsv
$ head gRNAorigSeqrev.tsv

ENSMUSG00000078919_gR57r CCCCGCGTCTTTCTCCTTCCTGC
ENSMUSG00000078919_gR97r CCGGTGTCTATCTGTTGACAGAT
ENSMUSG00000078919_gR361f CCGGGTTCCTAATCAAACTCGGC
ENSMUSG00000093752_gR369r CCCGGGCTTATTTGTGTGTTGTT
ENSMUSG00000093752_gR429r CCCTAGCTCTTAGGAACAAGAAT
ENSMUSG00000026024_gR252f CCCCGCCCCCCTTCACTTTTCCC
ENSMUSG00000026024_gR250f CCGCCCCCCTTCACTTTTCCCCA
ENSMUSG00000026024_gR225r CCAGGATGCATATCGTGGGGAAA
ENSMUSG00000016534_gR266r CCGGGGAGGCTTGGTTTCCTGGA

$ join -j 1 <(sort -k1 gRNAorigSeq.tsv) <(sort -k1 gRNAorigSeqrev.tsv) | sed 's/ /\t/' >gRNAorigSeqs_1.tsv
$ sed 's/  /\t/' <gRNAorigSeqs_1.tsv >gRNAorigSeqs_2.tsv
$ tail -n +2 gRNAorigSeqs_2.tsv >gRNAorigSeqs.tsv
$ rm gRNAorigSeqs_1.tsv gRNAorigSeqs_2.tsv
$ head gRNAorigSeqs.tsv
ENSMUSG00000000058_gR158f       GAAGGCTGAGCTGAGTGTGGAGG  CCTCCACACTCAGCTCAGCCTTC
ENSMUSG00000000058_gR19r        TCTCAGGTCTTGTTACCCTTGGG  CCCAAGGGTAACAAGACCTGAGA
ENSMUSG00000000058_gR81r        CTTCTGTTTAAACTCTCAGCCGG  CCGGCTGAGAGTTTAAACAGAAG
ENSMUSG00000000441_gR204f       AAAACCTATGGGCTGGATGTGGG  CCCACATCCAGCCCATAGGTTTT
ENSMUSG00000000441_gR205f       AAACCTATGGGCTGGATGTGGGG  CCCCACATCCAGCCCATAGGTTT
ENSMUSG00000000441_gR343f       TCTTCATGCATACCAAAAGAGGG  CCCTCTTTTGGTATGCATGAAGA
ENSMUSG00000000738_gR159r       TGAATGGTCTGATTGAAGAGGGG  CCCCTCTTCAATCAGACCATTCA
ENSMUSG00000000738_gR215f       GAATAAGATGTGTACAGAATAGG  CCTATTCTGTACACATCTTATTC
ENSMUSG00000000738_gR400r       CACCACTTTAGGGCTGCAGGTGG  CCACCTGCAGCCCTAAAGTGGTG
ENSMUSG00000001027_gR235r       TTTGGAATCGCCTGCTGGGATGG  CCATCCCAGCAGGCGATTCCAAA
```

#### 3.6.1. Missing gRNAs

Now, the missing gRNAs are those whose name does not appear on the file of the matched gRNAs. We expect a total of 50 (30 not found + 20 not correct coverage). We can use a inverse join for that. We assign as file 1 the list of all gRNAs of the transformed library file, and as file 2 the file with the matched converted coordinates. Then, we compare the 1st column of first file and the 12th column of 2nd file, which have the gRNA name, and specify to return all lines in file 1 (`-v 1`) that do not match with the file 2. Hence, we get the list of missing gRNAs in this subset. We get 50 missing gRNAs. To print them we use cut (space delimited, print first column which is gRNA ID) and sed to have them separated by commas. We also keep their name in a file `missinggRNAs.txt`.

```bash
$ join -1 1 -2 12 -v 1 <(sort -k1 gRNAorigSeqs.tsv) <(sort -k12 C3Hv2coordsMatch.tsv) | wc -l
50
$ join -1 1 -2 12 -v 1 <(sort -k1 gRNAorigSeqs.tsv) <(sort -k12 C3Hv2coordsMatch.tsv) | cut -d ' ' -f1 | tr '\n' ','
ENSMUSG00000001751_gR341f,ENSMUSG00000001998_gR300r,ENSMUSG00000002250_gR449r,ENSMUSG00000004631_gR459r,ENSMUSG00000020241_gR167f,ENSMUSG00000020593_gR156f,ENSMUSG00000021115_gR467f,ENSMUSG00000022110_gR325f,ENSMUSG00000022110_gR354r,ENSMUSG00000022110_gR387r,ENSMUSG00000023945_gR408f,ENSMUSG00000024843_gR361f,ENSMUSG00000024843_gR401f,ENSMUSG00000025056_gR118f,ENSMUSG00000025791_gR101r,ENSMUSG00000026554_gR278r,ENSMUSG00000027305_gR57f,ENSMUSG00000027466_gR146r,ENSMUSG00000027669_gR108r,ENSMUSG00000027669_gR429f,ENSMUSG00000029020_gR424r,ENSMUSG00000033335_gR303r,ENSMUSG00000035371_gR53f,ENSMUSG00000035371_gR71f,ENSMUSG00000035371_gR78f,ENSMUSG00000035458_gR407r,ENSMUSG00000035458_gR438r,ENSMUSG00000037139_gR246f,ENSMUSG00000037787_gR212f,ENSMUSG00000040084_gR123f,ENSMUSG00000041058_gR124r,ENSMUSG00000042961_gR176f,ENSMUSG00000047894_gR71f,ENSMUSG00000047894_gR89f,ENSMUSG00000048126_gR282r,ENSMUSG00000049281_gR303f,ENSMUSG00000062209_gR346f,ENSMUSG00000068218_gR425r,ENSMUSG00000068218_gR432f,ENSMUSG00000068218_gR468r,ENSMUSG00000079697_gR271r,ENSMUSG00000079697_gR420r,ENSMUSG00000079701_gR30r,ENSMUSG00000079701_gR345f,ENSMUSG00000079701_gR439r,ENSMUSG00000079704_gR355r,ENSMUSG00000079704_gR412r,ENSMUSG00000079704_gR97f,ENSMUSG00000079705_gR469r,ENSMUSG00000079705_gR87r,

$ join -1 1 -2 12 -v 1 <(sort -k1 gRNAorigSeqs.tsv) <(sort -k12 C3Hv2coordsMat) >missinggRNAs.txt
$ wc -l missinggRNAs.txt
50 missinggRNAs.txt
```

#### 3.6.2. Point mutated gRNAs

To identify the point mutated gRNAs we compare the sequence of the gRNAs. For that we use inverse joins. The mutated gRNAs are those whose sequence mapped on C3H does not match with the sequence on the original library, nor the first nor the second column. There are 74 gRNAs mutated in the C3H genome.

```bash
$ join -1 15 -2 2 -v 1 -t $'\t' <(sort -k15 C3Hv2coordsMatch.tsv) <(sort -k2 gRNAorigSeqs.tsv) >notInFwd.txt
$ wc -l notInFwd.txt
854 notInFwd.txt
$ sort -k3 gRNAorigSeqs.tsv >gRNAorigSeqs_revsort.tsv
$ join -1 1 -2 3 -v 1 notInFwd.txt gRNAorigSeqs_revsort.tsv >norInReverse.txt
$ wc -l norInReverse.txt
74
```

We create a file with the name of that gRNAs. It is important to remember that there are 20 more gRNAs mutated, which are the ones with coverage != 1.

```bash
$ cut -d ' ' -f13 norInReverse.txt >mutatedgRNAs.txt
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3Hv2$ wc -l mutatedgRNAs.txt
74 mutatedgRNAs.txt
```


### 3.7. Getting sequence to simulate the data

To simulate the data, the script was created to receive data with this format:

```
>gRNA_id_1,ATCGATCGAATTTC
>gRNA_id_2,ATCGATCGAATTTC
>gRNA_id_3,ATCGATCGAATTTC
```

where the DNA sequence consists of 120 nucleotides with the cutsite in the center. So it is 60nt upstream cutsite and 60nt downstream cutsite. For a given gRNA, it cuts -3 to the PAM sequence. For AATGAGGCTTTGAGAACACCTGG, the pam NGG is TGG, and it cuts 3 upstream of the pam, so it cuts: AATGAGGCTTTGAGAAC|ACCTGG.

The objective of this section is transforming the coordinates to get 120 nucleotides with the cutsite in the center and get the fasta sequence of that.

The reverse gRNA have to be treated in a special way.

```
Ex. Forward gRNA: >ENSMUSG00000026452_gR260f, AATGAGGCTTTGAGAACACCTGG

rel to cutsite                          --------- ++++++
position            to -60    -17    ...987654321 123456 ... to +60
gRNA                            AATGAGGCTTTGAGAAC|ACCTGG
genome                  ATCGTACTAATGAGGCTTTGAGAAC|ACCTGGATGATATGTATCG

Initial coordinate of forward gRNAs corresponds to -17 position relative to cutsite
Final coordinate of forward gRNA corresponds to +6 position relative to cutsite

Ex. Reverse gRNA: >ENSMUSG00000026452_gR82r, ATGTGTAACATAAACTCCCTTGG

rel to cutsite
position                         +654321-123456789      -17
genome5'3'                TATGAGC CCAAGG|GAGTTTATGTTACACAT  ATATAT
genome3'5'                ATACTCG CCAAGG|GAGTTTATGTTACACAT  TATATA
gRNA                              GGTTCC|CTCAAATACAATGTGTA

Initial coordinate of the reverse gRNAs corresponds to +6 position of the cutsite
Final coordinate of the reverse gRNA corresponds to -17 position of the cutsite. 
```

Since we want to reach -60 and +60 regions of the cutsite, we have to sum different amounts depending if the gRNA is fwd or rvs. We do `we want - we have`, note that for Rvs we invert the sign of the `we have`.

- Initial coordinate of Fwd RNA = -60 - (-17) = -43
- Final coordinate of Fwd RNA = +60 - (+6) = +54
- Initial coordinate of Rvs RNA = -60 - (-6) = -54
- Final coordinate of Rvs RNA = +60 - (+17) = +43

Note that some gRNAs might be inverted in the matched position.

```bash
$ cat discrepancy.awk
{
if ($4 == "+" && $7 == "-") print $0;
else if ( $4 == "-" && $7 == "+" ) print $0;
}
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3Hv2$ awk -f discrepancy.awk C3Hv2coordsMatch_noSeq.tsv | wc -l
49
```

In this cases we will consider them inverted. Hence, we need to get the information about what strand contains the gRNAs from the strand information of the matched C3H coordinates (column number 7).

To transform the coordinates with this numbers, we use awk that considers the sign indicated of the strand in matched position (C3H) and not in original mm10.

```bash
$ awk -v OFS='\t' -f cutsitecoords.awk C3Hv2coordsMatch.tsv > C3H_cutsite120nt.bed
$ head C3H_cutsite120nt.bed
chr6    100230409       100230529       ENSMUSG00000033788_gR434r       0       +
chr6    100230729       100230849       ENSMUSG00000033788_gR113f       0       -
chr6    100230772       100230892       ENSMUSG00000033788_gR70f        0       -
chr15   100614203       100614323       ENSMUSG00000023020_gR346r       0       -
chr15   100614233       100614353       ENSMUSG00000023020_gR376r       0       -
chr15   100614291       100614411       ENSMUSG00000023020_gR434r       0       -
chr12   100701816       100701936       ENSMUSG00000021177_gR34f        0       +
chr12   100701905       100702025       ENSMUSG00000021177_gR124r       0       -
chr12   100701990       100702110       ENSMUSG00000021177_gR208f       0       +
chr13   100745132       100745252       ENSMUSG00000021645_gR59f        0       +
```

Let's look at one of them in detail:

```bash
chr12   100701816       100701936       ENSMUSG00000021177_gR34f        0       + 
chr12   99884669        99884691        +       100701859       100701882       +       +       chr12   99884668                                                                                                  99884691 ENSMUSG00000021177_gR34f        0       +       AGTGGAAGGCCAGCAGGGCTCGG

First nucleotide will be 100701816, and 43 nucleotides later we will have 100701859, which is the start of the gRNA -> OK. Then, we have 100701816+60=100701876 which is the position after the cut site! OK. Then, we look 60nt after this, and we get the last coordinate which matches with the record 100701936. OK!

Nucleotides are numbered like in between. Example:

0  1  2  3  4  5  6  7  8  9
 A  T  C  G  T  C  G  T  A
So, if you ask for the coordinates 1-7, wou would get 6nt, which are TCGTCG

In the example we can see that by doing 936-816 = 120, so we will get 120 nucleotides and the cut site will be in the middle.
```

Then, the generated file is transferred to the VM to get the sequence. Note that it is identical as before, but the `-s` tag is added so that strand information is included. We will get the Rvs compl for the gRNAs that are inverted, which is very important.

```bash
$ cat get-sequenceFromCord_strand.sh
#!/bin/bash

#SBATCH --partition=normal
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=marc.exposit@upf.edu
#SBATCH --mem=60000

#SBATCH -e stderr_filt_%j.err
#SBATCH -o stdout_filt_%j.out

# Getting variables


# Load modules
module load BEDTools/2.27.1-foss-2018b

# BEDtools
bedtools getfasta -fi ../../Reference_Genomes/GCA_001632575.1_C3H_HeJ_v1.fa -bed C3H_cutsite120nt.bed -s -fo C3H_cutsite120nt_Seq.bed -tab
```

After running this, the sequences are in the `C3H_cutsite120nt_Seq.bed` file.

```bash
$ wc -l C3H_cutsite120nt*
  1735 C3H_cutsite120nt.bed
  1735 C3H_cutsite120nt_Seq.bed

$ head C3H_cutsite120nt_Seq.bed
chr6:100230409-100230529(+)     CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGGGGTGGGAAAGgggagagagggagggaaagaaagagggaaggagggagggagggagagagagagagggagagaga
chr6:100230729-100230849(-)     GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAACTGTGCTTCAGCTGGACCAGGGCTGAGGGCAGGCTACTACTCTTTTCTCTGGATCCCTTAAAAGATGTCTTTCCA
chr6:100230772-100230892(-)     CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAACTGTGCTTCAGCTGGACCAGGGCTGAGGGCAG
chr15:100614203-100614323(-)    AGCTTCTGTCATTTAGCCCTCTTTGGAGAGACTTGGTGTCCGCAGAAAACAACTTCTGACTAACGGAGCCCAGAAAGGTAAGCTATGAACCCCATATTCTGTAGGATCAGAGAAGGTGGT
chr15:100614233-100614353(-)    AAATAAGGATCAAAGCCCTAAAACCACCCCAGCTTCTGTCATTTAGCCCTCTTTGGAGAGACTTGGTGTCCGCAGAAAACAACTTCTGACTAACGGAGCCCAGAAAGGTAAGCTATGAAC
```

Let's give a BED format to this file.

```bash
$ tr ':' '\t' <C3H_cutsite120nt_Seq.bed >process1.txt
$ sed 's/(+)//' <process1.txt >process2.txt
$ sed 's/(-)//' <process2.txt >process3.txt
$ tr '-' '\t' <process3.txt >process4.txt
$ awk -v FS='\t' -v OFS='\t' '$4 = toupper($4)' process4.txt >C3H_cutsite120nt_Seq.bed
mexposit@DESKTOP-7MAB2EU:/c/Users/Usuario/Documents/02.TransSynBio/BioinfoThings/coordinates/ConversionC3Hv2$ head C3H_cutsite120nt_Seq.bed
chr6    100230409       100230529       CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGGGGTGGGAAAGGGGAGAGAGGGAGGGAAAGAAAGAGGGAAGGAGGGAGGGAGGGAGAGAGAGAGAGGGAGAGAGA
chr6    100230729       100230849       GGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAACTGTGCTTCAGCTGGACCAGGGCTGAGGGCAGGCTACTACTCTTTTCTCTGGATCCCTTAAAAGATGTCTTTCCA
chr6    100230772       100230892       CTGACTAACGTTGAATATCGATAAGACACAGAGAAGAGGGGTGGGGCACCTCAGGGGAAACTGAGGCTGTGAGATAGAAGTGGCCCAACTGTGCTTCAGCTGGACCAGGGCTGAGGGCAG
chr15   100614203       100614323       AGCTTCTGTCATTTAGCCCTCTTTGGAGAGACTTGGTGTCCGCAGAAAACAACTTCTGACTAACGGAGCCCAGAAAGGTAAGCTATGAACCCCATATTCTGTAGGATCAGAGAAGGTGGT
chr15   100614233       100614353       AAATAAGGATCAAAGCCCTAAAACCACCCCAGCTTCTGTCATTTAGCCCTCTTTGGAGAGACTTGGTGTCCGCAGAAAACAACTTCTGACTAACGGAGCCCAGAAAGGTAAGCTATGAAC
chr15   100614291       100614411       AAAAGAAACTCTTGAGACTGGCAGAAGGTTTTCATCTGATGGGCATCCTTTACAAAACAAATAAGGATCAAAGCCCTAAAACCACCCCAGCTTCTGTCATTTAGCCCTCTTTGGAGAGAC
chr12   100701816       100701936       CCCTCCTCATGGCGGCGCGATGGCCGCTGCCATCAGATGTGGTAGTGGAAGGCCAGCAGGGCTCGGTGAGGACCGCGAGGAGAGCGGCGAGGCCGGGGTCCTTGGGCAGGGTTCCTCCGC
chr12   100701905       100702025       GCGTGGTCCACGGAGGGAGCGAAGGCCAGCGGCTCTGCAGCGCGTCCCAAGGGTTAGGAGCGCAGGGTCCGGGGCTCCCGGGACTCAGAGCGGAGGAACCCTGCCCAAGGACCCCGGCCT
chr12   100701990       100702110       GAGCCGCTGGCCTTCGCTCCCTCCGTGGACCACGCTCCACACCAGGGACACCGCCGTGATATTGGGCTGTCTTGAGGTTGAAATGAGTCTGTACGCTTGGAATTGTGTCTGGGCAGAGCT
chr13   100745132       100745252       GCGCCGGCGAGGCTGTGAGCCACGCTTTGGTCCTGTGTGACTGTGAGGGGATGTGCAGGAACGGGGACAGGCTGGCTGAAGCAAGGCAACCAGATAGGAGCCATTGGAAAGCTCCAGATT
```

And look for all the gRNAs in it. Since we are looking from the original library, not the C3H sequence of the gRNAs, there are some gRNAs not found. However, we know that the process was correct because for each line in the coordinates we got one sequence.

```
$ grep -f mouse_gRNAs_GC-longer_not-term_noDup_no-BspQI.fasta C3H_cutsite120nt_Seq.bed | wc -l
1673
```

We check that there are 60 nucleotides on each side of the cut site.

```bash
chr15   100614291       100614411       AAAAGAAACTCTTGAGACTGGCAGAAGGTTTTCATCTGATGGGCATCCTTTACAAAACAAATAAGGATCAAAGCCCTAAAACCACCCCAGCTTCTGTCATTTAGCCCTCTTTGGAGAGAC

The gRNa is CATCCTTTACAAAACAAATAAGG, ENSMUSG00000023020_gR434r

So we have: AAAAGAAACTCTTGAGACTGGCAGAAGGTTTTCATCTGATGGGCATCCTTTACAAAACAA on the left and  
            ATAAGGATCAAAGCCCTAAAACCACCCCAGCTTCTGTCATTTAGCCCTCTTTGGAGAGAC on the right.
This is 60nt on each side. perfect.

Another one, the ones that were different on both ends:
chr6    100230409       100230529       CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGGGGTGGGAAAGGGGAGAGAGGGAGGGAAAGAAAGAGGGAAGGAGGGAGGGAGGGAGAGAGAGAGAGGGAGAGAGA

TGGGGTGGGAAAGGGGAGAGAGG, ENSMUSG00000033788_gR434r

We can see that has 60nt on each side of the cutsite! Perfect
```

The only quality control is that a few sequences contain N nucleotides. We delete them from the set. Even if they do not interfere with the gRNA they would cause errors in the simulation and inDelphi.

```
$ grep N C3H_cutsite120nt_Seq.bed
chr12   107365098       107365218       TAGCTCAATTGAAAGACAATCTATCCCATAAGAAGCATTTAGGGGGCTGGTGAGATGGCTCAGTGGGTAAGAGCACCCGACTGCTCTTCCGAAGGTCCGAAGTTCAAATCCCANNNNNNN
chr12   114355470       114355590       CTTGCCTTTGCTTCCTGACTGAGATGTGGGGGTCAGATAAGCCACAAGTATATCAGCCAGAGCTGGACGAATGAAGACAAGAGGGGCTGGGGGNNNNNNNNNNNNNNNNNNNNNNNNNNN
chr12   114355483       114355603       NNNNNNNNNNNNNNCCCCCAGCCCCTCTTGTCTTCATTCGTCCAGCTCTGGCTGATATACTTGTGGCTTATCTGACCCCCACATCTCAGTCAGGAAGCAAAGGCAAGGTACTGCCTAGCC
chr4    16724725        16724845        NNNNNNNCTGACTCCGCACCTCCTCTCACCAACAATCTTAATCTCCACCTTTGGTGACCTGGTAGGTTCAAGTGACCTTGCAAAATTTTCTAATCCTCATCTCGTGATTTGGGTCCTTTT
chr4    16725329        16725449        NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCCGGGTGAAAGTCAGGGCAGCAGTTCTGAGGCGGGGAAGCTGGGACGCCAGCAAGCCGCGGGTCAGGAGGCTGAGCACTGCAGTC
chr14   25760800        25760920        GGAGAGAAGAGGGCCAGGGAGGAGTTAGAAAGAGGAGTGGGAGAGTAGAAAGGAAGGGAAAATAGGCGGAGCTGGGGGGGGGGGGGGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
chr14   25760825        25760945        CAAGAGCAGAGGGAGGGTAAGGTCTGGAGAGAAGAGGGCCAGGGAGGAGTTAGAAAGAGGAGTGGGAGAGTAGAAAGGAAGGGAAAATAGGCGGAGCTGGGGGGGGGGGGGGGGNNNNNN
chr2    27853167        27853287        TCAGTGGGTAAGAGCACCCGACTGCTCTTCCGAAGGTCAGGAGTTCAAATCCCAGCAACCACATGGTGGCTCACAACNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
chr18   33589301        33589421        CTCTTCTGGCGTGGGGGGGGGGGGGGGGGGGGNNAGGAACGTGACAAGACAGAGCACAGCTTTTGGGCTGCCAGTCTAGGGTGCTGTTGGACGTCTCGCCCGCGAGGCACTGAGGTGGAT
chr19   40673920        40674040        AGAAAGAGGGCACAGCCGCCGCTAGGCTAAAATCTGTTCTTCCGGTCCGAGCAGCCCCACCTGTGGCGCCCAGGACCCTTTAGGGTACCACGAGGGCCTAGGNNNNNNNNNNNGCGCCTG
chr19   40673942        40674062        GATGTCACGGGGAATCGTCCGCCAGGCGCNNNNNNNNNNNCCTAGGCCCTCGTGGTACCCTAAAGGGTCCTGGGCGCCACAGGTGGGGCTGCTCGGACCGGAAGAACAGATTTTAGCCTA
chr4    41761814        41761934        TTTCCTAGCATCTTGTCTTCTCGGAGCATTTAGTCCTGAAGTCAGTCTTTCCTATTTAACTGCTGGTCCCCCTTCCCCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
chr10   76287652        76287772        CACAGGGACACAGTACCTTCCAAGGCTGTAGGACCCCTCAAGACAGGCACATGGTTTCTAGTCTGGGTTGGATAAAGCTCCCCACTCTGGTACATGCCTCAGTTTCTTCAGAAGTCNNNN

$ grep N C3H_cutsite120nt_Seq.bed | wc -l
13
$ grep -v N C3H_cutsite120nt_Seq.bed | wc -l
1722
$ grep -v N C3H_cutsite120nt_Seq.bed >C3H_cutsite120nt_Seq_flt.bed
$ wc -l C3H_cutsite120nt_Seq*
  1735 C3H_cutsite120nt_Seq.bed
  1722 C3H_cutsite120nt_Seq_flt.bed
  3457 total
$ grep N C3H_cutsite120nt_Seq_flt.bed | wc -l
0
```

So the final set has 13 gRNA less, and we end up with 1722 gRNAs.

Now, give a proper format to the files. We need to match each gRNA ID with the sequence of its cutsite. We can now perform a join and match them.

```
$ join -j 2 -o 2.4,1.4 -t $'\t' <(sort -k2 C3H_cutsite120nt_Seq_flt.bed) <(sort -k2 C3H_cutsite120nt.bed) > C3H_cutsite120nt_Seq_id.bed
```

The problem is that some gRNA are repeated!

```
$ cut -f1 C3H_cutsite120nt_Seq_id.bed | sort | uniq -c | sort -nk 1 | tail -n 5
      1 ENSMUSG00000116461_gR176f
      2 ENSMUSG00000028322_gR170f
      2 ENSMUSG00000028322_gR171r
      2 ENSMUSG00000028414_gR483r
      2 ENSMUSG00000028414_gR493f
```

We look at them individually and remove the sequence which is not correct. For instance, for the first one, it is clear that the second one is not correct because we look at the gRNA sequence and see that is not in the middle. We remove them manually. They are the gRNAs which are equal in fwd and in reverse, so it is just removing one of them and that is it.

```
$ grep ENSMUSG00000028322_gR170f C3H_cutsite120nt_Seq_id.bed
ENSMUSG00000028322_gR170f       ATGCTATTCCCTCCTCACCTGTCATGTCTGAATCCACTGCAGCAGGAAAATGGCCCTAGTCCAAGGGCCATGGGCACTATTGGCCATCTGTTCATGGTGGCTCTAAATCCTTGGAGCTGG
ENSMUSG00000028322_gR170f       CCAGCTCCAAGGATTTAGAGCCACCATGAACAGATGGCCAATAGTGCCCATGGCCCTTGGACTAGGGCCATTTTCCTGCTGCAGTGGATTCAGACATGACAGGTGAGGAGGGAATAGCAT

ENSMUSG00000028414_gR483r	GAGAGGGTGGAGTTAAACTGTAGCTCAGTTGACAGAGCGTTTGCCTAGCATGCAGGAAGCCCTGGGTTCAATCTCAAGCACTGTATAAACCAGACATGATAGCAGGGGGGCCAGGAGCTC
ENSMUSG00000028414_gR493f	GAGAGGGTGGAGTTAAACTGTAGCTCAGTTGACAGAGCGTTTGCCTAGCATGCAGGAAGCCCTGGGTTCAATCTCAAGCACTGTATAAACCAGACATGATAGCAGGGGGGCCAGGAGCTC
ENSMUSG00000028414_gR483r	GAGAGGGTGGAGTTAAACTGTAGCTCAGTTGACAGAGCGTTTGCCTAGCATGCAGGAAGCCCTGGGTTCAATCTCAAGCACTGTATAAACCAGACATGATAGCAGGGGGGCCAGGAGCTC
ENSMUSG00000028414_gR493f	GAGAGGGTGGAGTTAAACTGTAGCTCAGTTGACAGAGCGTTTGCCTAGCATGCAGGAAGCCCTGGGTTCAATCTCAAGCACTGTATAAACCAGACATGATAGCAGGGGGGCCAGGAGCTC
```

So at the end we correctly have the sequences with its cutsite in the middle and its ID, in the `C3H_cutsite120nt_Seq_id_flt.bed` file.

```
$ wc -l C3H_cutsite120nt_S*
   1735 C3H_cutsite120nt_Seq.bed
   1722 C3H_cutsite120nt_Seq_flt.bed
   1726 C3H_cutsite120nt_Seq_id.bed
   1722 C3H_cutsite120nt_Seq_id_flt.bed
```

Now, we convert it to the format used to simulate the data. That's it! 

```bash
$ head -n 1 C3H_cutsite120nt_Seq_id_flt.bed
ENSMUSG00000033788_gR434r       CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGGGGTGGGAAAGGGGAGAGAGGGAGGGAAAGAAAGAGGGAAGGAGGGAGGGAGGGAGAGAGAGAGAGGGAGAGAGA
$ tr '\t' ',' <C3H_cutsite120nt_Seq_id_flt.bed >C3H_targets.csv
$ head -n 1 C3H_targets.csv
ENSMUSG00000033788_gR434r,CTCCTAAAGATTGATAATGGTGACAGCTAAACTTCCCCCACTATGGGGTGGGAAAGGGGAGAGAGGGAGGGAAAGAAAGAGGGAAGGAGGGAGGGAGGGAGAGAGAGAGAGGGAGAGAGA
$ wc -l C3H_targets.csv
1722
```