# Jupyter notebook: ncRNA genome mapping and analysis

## Overview

This Jupyter notebook attempts to map the sequence of a selection of non-coding RNAs (ncRNAs) that showed a differential fitness when targeted with CRISPRi repression. For details see the [CRISPRi library github repository](https://github.com/m-jahn/R-notebook-crispri-lib) that contain this notebook and further information. The R analysis pipeline that led to the selection of the ncRNAs of interest can be viewed on [m-jahn.github.io](https://m-jahn.github.io/R-notebook-crispri-lib/CRISPRi_V2_data_processing.nb.html).

Tasks:

- import ncRNA sequences; probably also genome in `genbank` format
- map ncRNAs to *Synechocystis* sp PCC 6803 genome
- structural and/or functional analysis of ncRNAs

## Import of required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sb
import Bio.Align
import Bio.AlignIO
import Bio.SeqIO

## Import and reshaping of data

In [2]:
data = pd.read_csv('../data/output/fitness_ncRNA.csv')
data.head(n = 5)

Unnamed: 0,sgRNA_target,sgRNA_number,condition,carbon,light,treatment,mean_fitness,wmean_fitness,top1_fitness,top2_fitness,sd_fitness,p_value,p_value_adj,score,ncRNA_type,sequence,length,direction,locus,comment
0,ncl0110,1,"HC, HL",HC,HL,,-6.231719,-6.231719,-6.231719,,,0.04878,0.221471,4.079799,ncRNA,AGTCCTGTTGGTCAAAATAATTTCGTTAAAATTAGCATTAGCATCG...,61,reverse,ncl0110,
1,ncl0110,1,"HC, IL",HC,IL,,-5.6382,-5.6382,-5.6382,,,0.04878,0.236947,3.525842,ncRNA,AGTCCTGTTGGTCAAAATAATTTCGTTAAAATTAGCATTAGCATCG...,61,reverse,ncl0110,
2,ncl0110,1,"HC, LL",HC,LL,,-4.527253,-4.527253,-4.527253,,,0.04878,0.245046,2.765033,ncRNA,AGTCCTGTTGGTCAAAATAATTTCGTTAAAATTAGCATTAGCATCG...,61,reverse,ncl0110,
3,ncl0110,1,"HC, LL, -N",HC,LL,-N,-5.203274,-5.203274,-5.203274,,,0.04878,0.238071,3.243167,ncRNA,AGTCCTGTTGGTCAAAATAATTTCGTTAAAATTAGCATTAGCATCG...,61,reverse,ncl0110,
4,ncl0110,1,"HC, LL, +FL",HC,LL,+FL,-4.752213,-4.752213,-4.752213,,,0.04878,0.251682,2.847279,ncRNA,AGTCCTGTTGGTCAAAATAATTTCGTTAAAATTAGCATTAGCATCG...,61,reverse,ncl0110,


In [3]:
data = data.groupby('sgRNA_target')
data_only_targets = data.agg(sequence = pd.NamedAgg(aggfunc = 'unique', column = 'sequence'))
data_only_targets.head()

Unnamed: 0_level_0,sequence
sgRNA_target,Unnamed: 1_level_1
ncl0110,[AGTCCTGTTGGTCAAAATAATTTCGTTAAAATTAGCATTAGCATC...
ncl0200,[GACCACAATTAAGCTGATATCCCCAAGTTGTCCCCCGTTGGCCAT...
ncl0320,[TGTTATGGATTGTCACCGTCGGATTTGCTTCCATTGGTGCATTGC...
ncl0360,[ATTGCTAACCAGGCGGCCCTGCGACAGCCCCAAGCTGTCCCCCGT...
ncl0400,[ATGGGCTAAAAATAAATTTCCCTAGCCCCCTCATACATTCTGAGC...


In [4]:
ref_genome = Bio.SeqIO.parse('/home/michael/Documents/SciLifeLab/Resources/MS/databases/Synechocystis/Synechocystis_PCC6803_NC_000911.gbk', 'genbank')
for record in ref_genome:
    print('ID %s' % record.id)
    print('Sequence length %i' % len(record))

ID NC_000911.1
Sequence length 3573470


## Simple text-based alignments

Now that the reference genome is loaded, how can be align (text-based) sequences to the genbank file?
Simple text based alignments can be done in biopython using the `Bio.Align.PairwiseAligner()` function.
THe following code chunk is an example.

In [5]:
aligner = Bio.Align.PairwiseAligner()
aligner.open_gap_score = -0.5
aligner.extend_gap_score = -0.1
aligner.target_end_gap_score = 0.0
aligner.query_end_gap_score = 0.0
alignments = aligner.align('TACCGAACCCGGATTCGATCGATCGGGATGCA', 'AGCACCCGGAT')

for i in range(10):
    if i < len(alignments):
        print('Score = %.1f:' % alignments[i].score)
        print(alignments[i])

Score = 9.4:
TACCGAACCCGGATTCGATCGATCGGGATGCA
-|--|.||||||||------------------
-A--GCACCCGGAT------------------

Score = 9.4:
TACCGAACCCGGATTCGATCGATCGGGATGCA
-|.|--||||||||------------------
-AGC--ACCCGGAT------------------



## High performance alignment using external tools

The next step is to try to align the ncRNA sequences to the reference genome. This can be very time consuming, so we will try to select a tool that is appropriate for this task. Most importantly, many tools are optimized to align several (shorter) sequences to each other (multiple sequence alignment), while here, we want to align short sequences to a genome, an extremely large single sequence or collection of sequences.

In [6]:
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

We try a blast search, preferably only for the genome of interest, *Synechocystis* sp. We do this using the `entrez_query` parameter with a custom taxomnoy ID for `Synechocystis sp. PCC 6803`. It roughly has the format `entrez_query='txid1148[ORGN]'`. Additional terms can be combined using ` AND `. It is importnat to note that the result handle for the blast search can only be used for reading results one time, and expires after that.

Before submitting sequence data to blast, it is necessary to re-arrange them into a `*.fasta` like format. This seems to speed up retrieval of results from the NCBI server, as opposed to submitting only a raw sequence where retrieval often takes longer or stalls.

In [7]:
%%time

file = open('../data/output/ncRNA_alignment.txt', 'w')

for index, row in data_only_targets.iterrows():
    str_query = '>' + index + '\n' + row['sequence'][0]
    result_handle = NCBIWWW.qblast(
        program = 'blastn',
        database = 'nt',
        sequence = str_query,
        entrez_query = 'Synechocystis sp. PCC 6803 chromosome, complete genome')
    result_record = NCBIXML.read(result_handle)
    for alignment in result_record.alignments:
        file.write('alignment of: ' + index + '\n')
        file.write('sequence: ' + alignment.title + '\n')
        file.write('length: ' + str(alignment.length) + '\n')
        file.write(str(alignment.hsps[0]) + '\n\n')

file.close()

CPU times: user 16.2 s, sys: 1.99 s, total: 18.2 s
Wall time: 7h 54min 42s
