## Biokotlin Range Testing

Biokotlin SeqPositions objects that contain an optional SeqRecord (containing DNA/RNA or Amino Acid sequences) and a site.

Biokotlin SRanges are objects containing an interval of SeqPositions.

This notebook demonstrates operations to construct and manage biokotlin SRanges

### Setup initial imports and dependencies

In [None]:
//If this does not exist run from cmdline: ./gradlew shadowjar
// THis is old code - if the imports don't work
//@file:DependsOn("../build/libs/biokotlin-0.03-all.jar")
@file:Repository("https://jcenter.bintray.com/")
@file:DependsOn("org.biokotlin:biokotlin:0.03")

In [None]:
import biokotlin.genome.*
// import seq* as Ranges will use NucSeqRecord
import biokotlin.seq.NUC.*
import biokotlin.seq.*
import java.util.*


### Create a Sequence and a SeqRecord.  

By default, calling Seq(<sequence>) or NucSeq(<sequence>) will create a NucSeq class with DNA/RNA sequence.  To create a class of AminoAcid sequence we would specify ProteinSeq(<sequence>)
    
In the example below, the name of the sequence is "1".  
    

In [None]:

val seq = Seq("GCAGAT")

In [None]:
val rec1 = NucSeqRecord(NucSeq("ATAACACAGAGATATATC"),"1")
val rec1a = NucSeqRecord(seq,"1")
println(rec1)
println(rec1a)

### Create a subsetted sequence

To subset a sequence, give the 1-based start and end coordinates you would like included for this sequence

In [None]:
// This creates a subset of the sequence in rec1 - just seq from positions 1 to 6
// Note the array positions are 0-based, so this should pull TAACAC
// Also note that the coordinates for the slice (1..6) are inclusive/includes - so 6 total
val subSettedSeq = rec1[1..6]
subSettedSeq

### Create a SeqPosition

Biokotlin SeqPositions may be created with or without a SeqRecord object.

If a SeqPosition is created with a SeqRecord object, the site value must not exceed the length of the stored sequence.

In [None]:
val seqPos1 = SeqPosition(rec1, 8)
println(seqPos1)
val seqPosNoRecord = SeqPosition(null,8)
println(seqPosNoRecord)

### Create a Biokotlin SRange

To create an SRange object we specify a SeqRecord and provide the start and end coordinates for the range. That range will create the starting/ending SeqPosition objects using the specified SecRecord for both.

Alternately, you can explicitly specify the SeqPosition objects for the start and end coordinates.

In [None]:
// create a Sequence Range (SRange) object from a SeqRecord
val sRange = rec1.range(8..12)
println(sRange)

// Create a Sequence Rane (SRange) object from SeqPosition objects
// THis also demonstrates the "plus" operator
val sRange2 = seqPos1..seqPos1.plus(4)
println(sRange2)

## Flanking
Flanking ranges is similar to bedtools flanking.  flankBoth will create two new flanking intervals, one interval on each side of the SRange interval. For an SRange Set it will create two new flanking intervals for each interval in the set. 

If the SRange contains a SeqRecord the Note that flank will restrict the created flanking intervals to the size of the chromosome (i.e. no start < 0 and no end > chromosome size). 

The image below shows flanking both sides by 10bps, and just flanking the left side.  
![Range_Flank.png](../resources/Range_Flank.png)



In [None]:
val sRangeFlanked = sRange.flankBoth(5)

In [None]:
println("sRange: ${sRange}")
println("sRangeFlanked: ${sRangeFlanked}")

In [None]:
val sRangeFlankRight = sRange.flankRight(10)

In [None]:
// Why does it stop at 18?  Because the DNA sequence is only 18 chars long!
println(sRange)
println(sRangeFlankRight)

In [None]:
val sRangeFlankLeft = sRange.flankLeft(4)
println(sRangeFlankLeft)

## Create a set of SRanges (SRangeSet)

SRangeSet is a set of SRanges.  Operations may be performed on it much the same as on an SRang.  

When adding the sequences to an SRangeSet, they will be sorted based on a default comparator.  

Show the intersection of the 2 sets:

In [None]:
%use Krangl
val dnaString = "ACGTGGTGAATATATATGCGCGCGTGCGTGGATCAGTCAGTCATGCATGCATGTGTGTACACACATGTGATCGTAGCTAGCTAGCTGACTGACTAGCTGACCGTACGTACGTATCAGTCAGCTGACACGTGGTGAATATATATGCGCGCGTGCGTGGATCAGTCAGTCATGCATGCATGTGTGTACACA"
val dnaString2 = "ACGTGGTGAATATATATGCGCGCGTGCGTGGACGTACGTACGTACGTATCAGTCAGCTGAC"
val dnaString3 = "TCAGTGATGATGATGCACACACACACACGTAGCTAGCTGCTAGCTAGTGATACGTAGCAAAAAATTTTTT"
val record1 = NucSeqRecord(NucSeq(dnaString), "Seq1")
val record2 = NucSeqRecord(NucSeq(dnaString2), "Seq2")
val record3 = NucSeqRecord(NucSeq(dnaString3), "Seq3")
val record4 = NucSeqRecord(NucSeq(dnaString2), "Seq2-id2")
       
val sr1 = record1.range(27..40)
val sr2 = record1.range(1..15)
val sr6 = record1.range(44..58)
val sr3 = record3.range(18..33)
val sr4 = record2.range(25..35)
val sr5 = record2.range(3..13)
val set1 = nonCoalescingSetOf(SeqRangeSort.by(SeqRangeSort.numberThenAlphaSort, SeqRangeSort.leftEdge), sr1,sr6,sr2,sr3,sr5,sr4)
val s1df:DataFrame = set1.toDataFrame()
println("SRangeSet 1:")
s1df.print()

val sr10 = record1.range(30..35)
val sr20 = record1.range(18..22)
val sr60 = record1.range(40..50)
val sr30 = record3.range(1..10)
val sr40 = record2.range(45..55)
val sr50 = record2.range(10..13)
val set2 = nonCoalescingSetOf(SeqRangeSort.by(SeqRangeSort.numberThenAlphaSort, SeqRangeSort.leftEdge), sr10,sr60,sr20,sr30,sr50,sr40)
val s2df:DataFrame = set2.toDataFrame()
println("SRangeSet 2:")
s2df.print()



## Intersections - similar to bedTools intersection

![Range_Intersect.png](../resources/Range_Intersect.png)

In [None]:
// Intersect the ranges above.
val intersections = set1.intersect(set2)
println("intersection size: ${intersections.size}")
val sidf:DataFrame = intersections.toDataFrame()
sidf.print()

## Coalescing and non-Coalescing Set of Ranges

Biokotlin allows the user to create sets of ranges that merge overlapping ranges or leave them independent.

Again, create some ranges, add to a set sorted by a specific comparator, create merged and non-merged sets of SRanges.

In [None]:
import kotlin.collections.*
import biokotlin.genome.SeqRangeSort.leftEdge

// Create some DNA strings, make range from these strings
val dnaString = "ACGTGGTGAATATATATGCGCGCGTGCGTGGATCAGTCAGTCATGCATGCATGTGTGTACACACATGTGATCGTAGCTAGCTAGCTGACTGACTAGCTGAC"
val dnaString2 = "ACGTGGTGAATATATATGCGCGCGTGCGTGGACGTACGTACGTACGTATCAGTCAGCTGAC"
val record1 = NucSeqRecord(NucSeq(dnaString), "Sequence 1", description = "The first sequence",
                annotations = mapOf("key1" to "value1"))
val record2 = NucSeqRecord(NucSeq(dnaString2), "Sequence 2", description = "The second sequence",
                annotations = mapOf("key1" to "value1"))

var range1 = SeqPositionRanges.of(record1,8..28)
var range2 = SeqPositionRanges.of(record2,3..19)
var range3 = SeqPositionRanges.of(SeqPosition(record1, 27),SeqPosition(record1,40))
var range4 = record2.range(25..40)



In [None]:
// create a list of ranges
var srangeList = mutableListOf<SRange>()
srangeList.add(range1)
srangeList.add(range4)
srangeList.add(range3)
srangeList.add(range2)

println("\nRanges in the List are:")
for(range in srangeList) {
    println(range.toString())
}

In [None]:
// Create a set of non-merged ranges
val comparator: Comparator<SRange> = SeqRangeSort.by(SeqRangeSort.numberThenAlphaSort,leftEdge)
val nonCoalsedSet = nonCoalescingSetOf(comparator, srangeList)

println("\nThe noncoalsedSet has these values:")
for (range in nonCoalsedSet) {
    println(range.toString())
}

In [None]:
// Create set, merge the ranges
val coalesedSet = coalescingSetOf(comparator,srangeList)

println("\nthe coalsedSet has these values:")
for (range in coalesedSet) {
    println(range.toString())
}


## Reading BedFiles as SRanges

Given a fasta file with sequence, and a bedfile with entries relative to the given fasta, Biokotlin can create an SRange Set.  To this set, the SRange functions may be applied, and Krangl may be used to display the results in DataFrame format.


In [None]:
// read fasta and bed file - create SRange Set

import biokotlin.genome.*

val fasta = "../src/test/kotlin/biokotlin/genome/chr9chr10short.fa"
val bedFile = "../src/test/kotlin/biokotlin/genome/chr9chr10_SHORTwithOverlaps.bed"
var srangeSet = bedfileToSRangeSet(bedFile,fasta)
println("Size of srangeSet: ${srangeSet.size}")

In [None]:
%use krangl
// read into a Krangl frame, print the data
var df:DataFrame = srangeSet.toDataFrame()
df.print()

## Intersections on Individual Ranges from the Bed File

We saw intersecting on sets above.  The cells below show intersecting on an individual range.



In [None]:
// Define a target range, detemine which ranges overlap this range
// We need to use a SeqRecord that exists in the SRangeSet we are checking for overlaps
// We create that by pulling the existing record
val seqRecT = srangeSet.elementAt(0).start.seqRecord

// Create the range:  here is is a single position (which could be a SNP)
val targetRange = SeqPosition(seqRecT,220)..SeqPosition(seqRecT,220)

// Now find ranges in our set which intersect with this position
// 

val targetIntersections = targetRange.intersectingRanges(srangeSet)
println(targetIntersections)
println()

// print result as dataframe
var tdf:DataFrame = targetIntersections.toDataFrame()
tdf.print()

In [None]:
// Intersect a multiple position range
val targetRange2 = SeqPosition(seqRecT,70)..SeqPosition(seqRecT,200)
println("targeRange2: $targetRange2")
println()
println("srangeSet: $srangeSet")

val targetIntersections2 = targetRange2.intersectingRanges(srangeSet)
println(targetIntersections2)
println("intersections: $targetIntersections2")

// print result as dataframe
var tdf:DataFrame = targetIntersections2.toDataFrame()
tdf.print()