# BioKotlin GFF Parsing using the FeatureTree Package
While the fundamental model of this package is substantially complete, many biological convenience functions using this model need to be implemented.

## Simple Parsing and Basic Queries

First, generate a `Genome` from a GFF3 file:

In [1]:
@file:DependsOn("../build/libs/BioKotlin-0.08-all.jar")
import biokotlin.featureTree.*
import java.io.File
val genome = Genome.fromFile("resources/b73_shortened.gff")

Let's query a gene within this genome! The `byID` function is constant-time and returns a `Feature?`

In [2]:
val gene = genome.byID("Zm00001eb000010")!!

`gene` is a `Feature` instance because it contains the data in the 9 columns of a GFF file. Let's query some of this data.

In [3]:
println(""" 
Source: ${gene.source}
Start: ${gene.start}
biotype Attribute: ${gene.attribute("biotype")}
""".trimIndent())

Source: NAM
Start: 34617
biotype Attribute: [protein_coding]


`gene` is a `Parent` because it is also the root of a tree of `Feature`. Let's get access to the direct children of `gene`. In this case, there is only one child, a transcript.

In [4]:
val geneChildren = gene.children
val transcript = geneChildren.first()
println(transcript.id)

Zm00001eb000010_T001


Of course, you can also walk back up the tree with `parent`.

In [5]:
println("${transcript.parent == gene}")

true


It is often useful to apply an operation for every node below the `parent` on the tree, not only its immediate children. This is where `descendants` comes in, which produces a sequence of all nodes below the parent. These nodes are in depth-first, left-to-right order. This sequence can be combined with the powerful Kotlin collections framework to make a range of interesting queries.

In [6]:
gene.descendants().map { it.type }.toList()

[mRNA, five_prime_UTR, exon, exon, CDS, three_prime_UTR, three_prime_UTR]

`subtree` is quite similar except that it inclues the receiver as well. Observe the inclusion of "gene" in the output.

In [7]:
gene.subtree().map { it.type }.toList()

[gene, mRNA, five_prime_UTR, exon, exon, CDS, three_prime_UTR, three_prime_UTR]

# Mutability
The package supports quickly shifting from mutable to immutable representations of your feature tree. This allows for mutability when you need to modify something, but deep immutability when you want an extra assurance of correctness.

The package is highly opinionated and does not allow the client to form feature trees that do not constitute a valid GFF3 file.

## Getting a mutable genome

`MutableGenome` instances can be obtained either through creating a mutable clone of an existing `Genome` through the `mutable` function, parsing a file directly into a `MutableGenome`, or creating a blank instance.

In [8]:
// Mutable cloning
val immutable1 = Genome.fromFile("resources/b73_shortened.gff")
val mutable1 = immutable1.mutable()

// Directly parsing to mutable
val mutable2 = MutableGenome.fromFile("resources/b73_shortened.gff")

//Creating a blank MutableGenome
val mutable3 = MutableGenome.blank()

Of course, when you're done with your mutations, you can clone to an immutable instance if desired.

In [9]:
val immutable2 = mutable1.immutable()

## Point mutations

A "point mutation" is one that does not affect the topology of the tree. The feature tree framework supports all the point mutations that you'd expect, allowing for convnient modification of all nine columns of data. Let's make some modifications to an exon. First, let's print out the starting state of this exon.

In [10]:
val exon = mutable1.byName("Zm00001eb000010_T001.exon.1").first()
println(exon)

chr1	NAM	exon	34617	35318	.	+	.	Parent=Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1;	



Now let's do some mutations. Note that start and end cannot be directly modified and `setRange` must be used instead, see [Discontinuous Features](#Discontinuous-Features).

In [11]:
exon.setRange(34000..36000)
exon.addAttribute("custom_attr", "42")
exon.setID("my_favorite_exon_id")
exon.name = "my_favorite_exon_name"
exon.strand = Strand.MINUS
exon.score = 1.0
exon.source = "my_source"
println(exon)

chr1	my_source	exon	34000	36000	1.0	-	.	ID=my_favorite_exon_id;Parent=Zm00001eb000010_T001;Name=my_favorite_exon_name;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1;custom_attr=42;	



### Illegal point mutations

The package is highly opinionated and will not allow states that do not represent valid GFF3 files. While the full documentation of these illegal mutations can be found in the API specification, it is worth discussing common illegal mutations.

The Parent attribute is defined by the actual topology of the tree and may not be directly modified as other attributes can.

In [12]:
try {
    exon.setAttribute("Parent", "NewParent")
} catch (e: IllegalArgumentException) {
    println(e.message)
}

The Parent attribute may not be directly modified. Its value is based on the actual structure of the tree.
Hint: use copyTo or moveTo.


IDs must be unique within a `Genome`!

In [13]:
try {
    exon.setAttribute("ID", "Zm00001eb000010")
} catch (e: IDConflict) {
    println(e.message)
}

IDs must be unique within a Genome.
ID in conflict: Zm00001eb000010
Feature already having the ID:
chr1	NAM	gene	34617	40204	.	+	.	ID=Zm00001eb000010;biotype=protein_coding;logic_name=cshl_gene;	

Feature that would conflict:
chr1	my_source	exon	34000	36000	1.0	-	.	ID=my_favorite_exon_id;Parent=Zm00001eb000010_T001;Name=my_favorite_exon_name;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1;custom_attr=42;	

Hint: within the feature tree framework, discontinuous features are represented as a single Feature object with
several start-end ranges and an equal number of phases, not as distinct objects.


## Topological mutations
A topological mutation is one that modifies the shape of the true through insertion, deletion, or sorting.

Insertion adds new features to the tree, either as children of the root `Genome` or of a particular `Feature`.

In [14]:
val mRNA = mutable1.byID("Zm00001eb000010_T001")!!
mRNA.insert(
    seqid = "my_seqid",
    source = "my_source",
    type = "exon",
    range = 100..200,
    score = 42.0,
    strand = Strand.PLUS,
    phase = Phase.ONE,
    attributes = mapOf("ID" to listOf("my_exon"), "custom_attr" to listOf("custom value"))
)

my_seqid	my_source	exon	100	200	42.0	+	1	ID=my_exon;Parent=Zm00001eb000010_T001;custom_attr=custom value;	


Now observe our inserted exon!

In [15]:
println(mRNA.children.filter { it.source == "my_source" })

[chr1	my_source	exon	34000	36000	1.0	-	.	ID=my_favorite_exon_id;Parent=Zm00001eb000010_T001;Name=my_favorite_exon_name;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1;custom_attr=42;	
, my_seqid	my_source	exon	100	200	42.0	+	1	ID=my_exon;Parent=Zm00001eb000010_T001;custom_attr=custom value;	
]


Deletion removes a feature and all of its orphaned descendants from a tree.

In [16]:
val gene2 = mRNA.parent
println("Before deletion: ${gene2.children}")
mRNA.delete()
println("After deletion: ${gene2.children}")

Before deletion: [chr1	NAM	mRNA	34617	40204	.	+	.	ID=Zm00001eb000010_T001;Parent=Zm00001eb000010;biotype=protein_coding;transcript_id=Zm00001eb000010_T001;canonical_transcript=1;	
]
After deletion: []


Still have a pointer to the deleted feature? Attempting to read or write to it throws a `DeletedAccessException` so that you don't accidentally modify anything that is no longer presnet in the `Genome`.

In [17]:
try {
    println(mRNA.start)
} catch (e: DeletedAccessException) {
    println(e.message)
}

Do not access deleted features or any of their orphaned descendants


### Illegal topological mutations
To maintain the correctness of your model, some topological mutations are prohibited.

Creating a parent/child relationship that does not comport with the sequence ontology is prohibited. For the rare cases where you do intend to override the Sequence Ontology, see [Advanced Type Schema](#Advanced-Type-Schema).

In [18]:
try {
    gene2.insert(
        seqid = "my_seqid",
        source = "my_source",
        type = "contig",
        range = 100..200,
        score = 42.0,
        strand = Strand.PLUS,
        phase = Phase.ONE,
    )
} catch (e: TypeSchemaException) {
    println(e.message)
}

contig does not have a part-of relationship with gene and cannot be inserted in a feature with type gene.
Hint: If you wish to insert something that does not follow the Sequence Ontology, use MutableGenome.defineType first.


In addition, you cannt modify the topology of the tree while you are iterating over `descendants` or `subtree`. This is because these sequences iterate over the tree, and changing the tree out from under them in the course of their iteration would lead to unpredictable results.

In [19]:
try {
    mutable1.descendants().forEach {
        it.sort { one, two -> one.start - two.start }
    }
}
catch (e: ConcurrentModificationException) {
    println("Don't do concurrent modification!")
}

Don't do concurrent modification!


# Discontinuous Features
GFF3 files represents discontinuous features as multiple rows that share an ID. Since these rows logically constitute the same row, and consequently should have identical properties except start, end, and phase, they are represented as a singular object within the `featureTree` package. This ensures that every instance of a `Feature` has a unique ID attribute and that changes made to a discontinuous feature affect all the discontinuities, ensuring they do not become "out-of-sync" through mutation. However, this does present some drawbacks that need to be accounted for.

Firstly, the number of descendants that a genome has does not necesarily match the number of rows in the source GFF3 file. If you want to count the number of distinct regions rather than the number of logical features within the file, you should use the `multiplicity` property. This property is equivalent to the number of distinct regions within a `Feature` and is `1` for continuous features.

In [20]:
println("Number of logical features: ${genome.descendants().count()}")
println("Number of distinct regions (equivalent to number of rows): ${genome.descendants().sumOf { it.multiplicity }}")

Number of logical features: 14
Number of distinct regions (equivalent to number of rows): 20


Accessing the distinct regions of these features is done through the `ranges` `phases` and `lengths` properties. `ranges` and `phases` always "line up," meaning that they have the same size (which is equivalent to `multiplicity`) and that `ranges[i]` represents the range with phase `phases[i]`. Of course lengths simply represents the length at each range.

In [21]:
val cds = genome.byID("Zm00001eb000010_P001")!!
println("Ranges: ${cds.ranges}")
println("Phases: ${cds.phases}")
println("Lengths: ${cds.lengths}")

Ranges: [34722..35318, 36037..36174, 36259..36504, 36600..36713, 36822..37004, 37416..37633, 38021..38366]
Phases: [ZERO, ZERO, ZERO, ZERO, ZERO, ZERO, ONE]
Lengths: [597, 138, 246, 114, 183, 218, 346]


Of course, you can still use `start`, `end`, `range`, `phase`, and `length` for discontinuous features, though not all are useful.

In [22]:
println("Start returns the least start value among ranges: ${cds.start}")
println("End returns the greatest end value among ranges: ${cds.end}")
println("Range returns the leftmost range among ranges, but these are not always ordered meaningfully: ${cds.range}")
println("Phase returns the leftmost phase among phases, but these are not always ordered meaningfully: ${cds.phase}")
println("Length returns the length between start and end: ${cds.length}")

Start returns the least start value among ranges: 34722
End returns the greatest end value among ranges: 38366
Range returns the leftmost range among ranges, but these are not always ordered meaningfully: 34722..35318
Phase returns the leftmost phase among phases, but these are not always ordered meaningfully: ZERO
Length returns the length between start and end: 3645


Observe that the `toString` function for a discontinuous feature returns multiple lines of text.

In [23]:
cds.toString()

chr1	NAM	CDS	34722	35318	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36037	36174	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36259	36504	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36600	36713	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36822	37004	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	37416	37633	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	38021	38366	.	+	1	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	


## Mutability with discontinuous features
Discontinuities can be modified in a variety of ways, though some limitations are imposed. Most importantly, `ranges` and `phases` must always be the same size. Because of this, all functions that allow you to modify the discontinuities of a feature require you to specify these properties simultaneously.

In [24]:
val mutable4 = mutable2.copy()
val cds4 = mutable4.byID("Zm00001eb000010_P001")!!
println("cds4 ranges: ${cds4.ranges}")
println("cds4 phases: ${cds4.phases}")
cds4.addDiscontinuity(4000..41000, Phase.TWO)
println("~muation 1~")
println("cds4 ranges: ${cds4.ranges}")
println("cds4 phases: ${cds4.phases}")
println("~mutation 2~")
cds4.setDiscontinuities(listOf(100..200 to Phase.TWO, 300..400 to Phase.ZERO))
println("cds4 ranges: ${cds4.ranges}")
println("cds4 phases: ${cds4.phases}")

cds4 ranges: [34722..35318, 36037..36174, 36259..36504, 36600..36713, 36822..37004, 37416..37633, 38021..38366]
cds4 phases: [ZERO, ZERO, ZERO, ZERO, ZERO, ZERO, ONE]
~muation 1~
cds4 ranges: [34722..35318, 36037..36174, 36259..36504, 36600..36713, 36822..37004, 37416..37633, 38021..38366, 4000..41000]
cds4 phases: [ZERO, ZERO, ZERO, ZERO, ZERO, ZERO, ONE, TWO]
~mutation 2~
cds4 ranges: [100..200, 300..400]
cds4 phases: [TWO, ZERO]


Using the `setPhase` or `setRange` will make the feature continuous.

In [25]:
cds4.setPhase(Phase.ONE)
println("~mutation 3~")
println("cds4 ranges: ${cds4.ranges}")
println("cds4 phases: ${cds4.phases}")

~mutation 3~
cds4 ranges: [100..200]
cds4 phases: [ONE]


Observe that the range defaults to the leftmost existing range (alternatively you can explicitly specify).

Any feature with a defined ID can be made discontinuous easily.

In [26]:
val chrom = mutable4.byID("1")!!
chrom.addDiscontinuity(100..200, Phase.UNSPECIFIED)
println(chrom.ranges)

[1..308452471, 100..200]


Keep in mind that discontinuous features must *always* have an ID! This ensures that the GFF3 output can recognize that the features are indeed discontinuous.

In [27]:
val mutable5 = mutable2.copy()
val cds5 = mutable5.byID("Zm00001eb000010_P001")!!
try {
    cds5.setID(null)
} catch (e: DiscontinuousLacksID) {
    println(e.message)
}

        Discontinuous features must contain an ID property.
        Feature: chr1	NAM	CDS	34722	35318	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36037	36174	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36259	36504	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36600	36713	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	36822	37004	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	37416	37633	.	+	0	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	
chr1	NAM	CDS	38021	38366	.	+	1	ID=Zm00001eb000010_P001;Parent=Zm00001eb000010_T001;protein_id=Zm00001eb000010_P001;	



Finally, it is forbidden to try to insert a feature with a different number of phases and ranges.

In [28]:
val gene5 = mutable5.byID("Zm00001eb000010")!!
try {
    gene5.insert(
        seqid = "my_seqid",
        source = "my_source",
        type = "mRNA",
        ranges = listOf(100..200, 300..400),
        score = 42.0,
        strand = Strand.PLUS,
        phases = listOf(Phase.ONE)
    )
} catch (e: MixedMultiplicity) {
    println(e.message)
}

Features must have one range and one phase for each continuous region of the feature, but the supplied ranges
and phases do not agree in number.
Ranges: [100..200, 300..400]
Phases: [ONE]


# Advanced Parsing
While the vast majority of input files do not require any special parameters, these parameters can extend the functionality of the parser, particularly its ability to correct non-standard GFF3 files.

## Text Corrector
The `textCorrector` is a function applied to each line of text prior to the parsing attempting to parse it. This provides you an opportunity to correct errors in your GFF3 file, such as the prescense of illegal characters, etc. The provided file below accidentally included multiple ID attributes in some rows. Let's cut those out.

In [29]:
println("UNCORRECTED")
println(File("resources/b73_multi_id.gff").readText())
val genomeCorrected = Genome.fromFile(
    path = "resources/b73_multi_id.gff",
    textCorrecter = { text ->
        // True if more than one instance of ID
        if (text.split("ID=").size > 2) {
            text.replaceFirst(Regex(";ID=.*;"), ";")
        } else {
            text
        }
    }
)
println("CORRECTED")
println(genomeCorrected)

UNCORRECTED
##gff-version 3
chr1	assembly	chromosome	1	308452471	.	.	.	ID=1;ID=2;Name=chromosome:Zm-B73-REFERENCE-NAM-5.0:1:1:308452471:1
chr1	NAM	gene	34617	40204	.	+	.	ID=Zm00001eb000010;ID=3;biotype=protein_coding;logic_name=cshl_gene
chr1	NAM	mRNA	34617	40204	.	+	.	ID=Zm00001eb000010_T001;ID=4;Parent=Zm00001eb000010;biotype=protein_coding;transcript_id=Zm00001eb000010_T001;canonical_transcript=1
chr1	NAM	five_prime_UTR	34617	34721	.	+	.	Parent=Zm00001eb000010_T001
chr1	NAM	exon	34617	35318	.	+	.	Parent=Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1

CORRECTED
chr1	assembly	chromosome	1	308452471	.	.	.	ID=1;Name=chromosome:Zm-B73-REFERENCE-NAM-5.0:1:1:308452471:1;	
chr1	NAM	gene	34617	40204	.	+	.	ID=Zm00001eb000010;logic_name=cshl_gene;	
chr1	NAM	mRNA	34617	40204	.	+	.	ID=Zm00001eb000010_T001;canonical_transcript=1;	
chr1	NAM	five_prime_UTR	34617	34721	.	+	.	Parent=Zm00001eb000010_T001;	
chr1	NAM	e

## Parent Resolver
Since there are added difficulties of working with multiple parentage (see [Multiple Parentage](#Multiple-Parentage)), it may be desirable to resolve these instances while the file is being parsed. A parent resovler is a function that takes as input the parsed line of the child, and the feature objects that represent its listed parents, in the order that they are listed. Then, it returns an `int` representing the index of the desired parent within the provided list of options. Any lambda can be used, but in practice, the provided `LEFT` and `RIGHT` may be most useful, which pick the leftmost and rightmost parent respeictively.

In this file, the exons have multiple parents, but you only want the leftmost parents.

In [30]:
println("UNRESOLVED")
println(File("resources/b73_multi_parent.gff").readText())

val genomeResolved = Genome.fromFile(
    path = "resources/b73_multi_parent.gff",
    parentResolver = LEFT
)
println("\nRESOLVED")
println(genomeResolved)

UNRESOLVED
##gff-version 3
chr1	assembly	chromosome	1	308452471	.	.	.	ID=1;Name=chromosome:Zm-B73-REFERENCE-NAM-5.0:1:1:308452471:1
chr1	NAM	gene	34617	40204	.	+	.	ID=Zm00001eb000010;biotype=protein_coding;logic_name=cshl_gene
chr1	NAM	mRNA	34617	40204	.	+	.	ID=Zm00001eb000010_T001;Parent=Zm00001eb000010;biotype=protein_coding;transcript_id=Zm00001eb000010_T001;canonical_transcript=1
chr1	NAM	five_prime_UTR	34617	34721	.	+	.	Parent=Zm00001eb000010_T001
chr1	NAM	exon	34617	35318	.	+	.	Parent=Zm00001eb000010_T001,Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1
chr1	NAM	exon	36037	36174	.	+	.	Parent=Zm00001eb000010_T001,Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.2;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.2;rank=2

RESOLVED
chr1	assembly	chromosome	1	308452471	.	.	.	ID=1;Name=chromosome:Zm-B73-REFERENCE-NAM-5.0:1:1:308452471:1;	
chr1	NAM	gene	34617	40204	.	+	.	ID=Zm000

## Multiple Parentage
Due to some added complexities of features containing multiple parents (see [Multiple Parentage](#Multiple-Parentage)), it must be specifically enabled.

Using the same multiple parentage file as above, observe the exception that occurs when the `multipleParentage` parameter is allowed to default to `false` and no `parentResolver` is specified.

In [31]:
val multipleParentage = try {
    Genome.fromFile("resources/b73_multi_parent.gff")
} catch (e: ParseException) {
    println(e.message)
}

Error parsing GFF file resources/b73_multi_parent.gff at line number 6.
Text of line:
chr1	NAM	exon	34617	35318	.	+	.	Parent=Zm00001eb000010_T001,Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1

Must enable multipleParentage to have features with multiple parents


Now, with it enabled, it parses fine.

In [32]:
val multipleParentage = Genome.fromFile(
    path = "resources/b73_multi_parent.gff",
    multipleParentage = true
)
println(multipleParentage)

chr1	assembly	chromosome	1	308452471	.	.	.	ID=1;Name=chromosome:Zm-B73-REFERENCE-NAM-5.0:1:1:308452471:1;	
chr1	NAM	gene	34617	40204	.	+	.	ID=Zm00001eb000010;biotype=protein_coding;logic_name=cshl_gene;	
chr1	NAM	mRNA	34617	40204	.	+	.	ID=Zm00001eb000010_T001;Parent=Zm00001eb000010;biotype=protein_coding;transcript_id=Zm00001eb000010_T001;canonical_transcript=1;	
chr1	NAM	five_prime_UTR	34617	34721	.	+	.	Parent=Zm00001eb000010_T001;	
chr1	NAM	exon	34617	35318	.	+	.	Parent=Zm00001eb000010_T001, Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1;	
chr1	NAM	exon	34617	35318	.	+	.	Parent=Zm00001eb000010_T001, Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb000010_T001.exon.1;rank=1;	
chr1	NAM	exon	36037	36174	.	+	.	Parent=Zm00001eb000010_T001, Zm00001eb000010_T001;Name=Zm00001eb000010_T001.exon.2;ensembl_end_phase=0;ensembl_phase=0;exon_id=Zm00001eb00

## Modify Schema
While in general the tree must obey the Sequence Ontology, it is sometimes necesary to use non-standard types or use types in a non-standard way. Schema modifications allow you to selectively make the schema more permissive. While the schema of a `MutableGenome` can be modified at any time, for a non-standard file to parse correctly these modifications must be made at parse time. See [Advanced Type Schema](#Advanced-Type-Schema) for more.

In this example, an [Arabidopsis GFF3 file](https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR_GFF3_ssrs.gff) uses the non-standard type "satellite." Say you wish to treat this type as a synonym of the standard "satellite_DNA".

In [33]:
println("Raw file")
println(File("resources/arabidopsis_ssr.gff").readText())
val modifiedSchema = Genome.fromFile(
    path = "resources/arabidopsis_ssr.gff",
    modifySchema = { addSynonym("satellite_DNA", "satellite") }
)

Raw file
Chr1	TandemRepeatsFinder_v4.04	satellite	1	115	.	+	.	ID=SSR000001;Name=SSR000001;Unit=CCCTAAA;Length=115;Period=7;Copy=15.3
Chr1	TandemRepeatsFinder_v4.04	satellite	1	106	.	+	.	ID=SSR000002;Name=SSR000002;Unit=CCCTAAAT;Length=106;Period=8;Copy=13.9
Chr1	TandemRepeatsFinder_v4.04	satellite	3	90	.	+	.	ID=SSR000003;Name=SSR000003;Unit=CTAAATCCTTAATCCCTAAATCCCTAAACCT;Length=88;Period=31;Copy=2.9
Chr1	TandemRepeatsFinder_v4.04	satellite	4	91	.	+	.	ID=SSR000004;Name=SSR000004;Unit=TAAATCCTAAATCCA;Length=88;Period=15;Copy=5.8
Chr1	TandemRepeatsFinder_v4.04	satellite	3	106	.	+	.	ID=SSR000005;Name=SSR000005;Unit=CTAAATCCTAAATCCATAAATCCCTAAATCT;Length=104;Period=31;Copy=3.4


# Multiple Parentage
While the parser accepts mutliple parentage, its support is currently experimental. `descendants` and any function dependent on it will produce duplicates, which is generally not what you want. The best workaround curently is to simply call `toSet` on `descendants` to remove these duplicates and then perform your operation.

# Advanced Type Schema
The type schema governing the permissible vocabulary and parent/child relationships can be modified at parse-time, or, for `MutableGenome`, whenever youwant. The type schema can only be modified to be *more permissive* than the base schema defined by the Sequence Ontology.