### Genome Comparison, using [AWS](https://aws.amazon.com/), [elasticsearch](https://www.elastic.co) and [altair](https://altair-viz.github.io/)
#### Human, Chimp, Gorilla
Using FASTA data from (links go here).
Comparing sequences in order they exist in file, all other text removed.

#### Processing Pipeline
To do the comparison, the data needs to be inserted into a database.  The genome data is changed into a searchable format and inserted into elasticsearch.

- clean up data, leaving only ACGT sequence
- break sequence into fixed size chunks (1M in this example)
- *process each chunk into a sequence of "words" (smaller character sequences)*
- *process each character sequence into a different sequence*
- insert word sequence along with species, chromosome, location (chunk) into an Elasticsearch index

###### Processing Steps

The steps that process the sequence can execute *before* data is inserted into Elasticsearch, or they can execute *inside* Elasticsearch via elasticsearch [Character Filters, Tokenizers, and Token Filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html).

#### Finding Relationships

This is similar to the original processing pipeline, but the end result is relating the sequence data from one chunk in one species to another chunk in another species.  This is done by taking a sample of each chunk, and searching for it over all genomes.  In case below, taking 1/10th of 1 percent of each chunk, and finding all the species/chromosome/location chunks that are most similar to it.  

The expected result is that the original source chunk is found with the highest score, and that any other high score indicates a common ancestor for those data segments.  The graphs below show those relationships.

A second pass over each data chunk finds the relationship, with the following steps:

- take a 1/10 of 1% sample of sequence data
- process it exactly as done for data inserted into elasticsearch
- process the reverse complement of that sample in the same was as done for data inserted into elasticsearch
- search for each sequence, save results (top 10 species/chromosome/location matches with score)
- any scores above a threshhold marked as indicating a relationship (based on values from elasticsearch, in case below values ranged from 50 to 2000, with very few being above 250)

The graphs below show the marked relationships, with blue showing the search term was normal, and orange showing the search term was the reverse complement.  The means an inversion will usually show up as a sequence of orange lines that cross in the middle.


In [1]:
import altair as alt
import numpy as np
import pandas as pd

In [2]:
alt.renderers.enable()

RendererRegistry.enable('default')

In [3]:
domains = [ 'same orientation', 'inversed']
color_scale = alt.Scale(
    domain=domains,
    range=['#6baed6', '#fcae91']
)

In [6]:
def render_csv(csvFile):
    csv = pd.read_csv(f"data/{csvFile}")
    base = alt.Chart(csv)
    return base.configure_view(
        strokeOpacity=0
    ).mark_line().encode(
        x=alt.X('x',axis=None),
        y=alt.Y('y',axis=None),
        x2='x2',
        y2='y2',
        color=alt.Color('color:N', legend=None, scale=color_scale)
    )
    

##### Chromosome 1:  Human + Chimp + Gorilla

When 3 species are shown, we can identify which species had what sort of large scale event (inversion, duplication, splitting, joining).

For example, below there are events like:
- a large inversion in chimp 2A, and a smaller human inversion
- a large section of chimp 7 getting duplicated onto the end of chimp 7 (needs some more investigation)


*Chromosome 1 events*
- Gorilla, chromosome 1, a sequence of inversion events

In [7]:
render_csv('chimp1_x2.csv')

*Chromosome 2 events*
- Chimp, large 2A inversion
- Human, smaller 2 inversion

In [11]:
render_csv('chimp2A_x2.csv')

*Chromosome 2 events*
- Human, 2 chromosomes merge into one

below shows right side of Human 2 same as 2B for Chimp and Gorilla.

In [9]:
render_csv('chimp2B_x2.csv')

*Chromosome 3 events*
- Human, several large inversions

In [8]:
render_csv('chimp3_x2.csv')

In [9]:
render_csv('chimp4_x2.csv')

In [10]:
render_csv('chimp5_x2.csv')

In [20]:
render_csv('chimp6_x2.csv')

*Chromosome 7 events*
- looks like several sections got duplicated onto end (unexpected)

In [14]:
render_csv('chimp7_x2.csv')

In [15]:
render_csv('chimp8_x2.csv')

In [23]:
render_csv('chimp9_x2.csv')

In [24]:
render_csv('chimp10_x2.csv')

In [25]:
render_csv('chimp11_x2.csv')

In [26]:
render_csv('chimp12_x2.csv')

In [27]:
render_csv('chimp13_x2.csv')

In [28]:
render_csv('chimp14_x2.csv')

In [29]:
render_csv('chimp15_x2.csv')

In [30]:
render_csv('chimp16_x2.csv')

In [31]:
render_csv('chimp17_x2.csv')

In [32]:
render_csv('chimp18_x2.csv')

In [33]:
render_csv('chimp19_x2.csv')

In [34]:
render_csv('chimp20_x2.csv')

In [35]:
render_csv('chimp21_x2.csv')

In [36]:
render_csv('chimp22_x2.csv')

In [37]:
render_csv('chimpX_x2.csv')

In [38]:
render_csv('chimpY_x2.csv')

In [39]:
render_csv("gorilla5.csv")

Gorilla 5 is in center, Human 17 is above and Chimp 17 is below.  This is the section of Gorilla 5 that is missing in the Chromosome 5 graph above.

In [40]:
render_csv("g5_17.csv")

In [41]:
render_csv("g5_5.csv")