### Genome Comparison, using [AWS](https://aws.amazon.com/), [elasticsearch](https://www.elastic.co) and [altair](https://altair-viz.github.io/)
#### Overview
This visualization shows large scale changes within genomes since their last common ancestor.  Using three genomes -- human, chimp, gorilla from (links go here).

The sequences are each about 3 billion values grouped into chromosomes, and the input is just raw sequence data (below code just strips out everything else from FASTA file, which is probably not going to get that raw sequence, but it seems to be close enough the visualization still makes sense.

There's two passes over the data, the first pass processes the files into bulk load format for elasticsearch.  The second pass samples each location (below is about 1/10th of 1 percent sample), searches for best match in elasticsearch.  In most cases, the best match will be a corresponding location in another species 
Comparing sequences in order they exist in file, all other text removed.


#### Loading genome data into elasticsearch
To do the comparison, the data needs to be inserted into a database.  The genome data is changed into a searchable format and inserted into elasticsearch.

- clean up data, leaving only ACGT sequence
- break sequence into fixed size chunks (1M in this example)
- *process each chunk into a sequence of "words" (smaller character sequences)*
- *process each character sequence into a different sequence*
- insert word sequence along with species, chromosome, location (chunk) into an Elasticsearch index

###### Processing Steps

The steps that process the sequence can execute *before* data is inserted into Elasticsearch, or they can execute *inside* Elasticsearch via elasticsearch [Character Filters, Tokenizers, and Token Filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html).

#### Finding Relationships

This is similar to the original processing pipeline, but the end result is relating the sequence data from one chunk in one species to another chunk in another species.  This is done by taking a sample of each chunk, and searching for it over all genomes.  In case below, taking 1/10th of 1 percent of each chunk, and finding all the species/chromosome/location chunks that are most similar to it.  

The expected result is that the original source chunk is found with the highest score, and that any other high score indicates a common ancestor for those data segments.  The graphs below show those relationships.

A second pass over each data chunk finds the relationship, with the following steps:

- take a 1/10 of 1% sample of sequence data
- process it exactly as done for data inserted into elasticsearch
- process the reverse complement of that sample in the same was as done for data inserted into elasticsearch
- search for each sequence, save results (top 10 species/chromosome/location matches with score)
- any scores above a threshhold marked as indicating a relationship (based on values from elasticsearch, in case below values ranged from 50 to 2000, with very few being above 250)

The graphs below show the marked relationships, with blue showing the search term was normal, and orange showing the search term was the reverse complement.  The means an inversion will usually show up as a sequence of orange lines that cross in the middle.


In [1]:
import altair as alt
import numpy as np
import pandas as pd

alt.renderers.enable()

RendererRegistry.enable('default')

#### Example #1: storing 1M bp records in elasticsearch, search for 10k of data
Below shows correspondence between chromosome data based on elasticsearch results.  In most cases, the chromosome data corresponds to same chromosome number.  But there are some large scale structural changes that show up in the data.

In [240]:
df = pd.read_csv('data/generated_csv/cgh_1000000/data.csv', names=['sp', 'chr', 'loc', 'score', 'msp', 'mchr', 'mloc', 'orient'])
chrdf = df.groupby(['sp','chr', 'msp', 'mchr']).size().reset_index().rename(columns={0:'count'})

In [241]:
hchr_list = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X', 'Y' ]
cchr_list = [ '1', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X', 'Y' ]
gchr_list = [ '1', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X' ]

chr_map = { "human": hchr_list, "chimp": cchr_list, "gorilla": gchr_list }

def genome_graph(df, sp1, sp2):
    sp1_chr_list = chr_map[sp1]
    sp2_chr_list = chr_map[sp2]
    ax1chrs = [ v1 for v1 in sp1_chr_list for v2 in sp2_chr_list]
    ax2chrs = [ v2 for v1 in sp1_chr_list for v2 in sp2_chr_list]
    counts = []
    for ax1_chr,ax2_chr in zip(ax1chrs,ax2chrs):
        xdf = df[(df['sp']== sp1) & (df['msp'] == sp2) & (df['chr'] == ax1_chr) & (df['mchr'] == ax2_chr)]
        count = 0
        if len(xdf) > 0:
            count = xdf.iloc[0]['count']
        counts.append(count)

    source = pd.DataFrame({'x': ax1chrs,
                           'y': ax2chrs,
                           'z': counts})
    ax1_title = f"{sp1} chromosomes"
    ax2_title = f"{sp2} chromosomes"
    return alt.Chart(source).mark_rect().encode(
        x=alt.X('x:N', axis=alt.Axis(title=ax1_title, grid=True, ticks=True)),
        y=alt.Y('y:N', axis=alt.Axis(title=ax2_title, grid=True, ticks=True)),
        color=alt.Color('z:Q', title="count")
    )
    
genome_graph(chrdf, 'human', 'chimp')

In [242]:
genome_graph(chrdf, 'human', 'gorilla')

In [243]:
genome_graph(chrdf, 'chimp', 'gorilla')

Below shows gorilla chromosome 5 sequence data is found in human and chimp chromosome 17.  The graphs below show the
large scale structural differences in more detail.

In [244]:
df50[(df50['sp'] == 'gorilla') & (df50['chr'] == '5')]

Unnamed: 0,sp,chr,msp,mchr,count
826,gorilla,5,chimp,17,82
827,gorilla,5,chimp,20,4
828,gorilla,5,chimp,5,120
829,gorilla,5,chimp,7,3
830,gorilla,5,gorilla,17,2
831,gorilla,5,gorilla,20,2
832,gorilla,5,human,17,78
833,gorilla,5,human,20,2
834,gorilla,5,human,5,108
835,gorilla,5,human,7,2


In [279]:
domains = [ 'same orientation', 'inversed']
color_scale = alt.Scale(
    domain=domains,
    range=['#6baed6', '#fcae91']
)


def chromosome_graph(csvFile, top_species, top_chromosome, middle_species, middle_chromosome, bottom_species, bottom_chromosome):
    csv = pd.read_csv(f"data/generated_csv/{csvFile}")
    return cgraph(csv, top_species, top_chromosome, middle_species, middle_chromosome, bottom_species, bottom_chromosome)

def cgraph(df, top_species, top_chromosome, middle_species, middle_chromosome, bottom_species, bottom_chromosome, graph_width=600):
    g = alt.Chart(df).mark_line().encode(
        x=alt.X('x',axis=alt.Axis(grid=False)),
        y=alt.Y('y',axis=alt.Axis(grid=False)),
        x2='x2',
        y2='y2',
        color=alt.Color('orientation:N', title='', scale=color_scale)
    )
    
    maxes = df.max()
    maxCenter = maxes['x']
    maxRest = maxes['x2']
    maxY = max(maxes['y'], maxes['y2'])
     
    # here I just want a bar at the top, and text on the right that says:  species, chromosome
    # top_data + bars, gives me a transparent green bar at top, with no text
    top_label = f"{top_species}, {top_chromosome}"
    middle_label = f"{middle_species}, {middle_chromosome}"
    bottom_label = f"{bottom_species}, {bottom_chromosome}"
    X_MARGIN = 10
    Y_MARGIN = 12
    top_data = pd.DataFrame({
        'x': [ maxRest + X_MARGIN, maxRest + X_MARGIN, maxRest + X_MARGIN ],
        'y': [ maxY - Y_MARGIN, int(maxY/2), Y_MARGIN ],
        'text': [ top_label, middle_label, bottom_label ]
    })
    bars = alt.Chart(top_data).mark_text(
        stroke='grey',
        opacity=0.9, 
        fontSize=10,
        fontStyle="italic",
        align="left"
    ).encode(
        x=alt.X('x:Q'),
        y=alt.Y('y:Q'),
        text=alt.Text('text'),
        color=alt.Color('orientation:N', legend=None, scale=color_scale)
    )
    
    x = alt.Chart().mark_text().encode(
        x=alt.X('x:Q', axis=alt.Axis(title='million bp', grid=False, ticks=True)),
        y=alt.Y('y:Q', axis=alt.Axis(title='', grid=False, labels=False, ticks=False)),
        color=alt.Color('orientation:N', legend=alt.Legend(orient="left",title='', symbolType="stroke"), scale=color_scale)
    )

    return alt.layer(g, bars, x).configure_view(
        stroke='transparent',
        width=graph_width
    ).configure_axis(grid=False)


##### Chromosome 1:  Human + Chimp + Gorilla

When 3 species are shown, we can identify which species had what sort of large scale event (inversion, duplication, splitting, joining).

For example, below there are events like:
- a large inversion in chimp 2A, and a smaller human inversion
- a large section of chimp 7 getting duplicated onto the end of chimp 7 (needs some more investigation)


*Chromosome 1 events*
- Gorilla, chromosome 1, a sequence of inversion events

In [281]:
chromosome_graph('chimp1_x2.csv', "human", "1", "chimp", "1", "gorilla", "1")

*Chromosome 2 events*
- Chimp, large 2A inversion
- Human, smaller 2 inversion

In [282]:
chromosome_graph('chimp2A_x2.csv', "human", "2", "chimp", "2A", "gorilla", "2A")

*Chromosome 2 events*
- Human, 2 chromosomes merge into one

below shows right side of Human 2 same as 2B for Chimp and Gorilla.

In [283]:
chromosome_graph('chimp2B_x2.csv', 'human', '2', 'chimp', '2B', 'gorilla', '2B')

*Chromosome 3 events*
- Human, several large inversions

In [284]:
chromosome_graph('chimp3_x2.csv', 'human', '3', 'chimp', '3', 'gorilla','3')

In [285]:
chromosome_graph('chimp4_x2.csv', 'human', '4', 'chimp', '4', 'gorilla', '4')

In [286]:
chromosome_graph('chimp5_x2.csv', 'human', '5', 'chimp', '5', 'gorilla', '5')

In [287]:
chromosome_graph('chimp6_x2.csv', 'human', '6', 'chimp', '6', 'gorilla', '6')

*Chromosome 7 events*
- looks like several sections got duplicated onto end (unexpected)

In [288]:
chromosome_graph('chimp7_x2.csv', 'human', '7', 'chimp', '7', 'gorilla', '7')

In [289]:
chromosome_graph('chimp8_x2.csv', 'human', '8', 'chimp', '8', 'gorilla', '8')

In [290]:
chromosome_graph('chimp9_x2.csv', 'human', '9', 'chimp', '9', 'gorilla', '9')

In [291]:
chromosome_graph('chimp10_x2.csv', 'human', '10', 'chimp', '10', 'gorilla', '10')

In [292]:
chromosome_graph('chimp11_x2.csv', 'human', '11', 'chimp', '11', 'gorilla', '11')

In [293]:
chromosome_graph('chimp12_x2.csv', 'human', '12', 'chimp', '12', 'gorilla', '12')

In [294]:
chromosome_graph('chimp13_x2.csv', 'human', '13', 'chimp', '13', 'gorilla', '13')

In [295]:
chromosome_graph('chimp14_x2.csv', 'human', '14', 'chimp', '14', 'gorilla', '14')

In [296]:
chromosome_graph('chimp15_x2.csv', 'human', '15', 'chimp', '15', 'gorilla', '15')

In [297]:
chromosome_graph('chimp16_x2.csv', 'human', '16', 'chimp', '16', 'gorilla', '16')

In [298]:
chromosome_graph('chimp17_x2.csv', 'human', '17', 'chimp', '17', 'gorilla', '17')

In [299]:
chromosome_graph('chimp18_x2.csv', 'human', '18', 'chimp', '18', 'gorilla', '18')

In [300]:
chromosome_graph('chimp19_x2.csv', 'human', '19', 'chimp', '19', 'gorilla', '19')

In [301]:
chromosome_graph('chimp20_x2.csv', 'human', '20', 'chimp', '20', 'gorilla', '20')

In [302]:
chromosome_graph('chimp21_x2.csv', 'human', '21', 'chimp', '21', 'gorilla', '21')

In [303]:
chromosome_graph('chimp22_x2.csv', 'human', '22', 'chimp', '22', 'gorilla', '22')

In [304]:
chromosome_graph('chimpX_x2.csv', 'human', 'X', 'chimp','X', 'gorilla', 'X')

Large parts of gorilla chromosome 5 correspond to human and chimp chromosome 17.

In [320]:
chromosome_graph("g5_17.csv", 'human', '17', 'gorilla', '5', 'chimp', '17')

In [307]:
def sp_to_y(val, top, mid, bot):
    if val == top:
        return 0
    elif val == mid:
        return 200
    else:
        return 400

def graph_df(df, top_sp, top_chr, middle_sp, middle_chr, bottom_sp, bottom_chr, min_score):
    df = df[df['sp'] == middle_sp]
    df = df[df['score'] > min_score]
    df = df[df['chr'] == middle_chr]
    df = df[(df['mchr'] == top_chr) | (df['mchr'] == bottom_chr)]
    df['x'] = [x/1000000 for x in df['loc']]
    df['x2'] = [x/1000000 for x in df['mloc']]
    df['y'] = [ sp_to_y(val, top_sp, middle_sp, bottom_sp) for val in df['sp']]
    df['y2'] = [ sp_to_y(val, top_sp, middle_sp, bottom_sp) for val in df['msp']]
    return df

In [308]:
df = pd.read_csv('cgh_100000_data.csv', index_col=False)

In [309]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [322]:
gdf = graph_df(df, 'chimp', '3', 'gorilla', '3', 'human', '3', 1000)
cgraph(gdf, 'chimp', '3', 'gorilla', '3', 'human', '3', 3000)

In [323]:
gdf = graph_df(df, 'chimp', '14', 'gorilla', '14', 'human', '14', 1000)
cgraph(gdf, 'chimp', '14', 'gorilla', '14', 'human', '14', 3000)