### Genome Comparison, using [AWS](https://aws.amazon.com/), [elasticsearch](https://www.elastic.co) and [altair](https://altair-viz.github.io/)
#### Overview
This visualization shows large scale changes within genomes since their last common ancestor.  Using three genomes -- human, chimp, gorilla from (links go here).

The sequences are each about 3 billion values grouped into chromosomes, and the input is just raw sequence data (below code just strips out everything else from FASTA file, which is probably not going to get that raw sequence, but it seems to be close enough the visualization still makes sense.

There's two passes over the data, the first pass processes the files into bulk load format for elasticsearch.  The second pass samples each location (below is about 1/10th of 1 percent sample), searches for best match in elasticsearch.  In most cases, the best match will be a corresponding location in another species 
Comparing sequences in order they exist in file, all other text removed.


#### Loading genome data into elasticsearch
To do the comparison, the data needs to be inserted into a database.  The genome data is changed into a searchable format and inserted into elasticsearch.

- clean up data, leaving only ACGT sequence
- break sequence into fixed size chunks (1M in this example)
- *process each chunk into a sequence of "words" (smaller character sequences)*
- *process each character sequence into a different sequence*
- insert word sequence along with species, chromosome, location (chunk) into an Elasticsearch index

###### Processing Steps

The steps that process the sequence can execute *before* data is inserted into Elasticsearch, or they can execute *inside* Elasticsearch via elasticsearch [Character Filters, Tokenizers, and Token Filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html).

#### Finding Relationships

This is similar to the original processing pipeline, but the end result is relating the sequence data from one chunk in one species to another chunk in another species.  This is done by taking a sample of each chunk, and searching for it over all genomes.  In case below, taking 1/10th of 1 percent of each chunk, and finding all the species/chromosome/location chunks that are most similar to it.  

The expected result is that the original source chunk is found with the highest score, and that any other high score indicates a common ancestor for those data segments.  The graphs below show those relationships.

A second pass over each data chunk finds the relationship, with the following steps:

- take a 1/10 of 1% sample of sequence data
- process it exactly as done for data inserted into elasticsearch
- process the reverse complement of that sample in the same was as done for data inserted into elasticsearch
- search for each sequence, save results (top 10 species/chromosome/location matches with score)
- any scores above a threshhold marked as indicating a relationship (based on values from elasticsearch, in case below values ranged from 50 to 2000, with very few being above 250)

The graphs below show the marked relationships, with blue showing the search term was normal, and orange showing the search term was the reverse complement.  The means an inversion will usually show up as a sequence of orange lines that cross in the middle.


In [1]:
import altair as alt
import numpy as np
import pandas as pd

alt.renderers.enable()

RendererRegistry.enable('default')

#### Example #1: storing 1M bp records in elasticsearch, search for 10k of data
Below shows correspondence between chromosome data based on elasticsearch results.  In most cases, the chromosome data corresponds to same chromosome number.  But there are some large scale structural changes that show up in the data.

In [2]:
df = pd.read_csv('data/generated_csv/cgh_1000000/data.csv', names=['sp', 'chr', 'loc', 'score', 'msp', 'mchr', 'mloc', 'orient'])
chrdf = df.groupby(['sp','chr', 'msp', 'mchr']).size().reset_index().rename(columns={0:'count'})
# only show the corresponding matches if there seems to be as much as 50M bp correspondence in the data
df50 = chrdf[ chrdf['count'] > 50 ]
df50

Unnamed: 0,sp,chr,msp,mchr,count
6,chimp,1,gorilla,1,260
14,chimp,1,human,1,250
30,chimp,10,gorilla,10,156
36,chimp,10,human,10,148
56,chimp,11,gorilla,11,163
61,chimp,11,human,11,153
74,chimp,12,gorilla,12,151
80,chimp,12,human,12,146
91,chimp,13,gorilla,13,101
96,chimp,13,human,13,105


In [3]:
hchr_list = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X', 'Y' ]
cchr_list = [ '1', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X', 'Y' ]

hchrs = [ hval for hval in hchr_list for cval in cchr_list ]
cchrs = [ cval for hval in hchr_list for cval in cchr_list ]
counts = []
for hchr,cchr in zip(hchrs,cchrs):
    xdf = df50[(df50['sp']=='human') & (df50['msp'] == 'chimp') & (df50['chr'] == hchr) & (df50['mchr'] == cchr)]
    count = 0
    if len(xdf) > 0:
        count = xdf.iloc[0]['count']
    counts.append(count)

source = pd.DataFrame({'x': hchrs,
                     'y': cchrs,
                     'z': counts})

alt.Chart(source).mark_rect().encode(
    x=alt.X('x:N', axis=alt.Axis(title='human chromosomes', grid=True, ticks=True)),
    y=alt.Y('y:N', axis=alt.Axis(title='chimp chromosomes', grid=True, ticks=True)),
    color=alt.Color('z:Q', title="count")
)

In [4]:
hchr_list = [ '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X', 'Y' ]
gchr_list = [ '1', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X' ]
# to get a better sense of the 5/17 move
#hchr_list = [ '3', '4', '5', '6', '7', '15', '16', '17', '18', '19' ]
#gchr_list = [ '3', '4', '5', '6', '7', '15', '16', '17', '18', '19' ]

hchrs = [ hval for hval in hchr_list for cval in gchr_list ]
cchrs = [ cval for hval in hchr_list for cval in gchr_list ]
counts = []
for hchr,cchr in zip(hchrs,cchrs):
    xdf = df50[(df50['sp']=='human') & (df50['msp'] == 'gorilla') & (df50['chr'] == hchr) & (df50['mchr'] == cchr)]
    count = 0
    if len(xdf) > 0:
        count = xdf.iloc[0]['count']
    counts.append(count)

source = pd.DataFrame({'x': hchrs,
                     'y': cchrs,
                     'z': counts})

alt.Chart(source).mark_rect().encode(
    x=alt.X('x:N', axis=alt.Axis(title='human chromosomes', grid=True, ticks=True)),
    y=alt.Y('y:N', axis=alt.Axis(title='gorilla chromosomes', grid=True, ticks=True)),
    color=alt.Color('z:Q', title="count")
)

In [5]:
cchr_list = [ '1', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X', 'Y' ]
gchr_list = [ '1', '2A', '2B', '3', '4', '5', '6', '7', '8', '9', '10', 
         '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', 
         '21', '22', 'X' ]

hchrs = [ hval for hval in cchr_list for cval in gchr_list ]
cchrs = [ cval for hval in cchr_list for cval in gchr_list ]
counts = []
for hchr,cchr in zip(hchrs,cchrs):
    xdf = df50[(df50['sp']=='human') & (df50['msp'] == 'gorilla') & (df50['chr'] == hchr) & (df50['mchr'] == cchr)]
    count = 0
    if len(xdf) > 0:
        count = xdf.iloc[0]['count']
    counts.append(count)

source = pd.DataFrame({'x': hchrs,
                     'y': cchrs,
                     'z': counts})

alt.Chart(source).mark_rect().encode(
    x=alt.X('x:N', axis=alt.Axis(title='chimp chromosomes', grid=True, ticks=True)),
    y=alt.Y('y:N', axis=alt.Axis(title='gorilla chromosomes', grid=True, ticks=True)),
    color=alt.Color('z:Q', title="count")
)

Below shows gorilla chromosome 5 sequence data is found in human and chimp chromosome 17.  The graphs below show the
large scale structural differences in more detail.

In [7]:
df50[(df50['sp'] == 'gorilla') & (df50['chr'] == '5')]

Unnamed: 0,sp,chr,msp,mchr,count
826,gorilla,5,chimp,17,82
828,gorilla,5,chimp,5,120
832,gorilla,5,human,17,78
834,gorilla,5,human,5,108


In [143]:
domains = [ 'same orientation', 'inversed']
color_scale = alt.Scale(
    domain=domains,
    range=['#6baed6', '#fcae91']
)


def chromosome_graph(csvFile, top_species, top_chromosome, middle_species, middle_chromosome, bottom_species, bottom_chromosome):
    csv = pd.read_csv(f"data/generated_csv/{csvFile}")
    return cgraph(csv, top_species, top_chromosome, middle_species, middle_chromosome, bottom_species, bottom_chromosome)

def cgraph(csv, top_species, top_chromosome, middle_species, middle_chromosome, bottom_species, bottom_chromosome, graph_width=600):
    g = alt.Chart(csv).mark_line().encode(
        x=alt.X('x',axis=alt.Axis(grid=False)),
        y=alt.Y('y',axis=alt.Axis(grid=False)),
        x2='x2',
        y2='y2',
        color=alt.Color('color:N', title='', scale=color_scale)
    )
    
    maxes = csv.max()
    maxCenter = maxes['x']
    maxRest = maxes['x2']
    maxY = max(maxes['y'], maxes['y2'])
     
    # here I just want a bar at the top, and text on the right that says:  species, chromosome
    # top_data + bars, gives me a transparent green bar at top, with no text
    top_label = f"{top_species}, {top_chromosome}"
    middle_label = f"{middle_species}, {middle_chromosome}"
    bottom_label = f"{bottom_species}, {bottom_chromosome}"
    X_MARGIN = 10
    Y_MARGIN = 12
    top_data = pd.DataFrame({
        'x': [ maxRest + X_MARGIN, maxRest + X_MARGIN, maxRest + X_MARGIN ],
        'y': [ maxY - Y_MARGIN, int(maxY/2), Y_MARGIN ],
        'text': [ top_label, middle_label, bottom_label ]
    })
    bars = alt.Chart(top_data).mark_text(
        stroke='grey',
        opacity=0.9, 
        fontSize=10,
        fontStyle="italic",
        align="left"
    ).encode(
        x=alt.X('x:Q'),
        y=alt.Y('y:Q'),
        text=alt.Text('text'),
        color=alt.Color('color:N', legend=None, scale=color_scale)
    )
    
    x = alt.Chart().mark_text().encode(
        x=alt.X('x:Q', axis=alt.Axis(title='million bp', grid=False, ticks=True)),
        y=alt.Y('y:Q', axis=alt.Axis(title='', grid=False, labels=False, ticks=False)),
        color=alt.Color('color:N', legend=alt.Legend(orient="left",title='', symbolType="stroke"), scale=color_scale)
    )

    return alt.layer(g, bars, x).configure_view(
        stroke='transparent',
        width=graph_width
    ).configure_axis(grid=False)


##### Chromosome 1:  Human + Chimp + Gorilla

When 3 species are shown, we can identify which species had what sort of large scale event (inversion, duplication, splitting, joining).

For example, below there are events like:
- a large inversion in chimp 2A, and a smaller human inversion
- a large section of chimp 7 getting duplicated onto the end of chimp 7 (needs some more investigation)


*Chromosome 1 events*
- Gorilla, chromosome 1, a sequence of inversion events

In [144]:
chromosome_graph('chimp1_x2.csv', "human", "1", "chimp", "1", "gorilla", "1")

*Chromosome 2 events*
- Chimp, large 2A inversion
- Human, smaller 2 inversion

In [100]:
chromosome_graph('chimp2A_x2.csv', "human", "2", "chimp", "2A", "gorilla", "2A")

*Chromosome 2 events*
- Human, 2 chromosomes merge into one

below shows right side of Human 2 same as 2B for Chimp and Gorilla.

In [101]:
chromosome_graph('chimp2B_x2.csv', 'human', '2', 'chimp', '2B', 'gorilla', '2B')

*Chromosome 3 events*
- Human, several large inversions

In [102]:
chromosome_graph('chimp3_x2.csv', 'human', '3', 'chimp', '3', 'gorilla','3')

In [103]:
chromosome_graph('chimp4_x2.csv', 'human', '4', 'chimp', '4', 'gorilla', '4')

In [104]:
chromosome_graph('chimp5_x2.csv', 'human', '5', 'chimp', '5', 'gorilla', '5')

In [105]:
chromosome_graph('chimp6_x2.csv', 'human', '6', 'chimp', '6', 'gorilla', '6')

*Chromosome 7 events*
- looks like several sections got duplicated onto end (unexpected)

In [106]:
chromosome_graph('chimp7_x2.csv', 'human', '7', 'chimp', '7', 'gorilla', '7')

In [107]:
chromosome_graph('chimp8_x2.csv', 'human', '8', 'chimp', '8', 'gorilla', '8')

In [108]:
chromosome_graph('chimp9_x2.csv', 'human', '9', 'chimp', '9', 'gorilla', '9')

In [109]:
chromosome_graph('chimp10_x2.csv', 'human', '10', 'chimp', '10', 'gorilla', '10')

In [110]:
chromosome_graph('chimp11_x2.csv', 'human', '11', 'chimp', '11', 'gorilla', '11')

In [111]:
chromosome_graph('chimp12_x2.csv', 'human', '12', 'chimp', '12', 'gorilla', '12')

In [112]:
chromosome_graph('chimp13_x2.csv', 'human', '13', 'chimp', '13', 'gorilla', '13')

In [113]:
chromosome_graph('chimp14_x2.csv', 'human', '14', 'chimp', '14', 'gorilla', '14')

In [114]:
chromosome_graph('chimp15_x2.csv', 'human', '15', 'chimp', '15', 'gorilla', '15')

In [115]:
chromosome_graph('chimp16_x2.csv', 'human', '16', 'chimp', '16', 'gorilla', '16')

In [116]:
chromosome_graph('chimp17_x2.csv', 'human', '17', 'chimp', '17', 'gorilla', '17')

In [117]:
chromosome_graph('chimp18_x2.csv', 'human', '18', 'chimp', '18', 'gorilla', '18')

In [118]:
chromosome_graph('chimp19_x2.csv', 'human', '19', 'chimp', '19', 'gorilla', '19')

In [119]:
chromosome_graph('chimp20_x2.csv', 'human', '20', 'chimp', '20', 'gorilla', '20')

In [120]:
chromosome_graph('chimp21_x2.csv', 'human', '21', 'chimp', '21', 'gorilla', '21')

In [121]:
chromosome_graph('chimp22_x2.csv', 'human', '22', 'chimp', '22', 'gorilla', '22')

In [122]:
chromosome_graph('chimpX_x2.csv', 'human', 'X', 'chimp','X', 'gorilla', 'X')

In [123]:
chromosome_graph("gorilla5.csv", 'human', '17', 'gorilla', '5', 'chimp', '17')

Large parts of gorilla chromosome 5 correspond to human and chimp chromosome 17.

In [158]:
chromosome_graph("g5_17.csv", 'human', '17', 'gorilla', '5', 'chimp', '17')

In [178]:
df = pd.read_csv('ch16_cgh_50000_data.csv', index_col=False)
df = df[df['sp'] == 'chimp']
df = df[df['score'] > 1000]

In [179]:
df.head()

Unnamed: 0,sp,chr,loc,score,msp,mchr,mloc,color
12,chimp,16,79100000,1446,human,16,27700000,same orientation
13,chimp,16,79100000,1011,human,16,27750000,same orientation
15,chimp,16,27200000,1389,human,16,20900000,same orientation
16,chimp,16,27200000,1156,human,16,20950000,same orientation
17,chimp,16,75800000,2437,human,16,73250000,same orientation


In [180]:
df['x'] = [x/500000 for x in df['loc']]
df['x2'] = [x/500000 for x in df['mloc']]

In [181]:
def sp_to_y(val):
    if val == 'human':
        return 0
    elif val == 'chimp':
        return 200
    else:
        return 400
df['y'] = [ sp_to_y(val) for val in df['sp']]
df['y2'] = [ sp_to_y(val) for val in df['msp']]

In [182]:
df

Unnamed: 0,sp,chr,loc,score,msp,mchr,mloc,color,x,x2,y,y2
12,chimp,16,79100000,1446,human,16,27700000,same orientation,158.2,55.4,200,0
13,chimp,16,79100000,1011,human,16,27750000,same orientation,158.2,55.5,200,0
15,chimp,16,27200000,1389,human,16,20900000,same orientation,54.4,41.8,200,0
16,chimp,16,27200000,1156,human,16,20950000,same orientation,54.4,41.9,200,0
17,chimp,16,75800000,2437,human,16,73250000,same orientation,151.6,146.5,200,0
18,chimp,16,75800000,1048,gorilla,16,78100000,same orientation,151.6,156.2,200,400
39,chimp,16,16950000,3161,human,16,17000000,inversed,33.9,34.0,200,0
40,chimp,16,10200000,2152,human,16,9800000,same orientation,20.4,19.6,200,0
41,chimp,16,10200000,2089,gorilla,16,10200000,same orientation,20.4,20.4,200,400
52,chimp,16,39450000,1662,gorilla,16,38000000,inversed,78.9,76.0,200,400


In [183]:
cgraph(df, 'human', '16', 'chimp', '16', 'gorilla', '16', 3000)