# Huddinge Browser

This is a tool for browsing kmer enrichments in interactive two dimensional plots. This is work in progress so the user interface is a bit involved and the backend is quite fragile, but bare with me. I'm happy to take comments.

Here is a sample and tutorial for use of the system.

In [1]:
import sys
import logging as log
log.basicConfig(level=log.INFO,
                            format='%(asctime)s:%(funcName)s:%(levelname)s:%(message)s')
import numpy as np
import holoviews as hv
import numpy as np
import pandas as pd
hv.extension('bokeh',logo=False)

In [2]:
%cd /tmp/kpalin
## Temporary directory


/tmp/kpalin


In [3]:
import huddinge_tsne_browser.tsne_mapper as htm 
import huddinge_tsne_browser.huddinge_browser as hhb
import huddinge_tsne_browser.datashaderselect

Two main modules of the system are the `huddinge_tsne_browser.tsne_mapper` and `huddinge_tsne_browser.huddinge_browser`.  The tsne_mapper class reads the input files and possibly lays out the input kmers if they have not been laid out before.  Huddinge_browser class is more for interfacing the user.


The distribution comes with 8 mers laid out with TSNE approximating Huddinge distance. Software for calculating all pairs Huddinge distance (and producing appropriate output) is in branch `huddinge_pairs` of git repository `https://github.com/kpalin/MODER.git` and the computation can be done with command line `python huddinge_tsne_browser`.

First you need to compute 8 mer counts for some selex [experiment](https://www.ebi.ac.uk/ena/data/view/PRJEB3289)

In [4]:
f = """ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172922/fastq/HNF4A_TGACAG20NGA_AF_1.fastq.gz
ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172922/fastq/HNF4A_TGACAG20NGA_AF_2.fastq.gz
ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172922/fastq/HNF4A_TGACAG20NGA_AF_3.fastq.gz
ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172922/fastq/HNF4A_TGACAG20NGA_AF_4.fastq.gz
""".split()
for i in f:
    !wget --no-clobber {i}

File ‘HNF4A_TGACAG20NGA_AF_1.fastq.gz’ already there; not retrieving.

File ‘HNF4A_TGACAG20NGA_AF_2.fastq.gz’ already there; not retrieving.

File ‘HNF4A_TGACAG20NGA_AF_3.fastq.gz’ already there; not retrieving.

File ‘HNF4A_TGACAG20NGA_AF_4.fastq.gz’ already there; not retrieving.



### Calculate kmer counts

Then calculate 8mer counts for your data. Currently only jellyfish text output is good. (Also note jellyfish needs the `--disk` option for 8 and 7 mers)

In [5]:
%%bash
K=8
for i in *.fastq.gz
do
    
    OUT=$(basename $i .fastq.gz).${K}mer_counts.jf
    echo $OUT
    if [ ! -e ${OUT} ];
    then
        zcat $i | /usr/bin/time -v jellyfish count -o $OUT --text -m ${K} -s 1M --bf-size 1G -t 16 --disk /dev/stdin
    fi
done

HNF4A_TGACAG20NGA_AF_1.8mer_counts.jf
HNF4A_TGACAG20NGA_AF_2.8mer_counts.jf
HNF4A_TGACAG20NGA_AF_3.8mer_counts.jf
HNF4A_TGACAG20NGA_AF_4.8mer_counts.jf


## Initialize layout

First you need to lay out your kmers. You can compute huddinge distances between all pairs of given kmers with accompanying `all_pairs_huddinge` command. Its output can be laid out with command line `huddinge_tsne_browser`command and the resulting should be given to `TsneMapper("kmer_layout.tsnet")`. TSNE layout of all 8mers takes more than an hour.  


For 8mers, the layout has already been done and the layout can be downloaded from [here](http://www.cs.helsinki.fi/u/kpalin/kmer8_iters4k.tsne.gz).


In [310]:
!wget --no-clobber http://www.cs.helsinki.fi/u/kpalin/kmer8_iters4k.tsne.gz
!gunzip kmer8_iters4k.tsne.gz

--2018-01-12 17:45:23--  http://www.cs.helsinki.fi/u/kpalin/kmer8_iters4k.tsne.gz
Resolving www.cs.helsinki.fi (www.cs.helsinki.fi)... 128.214.166.78
Connecting to www.cs.helsinki.fi (www.cs.helsinki.fi)|128.214.166.78|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.cs.helsinki.fi/u/kpalin/kmer8_iters4k.tsne.gz [following]
--2018-01-12 17:45:23--  https://www.cs.helsinki.fi/u/kpalin/kmer8_iters4k.tsne.gz
Connecting to www.cs.helsinki.fi (www.cs.helsinki.fi)|128.214.166.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 169402557 (162M) [application/x-gzip]
Saving to: ‘kmer8_iters4k.tsne.gz’


2018-01-12 17:45:25 (64.6 MB/s) - ‘kmer8_iters4k.tsne.gz’ saved [169402557/169402557]



In [94]:
import huddinge_tsne_browser.tsne_mapper as htm 
import huddinge_tsne_browser.huddinge_browser as hhb
import huddinge_tsne_browser.datashaderselect
reload(huddinge_tsne_browser.datashaderselect)

reload(htm)


<module 'huddinge_tsne_browser.tsne_mapper' from '/home/kpalin/software/huddinge_tsne_browser/huddinge_tsne_browser/tsne_mapper.pyc'>

In [311]:
tsne = htm.TsneMapper("kmer8_iters4k.tsne")
kmer_size=8
for i in range(4):
    tsne.add_kmercounts("HNF4A_{}".format(i+1),
                        "HNF4A_TGACAG20NGA_AF_{}.{}mer_counts.jf".format(i+1,kmer_size))


2018-01-12 17:46:00,237:read_data:INFO:Read 65536 sequences.
2018-01-12 17:46:00,238:read_data:INFO:Setting embedding from input data


## Browse

Create the browser module and display the browsing window. Browsing tools are selectable top right. The main display top left shows the kmers laid out colored according to the counts loaded above.  By clicking the main display you get table of kmers in the selected rectangle top right and more detailed figure, with point wise hover tool for counts at the bottom.  The coloring criterion of the main plot can be selected from the drop down menu.  These interactive features require jupyter running in the server.


In [312]:
reload(hhb)

br=hhb.HuddingBrowser(tsne)
p = br.holoview_plot()
p

2018-01-12 17:46:03,535:__init__:INFO:Initialized DataShaderSelect


A pandas dataframe of selected kmers can be obtained from br.selected attribute.
The selected samples (all samples, all samples in zoom plot or samples selected in zoom plot) can be worked with in pandas:

In [351]:
br.selected.head()

Unnamed: 0_level_0,tsne0,tsne1,HNF4A_1,HNF4A_2,HNF4A_3,HNF4A_4
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAACGGGA,0.549887,-83.427155,15.0,20.0,45.0,16.0
AAACGGGG,-1.87227,-82.756439,11.0,25.0,41.0,15.0
AACCGGGA,1.170999,-81.577301,21.0,29.0,28.0,5.0
AACCGGGG,-2.306109,-81.515068,21.0,11.0,21.0,10.0
AACCGGGT,-0.600562,-77.583412,13.0,14.0,24.0,4.0


In [352]:
br.selected.to_csv("selected_kmers.tsv",sep="\t")
!head selected_kmers.tsv

Sequence	tsne0	tsne1	HNF4A_1	HNF4A_2	HNF4A_3	HNF4A_4
AAACGGGA	0.549886524677	-83.427154541	15.0	20.0	45.0	16.0
AAACGGGG	-1.87226963043	-82.756439209	11.0	25.0	41.0	15.0
AACCGGGA	1.17099869251	-81.5773010254	21.0	29.0	28.0	5.0
AACCGGGG	-2.30610871315	-81.5150680542	21.0	11.0	21.0	10.0
AACCGGGT	-0.600561797619	-77.5834121704	13.0	14.0	24.0	4.0
AACGGGAA	0.941661596298	-83.6279678345	28.0	23.0	45.0	14.0
AACGGGAC	0.935633540154	-83.6769256592	35.0	28.0	57.0	12.0
AACGGGAG	0.968995451927	-83.5255355835	15.0	14.0	22.0	15.0
AACGGGAT	0.967688202858	-83.563835144	20.0	18.0	32.0	10.0


In [116]:
br.selected["HNF4A_1"].describe()

count    424.000000
mean      17.077830
std        9.372314
min        1.000000
25%       10.000000
50%       15.500000
75%       22.000000
max       55.000000
Name: HNF4A_1, dtype: float64

In [313]:
%%opts Bivariate [bandwidth=0.5] (cmap='Blues') Points (size=2)
from holoviews.operation import gridmatrix
point_grid = gridmatrix(hv.Dataset(br.selected[tsne.data_dims]), 
                        diagonal_type=hv.Histogram, chart_type=hv.Points) 
point_grid

## Custom values in plot

You can replace the kmer counts with numbers you count yourself.

In [304]:
# Pseudocount for kmer count estimates
p_cnt = (tsne.embedding[tsne.data_dims]+1.0)

# Scale to mean count 1.0
norm_cnt = p_cnt/(p_cnt*4**-8).sum()

# log fold change per cycle
ln_fold_change = np.log(norm_cnt).diff(axis=1).drop("HNF4A_1",axis=1)

# Mean fold change weighted by number of reads
w_mean_ln_fold_change = (p_cnt*ln_fold_change ).sum(axis=1)/p_cnt.sum(axis=1)

annot = tsne.embedding.join(norm_cnt,rsuffix="_norm").join(ln_fold_change,rsuffix="_lnfold")
annot["MeanFold"] = w_mean_ln_fold_change


In [305]:
import huddinge_tsne_browser.tsne_mapper as htm 
import huddinge_tsne_browser.huddinge_browser as hhb
import huddinge_tsne_browser.datashaderselect
reload(huddinge_tsne_browser.datashaderselect)
reload(hhb)

reload(htm)
%pdb off

Automatic pdb calling has been turned OFF


In [314]:
tsne_fold = htm.TsneMapper("kmer8_iters4k.tsne")
tsne_fold.set_kmer_values(annot[[x for x in annot.columns if x.endswith("_lnfold")] + ["MeanFold"]])
#tsne_fold.set_kmer_values(annot[["MeanFold"]])

br_fold = hhb.HuddingBrowser(tsne_fold)

br_fold.holoview_plot()



2018-01-12 17:47:43,304:read_data:INFO:Read 65536 sequences.
2018-01-12 17:47:43,306:read_data:INFO:Setting embedding from input data
2018-01-12 17:47:43,442:__init__:INFO:Initialized DataShaderSelect


Diverging palette


### Note on coloring

The datashaded main plot is normalized with histogram normalization to maximize the dynamic range of the color palette. This might result in surprisng color scales.

The colormap for non-negative (or non-positive) data is viridis and for data with both positive and negative values,  it is RdYlBu (Red Yellow Blue) but that is not very informative because of the histogram equalisation/normalisation.