### Part 2: Insertions - Random Selection of Structural Variants


** Background**
- 5000 Insertions and 5000 Deletions were randomly selected from our union callset of sequence resolved variants.

- **3991** unique insertions are described below.

- Features were generated by svviz to describe each variant

- tSNE was used to visualize the structure of the data

- The goal is to randomly select datapoints from each unique group/tSNE cluster and distribute these selected variants for manual curation. 

- In order to randomly select samples from each unique tSNE cluster, DBSCAN will be used to generate cluster labels. For each set of DBSCAN cluster labels, a select number will be randomly selected from each cluster group.

**Technical Overview**


Part 2

- Secondary DBSCAN analyses to generate cluster groups
- The resulting dataframe tSNE analysis was run through a DBSCAN model in min_samples altered for each iteration; min_samples = 0, 5, 10, 15, 20
- For each interation of the min_sample/DBSCAN analysis, the following is displayed:
    - tSNE plot with DSCAN clusters
    - Histogram displaying the frequency of each cluter label
    - Cumulative distribution plot for each histogram
- The following analysis only shows DBSCAN results from tSNE alone (excludes SVD results). Given the results from the itnital analysis, I think that the DBSCAN/tSNE results gives the best representation of the data.

***
** Part 1 **
***

In [1]:
'''
Import statements
'''
import pandas as pd
import numpy as np
from fancyimpute import KNN
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import LeaveOneOut
from scipy.stats import ks_2samp
from scipy import stats
from matplotlib import pyplot
from scipy.linalg import svd
from sklearn.decomposition import TruncatedSVD
import sqlite3
from sklearn.manifold import TSNE
import bokeh.palettes as palettes
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.cluster import DBSCAN
from bokeh.charts import Scatter, Histogram, output_file, show
from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.io import output_notebook
from bokeh.models import HoverTool, BoxSelectTool, Legend
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)

In [2]:
'''
Load Data
'''
df = pd.read_csv('dftsne_ins.csv')
minSample0 = pd.read_csv('INS.tSNE_minSample_0.csv')
minSample5 = pd.read_csv('INS.tSNE_minSample_5.csv')
minSample10 = pd.read_csv('INS.tSNE_minSample_10.csv')
minSample15 = pd.read_csv('INS.tSNE_minSample_15.csv')
minSample20 = pd.read_csv('INS.tSNE_minSample_20.csv')

In [3]:
df['minSample0'] = minSample0['clusterLabel']
df['minSample5'] = minSample5['clusterLabel']
df['minSample10'] = minSample10['clusterLabel']
df['minSample15'] = minSample15['clusterLabel']
df['minSample20'] = minSample20['clusterLabel']

** Minimum Samples: 0 **

In [4]:
output_notebook()
p = figure()

x = df['x']
y = df['y']
samp = df['minSample0']
source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            samp=samp,
        )
    )

hover = HoverTool(
        tooltips=[
            ("index", "$index"),
            ("(x,y)", "($x, $y)"),
            ("Group ID", "@minSample0"),
        ]
    )

p = Scatter(df, x='x', y='y', color='minSample0', palette=palettes.Category20[20],tools=[hover])
output_file("tSNE_DBSCAN_DEL_minSample0_label.html")
show(p)

In [18]:
p = figure()

x = df['x']
y = df['y']
samp = df['minSample5']
source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            samp=samp,
        )
    )

hover = HoverTool(
        tooltips=[
            ("index", "$index"),
            ("(x,y)", "($x, $y)"),
            ("Group ID", "@minSample5"),
        ]
    )

p = Scatter(df, x='x', y='y', color='minSample5', palette=palettes.Category20[20],tools=[hover])
output_file("tSNE_DBSCAN_DEL_minSample_label.html")
show(p)