### Crowdvariant Analysis - Machine Learning
<br>

**Summary**

1. Data collection and data cleaning
2. Data preprocessing
3. Machine Learning analysis

** Notes **

- Gathered crowdsourced labels from the crowdvariant study
- high confidence labels only available for HG002
    - Are there other labels?
- All deletions
- 1514 data points total
    - 552 Heterozygous Variant (CrowdVar Label = 1)  [Confidence: >=84%]
    - 959 Homozygous Variant (CrowdVar Label = 0)    [Confidence: >=84%]
    - 3   Homozygous Reference (CrowdVar Label = 2)  [Confidence: >=91%]
    - 1   Unknown

***
Data 
***

**Train/Test Data**

- CrowdVariant Data - cleaned and parsed in Part 1



** Prediction Dataset **

- HG002 Deletions
- Randomly selected datapoints (Try 1) April 2017

***
Data Preprocessing
***

- Drop columns with labels
    
    'GTcons', 'GTconflict',	'GTsupp', 'CN0_prob', 'CN1_prob', 'CN2_prob', 'Label', 'TenX.GT', 'pacbio.GT', 'IllMP.GT', 'Ill250.GT', 'Ill300x.GT'



- Drop irrelevant columns
    
    'chrom', 'start', 'end', 'sample'


***
Machine Learning
***

**Objective**
- Train [Random Forest Classifier](http://scikit-learn.org/stable/modules/ensemble.html#forest) with labeled CrowdVariant Data
- Train model using train test split [train on 70% of data and test on 30%]
- Predict Genotype Labels in the 5000 randomly selected set of deletions from svviz
- [Multiclass Classifiers](http://scikit-learn.org/stable/modules/multiclass.html)

**Future Tasks:** 
1. Train model using K-Fold validation
2. Run the model multiple times and take an average of the precision score
2. Compare this precision score to the average precision score of another model [i.e.: neural net]

In [130]:
"""
Imports
"""
import pandas as pd
import numpy as np
from fancyimpute import KNN
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import LeaveOneOut
from scipy.stats import ks_2samp
from scipy import stats
from matplotlib import pyplot
from sklearn import preprocessing
from scipy.linalg import svd
from sklearn.decomposition import TruncatedSVD
import sqlite3
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA as sklearnPCA
import plotly.plotly as py
from sklearn.cluster import DBSCAN
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score, precision_score
from sklearn import preprocessing
from ggplot import *
from bokeh.charts import TimeSeries
from bokeh.models import HoverTool
from bokeh.plotting import show
from bokeh.charts import Scatter, Histogram, output_file, show
from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.io import output_notebook
from bokeh.charts import Bar, output_file, show
import bokeh.palettes as palettes
from bokeh.models import HoverTool, BoxSelectTool, Legend
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)

In [131]:
### Import Data
df_crowd = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/Train/Final_DF/CrowdVar.Train_HG002.csv')

In [186]:
### Copy data in new dataframe as a later reference
df_crowd_2 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/Train/Final_DF/CrowdVar.Train_HG002.csv')

In [132]:
df_crowd.head(3)

Unnamed: 0,chrom,start,end,sample,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,...,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct,Label,CN0_prob,CN1_prob,CN2_prob
0,1,187464828,187466479,HG002,579.446154,17.934094,65.0,756.430769,163.879409,0.0,...,8,0.096911,0,0.0,0,0,1.0,0.0,0.91,0.09
1,1,33156824,33157000,HG002,557.0,20.66584,13.0,1158.307692,134.247982,0.0,...,2,0.221591,0,0.0,0,0,1.0,0.04,0.91,0.05
2,1,53594099,53595428,HG002,574.335366,18.613946,164.0,678.518293,139.056203,31.0,...,3,0.059443,0,0.0,0,0,0.0,0.96,0.04,0.0


In [133]:
### Drop irrelevant columns and GT information
df_crowd.drop(['GTcons'], axis=1, inplace = True)
df_crowd.drop(['GTconflict'], axis=1, inplace = True)
df_crowd.drop(['GTsupp'], axis=1, inplace = True)
# df_crowd.drop('SVtype', axis=1)
# df_crowd.drop('type',axis=1)
df_crowd.drop(['start'],axis=1, inplace = True)
df_crowd.drop(['end'],axis=1, inplace = True)
df_crowd.drop(['chrom'],axis=1, inplace = True)
# df_crowd.drop('Size',axis=1)
df_crowd.drop(['CN0_prob'],axis=1, inplace = True)
df_crowd.drop(['CN1_prob'],axis=1, inplace = True)
df_crowd.drop(['CN2_prob'],axis=1, inplace = True)
df_crowd.drop(['TenX.GT'],axis=1, inplace = True)
df_crowd.drop(['pacbio.GT'],axis=1, inplace = True)
df_crowd.drop(['IllMP.GT'],axis=1, inplace = True)
df_crowd.drop(['Ill250.GT'],axis=1, inplace = True)
df_crowd.drop(['Ill300x.GT'],axis=1, inplace = True)
df_crowd.drop(['sample'],axis=1, inplace = True)

In [134]:
#Save new dataframe to csv file [with dropped columns] and store header names
df_crowd.to_csv('df_crowd_headers.csv', index=False)

In [135]:
df_crowd.head(3)

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,TenX.HP2_ref_reason_alignmentScore,TenX.HP2_ref_reason_orientation,size,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct,Label
0,579.446154,17.934094,65.0,756.430769,163.879409,0.0,65.0,0.0,526.250785,88.797088,...,2.0,0.0,1651,8,0.096911,0,0.0,0,0,1.0
1,557.0,20.66584,13.0,1158.307692,134.247982,0.0,13.0,0.0,545.050452,62.371767,...,12.0,0.0,176,2,0.221591,0,0.0,0,0,1.0
2,574.335366,18.613946,164.0,678.518293,139.056203,31.0,133.0,0.0,518.29674,91.633718,...,3.0,0.0,1329,3,0.059443,0,0.0,0,0,0.0


In [136]:
# Store data in a new variable which will be converted to a matrix
X = df_crowd

In [137]:
X.head(3)

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,TenX.HP2_ref_reason_alignmentScore,TenX.HP2_ref_reason_orientation,size,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct,Label
0,579.446154,17.934094,65.0,756.430769,163.879409,0.0,65.0,0.0,526.250785,88.797088,...,2.0,0.0,1651,8,0.096911,0,0.0,0,0,1.0
1,557.0,20.66584,13.0,1158.307692,134.247982,0.0,13.0,0.0,545.050452,62.371767,...,12.0,0.0,176,2,0.221591,0,0.0,0,0,1.0
2,574.335366,18.613946,164.0,678.518293,139.056203,31.0,133.0,0.0,518.29674,91.633718,...,3.0,0.0,1329,3,0.059443,0,0.0,0,0,0.0


** Impute missing values using KNN **

Used 3 of the nearest neighbors to impute missing values

In [138]:
# Convert dataframe to matrix
X=X.as_matrix()

#Imput missing values from three closest observations
X_imputed=KNN(k=3).complete(X)
X=pd.DataFrame(X_imputed)

Imputing row 1/1515 with 0 missing, elapsed time: 1.952
Imputing row 101/1515 with 0 missing, elapsed time: 1.966
Imputing row 201/1515 with 0 missing, elapsed time: 1.968
Imputing row 301/1515 with 0 missing, elapsed time: 1.972
Imputing row 401/1515 with 0 missing, elapsed time: 1.977
Imputing row 501/1515 with 0 missing, elapsed time: 1.978
Imputing row 601/1515 with 0 missing, elapsed time: 1.983
Imputing row 701/1515 with 0 missing, elapsed time: 1.983
Imputing row 801/1515 with 0 missing, elapsed time: 1.984
Imputing row 901/1515 with 0 missing, elapsed time: 1.985
Imputing row 1001/1515 with 0 missing, elapsed time: 1.985
Imputing row 1101/1515 with 0 missing, elapsed time: 1.990
Imputing row 1201/1515 with 0 missing, elapsed time: 1.993
Imputing row 1301/1515 with 0 missing, elapsed time: 1.993
Imputing row 1401/1515 with 0 missing, elapsed time: 1.993
Imputing row 1501/1515 with 0 missing, elapsed time: 1.994


In [140]:
# Add headers to the data frame with newly imputed missing values [FYI: header removed during imputation]
X.columns=['Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_insertSizeScore_insertSizeScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','size','tandemrep_cnt','tandemrep_pct','segdup_cnt','segdup_pct','refN_cnt','refN_pct','Label']

In [141]:
X.head(3)

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,TenX.HP2_ref_reason_alignmentScore,TenX.HP2_ref_reason_orientation,size,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct,Label
0,579.446154,17.934094,65.0,756.430769,163.879409,0.0,65.0,0.0,526.250785,88.797088,...,2.0,0.0,1651.0,8.0,0.096911,0.0,0.0,0.0,0.0,1.0
1,557.0,20.66584,13.0,1158.307692,134.247982,0.0,13.0,0.0,545.050452,62.371767,...,12.0,0.0,176.0,2.0,0.221591,0.0,0.0,0.0,0.0,1.0
2,574.335366,18.613946,164.0,678.518293,139.056203,31.0,133.0,0.0,518.29674,91.633718,...,3.0,0.0,1329.0,3.0,0.059443,0.0,0.0,0.0,0.0,0.0


In [142]:
# Store Labels in a new 'Y' DataFrame
Y = pd.DataFrame()
Y['Label'] = X['Label']
#Y = X.pop('Label')

In [143]:
#Count the number of labels
pd.value_counts(Y['Label'].values, sort=False)

 1.0    959
 0.0    552
-1.0      1
 2.0      3
dtype: int64

**Select Features to Train Machine Learning Model**
- The features in the CrowdVar dataframe must match all of the features in the svviz dataframe
- The next step selects all of the features found in both the CrowdVar dataframe and the svviz dataframe

In [144]:
#Features in training set must match features in the prediction set
X=X[['Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','tandemrep_cnt','tandemrep_pct','segdup_cnt','segdup_pct','refN_cnt','refN_pct']]

In [145]:
X.head()

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,TenX.HP2_ref_insertSize_mean,TenX.HP2_ref_insertSize_std,TenX.HP2_ref_reason_alignmentScore,TenX.HP2_ref_reason_orientation,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct
0,579.446154,17.934094,65.0,756.430769,163.879409,0.0,65.0,0.0,526.250785,88.797088,...,321.5,25.5,2.0,0.0,8.0,0.096911,0.0,0.0,0.0,0.0
1,557.0,20.66584,13.0,1158.307692,134.247982,0.0,13.0,0.0,545.050452,62.371767,...,472.25,138.336019,12.0,0.0,2.0,0.221591,0.0,0.0,0.0,0.0
2,574.335366,18.613946,164.0,678.518293,139.056203,31.0,133.0,0.0,518.29674,91.633718,...,988.0,181.532366,3.0,0.0,3.0,0.059443,0.0,0.0,0.0,0.0
3,582.0,12.928956,38.0,697.026316,152.204932,4.0,34.0,0.0,528.007583,93.211521,...,330.142857,90.760168,13.0,1.0,1.0,0.24031,0.0,0.0,0.0,0.0
4,577.9375,16.320496,80.0,751.15,128.707527,0.0,80.0,0.0,528.103762,93.192739,...,627.210526,409.833031,19.0,0.0,1.0,0.046729,0.0,0.0,0.0,0.0


In [146]:
X_df = pd.DataFrame(X)
X_df.to_csv('X_df.csv', index=False)

** Standardize Data **

** Note **

Features only were standardized

** For final analysis, data not standardized **

In [147]:
scaler=preprocessing.StandardScaler()
X=scaler.fit_transform(X)

** Train Random Forest Classifier **

In [148]:
# Train Test Split
# Train on 70% of the data and test on 30%
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.7, random_state=0)

In [149]:
# Train Random Forest Classifier
model = RandomForestClassifier() 
model.fit(X_train, y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [150]:
#NOTE: Training Set - Show number of Hom Ref, Hom Var, Het Var datapoints the model was trained on
ytrain = pd.DataFrame()
ytrain['ytrain'] = y_train
pd.value_counts(ytrain['ytrain'].values, sort=False)

1.0    123
0.0     11
dtype: int64

** Precision Score **
- Overall model performance
- Using 30% of original dataset (test set)
- Truth labels: CrowdVariant labels

In [151]:
model.predict(X_test)

array([ 0.,  1.,  1., ...,  0.,  1.,  0.])

In [152]:
pred = model.predict(X_test)

In [153]:
precision_score(pred, y_test, average='micro') 

0.97455230914231861

In [154]:
# Add original labels and predicted labels back to the original dataframe
df_Xtest = pd.DataFrame(X_test)
df_Xtest.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,160,161,162,163,164,165,166,167,168,169
0,0.245573,-0.600937,1.726388,0.108326,-0.20026,1.174293,1.78106,-0.083405,1.317134,0.145072,...,-0.838357,-0.635884,-0.481128,-0.091201,-0.328736,-0.073591,-0.301666,-0.2916,0.0,0.0
1,0.126697,-0.038937,-0.092265,0.134947,0.305214,-0.223517,0.050744,-0.083405,-0.174849,0.40681,...,0.864636,0.865565,0.529644,-0.091201,-0.328736,-0.170711,-0.301666,-0.2916,0.0,0.0
2,-0.302458,2.912548,-1.441374,1.235104,0.660965,-1.753877,-0.993976,24.168735,2.300369,-1.890106,...,1.369485,1.467017,-0.135338,-0.091201,-0.911915,-0.721055,-0.301666,-0.2916,0.0,0.0
3,0.059012,-1.055796,-0.661007,-0.352068,0.631609,0.198236,-1.266038,-0.083405,0.169932,0.322962,...,0.791506,0.753929,0.609441,-0.091201,-0.328736,0.905634,2.807337,1.335447,0.0,0.0
4,-0.059089,1.890943,-0.832952,-0.215059,-0.262996,-0.862171,-0.591323,-0.083405,-1.293854,0.279057,...,-0.838357,-0.635884,-0.481128,-0.091201,0.254443,0.126242,-0.301666,-0.2916,0.0,0.0


In [155]:
df_Xtest.columns=['Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','tandemrep_cnt','tandemrep_pct','segdup_cnt','segdup_pct','refN_cnt','refN_pct']

In [156]:
labels = pd.DataFrame(y_test)

In [188]:
df_Xtest['predicted_label'] = pred
df_Xtest['Label'] = df_crowd['Label']
df_Xtest['chrom'] = df_crowd_2['chrom']
df_Xtest['start'] = df_crowd_2['start']
df_Xtest['end'] = df_crowd_2['end']
# df_Xtest['Y_test'] = labels

In [189]:
df_Xtest.head()

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,segdup_pct,refN_cnt,refN_pct,model_label,Label,Y_test,predicted_label,chrom,start,end
0,0.245573,-0.600937,1.726388,0.108326,-0.20026,1.174293,1.78106,-0.083405,1.317134,0.145072,...,-0.2916,0.0,0.0,0.0,1.0,,0.0,1,187464828,187466479
1,0.126697,-0.038937,-0.092265,0.134947,0.305214,-0.223517,0.050744,-0.083405,-0.174849,0.40681,...,-0.2916,0.0,0.0,1.0,1.0,1.0,1.0,1,33156824,33157000
2,-0.302458,2.912548,-1.441374,1.235104,0.660965,-1.753877,-0.993976,24.168735,2.300369,-1.890106,...,-0.2916,0.0,0.0,1.0,0.0,0.0,1.0,1,53594099,53595428
3,0.059012,-1.055796,-0.661007,-0.352068,0.631609,0.198236,-1.266038,-0.083405,0.169932,0.322962,...,1.335447,0.0,0.0,1.0,1.0,,1.0,1,59018046,59018304
4,-0.059089,1.890943,-0.832952,-0.215059,-0.262996,-0.862171,-0.591323,-0.083405,-1.293854,0.279057,...,-0.2916,0.0,0.0,1.0,0.0,0.0,1.0,1,68008246,68009102


In [190]:
pd.value_counts(df_Xtest['Label'].values, sort=False)

 1.0    958
 0.0     99
-1.0      1
 2.0      3
dtype: int64

In [191]:
pd.value_counts(df_Xtest['predicted_label'].values, sort=False)

0.0    390
1.0    671
dtype: int64

In [192]:
from sklearn.metrics import confusion_matrix
ytest = df_Xtest['Label']
predict = df_Xtest['predicted_label']
print(confusion_matrix(ytest, predict))

[[  0   0   1   0]
 [  0  43  56   0]
 [  0 346 612   0]
 [  0   1   2   0]]


In [193]:
pd.crosstab(ytest, predict, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,0.0,1.0,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1.0,0,1,1
0.0,43,56,99
1.0,346,612,958
2.0,1,2,3
All,390,671,1061


In [184]:
# import pylab as pl
# labels = ['-1', '0', '1', '2']
# cm = confusion_matrix(y_test, pred)
# print(cm)
# fig = plt.figure()
# ax = fig.add_subplot(111)
# cax = ax.matshow(cm)
# pl.title('Confusion matrix of the classifier')
# fig.colorbar(cax)
# ax.set_xticklabels([''] + labels)
# ax.set_yticklabels([''] + labels)
# pl.xlabel('Predicted')
# pl.ylabel('True')
# pl.show()

In [159]:
df_Xtest.to_csv('df_Xtest.csv', index=False)

** Future Task **

Use GridSearch to find the best parameters

***
Prediction 
***

Used svviz.HG002 data from first 5000 random selection

**Data Label Update**

Svviz GT Labels and CrowdVar GT Labels do not match

Created a new set of labels for the svviz HG002 Deletions Dataframe

CrowdVar GT Label Key

- 0: Hom. Var.
- 1: Het Var
- 2: Hom Ref

svviz GTcons Labels

- 0: Hom Ref
- 1: Het Var
- 2: Hom Var

Changed svviz DF (svviz.Annotate.DEL.HG002_2.csv) to have the following data labels (which match CrowdVar)
New Column (GIAB_Crowd)

- 0: Hom. Var.
- 1: Het Var
- 2: Hom Ref

**Stored** all rows that have '-1' in a separate dataframe
- including '-1' will throw an error in accuracy and prediction measures
- will use these entries as final prediction set

Used excel formula to manually change svviz.HG002 labels

=IF(GB2=0,2,IF(GB2=2,0,1))

In [258]:
# Read in HG002 DEL dataframe
HG002_pred = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/svviz.Annotate.DEL.HG002_data2.csv')

In [259]:
HG002_pred_2 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/svviz.Annotate.DEL.HG002_data2.csv')

In [260]:
### Drop irrelevant columns
# Note: Features in the prediction dataframe must match the features in the dataframe (CrowdVar) used to train the RF model
# NOTE TODO: Include size
HG002_pred.drop(['GTcons'], axis=1, inplace = True)
HG002_pred.drop(['GTconflict'], axis=1, inplace = True)
HG002_pred.drop(['GTsupp'], axis=1, inplace = True)
HG002_pred.drop('SVtype', axis=1)
HG002_pred.drop('type',axis=1)
HG002_pred.drop(['type'],axis=1, inplace = True)
HG002_pred.drop(['SVtype'],axis=1, inplace = True)
HG002_pred.drop(['start'],axis=1, inplace = True)
HG002_pred.drop(['end'],axis=1, inplace = True)
HG002_pred.drop(['chrom'],axis=1, inplace = True)
HG002_pred.drop(['Size'],axis=1, inplace = True)
HG002_pred.drop(['TenX.GT'],axis=1, inplace = True)
HG002_pred.drop(['pacbio.GT'],axis=1, inplace = True)
HG002_pred.drop(['IllMP.GT'],axis=1, inplace = True)
HG002_pred.drop(['Ill250.GT'],axis=1, inplace = True)
HG002_pred.drop(['Ill300x.GT'],axis=1, inplace = True)
HG002_pred.drop(['sample'],axis=1, inplace = True)
HG002_pred.drop(['id'],axis=1, inplace = True)
# HG002_pred.drop(['GIAB_Crowd'],axis=1, inplace = True)

In [261]:
HG002_pred.head()

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,pacbio.ref_insertSize_mean,pacbio.ref_insertSize_std,pacbio.ref_reason_alignmentScore,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct,GIAB_Crowd
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,541.033679,91.702755,...,9694.425532,4306.492796,47.0,0,0.0,0,0.0,0,0,2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,520.882022,78.521682,...,9218.592593,4009.865909,27.0,0,0.0,0,0.0,0,0,2
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.881724,77.118761,...,10296.15,4547.097621,40.0,0,0.0,1,0.675712,0,0,2
3,587.5,4.5,2.0,218.5,20.5,2.0,0.0,0.0,539.138453,85.428725,...,10774.62264,4040.498761,53.0,0,0.0,0,0.0,0,0,2
4,592.0,0.0,1.0,1009.0,0.0,0.0,1.0,0.0,514.035095,90.478044,...,10438.0,4823.436414,19.0,0,0.0,0,0.0,0,0,2


In [262]:
HG002_pred.to_csv('HG002_pred.csv', index=False)

In [263]:
X2 = HG002_pred

** Impute missing values using KNN **

In [264]:
# Convert dataframe to matrix
X2=X2.as_matrix()

#Imput missing values from three closest observations
X2_imputed=KNN(k=3).complete(X2)
X2=pd.DataFrame(X2_imputed)

Imputing row 1/2805 with 1 missing, elapsed time: 6.936
Imputing row 101/2805 with 2 missing, elapsed time: 6.950
Imputing row 201/2805 with 1 missing, elapsed time: 6.966
Imputing row 301/2805 with 1 missing, elapsed time: 6.980
Imputing row 401/2805 with 1 missing, elapsed time: 6.990
Imputing row 501/2805 with 1 missing, elapsed time: 6.995
Imputing row 601/2805 with 2 missing, elapsed time: 7.006
Imputing row 701/2805 with 1 missing, elapsed time: 7.018
Imputing row 801/2805 with 1 missing, elapsed time: 7.023
Imputing row 901/2805 with 2 missing, elapsed time: 7.029
Imputing row 1001/2805 with 1 missing, elapsed time: 7.041
Imputing row 1101/2805 with 1 missing, elapsed time: 7.050
Imputing row 1201/2805 with 2 missing, elapsed time: 7.057
Imputing row 1301/2805 with 2 missing, elapsed time: 7.066
Imputing row 1401/2805 with 1 missing, elapsed time: 7.073
Imputing row 1501/2805 with 1 missing, elapsed time: 7.080
Imputing row 1601/2805 with 1 missing, elapsed time: 7.095
Imputing 

In [265]:
# Rename Columns in the dataframe
X2.columns = ['Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_insertSizeScore_orientation','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_alignmentScore','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_insertSizeScore_insertSizeScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_insertSizeScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_insertSizeScore_insertSizeScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_insertSizeScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','tandemrep_cnt','tandemrep_pct','segdup_cnt','segdup_pct','refN_cnt','refN_pct','GIAB_Crowd']

In [266]:
# Select columns that match columns in training dataframe
#Features in training set must match features in the prediction set
X2=X2[['Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','tandemrep_cnt','tandemrep_pct','segdup_cnt','segdup_pct','refN_cnt','refN_pct']]

In [267]:
X2.to_csv('X2.csv', index=False)

** Standardize Data **

In [268]:
scaler=preprocessing.StandardScaler()
X=scaler.fit_transform(X2)

In [269]:
model.predict(X)

array([ 1.,  1.,  1., ...,  0.,  0.,  0.])

** Precision Score **

Data: 5000 randomly selected datapoints (Round 1) svviz GT vs. model selected GT from trained RF model

The ratio of tp /(tp +fn)

Intuitively the ability of a model to avoid labeling negative events as positive events

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [271]:
precision_score(true, test, average='micro') 

0.74224598930481278

In [270]:
model.predict_proba(X)

array([[ 0.1,  0.9],
       [ 0.2,  0.8],
       [ 0.2,  0.8],
       ..., 
       [ 0.6,  0.4],
       [ 0.5,  0.5],
       [ 0.5,  0.5]])

In [272]:
pred = model.predict_proba(X)

In [273]:
X6 = pd.concat([X2, pd.DataFrame(pred, columns=['1','2'])])

In [274]:
X6.to_csv('X6_2.csv', index=False)

In [276]:
X6 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/X6_2.csv')

In [277]:
X6.head()

Unnamed: 0,1,2,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,Ill250.alt_reason_insertSizeScore,Ill250.alt_reason_orientation,...,pacbio.ref_count,pacbio.ref_insertSize_mean,pacbio.ref_insertSize_std,pacbio.ref_reason_alignmentScore,refN_cnt,refN_pct,segdup_cnt,segdup_pct,tandemrep_cnt,tandemrep_pct
0,0.1,0.9,0.0,0.0,0,0.0,0.0,0,0,0,...,47.0,9694.425532,4306.492796,47.0,0,0,0,0.0,0,0.0
1,0.2,0.8,0.0,0.0,0,0.0,0.0,0,0,0,...,27.0,9218.592593,4009.865909,27.0,0,0,0,0.0,0,0.0
2,0.2,0.8,0.0,0.0,0,0.0,0.0,0,0,0,...,40.0,10296.15,4547.097621,40.0,0,0,1,0.675712,0,0.0
3,0.2,0.8,819.0,0.0,1,253.0,0.0,1,0,0,...,53.0,10774.62264,4040.498761,53.0,0,0,0,0.0,0,0.0
4,0.3,0.7,0.0,0.0,0,0.0,0.0,0,0,0,...,19.0,10438.0,4823.436414,19.0,0,0,0,0.0,0,0.0


In [281]:
X6.shape

(2805, 173)

In [290]:
X6['GIAB_Crowd'] = HG002_pred['GIAB_Crowd']
X6['model_pred_label'] = model.predict(X)
X6['chrom'] = HG002_pred_2['chrom']
X6['start'] = HG002_pred_2['start']
X6['end'] = HG002_pred_2['end']
X6['Size'] = HG002_pred_2['Size']
X6['GTcons'] = HG002_pred_2['GTcons']
X6['Ill250.GT'] = HG002_pred_2['Ill250.GT']
X6['Ill300x.GT'] = HG002_pred_2['Ill300x.GT']
X6['IllMP.GT'] = HG002_pred_2['IllMP.GT']
X6['pacbio.GT'] = HG002_pred_2['pacbio.GT']

In [291]:
X6.shape

(2805, 183)

In [292]:
X6.head()

Unnamed: 0,Hom_Var,Het_Var,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,Ill250.alt_reason_insertSizeScore,Ill250.alt_reason_orientation,...,model_pred_label,chrom,start,end,Size,GTcons,Ill250.GT,Ill300x.GT,IllMP.GT,pacbio.GT
0,0.1,0.9,0.0,0.0,0,0.0,0.0,0,0,0,...,1.0,10,21929089,21929235,-145,0,0,0.0,0.0,0.0
1,0.2,0.8,0.0,0.0,0,0.0,0.0,0,0,0,...,1.0,17,744621,744963,-231,0,0,0.0,0.0,0.0
2,0.2,0.8,0.0,0.0,0,0.0,0.0,0,0,0,...,1.0,21,32114178,32114915,-648,0,0,0.0,0.0,0.0
3,0.2,0.8,819.0,0.0,1,253.0,0.0,1,0,0,...,1.0,16,5702979,5703571,-230,0,0,0.0,0.0,0.0
4,0.3,0.7,0.0,0.0,0,0.0,0.0,0,0,0,...,1.0,5,168173265,168174026,-215,0,0,0.0,0.0,0.0


In [293]:
X6.to_csv('X6_all.csv', index=False)

In [294]:
X6.rename(columns={'1': 'Hom_Var'}, inplace=True)
X6.rename(columns={'2': 'Het_Var'}, inplace=True)

In [288]:
X6.head()

Unnamed: 0,Hom_Var,Het_Var,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,Ill250.alt_reason_insertSizeScore,Ill250.alt_reason_orientation,...,segdup_cnt,segdup_pct,tandemrep_cnt,tandemrep_pct,GIAB_Crowd,model_pred_label,chrom,start,end,Size
0,0.1,0.9,0.0,0.0,0,0.0,0.0,0,0,0,...,0,0.0,0,0.0,2,1.0,10,21929089,21929235,-145
1,0.2,0.8,0.0,0.0,0,0.0,0.0,0,0,0,...,0,0.0,0,0.0,2,1.0,17,744621,744963,-231
2,0.2,0.8,0.0,0.0,0,0.0,0.0,0,0,0,...,1,0.675712,0,0.0,2,1.0,21,32114178,32114915,-648
3,0.2,0.8,819.0,0.0,1,253.0,0.0,1,0,0,...,0,0.0,0,0.0,2,1.0,16,5702979,5703571,-230
4,0.3,0.7,0.0,0.0,0,0.0,0.0,0,0,0,...,0,0.0,0,0.0,2,1.0,5,168173265,168174026,-215


In [296]:
X6=X6[['chrom','start','end','Size','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','refN_cnt','refN_pct','segdup_cnt','segdup_pct','tandemrep_cnt','tandemrep_pct','Ill250.GT','Ill300x.GT','IllMP.GT','pacbio.GT','GIAB_Crowd','GTcons','model_pred_label','Hom_Var','Het_Var']]

In [297]:
X6.head()

Unnamed: 0,chrom,start,end,Size,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,...,tandemrep_pct,Ill250.GT,Ill300x.GT,IllMP.GT,pacbio.GT,GIAB_Crowd,GTcons,model_pred_label,Hom_Var,Het_Var
0,10,21929089,21929235,-145,0.0,0.0,0,0.0,0.0,0,...,0.0,0,0.0,0.0,0.0,2,0,1.0,0.1,0.9
1,17,744621,744963,-231,0.0,0.0,0,0.0,0.0,0,...,0.0,0,0.0,0.0,0.0,2,0,1.0,0.2,0.8
2,21,32114178,32114915,-648,0.0,0.0,0,0.0,0.0,0,...,0.0,0,0.0,0.0,0.0,2,0,1.0,0.2,0.8
3,16,5702979,5703571,-230,819.0,0.0,1,253.0,0.0,1,...,0.0,0,0.0,0.0,0.0,2,0,1.0,0.2,0.8
4,5,168173265,168174026,-215,0.0,0.0,0,0.0,0.0,0,...,0.0,0,0.0,0.0,0.0,2,0,1.0,0.3,0.7


In [298]:
from sklearn.metrics import confusion_matrix
ytest = X6['GIAB_Crowd']
predict = X6['model_pred_label']
print(confusion_matrix(ytest, predict))

[[ 554   44    0]
 [ 206 1528    0]
 [  18  455    0]]


In [299]:
pd.crosstab(ytest, predict, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,0.0,1.0,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,554,44,598
1,206,1528,1734
2,18,455,473
All,778,2027,2805


In [300]:
output_notebook()

In [301]:
# Plot the counts of each predicted probability for Het Var events
p = figure()
p = Histogram(X6, values='Het_Var', title='Heterozygous Variant Labels', color='LightSlateGray', bins=30, xlabel="Predict Probability", ylabel="Frequency")
output_file("pred_prob_het_var.html")
show(p)

In [302]:
# Plot the counts of each predicted probability for Hom Var events
p = figure()
p = Histogram(X6, values='Hom_Var', title='Homozygous Variant Labels', color='LightSlateGray', bins=30, xlabel="Predict Probability", ylabel="Frequency")
output_file("pred_prob_hom_var.html")
show(p)

In [303]:
pd.value_counts(X6['GIAB_Crowd'].values, sort=False)

0     598
2     473
1    1734
dtype: int64

In [304]:
pd.value_counts(X6['model_pred_label'].values, sort=False)

1.0    2027
0.0     778
dtype: int64

In [310]:
d = {'Label' : [0, 0, 1, 1, 2, 2],
     'Source' : ['Pred', 'Actual', 'Pred', 'Actual','Pred', 'Actual'],
    'Count' : [778, 598, 2027, 1734, 0, 473],}
df = pd.DataFrame(d)
df

Unnamed: 0,Count,Label,Source
0,778,0,Pred
1,598,0,Actual
2,2027,1,Pred
3,1734,1,Actual
4,0,2,Pred
5,473,2,Actual


In [312]:
p = Bar(df, label='Label', xlabel="Genotype", ylabel="Count", values='Count', group='Source',plot_width=900, 
        plot_height=650, title="Predicted vs Actual", legend='top_left')

p.legend.legend_spacing = 0
show(p)


legend_spacing was deprecated in Bokeh 0.12.3 and will be removed, use Legend.spacing instead.



#### Predict '-1' Values

In [315]:
# Read in HG002 DEL dataframe
HG002_pred_min_1 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/svviz.Annotate.DEL.HG002_minus_one.csv')
HG002_pred_min_1_ = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/svviz.Annotate.DEL.HG002_minus_one.csv')

In [314]:
### Drop irrelevant columns
# Note: Features in the prediction dataframe must match the features in the dataframe (CrowdVar) used to train the RF model
# NOTE TODO: Include size
HG002_pred_min_1.drop(['GTcons'], axis=1, inplace = True)
HG002_pred_min_1.drop(['GTconflict'], axis=1, inplace = True)
HG002_pred_min_1.drop(['GTsupp'], axis=1, inplace = True)
HG002_pred_min_1.drop('SVtype', axis=1)
HG002_pred_min_1.drop('type',axis=1)
HG002_pred_min_1.drop(['type'],axis=1, inplace = True)
HG002_pred_min_1.drop(['SVtype'],axis=1, inplace = True)
HG002_pred_min_1.drop(['start'],axis=1, inplace = True)
HG002_pred_min_1.drop(['end'],axis=1, inplace = True)
HG002_pred_min_1.drop(['chrom'],axis=1, inplace = True)
HG002_pred_min_1.drop(['Size'],axis=1, inplace = True)
HG002_pred_min_1.drop(['TenX.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['pacbio.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['IllMP.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['Ill250.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['Ill300x.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['sample'],axis=1, inplace = True)
HG002_pred_min_1.drop(['id'],axis=1, inplace = True)
# HG002_pred.drop(['GIAB_Crowd'],axis=1, inplace = True)

In [None]:
HG002_pred_min_1.head()

In [None]:
HG002_pred_min_1.to_csv('HG002_pred.csv', index=False)

In [None]:
X3 = HG002_pred_min_1

In [254]:
X2['model_pred_label'] = model.predict(X)
X2['GIAB_Crowd'] = HG002_pred['GIAB_Crowd']
X6['model_pred_label'] = model.predict(X)

In [236]:
# Compare predicted model labels to conensus GT output by preliminary svviz analysis (R Script)
test = X2['model_pred_label']
true = X2['GIAB_Crowd']

In [242]:
X2['chrom'] = HG002_pred_2['chrom']
X2['start'] = HG002_pred_2['start']
X2['end'] = HG002_pred_2['end']
X2['Size'] = HG002_pred_2['Size']

In [240]:
X2.head(3)

Unnamed: 0,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,Ill300x.amb_alnScore_std,...,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct,model_pred_label,GIAB_Crowd,chrom,start,end
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,541.033679,91.702755,...,0.0,0.0,0.0,0.0,0.0,1.0,2,10,21929089,21929235
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,520.882022,78.521682,...,0.0,0.0,0.0,0.0,0.0,1.0,2,17,744621,744963
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,531.881724,77.118761,...,0.0,1.0,0.675712,0.0,0.0,1.0,2,21,32114178,32114915


In [243]:
X2.to_csv('X2_try1.csv', index=False)

** Predict Probabilities **

In [244]:
# TODO: Make Code More efficient
# Display predicted probability for each model predicted label in dataframe

In [250]:
# TODO: Make Code More efficient
# Import Dataframe with X and Y represented as 23 and 24
X2 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/X2_try1.csv')
# Drop the following columns so that pred_prob in next step works
X2.drop(['model_pred_label'], axis=1, inplace = True)
X2.drop(['GIAB_Crowd'], axis=1, inplace = True)
X2.drop(['chrom'], axis=1, inplace = True)
X2.drop(['start'], axis=1, inplace = True)
X2.drop(['end'], axis=1, inplace = True)
X2.drop(['Size'], axis=1, inplace = True)

In [251]:
pred = model.predict_proba(X2)
X6 = pd.concat([X2, pd.DataFrame(pred, columns=['1','2'])])
# X6.to_csv('X6_pred_prob.csv', index=False)

In [255]:
X6['model_pred_label'] = X2['model_pred_label'] 
X6['GIAB_Crowd'] = HG002_pred['GIAB_Crowd']
X6['chrom'] = HG002_pred_2['chrom']
X6['start'] = HG002_pred_2['start']
X6['end'] = HG002_pred_2['end']
X6['Size'] = HG002_pred_2['Size']

In [257]:
X6.to_csv('X6_pred_prob.csv', index=False)

In [46]:
# X6_2 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/X6_pred_prob.csv')

In [60]:
# X6_2['model_pred_label'] = model.predict(X2)

In [92]:
svvz_pred_labels_df = pd.concat([HG002_pred_2, pd.DataFrame(pred, columns=['1','2'])])
svvz_pred_labels_df.to_csv('svvz_original_df.csv', index=False)

In [85]:
svvz_pred_labels_df = pd.read_csv('svvz_original_df_2.csv')

In [86]:
svvz_pred_labels_df['model_pred_label'] = X2['model_pred_label']

In [87]:
svvz_pred_labels_df.head()

Unnamed: 0,1,2,GIAB_Crowd,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,Ill250.alt_reason_insertSizeScore,...,pacbio.ref_insertSize_mean,pacbio.ref_insertSize_std,pacbio.ref_reason_alignmentScore,refN_cnt,refN_pct,segdup_cnt,segdup_pct,tandemrep_cnt,tandemrep_pct,model_pred_label
0,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,8139.555556,4575.304996,18.0,0,0,1,0.679775,0,0.0,1.0
1,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,9962.93617,4301.89526,47.0,0,0,1,1.0,0,0.0,1.0
2,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,11189.14634,4525.45141,41.0,0,0,0,0.0,0,0.0,1.0
3,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,9694.425532,4306.492796,47.0,0,0,0,0.0,0,0.0,1.0
4,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,9724.0,4161.441384,51.0,0,0,0,0.0,0,0.0,1.0


In [88]:
svvz_pred_labels_df.rename(columns={'1': 'Het_Var'}, inplace=True)
svvz_pred_labels_df.rename(columns={'2': 'Hom_Var'}, inplace=True)

In [89]:
svvz_pred_labels_df.head()

Unnamed: 0,Het_Var,Hom_Var,GIAB_Crowd,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,Ill250.alt_reason_insertSizeScore,...,pacbio.ref_insertSize_mean,pacbio.ref_insertSize_std,pacbio.ref_reason_alignmentScore,refN_cnt,refN_pct,segdup_cnt,segdup_pct,tandemrep_cnt,tandemrep_pct,model_pred_label
0,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,8139.555556,4575.304996,18.0,0,0,1,0.679775,0,0.0,1.0
1,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,9962.93617,4301.89526,47.0,0,0,1,1.0,0,0.0,1.0
2,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,11189.14634,4525.45141,41.0,0,0,0,0.0,0,0.0,1.0
3,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,9694.425532,4306.492796,47.0,0,0,0,0.0,0,0.0,1.0
4,0.2,0.8,2,0.0,0.0,0,0.0,0.0,0,0,...,9724.0,4161.441384,51.0,0,0,0,0.0,0,0.0,1.0


In [90]:
svvz_pred_labels_df.reindex_axis(sorted(svvz_pred_labels_df.columns), axis=1)

Unnamed: 0,GIAB_Crowd,Het_Var,Hom_Var,Ill250.alt_alnScore_mean,Ill250.alt_alnScore_std,Ill250.alt_count,Ill250.alt_insertSize_mean,Ill250.alt_insertSize_std,Ill250.alt_reason_alignmentScore,Ill250.alt_reason_insertSizeScore,...,pacbio.ref_count,pacbio.ref_insertSize_mean,pacbio.ref_insertSize_std,pacbio.ref_reason_alignmentScore,refN_cnt,refN_pct,segdup_cnt,segdup_pct,tandemrep_cnt,tandemrep_pct
0,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,18.0,8139.555556,4575.304996,18.0,0,0,1,0.679775,0,0.0
1,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,47.0,9962.936170,4301.895260,47.0,0,0,1,1.000000,0,0.0
2,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,41.0,11189.146340,4525.451410,41.0,0,0,0,0.000000,0,0.0
3,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,47.0,9694.425532,4306.492796,47.0,0,0,0,0.000000,0,0.0
4,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,51.0,9724.000000,4161.441384,51.0,0,0,0,0.000000,0,0.0
5,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,46.0,10630.239130,4041.605251,46.0,0,0,0,0.000000,0,0.0
6,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,47.0,9875.787234,4338.187154,47.0,0,0,0,0.000000,0,0.0
7,2,0.3,0.7,978.000000,0.000000,1,264.000000,0.000000,1,0,...,19.0,11462.421050,3755.909496,19.0,0,0,0,0.000000,0,0.0
8,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,27.0,9218.592593,4009.865909,27.0,0,0,0,0.000000,0,0.0
9,2,0.2,0.8,0.000000,0.000000,0,0.000000,0.000000,0,0,...,34.0,10439.441180,4406.025027,34.0,0,0,0,0.000000,0,0.0


In [91]:
svvz_pred_labels_df.to_csv('test.csv', index=False)

In [48]:
X2['GIAB_Crowd'] = HG002_pred['GIAB_Crowd']

In [49]:
X6_2['GIAB_Crowd'] = HG002_pred['GIAB_Crowd']

In [50]:
X6_2['GIAB_Crowd'] = HG002_pred['GIAB_Crowd']

In [51]:
horizontal = pd.concat([HG002_pred_2, X6_2], axis=1)

In [52]:
horizontal.to_csv('hor.csv', index=False)

In [53]:
pd.value_counts(X6_2['model_pred_label'].values, sort=False)

KeyError: 'model_pred_label'

In [152]:
pd.value_counts(X6_2['GIAB_Crowd'].values, sort=False)

0     598
2     473
1    1734
dtype: int64

In [146]:
X6_2.to_csv('X6_2_labels.csv', index=False)

In [107]:
X2.shape

(2805, 172)

In [45]:
# Plot that diplays count of predicted labels vs actual labels

In [46]:
output_notebook()

In [47]:
output_notebook()
p = figure()
p = Histogram(X6_2, values='2', title='Heterozygous Variant Labels', color='LightSlateGray', bins=30, xlabel="Predict Probability", ylabel="Frequency")
output_file("pred_prob_het_var.html")
show(p)

In [150]:
output_notebook()
p = figure()
p = Histogram(X6_2, values='1', title='Homozygous Variant Labels', color='LightSlateGray', bins=30, xlabel="Predict Probability", ylabel="Frequency")
output_file("pred_prob_hom_var.html")
show(p)

INFO:bokeh.core.state:Session output file 'pred_prob.html' already exists, will be overwritten.


In [145]:
# pd.value_counts(X2['model_pred_label'].values, sort=False)

In [143]:
pd.value_counts(X2['GIAB_Crowd'].values, sort=False)

0     597
2     473
1    1735
dtype: int64

In [334]:
# X3_df = pd.DataFrame({'count' : X2.groupby(['model_pred_label', 'GIAB_Crowd']).size()}).reset_index()

In [144]:
# Correct Label Count
# Do df.count --> copy into excel file and save in a new csv --> get correct GT count
# The '2' Labels are missing
label_ct = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/label_count_2.csv')

In [158]:
label_ct_2 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/new_labels.csv')

In [159]:
p = Bar(label_ct_2, label='Label', xlabel="Genotype", ylabel="Count", values='Count', group='Source',plot_width=900, 
        plot_height=650, title="Predicted vs Actual", legend='top_left')

p.legend.legend_spacing = 0
show(p)


legend_spacing was deprecated in Bokeh 0.12.3 and will be removed, use Legend.spacing instead.



In [145]:
p = Bar(label_ct, label='Label', xlabel="Genotype", ylabel="Count", values='Count', group='Source',plot_width=900, 
        plot_height=650, title="Predicted vs Actual", legend='top_left')

p.legend.legend_spacing = 0
show(p)


legend_spacing was deprecated in Bokeh 0.12.3 and will be removed, use Legend.spacing instead.



**Notes Presentation**
- Why is there a difference?  What are the different ones?
- So few predicted because there were only a few trained on

***
Determine how well trained RF Classifier can assign labels to datapoints with GTcons: '-1'
***    

In [277]:
#Import Data
HG002_pred_min_1 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/svviz.Annotate.DEL.HG002_minus_one.csv')

In [278]:
### Drop irrelevant columns
# Note: Features in the prediction dataframe must match the features in the dataframe (CrowdVar) used to train the RF model
HG002_pred_min_1.drop(['GTcons'], axis=1, inplace = True)
HG002_pred_min_1.drop(['GTconflict'], axis=1, inplace = True)
HG002_pred_min_1.drop(['GTsupp'], axis=1, inplace = True)
HG002_pred_min_1.drop('SVtype', axis=1)
HG002_pred_min_1.drop('type',axis=1)
HG002_pred_min_1.drop(['type'],axis=1, inplace = True)
HG002_pred_min_1.drop(['SVtype'],axis=1, inplace = True)
HG002_pred_min_1.drop(['start'],axis=1, inplace = True)
HG002_pred_min_1.drop(['end'],axis=1, inplace = True)
HG002_pred_min_1.drop(['chrom'],axis=1, inplace = True)
HG002_pred_min_1.drop('Size',axis=1)
HG002_pred_min_1.drop(['TenX.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['pacbio.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['IllMP.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['Ill250.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['Ill300x.GT'],axis=1, inplace = True)
HG002_pred_min_1.drop(['sample'],axis=1, inplace = True)
HG002_pred_min_1.drop(['id'],axis=1, inplace = True)

In [279]:
HG002_pred_min_1.head()

Unnamed: 0,Size,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,Ill300x.alt_count,Ill300x.alt_insertSize_mean,Ill300x.alt_insertSize_std,Ill300x.alt_reason_alignmentScore,Ill300x.alt_reason_insertSizeScore,Ill300x.alt_reason_orientation,Ill300x.amb_alnScore_mean,...,pacbio.ref_count,pacbio.ref_insertSize_mean,pacbio.ref_insertSize_std,pacbio.ref_reason_alignmentScore,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct
0,-25,576.336245,6.124298,229.0,572.580786,146.677196,229.0,0.0,0.0,536.283903,...,0.0,0.0,0.0,0.0,0,0.0,0,0.0,0,0
1,-38,570.325,3.157432,40.0,620.175,133.58235,39.0,1.0,0.0,529.624239,...,20.0,10741.9,3922.716723,20.0,0,0.0,0,0.0,0,0
2,-28,574.447761,11.021395,134.0,552.843284,155.619794,133.0,1.0,0.0,535.542208,...,1.0,8703.0,0.0,1.0,0,0.0,1,1.0,0,0
3,-153,580.686274,10.901502,51.0,892.529412,257.2088,27.0,24.0,0.0,534.831862,...,20.0,11276.3,4110.799911,20.0,0,0.0,1,1.0,0,0
4,-67,561.728814,20.434957,59.0,791.067797,236.32661,53.0,6.0,0.0,487.928712,...,5.0,13458.0,2863.866757,5.0,0,0.0,0,0.0,0,0


In [280]:
X2 = HG002_pred_min_1

### Impute missing values using KNN 

In [281]:
# Convert dataframe to matrix
X2=X2.as_matrix()

#Imput missing values from three closest observations
X2_imputed=KNN(k=3).complete(X2)
X2=pd.DataFrame(X2_imputed)

Imputing row 1/1191 with 1 missing, elapsed time: 1.277
Imputing row 101/1191 with 2 missing, elapsed time: 1.300
Imputing row 201/1191 with 1 missing, elapsed time: 1.330
Imputing row 301/1191 with 1 missing, elapsed time: 1.348
Imputing row 401/1191 with 1 missing, elapsed time: 1.362
Imputing row 501/1191 with 1 missing, elapsed time: 1.386
Imputing row 601/1191 with 1 missing, elapsed time: 1.397
Imputing row 701/1191 with 2 missing, elapsed time: 1.418
Imputing row 801/1191 with 2 missing, elapsed time: 1.427
Imputing row 901/1191 with 1 missing, elapsed time: 1.440
Imputing row 1001/1191 with 1 missing, elapsed time: 1.450
Imputing row 1101/1191 with 1 missing, elapsed time: 1.468


In [283]:
# Replace column headers
X2.columns = ['Size','Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_insertSizeScore_orientation','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_alignmentScore','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_insertSizeScore_insertSizeScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_insertSizeScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_insertSizeScore_insertSizeScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_insertSizeScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','tandemrep_cnt','tandemrep_pct','segdup_cnt','segdup_pct','refN_cnt','refN_pct']

In [284]:
X2=X2[['Ill300x.alt_alnScore_mean','Ill300x.alt_alnScore_std','Ill300x.alt_count','Ill300x.alt_insertSize_mean','Ill300x.alt_insertSize_std','Ill300x.alt_reason_alignmentScore','Ill300x.alt_reason_insertSizeScore','Ill300x.alt_reason_orientation','Ill300x.amb_alnScore_mean','Ill300x.amb_alnScore_std','Ill300x.amb_count','Ill300x.amb_insertSize_mean','Ill300x.amb_insertSize_std','Ill300x.amb_reason_alignmentScore_alignmentScore','Ill300x.amb_reason_alignmentScore_orientation','Ill300x.amb_reason_flanking','Ill300x.amb_reason_insertSizeScore_alignmentScore','Ill300x.amb_reason_insertSizeScore_insertSizeScore','Ill300x.amb_reason_multimapping','Ill300x.amb_reason_orientation_alignmentScore','Ill300x.amb_reason_orientation_orientation','Ill300x.amb_reason_same_scores','Ill300x.ref_alnScore_mean','Ill300x.ref_alnScore_std','Ill300x.ref_count','Ill300x.ref_insertSize_mean','Ill300x.ref_insertSize_std','Ill300x.ref_reason_alignmentScore','Ill300x.ref_reason_insertSizeScore','Ill300x.ref_reason_orientation','Ill250.alt_alnScore_mean','Ill250.alt_alnScore_std','Ill250.alt_count','Ill250.alt_insertSize_mean','Ill250.alt_insertSize_std','Ill250.alt_reason_alignmentScore','Ill250.alt_reason_insertSizeScore','Ill250.alt_reason_orientation','Ill250.amb_alnScore_mean','Ill250.amb_alnScore_std','Ill250.amb_count','Ill250.amb_insertSize_mean','Ill250.amb_insertSize_std','Ill250.amb_reason_alignmentScore_alignmentScore','Ill250.amb_reason_alignmentScore_orientation','Ill250.amb_reason_flanking','Ill250.amb_reason_insertSizeScore_alignmentScore','Ill250.amb_reason_multimapping','Ill250.amb_reason_orientation_alignmentScore','Ill250.amb_reason_orientation_orientation','Ill250.amb_reason_same_scores','Ill250.ref_alnScore_mean','Ill250.ref_alnScore_std','Ill250.ref_count','Ill250.ref_insertSize_mean','Ill250.ref_insertSize_std','Ill250.ref_reason_alignmentScore','Ill250.ref_reason_orientation','IllMP.alt_alnScore_mean','IllMP.alt_alnScore_std','IllMP.alt_count','IllMP.alt_insertSize_mean','IllMP.alt_insertSize_std','IllMP.alt_reason_alignmentScore','IllMP.alt_reason_insertSizeScore','IllMP.alt_reason_orientation','IllMP.amb_alnScore_mean','IllMP.amb_alnScore_std','IllMP.amb_count','IllMP.amb_insertSize_mean','IllMP.amb_insertSize_std','IllMP.amb_reason_alignmentScore_alignmentScore','IllMP.amb_reason_alignmentScore_orientation','IllMP.amb_reason_flanking','IllMP.amb_reason_insertSizeScore_insertSizeScore','IllMP.amb_reason_multimapping','IllMP.amb_reason_orientation_alignmentScore','IllMP.amb_reason_orientation_orientation','IllMP.amb_reason_same_scores','IllMP.ref_alnScore_mean','IllMP.ref_alnScore_std','IllMP.ref_count','IllMP.ref_insertSize_mean','IllMP.ref_insertSize_std','IllMP.ref_reason_alignmentScore','IllMP.ref_reason_insertSizeScore','IllMP.ref_reason_orientation','pacbio.alt_alnScore_mean','pacbio.alt_alnScore_std','pacbio.alt_count','pacbio.alt_insertSize_mean','pacbio.alt_insertSize_std','pacbio.alt_reason_alignmentScore','pacbio.amb_alnScore_mean','pacbio.amb_alnScore_std','pacbio.amb_count','pacbio.amb_insertSize_mean','pacbio.amb_insertSize_std','pacbio.amb_reason_alignmentScore_alignmentScore','pacbio.amb_reason_flanking','pacbio.amb_reason_multimapping','pacbio.amb_reason_same_scores','pacbio.ref_alnScore_mean','pacbio.ref_alnScore_std','pacbio.ref_count','pacbio.ref_insertSize_mean','pacbio.ref_insertSize_std','pacbio.ref_reason_alignmentScore','TenX.HP1_alt_alnScore_mean','TenX.HP1_alt_alnScore_std','TenX.HP1_alt_count','TenX.HP1_alt_insertSize_mean','TenX.HP1_alt_insertSize_std','TenX.HP1_alt_reason_alignmentScore','TenX.HP1_alt_reason_insertSizeScore','TenX.HP1_alt_reason_orientation','TenX.HP1_amb_alnScore_mean','TenX.HP1_amb_alnScore_std','TenX.HP1_amb_count','TenX.HP1_amb_insertSize_mean','TenX.HP1_amb_insertSize_std','TenX.HP1_amb_reason_alignmentScore_alignmentScore','TenX.HP1_amb_reason_alignmentScore_orientation','TenX.HP1_amb_reason_flanking','TenX.HP1_amb_reason_insertSizeScore_alignmentScore','TenX.HP1_amb_reason_multimapping','TenX.HP1_amb_reason_orientation_alignmentScore','TenX.HP1_amb_reason_orientation_orientation','TenX.HP1_amb_reason_same_scores','TenX.HP1_ref_alnScore_mean','TenX.HP1_ref_alnScore_std','TenX.HP1_ref_count','TenX.HP1_ref_insertSize_mean','TenX.HP1_ref_insertSize_std','TenX.HP1_ref_reason_alignmentScore','TenX.HP1_ref_reason_orientation','TenX.HP2_alt_alnScore_mean','TenX.HP2_alt_alnScore_std','TenX.HP2_alt_count','TenX.HP2_alt_insertSize_mean','TenX.HP2_alt_insertSize_std','TenX.HP2_alt_reason_alignmentScore','TenX.HP2_alt_reason_insertSizeScore','TenX.HP2_alt_reason_orientation','TenX.HP2_amb_alnScore_mean','TenX.HP2_amb_alnScore_std','TenX.HP2_amb_count','TenX.HP2_amb_insertSize_mean','TenX.HP2_amb_insertSize_std','TenX.HP2_amb_reason_alignmentScore_alignmentScore','TenX.HP2_amb_reason_alignmentScore_orientation','TenX.HP2_amb_reason_flanking','TenX.HP2_amb_reason_insertSizeScore_alignmentScore','TenX.HP2_amb_reason_multimapping','TenX.HP2_amb_reason_orientation_alignmentScore','TenX.HP2_amb_reason_orientation_orientation','TenX.HP2_amb_reason_same_scores','TenX.HP2_ref_alnScore_mean','TenX.HP2_ref_alnScore_std','TenX.HP2_ref_count','TenX.HP2_ref_insertSize_mean','TenX.HP2_ref_insertSize_std','TenX.HP2_ref_reason_alignmentScore','TenX.HP2_ref_reason_orientation']]

In [285]:
model.predict(X2)

array([ 0.,  1.,  0., ...,  0.,  0.,  0.])

In [286]:
X2['model_pred_label'] = model.predict(X2)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [287]:
X2['GIAB_Crowd'] = HG002_pred_2['GIAB_Crowd']



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [294]:
#Import Data
HG002_pred_min_1_1 = pd.read_csv('/Users/lmc2/NIST/Notebooks/CrowdVariant/svviz.Annotate.DEL.HG002_minus_one.csv')

In [295]:
HG002_pred_min_1_1['model_pred_label'] = X2['model_pred_label']

In [296]:
HG002_pred_min_1_1.to_csv('HG002_pred_min_1.csv', index=False)

** Task **
- Look at newly labeled data points (svviz images)
- Formerly assigned GT ‘-1’, now assigned CrowdVar label

    0 —> Hom Var
    1 —> Het Var