# Fetching and processing the data


I used [bedtools](http://bedtools.readthedocs.io/en/latest/) to intersect constrained coding regions (CCRs), defined by exonic regions lying between consecutive non-synonymous [gnomAD](http://gnomad.broadinstitute.org/) variants, with somatic mutations observed in cancer genomes ([COSMIC](https://cancer.sanger.ac.uk/cosmic/download)). The result of this analysis is a `pickle` file containing the total multi-exonic length,  number of somatic mutations, and other "features" (e.g. CpG density and synonymous variant density) for each CCR:

In [1]:
import pandas as pd
data_df = pd.read_pickle('ccrs.v2.20180420.lengths_numberMutations.pkl')
data_df = data_df.sample(frac=0.01, random_state=0)

data_df.head()

Unnamed: 0,unique_key,total_number_of_mutations,total_length,chrom,ccr_pct,gene,ranges,varflag,syn_density,cpg,cov_score,resid,resid_pctile
1250370,1250371,1,6,1,30.131443,PER3,7854050-7854056,VARFALSE,0.167,0.0,6.0,-0.043,4.185563
3662076,3662078,0,4,19,0.033837,CTD-3214H19.16,7747293-7747297,VARFALSE,0.0,1.0,3.19,-1.765,0.639091
216081,216082,1,18,4,75.38235,CASP3,185553031-185553049,VARFALSE,0.056,0.0,17.82,1.627,7.623689
200877,200878,2,18,11,76.530214,TENM4,78413057-78413075,VARFALSE,0.0,0.0,18.0,1.653,7.676046
3125117,3125118,0,1,10,2.479659,DHTKD1,12139770-12139771,VARFALSE,0.0,0.0,1.0,-0.749,2.731195


# Edit the python path 

Make the modules containing the modeling code visible to this Notebook:


In [2]:
import os 
model_directory = os.path.dirname(os.getcwd()) + '/model'

import sys
sys.path.append(model_directory)

print 'Python searches these paths when asked to import a module:'
for path in sys.path: 
    print(path)

Python searches these paths when asked to import a module:

/anaconda2/envs/tensorflow/lib/python27.zip
/anaconda2/envs/tensorflow/lib/python2.7
/anaconda2/envs/tensorflow/lib/python2.7/plat-darwin
/anaconda2/envs/tensorflow/lib/python2.7/plat-mac
/anaconda2/envs/tensorflow/lib/python2.7/plat-mac/lib-scriptpackages
/anaconda2/envs/tensorflow/lib/python2.7/lib-tk
/anaconda2/envs/tensorflow/lib/python2.7/lib-old
/anaconda2/envs/tensorflow/lib/python2.7/lib-dynload
/anaconda2/envs/tensorflow/lib/python2.7/site-packages
/anaconda2/envs/tensorflow/lib/python2.7/site-packages/IPython/extensions
/Users/petermchale/.ipython
/Users/petermchale/Work/modeling_mutation_counts_using_neural_networks/engineer_features/model


# Modeling

I assume the mutations are Poisson distributed with an average mutation count equal to $lh(x)$ where $l$ is interval length, $x$ is some feature, and a $h$ is a neural network $h(x)$ modeling the unknown mutation rate: 

In [3]:
# train the model using tensorflow and just one feature 
from model import train
data_df_headings = {'l_heading_list': ['total_length'],
                    'X_heading_list': ['syn_density'],
                    'y_heading_list': ['total_number_of_mutations']}
log_df = train(data_df, **data_df_headings)
log_df

Instructions for updating:
Use the retry module or similar alternatives.


Unnamed: 0,epoch,cost,likelihood,bias,weight
0,0.0,5.11768,0.0,-0.1,-1.154452
1,10.0,1.948738,0.0,-1.02151,-2.05192
2,20.0,1.331807,0.0,-1.664027,-2.583327
3,30.0,1.266611,0.0,-2.008585,-2.73158
4,40.0,1.269745,0.0,-2.143325,-2.616213
5,50.0,1.264065,0.0,-2.160345,-2.352807
6,60.0,1.251779,0.0,-2.127649,-2.023236
7,70.0,1.240969,0.0,-2.088615,-1.680328
8,80.0,1.233528,0.0,-2.065691,-1.354487
9,90.0,1.228032,0.0,-2.064551,-1.0581


# Analysis of learned model 

In [5]:
%matplotlib inline 

from plot import plot_counts
plot_counts(data_df, **data_df_headings)

The deviation of observations from expectation indicates that the model is incomplete. 