# Topological Associated Domain boundary prediction

This notebook aims to reproduce results of the recent paper, [Henderson et al, 2019](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz315/5485073#supplementary-data), where the group was able to predict high resolution Topologically Associated Domain (TAD) boundaries from sequencing data using a convolutional neural network. After model fine-tuning and optimization they were able to detect boundaries with an accuracy of 96% from one-hot encoded sequencing data alone. Here I will attempt to reproduce their results using the [fastai](https://docs.fast.ai/) library, built on [PyTorch](https://pytorch.org/). 

In [1]:
%reload_ext autoreload
%autoreload 2 
%matplotlib inline

In [2]:
from fastai import *
from fastai.text import *
import h5py

Data were cloned from the public repository, https://github.com/lincshunter/TADBoundaryDectector and unpacked with the `unrar` utility in the data directory.

In [3]:
dpath = Path("data/TADBoundaryDectector")

In [4]:
dpath.ls()

[PosixPath('data/TADBoundaryDectector/.git'),
 PosixPath('data/TADBoundaryDectector/Models.py'),
 PosixPath('data/TADBoundaryDectector/README.md'),
 PosixPath('data/TADBoundaryDectector/dm3.kc167.example.rar'),
 PosixPath('data/TADBoundaryDectector/dm3.kc167.example.h5')]

In [5]:
h5file = "data/TADBoundaryDectector/dm3.kc167.example.h5"
f = h5py.File(h5file, 'r') 

In [6]:
f.keys() # list available keys

<KeysViewHDF5 ['x_test', 'x_train', 'x_val', 'y_test', 'y_train', 'y_val']>

In [7]:
x_test = np.array(f['x_test'])
x_test.shape # a 1k by 1k by 4 tensor

(1000, 1000, 4)

In [8]:
x_train = np.array(f['x_train'])
x_val = np.array(f['x_val']) 
y_test = np.array(f['y_test'])
y_train = np.array(f['y_train'])
y_val = np.array(f['y_val'])

In [9]:
 x_test.shape

(1000, 1000, 4)

In [10]:
x_val.shape

(1000, 1000, 4)

In [11]:
y_test.shape

(1000,)

In [12]:
y_train.shape

(28127,)

In [13]:
y_val.shape

(1000,)

Now let's cooerce these data into a fastai databunch object. Convert all to tensors. 

In [14]:
x_test,x_train,x_val, y_test,y_train,y_val = map(torch.tensor, [x_test,x_train,x_val, y_test,y_train,y_val])

In [15]:
x_test.shape

torch.Size([1000, 1000, 4])

In [16]:
# use minibatches
bs=64
train_ds = TensorDataset(x_train, y_train)
valid_ds = TensorDataset(x_val, y_val)
test_ds = TensorDataset(x_test, y_test)
data = DataBunch.create(train_ds, valid_ds, test_ds, bs=bs)

In [17]:
x,y=next(iter(data.train_dl))
x.shape, y.shape

(torch.Size([64, 1000, 4]), torch.Size([64]))

In [18]:
x,y=data.one_batch()
y,x.shape

(tensor([1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
         1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0]),
 torch.Size([64, 1000, 4]))

In [19]:
# batch, chan, len
x.shape[1]
x.transpose(1,2).shape

torch.Size([64, 4, 1000])

Create a model architecture like [three_CNN_LSTM](https://github.com/lincshunter/TADBoundaryDectector/blob/master/Models.py)


pseudocode of the three_CNN_LSTM model:


 - sequential
 - conv1d
 - relU
 - dropout(0.2)
 - maxpool1d


 - conv1d
 - relu
 - relu
 - relu
 - dropout 0.3
 - maxpool1d 


 - conv1d
 - relu
 - dropout(0.3)
 - maxpool1d 


 - bidirectionalLSTM
 - flatten 
 - dense
 - sigmoid


using ADAM optimizaiton with cross entropy loss



This might be helpful 

https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66





In [258]:
# import torch.nn.functional as F

# epochs = 150 
# kernsize = 9 
# bs = 50
# verbose = 1 
# nclass = 2
# metrics = ['accuracy']
# loss = 'binary_crossentropy'
# kernel = 'glorot_uniform'

class BoundaryModel(nn.Module):
    def __init__(self):
        super(BoundaryModel, self).__init__()
        
        self.conv1 = nn.Conv1d(4,64,9)
        self.drop1 = nn.Dropout(p=0.2)
        self.relu = nn.ReLU()
        self.drop2 = nn.Dropout(p=0.3)
        self.pool = nn.MaxPool1d(9)
        self.bdGRU = nn.GRU(1000,4,bidirectional=True)
        
    def forward(self, x):
        x = x.contiguous
        x = x.transpose(1,2)
        
        x = x.conv1(x)
        x = x.relu(x)
        x = x.drop1
        x = x.pool(x)
        
        x = x.conv1(x)
        x = x.relu(x)
        x = x.drop2
        x = x.pool(x)
        
        x = x.conv1(x)
        x = x.relu(x)
        x = x.drop2
        x = x.pool(x)
        
        return bdGRU(x)
    

In [None]:
# one dimensional image with length L and 4 channels
# inshape = [1000,4] # input layer shape

# N filters of kernels shape(k, 4)


model = nn.Sequential(
    # conv layer 1 
    nn.Conv1d(4, 64, (9,4)),
    #nn.BatchNorm1d(1000),
    nn.ReLU(),
    nn.MaxPool1d(9),
    nn.Conv1d(4, 64, 9),
    #nn.BatchNorm1d(1000),
    nn.ReLU(),
    nn.MaxPool1d(9),
    # conv layer 3
    nn.Conv1d(4, 64, 9), 
    #nn.BatchNorm1d(1000),
    nn.ReLU(),
    nn.MaxPool1d(9),
    nn.GRU(1000, 40, bidirectional=True )
)

In [None]:
x,y = next(iter(data.train_dl))
z = x.transpose(2,1).float().cuda()
z = z.contiguous()
z.shape

In [297]:
model(z)

RuntimeError: Given groups=1, weight of size 64 4 9, expected input[64, 64, 110] to have 4 channels, but got 64 channels instead

In [263]:
learn = Learner(data, model, loss_func = nn.CrossEntropyLoss(), metrics=accuracy)