## Experiments on Creating Deterministically Sparse Networks
Pinar Demetci, Lorin Crawford Lab 2019, Brown University

In our neural network model, the connections of the neural network is determined by biological annotations. This means, we construct deterministically sparse feedforward neural networks here. As a result of some brainstorming, the following are the ideas we came up with for achieving this. The last one (#5) is the method we currently use in our model, and we try to show here that it indeed works after discussing a few other options that one might think are possible but are not ideal.

**1) Creating dense subnetworks and then joining them all together into one final sparsely-connected network**
> $\;\;\;\;\;$ This option does not work in the case of overlapping connections i.e. a SNP falls into multiple genetic annotations (due to overlap in annotations) or a gene is part of multiple pathways, which can indeed be the case, biologically:
    <img src="images/sparsity_subnetwork.png">  
$\;\;\;\;\;$ This image displays the idea in a tiny example of a neural network. In the case of no overlaps in connections, the outputs of <font color=maroon> subnetwork#1 </font> amd <font color=blue>subnetwork#2</font> can be concatenated and used as the input of <font color=green>subnetwork#3</font>. This connects the dense subnetworks together into one sparsely connected network and the backpropagation happens through the subnetworks. Example code at the end of this section.  
However, in the case of overlapping connections (i.e. a node is connected to more than one subnetwork), we need to include the node with these connections in more than one subnetwork. In the example figure above, this would be the purple node, "x3". This leads to including a single input point multiple times in the resulting network as if there were multiple inputs with the same value, which introduces a bias/inaccuracy into the model.  
So this approach will not work for our purposes, where genes might participate in multiple pathways (e.g. ....) or multiple SNPs might fall in the neighborhood of multiple genetic elements based on the genomic annotation.


**2) Deterministically initializing weights of the sparse layers as 0s and 1s, freezing these layers, and only training the final layer variables:** 
> $\;\;\;\;\;$ This ensures that except for the final layer, the weights are either 1 or 0, initialized based on biological annotations and are never changed due to freezing (the "frozen" layers are not trained). Illustration for this is below:  
<img src="images/sparsity_freeze.png"> 
The way to "freeze" a layer is to set that layer to be untrainable as demonstrated in the code snippet below.  
But this method means we do not allow for training the weights of connections between SNPs and genes (or genes and pathways, as well, if we are carrying out pathway-based inference). So, no training of effect sizes of (or contribution of) SNPs to genes. This limits the model a little bit. It might not be a good idea biologically, because we know from empirical studies that not all SNPs affect gene expression at the same level. So, this is a oossible solution, but not really a desirable one. 


In [6]:

W1=[[tf.Variable(1.0), tf.constant(0.0)],
    [tf.Variable(1.0), tf.constant(0.0)],
    [tf.Variable(1.0), tf.Variable(1.0)],
    [tf.constant(0.0), tf.Variable(1.0)]]

layer1_output=tf.matmul(X,W1)
print(layer1_output)

TypeError: Input 'b' of 'MatMul' Op has type float32 that does not match type float64 of argument 'a'.

**3) Initializing weight matrices as a mix of "variables" and "constants" so only some of them are trained in the backpropagation**  
> $\;\;\;\;\;$ This is basically the same idea as #2, except instead of freezing all the weights in a given layer, we freeze only some of them by declaring them "constants". So, our weight matrix will have a mix of variables (trainable values) and constants (untrainable values).  
While this works, it is not a great solution, because:

**4) Overwriting the "Dropout Layer" of tensorflow to carry-out deterministic dropouts.**  
> $\;\;\;\;\;$ When people create sparse feed-forward neural networks with tensorflow or keras, they use the "Dropout" method. This method randomly drops connections after the user defines what percent of connections in a layer they want to keep. So, one might presume that a good way to carry out deterministically connected layers is to modify the Dropout method, such that we can specify which connections to drop in the input. The source code of Dropout can be found  here:  
And the paper that first introduced the idea of dropout for regularizing neural networks is:  
<a href="http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf">Shrivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. "Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014.</a>
As can be seen in both the source code as well as the publication that introduces dropout, this method drops nodes, not connections (image below is from the paper):
<img src="images/sparsity_dropout.png">
However, in our case, every SNP will be connected to at least one gene and all the genes that have some SNP information will be connected to at least one pathway (or will be connected to an "unannotated" node). What we want is to keep every node but drop-out connections, not drop-out connections by dropping out neurons. So, 

**5) Creating a "mask matrix" of 0s and 1s to multiply weights with in every iteration of forward pass, so those connections are dropped ("zeroed-out") and don't contribute to the output or gradient calculations.**  
> $\;\;\;\;\;$This does work quite smoothly. It is demonstrated below (by keeping track of weight updates and gradients). This is basically the same idea behind drop-out but the masks multiply the weight matrix instead of the input matrix. In weight matrix, every row corresponds to a "SNP" and every column corresponds to a "gene" for the first hidden layer. For the second hidden layer (if we are carry out pathway inference), the rows of the weight matrix will correspond to genes and the columns will correspond to pathways. Therefore, in the first hidden layer, for example, setting an element of the weight matrix to 0 means dropping the connection between the SNP that corresponds to the row value of that element and the gene that corresponds to the columnn value. When the weight is always multiplied by zero, whatever the value of the weight is, the output generated by this element will be the same, meaning the  gradient is 0. If  the gradient is zero,  then it will not be updated. 