# Deep Learning in Medicine
### BMSC-GA 4493, BMIN-GA 3007 
### Homework 2



**Learning Objectives**:

1. More CNN
2. Recurrent Neural Network RNN

**Instruction** 

1. If you need to write mathematical terms, you can type your answeres in a Markdown Cell via LaTex. See: <a href="https://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook">here</a> if you have issues with writing equations. To see basic LaTex notation see: <a href="https://en.wikibooks.org/wiki/LaTeX/Mathematics"> here </a>.

2. Upload and Submit your final jupyter notebook file in <a href='http://newclasses.nyu.edu '>newclasses.nyu.edu</a>

3. Deadline: Tuesday March 24th 2020 (3pm) **

4. Questions and Clarification: <a href="https://piazza.com/nyumc.org/spring2020/bmscga4493andbminga3007/home"> Class Piazza</a>

## 1. Short Questions

### 1.1.
What will be the dimension of the feature maps after we forward propogate the image using the given convolution kernels for: 

    a. Stride = 1, without zero padding?
    b. Stride = 2, without zero padding?
    c. Stride = 2, with zero padding?
    d. Stride = 3, with zero padding?
    e. A dilated convolution with stride=1, dilation rate=2 and zero padding?

### 1.2.
Calculate the feature maps for the case stride=2, with zero padding. 

In [None]:
# starter code to load image:x, kernel weights:w and bias:b
import numpy as np
npzfile = np.load('Question1.npz') 
# 'Question1.npz' is provided under /beegfs/ga4493/data/HW2 folder at HPC
print(npzfile.files) # check the variable names
x = npzfile['x']
w = npzfile['w']
b = npzfile['b']

### 1.3. 

Apply the following activation function on the feature maps calculated in 1.2 and provide the resulting activation maps

    a. ReLU
    b. Leaky ReLu with negative slope coefficient = 0.01

### 1.4.
List three pooling strategies, write their mathematical forms for 2D inputs

### 1.5.
Pick two out of three pooling strategies and provide the output features by applying it to the activation maps obtained in 1.3.b for:

    a. pool witdth=2, stride=1
    b. pool width=4, stride=1

### 1.6.

Here we will use the pytorch package to calculate feature/activation maps. Write a code which takes 3x6x6 image and performs a 2D convolution operation (with stride=2 and zero padding) using 3x3x3 filters provided on the picture. After convolution layer use leacky ReLU activation function (with negative slope 0.01) and L2-pooling operation (pool width = 2 and stride = 1). Provide the code, feature maps obtained from convolution operation (compare with 1.2.), activation maps (compare with 1.3.b), and feature maps after L2-pooling operation.

# Question 2: Deep CNN design for disease classification

In this part of the howework, we will focus on classifiying the lung disease using chest x-ray dataset provided by NIH (https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community). Please go over the following paper for the details of the dataset: https://arxiv.org/pdf/1705.02315.pdf 

You need to use HPC for training part of this question, as your computer's CPU will not be fast enough to compute learning iterations. In case you use HPC, please have your code/scripts uploaded under the questions and provide the required plots and tables there as well. Data is available in HPC under /beegfs/ga4493/data/HW2 folder. We are interested in classifying infiltration, pneumothorax, cardiomegaly and *not*(infiltration OR pneumothorax OR cardiomegaly) cases. By saying so we have 4 classes that we want to identify by modelling a deep CNN.


### 2.1. Label preparation


Work on Data_Entry_2017.csv file to identify cases/images that has infiltration, pneumothorax, cardiomegaly and *not* images that doesnt have any one of 3 diseases. 


### 2.2. Data preparation before training
From here on, you will use HW2_trainSet.csv, HW2_testSet.csv and HW2_validationSet.csv provided under /beegfs/ga4493/data/HW2 folder for defining train, test and validation set samples instead of the csv files you generate on Question 3.1.


There are multiple ways of using images as an input during training or validation. Here, you need to decide on one way of using images in your network. You may want to use numpy arrays as shown in Lab 4, HDF5 file format or torch Dataset class  (http://pytorch.org/tutorials/beginner/data_loading_tutorial.html). Once you decide on the way to use images as input, write necessary script which will enable you to input images in your designed CNN later. !! If you need to save anything, please use your own folder at HPC.

### 2.1. Train, Test, and Validation Sets
Write a script to read data from Data_Entry_2017.csv and process to obtain 3 sets(train, validation and test). By using 'Finding Labels' column, define a class that each image belongs to, in total you can define 5 classes:
- 1 infiltration
- 2 pneumothorax
- 3 cardiomegaly
- 4 cases which contain at least two disease and at least one of them belongs to classes 1,2 and 3 
- 0 for all other diseases (doesnt have infiltration OR pneumothorax OR cardiomegaly) or NoFinding

Generate a train, validation and test set by splitting the whole dataset containing specific classes (0, 1, 2, and 3)  by 60%, 20% and 20%, respectively. Since we have too many samples on Class 0, use only random 10% of the samples for creating sets. Test set will not be used during modelling but it will be used to test your model's accuracy. Make sure you have similar percentages of different cases in each subset. Provide statistics of the number of classess in your subsets. (you do not need to think about splitting the sets based on subjects for this homework. In general, we do not want images from the same subject to appear in both train and test sets!!) 

Write a .csv files defining the samples in your train, validation and test set with names: train.csv, validation.csv, and test.csv. Submit these files with your homework. 

### 2.3. CNN Model Architecture

Since now we can import images for model training, next step is to define a CNN model that you will use to train disease classification task. Any model requires us to select model parameters like how many layers, what is the kernel size, how many feature maps and so on. The number of possible models is infinite, but we need to make some design choices to start.  Lets design a CNN model with 5 convolutional layers and a fully connected layer followed by a classification layer. Lets use 

-  3x3 convolution kernels
-  ReLU for an activation function
-  max pooling with kernel 2x2 and stride 2. 

Define the number of feature maps in hidden layers as: 16, 16, 32, 32, 64, 32 (1st layer, ..., 6th layer). **Write a class which specifies this network details.**

### 2.4.
How many learnable parameters of this model has? How many learnable parameters we would have if we only have 5 convolutional layers without a fully connected 6th layer in our network? Describe why the fully connected layer needs so much trainable parameters, and provide additional suggestions to mitigate this?

### 2.5. Loss function and optimizer

Define a loss criterion and an optimizer using pytorch. What type of loss function is applicable to our multi-class classification problem? Explain your choice of a loss function.  For an optimizer lets use SGD with momentum for now. Choose an emprical learning rate and momentum.  

_Some background:_ In network architecture design, we want to have an architecture that has enough capacity to learn. We can achive this by using large number of feature maps and/or many more connections and activation nodes. However, having a large number of learnable parameters can easily result in overfitting. To mitigate overfitting, we can keep the number of learnable parameters of the network small either using shallow networks or few feature maps. This approach results in underfitting that model can neither model the training data nor generalize to new data. Ideally, we want to select a model at the sweet spot between underfitting and overfitting. It is hard to find the exact sweet spot. 

We first need to make sure we have enough capacity to learn, without a capacity we will underfit. Here, you will need to check if designed model in 3.3. can learn or not. Since we do not need to check the generalization capacity (overfitting is OK for now since it shows learning is possible), it is a great strategy to use a subset of training samples. Also, using a subset of samples is helpful for debugging and hyperparameter search.

### 2.6. Train the network on a subset


### 2.6.a.
Write a script which takes 256 random samples from train set (HW2_trainSet.csv), lets name this set as HW2_randomTrainSet. Choose 64 random samples from validation set (HW2_validationSet.csv), lets name this set as HW2_randomValidationSet. Make sure these sample sets include data from each class.     

### 3.6.b.

Use the random samples from 2.6.b. and write a script to train your network. Using the script train your network using your choice of weight initialization strategy. In case you need to define other hyperparameters choose them emprically, for example batch size. Plot average loss on your random sample set per epoch. (Stop the training after at most ~100 epochs) 

### 2.7. Analysis of training using a CNN model

Describe your findings. Can your network learn from 256 random samples? Does CNN model have enough capacity to learn with your choice of emprical hyperparameters?
-  If yes, how will average loss plot will change if you multiply the learning rate by 10?
-  If no, how can you increase the model capacity? Increase your model capacity and train again until you find a model with enough capacity. If the capacity increase is not sufficient to learn, think about emprical parameters you choose in designing your network and make some changes on your selection. Describe what type of changes you made to your original network and how can you manage this model to learn.

### 2.8. Hyperparameters
Now, we will revisit our selection of CNN model architecture, training parameters and so on: i.e. hyperparameters. In your investigations, define how you will change the hyperparameter in the light of model performance using previous hyperparameters. Provide your rationale choosing the next hyperparameter. Provide learning loss and accuracy curves, and model performance in HW2_randomValidationSet. You will use macro AUC as the performance metric for comparing CNN models for disease classification task.  Report macro AUC for each CNN model with different hyperparameters (Check http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#multiclass-settings).



### 2.8.a.
Investigate the effect of learning rate in the model performance

### 2.8.b.
We choose SGD with momentum as an optimizer. Investigate the effect of at least two other optimizers in model performance.

### 2.8.c.
Investigate the effect of the dimension of the fully connected layer in the model performance.

### 2.8.d.
Investigate the effect of the batch size in learning speed and the model performance.

### 2.9. Train the network on the whole dataset

After question 2.7., you should have a network which has enough capacity to learn and from question 2.8 you know which hyperparameters perform better on a subset of test and validation set. Train your network on the whole train set (HW2_trainSet.csv) and check the validation loss on the whole validation set (HW2_validationSet.csv) in each epoch. Plot average loss and accuracy on train and validation sets. Describe your findings. Do you see overfitting or underfitting to train set? What else you can do to mitigate it?

### 2.10. Analysis  of the results
Using the validation loss to choose the model (lets name it as baseline model) which learns from train data and generalizes well to the validation set. Using this model plot confusion matrix and ROC curve for your multi-class CNN disease classifier on the test set (HW2_testSet.csv). Report macro AUC for this CNN model as the performance metric. 

### 2.11. Understanding the network
Using the best performing model (choose from models developed in  3.10., and 3.12.(in case you work on it)), we will figure out where our network gathers infomation to decide the class for the image. One way of doing this is to oclude parts of the image and run through your network. By changing the location of the ocluded region we can visualize the probability of image being in one class as a 2-dimensional heat map. Using the best performing model, provide the heat map of the following images: HW2_visualize.csv. Do the heap map and bounding box for pathologies provide similar information? Describe your findings.
Reference: https://arxiv.org/pdf/1311.2901.pdf

### 2.12. Your CNN architecture design
Be creative and design your own CNN model. This model can be some variation of the baseline model using the information from hyperparameter search or it can be a totally new architecture. Use the knowledge you gained from previous questions to design your network. Because of this reason, your network is expected to provide superior results. After you trained your network on the whole train set, choose the best performaing model using the loss on the whole validation set. Provide the confusion matrix, ROC curves and macro AUC for your best performing model using the whole test set. Explain your design criteria and why your performance is better compared to the baseline model. Some architecture change suggestions: convolution filter dimensions, dilated convolutions, network without a fully connected layer, deeper networks, data augmentation ...     

# Question 3 - Build Sequence Classifiers - Convolutional and Recurrent Neural Networks

This exercise aims to classify each <a href="">protein</a> (represented as <a href="https://en.wikipedia.org/wiki/Protein_primary_structure">a sequence of amino acids</a>), into protein families.  

Why this is an important task? Briefly, our DNAs encode the code for proteins, which are molecular machines that make the cells work. 

![Our DNAs encode the code for proteins, which are molecular machines that make the cells work](https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Genetic_code.svg/580px-Genetic_code.svg.png) | ![Sequence to Structure](http://www.robotics.tu-berlin.de/fileadmin/_processed_/1/1f/csm_compbio_seq2struct_1614a2532b.jpg)

Given the sequence of the amino acids, there is great scientific value in being able to predict its 3D structure, and predict whether the protein will or will not bind to other chemical molecules such as drugs or other proteins. 
The applications are numerous in disease understanding and treatment (i.e. <a href="https://en.wikipedia.org/wiki/Amyloid_beta">Alzheimer's disease is related to *beta-amyloid* proteins in our brain not folding correctly and creating plaques</a>).

In this homework, we will focus on a dataset which has more than 400,000 protein sequences and their classes. The data and related pre-processing scriptes are is available <a href="https://www.kaggle.com/abharg16/predicting-protein-classification/data">here</a> and <a href="https://www.kaggle.com/abharg16/predicting-protein-classification/notebook">here</a>, which are super awesome.


Here, we will focus on predicting top few classes of proteins, from the sequence of the amino acids of that protein.
The data is available in the cluster in /scratch/nsr3/protein/rcsb/, although you're also welcome to have your own local copy of the data and work with that. We need two files: pdb_data_seq.csv and pdb_data_no_dups.csv

### 3.1. Data Preprocessing

Most of the preprocessing is available in the kernel that came with the data. In paricular you can use the following to pre-process your data.

How many data samples are available after the pre-processing? how many of the sequences are unique?

Select only the classes that have *more than 15,000 samples*. Only keep the rows that belong to one of these classes in your data. Which classes are there, and how many rows do you have after this filteration?

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# Import Datasets
df_seq = pd.read_csv('/scratch/nsr3/protein/rcsb/pdb_data_seq.csv')
df_char = pd.read_csv('/scratch/nsr3/protein/rcsb/pdb_data_no_dups.csv')

# Filter for only proteins
protein_char = df_char[df_char.macromoleculeType == 'Protein']
protein_seq = df_seq[df_seq.macromoleculeType == 'Protein']

# Select only necessary variables to join
protein_char = protein_char[['structureId','classification']]
protein_seq = protein_seq[['structureId','sequence']]

model_f = protein_char.set_index('structureId').join(protein_seq.set_index('structureId'))
model_f = model_f.dropna()

### 3.2. More Data Preprocessing 

Write a function, that takes a protein sequence *S* in, and converts it into a numpy array of size *25 x Len(S)*, which has the *one-hot encoding of the sequence*. 

You can use this list as all possible Amino Acid letters: **['H','V','G','A','P','C','D','I','R','E','K','L','W','T','Y','S','Q','F','N','M','U','X','Z','B','O']**

As an example, if S_0 is an 'H', the first column of our returned results has a 1.0 in row number 0 and, 0.0 in every other row. If it is a S_1 is a 'G', we put a 1.0 in row number 2 of that column, and a 0.0 in every other row in that column. We continue for all letterse in our sequence. 

### 3.3. Train / Test / Validation Set

Convert your data into train, test and validation set. Shuffle the rows, and split them with ratios of (train:60%, valid:20%, test:20%). 

(Hint: it's useful to set the random number seed before shuffling, so you get the same results over multiple runs).

### 3.4. Data Engineering

Convert your training, validation and test sequences to one-hot numpy arrays. 
Doing so in advance will save you computation time later. Also since we will be training a classifier, convert your one-hot label variables into the index. i.e. if your label is [0, 1, 0] convert it into [1]. If it is [0, 0, 1], convert it into [2]. (Hint: Use *numpy argmax* method if needed for fast implementation).

Write a dataloader similar to what we covered in the lab session (https://github.com/nyumc-dl/BMSC-GA-4493-Spring2019/blob/master/lab8/lab8_solutions.ipynb) so that we can begin to train our networks!

### 3.5. Sequence classification model

First, build a Convolutional sequence classification model similar to the architecture in question 1, (deepbind paper). 

Use Convolution, negative log likelihood (NLL) loss, and (optional: any additions to your architecture!), to go from the one-hot sequence of size *25 x len(S)* to multi-class classifier. 

At each epoch, compute **Average NLL loss** and **one AUC score per class** on both **train and validation set** 

Plot your validation and train loss over different epochs, and also print the AUCs on train and validation sets.

### 3.6. CNN Model Analysis 

One benefit of convolutional sequence model is that they are easier to interpret later. 
Use matplotlib and plt.imshow(), to visualize the filters of the *first layer convolution* that you have: 

(hint: an example, if the model is named model and the first layer of convolution is accessible via model.convnet1, the following code can give you those filters:
kernels = [k[0].data.numpy() for k in model.convnet1.weight])

**Note: It's ok if your model didn't converge at all. Just show the visualizations!**

**(Bonus 5 points):** Is there an equivalent of motif_plotter (i.e. line 31 in lab 8 https://github.com/nyumc-dl/BMSC-GA-4493-Spring2019/blob/master/lab8/lab8_solutions.ipynb) for Proteins and Amino Acids? Can you plot the convolution kernels using that library?

### 3.7. LSTM Model

Now, provide a second sequence classification model based on LSTMs. Build a simple LSTM model that takes as input the (25 x Len(s)) array, and ends with a softmax over total number of classees. (Hint - check your lab 7 session).

The rest of your experimental setting should be the same as section 3.6:

At each epoch, compute Average NLL loss and one AUC score per class on both train and validation set.

Plot your validation and train loss over different epochs, and also print the AUCs on train and validation sets.

### 3.8. Other Architectures

What are some other architectures that you could be using in future work? List a few and in a few sentences discuss why they might be a good fit for this task. 

### 3.9. Fine-tuning / Regularizations

What are some other fine-tunning/regularizations/etc. that you could do in the future work, to improve the scores?