# Deep Learning in Medicine
### BMSC-GA 4493, BMIN-GA 3007 
### Spring 2020
### Homework 1

**Learning Objectives**:

1. Basic Math Revision.
2. Introduction to Machine Learning.
3. Logistic Regression Model.
4. Multi-layer Perceptron Model.
5. Intro to Convolutional Neural Network Models.

**Instruction** 

1. If you need to write mathematical terms, you can type your answeres in a Markdown Cell via LaTex. See: <a href="https://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook">here</a> if you have issues with writing equations. To see basic LaTex notation see: <a href="https://en.wikibooks.org/wiki/LaTeX/Mathematics"> here </a>.

2. Upload and Submit your final jupyter notebook file in <a href='http://newclasses.nyu.edu '>newclasses.nyu.edu</a>

3. Deadline: Thursday Feb 20th 2020 (3pm) **

4. Questions and Clarification: <a href="https://piazza.com/nyumc.org/spring2020/bmscga4493andbminga3007/home"> Class Piazza</a>

---
# Question 1: Solving Linear Regression via Mean Squared Error (MSE) Optimization Problem (10 points)

Imagine that you have measured two variables X and Y, for a simple task, and you belive that they might be linearly related to each other. Here, our input X has 2 dimensions, and the output has 1 dimension. We will use super-script to indicate which sample it is, and sub-scipt to indicate which dimension it is. 
The measurements are as follows:

###### (Training data D = {($X^1$, $Y^1$), ($X^2$, $Y^2$), ($X^3$, $Y^3$)})

Sample 1: $X^1 = (x_1^1, x_2^1) = (1,1)$,   $Y^1$ = 6

Sample 2: $X^2 = (x_1^2, x_2^2) = (2,3)$,   $Y^2$ = 11

Sample 3: $X^3 = (x_1^3, x_2^3) = (-1,0)$,   $Y^3$ = 2



If we assume that the relationship between X and Y is linear, we can write this relationship as:

$Y = f_{W,B}(X) = WX + B = w_1*x_1 + w_2*x_2 + B$

where $W = (w_1, w_2)$ and $B$ are the parameters of the model.	
We are interested in finding best values for W and B.	
We define 'best' in terms of a loss function between $f_{W,b}(X_i)$ and $Y_i$ for each ($X_i$ and $Y_i$) in the training data. 	
Since $Y_i$s are real numbers, let's consider Mean Squared Error loss. 

Remember that Mean Squared Error for this function, over training data, and W and B is:

$MSELoss(D={(X_1, Y_1), (X_2, Y_2), (X_3, Y_3)}), W, B) = \frac{1}{3}\sum_{i=1}^{3} (f_{W,B}(X_i) - Y_i)^2 $

### 1.1.
Compute the partial derivative of $MESLoss(D, W, B)$, With respect to W and B.	
Remember that $X_1$, $X_2$, $X_3$, $Y_1$, $Y_2$, and $Y_3$ are constants, and already given to us as training data above.

$\frac{d}{d w_1} MSELoss(D, W, B) = ?$

$\frac{d}{d w_2} MSELoss(D, W, B) = ?$

$\frac{d}{d B} MSELoss(D, W, B) = ?$

### 2.2.
Use matplotlib library and plot $\frac{d}{d w1} MSELoss(D, W, B)$ for $w_1 = np.arange(0,2,0.1)$, when $w_2$ equals 2, and B equals to 3.

In [34]:
import matplotlib.pyplot as plt
import numpy as np
w1 = np.arange(0, 2, 0.1)
# plot dMSELoss/dw1 here:



### 2.3.
What values of $w_1$, $w_2$ and $B$, make all partial derivatives zero?

### 2.4.
If you start from an initial point $w_1^0 = 0.1$ , $w_2^0 = 0.1$ and $B^0 = 0.1$, and iteratively update your $w_1$, $w_2$, and B via gradient descent as follows:
    
$ w_1^{t+1} = w_1^t - 0.01 * \frac{d}{d w_1} MSELoss(D, W, B) |_{w_1^t,w_2^t,B^t} $	
$ w_2^{t+1} = w_2^t - 0.01 * \frac{d}{d w_2} MSELoss(D, W, B) |_{w_1^t,w_2^t,B^t} $	
$ B^{t+1} = B^t - 0.01 * \frac{d}{d B} MSELoss(D, W, B) |_{w_1^t,w_2^t,B^t} $	
(Note: This is gradient descent with a 0.01 learning rate.)

What are the values of Ws and B over iterations 0 to 50? (Don't compute by hand! Write a code!)	
Write a python script that computes these values for 50 iterations, i.e. lists of $\{w_1^0, w^1_1,.., w_1^{50}\}$, $\{w_2^0, w_2^1,.., w_2^{50}\}$, and $\{B^0, B^1,.., B^{50}\}$.	
Plot the lists of $w_1$s, $w_2$s and Bs over 50 iterations.



### 2.5.
Now that you learned the math and made the code yourself, we will use pytorch and automatic differentiation, to find optimal W and B!	
Again, consider data to be D = {($X_1$, $Y_1$), ($X_2$, $Y_2$), ($X_3$, $Y_3$)}) = {((1,1), 6), ((2,3),11), ((-1,0),2)}.

Some of your steps are here. Fill in the rest and show a plot of the loss function, $w_1$, $w_2$ and B over these 10 epochs. (4 plots total)

In [None]:
import torch
import torch.nn as nn
import numpy as np
from torch import optim

D = [((1,1), 6), ((2,3),11), ((-1,0),2)]
X = [d[0] for d in D]
Y = [d[1] for d in D]
print('data X is:', X)
print('data Y is:', Y)

model = torch.nn.Linear(2, 1, bias=True)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss = torch.nn.MSELoss()

losslist = []
w1list = []
w2list = []
blist = []

# for epoch in range(10):
    # Shuffle your training data samples
    # Loop over your training data in the new order:
        #dont forget to: optimizer.zero_grad()
        #prepare your x_input and y_target if needed
        #send the data through your model: i.e. pred_i = model(x_input)
        #send the prediction through the loss function too: i.e. lossout= loss(pred_i, y_target)
        #call backward to back-propagate: i.e. lossout.backward()
        #call optimizer.step() to update the model parameters based on the computed gradients
        #keep the w1s, w2s, and bs, and loss value some list so you can plot them later

#plot the losslist, w1s, w2s, and bs.
        


---
# Question 2: Learning Curves, Overfitting, and Machine Learning!


Now we know how to optimize, let's get some real machine learning done!

We will use the Diabetes Dataset which can be downloaded from [here](https://drive.google.com/drive/folders/1nuZg4pMFvOZHCHxtU5gBEnHq9YJBdZX_).

In this dataset, we are trying to predict if a person doesn't have diabetes (0), has diabetes in Stage 1 or in Stage 2. The output labels can be found in Output.csv. All the other files will be used as Input. From these files, the columns that we are interested in and what they stand for are as follows :
                          
                          SEQN : ID
                          
                          RIAGENDR : Gender
                          
                          DMDYRSUS : Years in US
                          
                          INDFMPIR : Family income
                          
                          LBXGH : GlycoHemoglobin
                          
                          BMXARMC : Arm Circum
                          
                          BMDAVSAD : Saggital Abdominal
                          
                          MGDCGSZ : Grip Strength
                          
                          DRABF : Breast fed

We will use the first 6000 samples for training while the rest for testing.

Solve questions 3.1 to 3.7 with this information.

### 2.1. Data Analysis

Read the input csvs. Rename the variables with their meaningful names shown above.<br>
Print all the desired variables in the training set with their count, mean, standard deviation and range. 

__Hint__ : If you are using pandas, it might save some time to look for a built-in function that returns all these!

### 2.2. Fill in missing values

Fill in rows for both the train and test sets where the values are missing with the __fillna__ method. Use the following criteria:

1. Missing Years in US - 0
2. Missing GlycoHemoglobin - median
3. Missing Saggital Abdominal - median
4. Missing Arm Circum - median
5. Missing Grip Strength - median
6. Missing Family Income - forward fill
7. Missing Breast Fed - 1
8. Missing Gender - 2 

Median value has to be calculated for the particular column only on the training set.<br>
Now create a dataframe with only the desired variables for both train and test sets. Print the training set mean, standard deviation and range again.

### 2.3. The DataLoader

Write a dataloader class in Pytorch for our dataset. 

If you need help in writing a dataloader class, read more about it __[here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class)__.

### 2.4. The Network.

Write a 2 hidden layer Multilayer Perceptron class using Pytorch. Use ReLU as the non-linearity. The hidden layer sizes are 1000-300.

__Bonus__: Write a function to initialize the layer weights.

### 2.5. The Training phase

Write a function to train your model and the rest of the script required to run this.<br>
Plot the average loss per epoch for 100 epochs. Report the loss values at epoch 100.<br>
Parameters to be used are as follows:<br>

Optimizer : Stochastic Gradient Descent<br>
Learning Rate : 1e-4<br>
Loss : Cross Entropy<br>
Batchsize : 200<br>
Shuffle : True

### 2.6. The Testing Phase

Write a function to test your trained model.<br>Report the average test loss and the testing accuracy.<br>Also print the confusion matrix.

## 2.7 Model Performance
Now calculate the AUC of your trained model on the test set.<br>
**(Bonus)** Use the sklearn Multiclass Logistic Regression model and fit it to our dataset. Calculate the AUC of this model. Compare the two AUCs.

# Question 3: CNN Network design for disease classification

Disease classification is a common problem in medicine and there are many ways to solve this problem. This question aims to show you one technique to tackle this classification task.

Assume that we have a 10K images in a dataset of x-rays. Each image has the dimension of 128x128 and a label that defines which class the image belongs (lets assume we have 10 disease classes in total). 

You will describe your approach of classifying the disease for the techniques below. Make sure you do not forget the bias term. You can either design your proposed network by explaining it explicitely or you can provide the pytorch code which designs the network for questions 3.1.a, 3.2.a, and 3.3.a


### 3.1. Logistic Regression

### 3.1.a.

Design a multi-class logistic regression model which takes an image as input (by reshaping it to a vector: lets call this a vectorized image) and outputs to get the probability of 10 disease classes. 

### 3.1.b.
What are the sizes for your input and output?

### 3.1.c.
What type of activation function you will use and why?

### 3.1.d.
How many parameters you need to fit for your design?

### 3.2. Multi-layer Perceptron

### 3.2.a.
Design a one layer multi layer perceptron (MLP) which first maps the vectorized images to a vector of 128 then feeds this vector to a fully connected layer to get the probability of 10 disease classes. 

### 3.2.b.

Clearly mention the sizes for your input and output at each layer until you get final output vector with 10 probabilities

### 3.2.c. 
Define two types of activation functions you can use in the first layer. Which activation function you will use on the second fully connected layer?

### 3.2.d.
How many parameters you need to fit for your design? How does adding another hidden layer effected the number of parameters to use?

### 3.3. Convolutional Neural Network (CNN)

### 3.3.a.

Design a one layer convolutional neural network which first maps the images to a vector of 128 (with the help of convolution and pooling operations) then feeds this vector to a fully connected layer to get the probability of 10 disease classes.

### 3.3.b.
Clearly mention the sizes for your input, kernel, pooling, and output at each step until you get final output vector with 10 probabilities

### 3.3.c.

How many parameters you need to fit for your design?

### 3.3.d.

Increase your selected convolution kernel size by 2 in each direction. Describe the effect of using small vs large filter size during convolution. 

### 3.3.e.

Multiply your selected stride size for convolution and pooling operation by 2. Describe the effect of this change in design criteria in terms of memory requirements, number of parameters to fit and number of operations.

### 3.3.f.

Assume we trained the designed network and we want to classify the disease from a image of size 256x192.  and we want to use your designed network for inference. Describe if your designed CNN is capable of accepting this image without any preprocessing. If we can not use your network with this image, please propose changes on your network which will enable accepting images of various shapes. 