# Deep Learning in Medicine
### BMSC-GA 4493, BMIN-GA 3007 
### Homework 1



**Note:** If you need to write mathematical terms, you can type your answeres in a Markdown Cell via LaTex 

See: <a href="https://stackoverflow.com/questions/13208286/how-to-write-latex-in-ipython-notebook">here</a> if you have issues. To see basic LaTex notation see: <a href="https://en.wikibooks.org/wiki/LaTeX/Mathematics"> here </a>.

**Submission instruction**: Upload and Submit your final jupyter notebook file in <a href='http://newclasses.nyu.edu'>newclasses.nyu.edu</a>

**Submission deadline:** Tuesday Feb 13th 2018 (3:00 PM)

# Question 1: Take Derivatives!  (Total 25 points)
### Take derivatives of function f(x) with respect to x in questions 1.1 to 1.7. For 1.8, take partial derivatives of f(X, A) with respect to each $a_i$ and $x_i$. (3 points for 1.1 to 1.7 and 4 points for 1.8)


1.1) $f(x) = x^2 + 1$

1.2) $f(x) = sin(x)+tanh(x)$

1.3) $f(x) = log_e(x)$

1.4) $f(x) = e^{2x + 5}$

1.5) $f(x) = \sum_{i=1}^{4}log_e(a_i x^2 + b_i)$

($a_1$, $a_2$, $a_3$ and $a_4$ are constants)

1.6) $f(x) = \sqrt{x}$

1.7) $f(x) = \sqrt{\sum_{i=1}^{4}(a_i x)}$

($a_1$, $a_2$, $a_3$ and $a_4$ are constants)

1.8) Now consider $X$ is a d-dimensional variable. i.e. $X=(x_1, x_2, ... , x_d)$. Consider $A = (a_i, a_1, ..., a_d)$ to also be a variables. 

Compute partial derivative of $f(X,A)$ with respect to each $x_i$, and each $a_i$:

$f(X, A) = \sum_{i=1}^{d}{log_e(a_ix_i)}$

# Question 2: Solving Linear Regression via Mean Squared Error (MSE) Optimization Problem (30 points)

Imagine that you have measured two variables X and Y, for a simple task, and you belive that they might be linearly related to each other.	
The measurements are as follows:

###### (Training data D = {($X_1$, $Y_1$), ($X_2$, $Y_2$), ($X_3$, $Y_3$)})

Data point 1: $X_1$ = 2, $Y_1$ = 5

Data point 2: $X_2$ = 4, $Y_2$ = 9

Data point 3: $X_3$ = 5, $Y_3$ = 11

If we assume that the relationship between X and Y is linear, we can write this relationship as:

$Y = f_{W,B}(X) = WX + B$

where $W$ and $B$ are the parameters of the model.	
We are interested in finding best values for W and B.	
We define 'best' in terms of a loss function between $f_{W,b}(X_i)$ and $Y_i$ for each ($X_i$ and $Y_i$) in the training data. 	
Since $Y_i$s are real numbers, let's consider Mean Squared Error loss. 

Remember that Mean Squared Error for this function, over training data, and W and B is:

$MSELoss(D={(X_1, Y_1), (X_2, Y_2), (X_3, Y_3)}), W, B) = \frac{1}{3}\sum_{i=1}^{3} (f_{W,B}(X_i) - Y_i)^2 $

### 2.1) (6 points) 
Compute the partial derivative of $MESLoss(D, W, B)$, With respect to W and B.	
Remember that $X_1$, $X_2$, $X_3$, $Y_1$, $Y_2$, and $Y_3$ are constants, and already given to us as training data above.

$\frac{d}{d W} MSELoss(D, W, B) = ?$

$\frac{d}{d B} MSELoss(D, W, B) = ?$

### 2.2) (3 points) 
Use matplotlib library and plot $\frac{d}{d W} MSELoss(D, W, B)$ for W=range(10), when B equals to 1.

### 2.3) (3 points) 
What values of W and B, make both partial derivatives zero? 	
i.e. Solve and find the unique answer to $\frac{d}{d W} MSELoss(D, W, B) = 0$ , and $\frac{d}{d B} MSELoss(D, W, B) = 0$

### 2.4) (8 points) 
If you start from an initial point $W_0$ = 0.1 and $B_0$ = 0.1, and iteratively update your W and B via gradient descent as follows:
    

$ W_{t+1} = W_t - 0.01 * \frac{d}{d W} MSELoss(D, W, B) |_{W_t,B_t} $	
$ B_{t+1} = B_t - 0.01 * \frac{d}{d B} MSELoss(D, W, B) |_{W_t,B_t} $	
(Note: This is gradient descent with a 0.01 learning rate.)

What are the values of W and B over iterations 0 to 5000? (Don't compute by hand! Write a code!)	
Write a python script that computes these values for 5000 iterations, i.e. lists of $\{W_0, W_1,.., W_{5000}\}$, and $\{B_0, B_1,.., B_{5000}\}$.	
Plot the lists of W and B over 5000 iterations here.


### 2.5) (10 points) 
Now that you learned the math and made the code yourself, we will use pytorch and automatic differentiation, to find optimal W and B!	
Again, consider data to be D = {($X_1$, $Y_1$), ($X_2$, $Y_2$), ($X_3$, $Y_3$)}) = {(2,5), (4,9), (5, 11)}.

Some of your steps are here. Fill in the rest and show a plot of the loss function, W and B over these 500 epochs. (3 plots)

In [None]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
from torch import optim

D = [(2,5), (4,9), (5, 11)]
X = [d[0] for d in D]
Y = [d[1] for d in D]
print(X, Y)

model = torch.nn.Linear(1, 1, bias=True)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss = torch.nn.MSELoss()

for epoch in range(500):
    for i in range(3):
        xinput = Variable(torch.from_numpy(np.array([X[i]]))).type(torch.FloatTensor)
        ytarget = Variable(torch.from_numpy(np.array([Y[i]]))).type(torch.FloatTensor)
        # don't forget to zero_grad your model. 
        # forward into your model and loss
        # do a backward step to compute gradients
        # make one step with the optimizer
        # keep track of the loss, W and b in some lists.

#Plot loss, W and b lists.


# Question 3: Solving Classification - Logistic Regression - via Negative Log Likelihood Optimization (30 points)

Now Imagine that your still have measured two variables X and Y, for a simple task, but your output $Y$ is only either 0 or 1.	
This is called classification. Our observations are as follows:

##### Training data D = {($X_1$, $Y_1$), ($X_2$, $Y_2$), ($X_3$, $Y_3$)}

Data point 1: $X_1$ = 2, $Y_1$ = 0	
Data point 2: $X_2$ = 4, $Y_2$ = 0	
Data point 3: $X_3$ = 5, $Y_3$ = 1	

How can we think of a function, f(X), which gives us binary predictions?	
Often, solution is to try to model probability of the label of X being equal to 1 (or 0).	
In other words, we can try to model:

#### $P(Y=1|X) = f(X)$

Probabilities are numbers between 0 and 1, so often, people use a function Sigmoid (<a href="https://en.wikipedia.org/wiki/Sigmoid_function">Read More</a>) on the output of a linear function, to map an input, X, to the probability of its label, Y, being equal to 1. 

$P_{W,B}(Y=1|X) = f_{W,B}(X) = \frac{1}{1+e^{-(WX+B)}}$

This is the basic formulation of the simplest classification model: Logistic Regression!	
You can note, that this function $\frac{1}{1+e^{-(WX+B)}}$ is also parametrized with a W and B only.	
Similar to Question 2, we can also find the 'best' W and B, by optimizing some loss function over the training data. 

In Classification tasks, the common loss functin to use is negative log likelihood,	
which is simply the negative of sum of log of probabilities of observed samples taking their correct labels. i.e.

$NLL\_Loss(D,W,B) =  -Log(P_{W,B}(Y=0|X_1)) -Log(P_{W,B}(Y=0|X_2)) - Log( P_{W,B}(Y=1|X_3))$

By expanding $P_{W,B}(Y=1|X) = \frac{1}{1+e^{-(WX+B)}}$, and $P_{W,B}(Y=0|X) = 1- \frac{1}{1+e^{-(WX+B)}}$, we can use chain rule and backpropagation to compute derivative of $NLL\_Loss(D,W,B)$ with respect to W and B, and find the 'best' W and B for each given dataset.

### 3.1) (6 points)
What are $\frac{d}{d W} NLL\_Loss(D,W,B)$, and $\frac{d}{d B} NLL\_Loss(D,W,B)$ ?

### 3.2) (8 points)
If you start from an initial point $W_0$ = 0 and $B_0$ = 0, and iteratively update your W and B via gradient descent as follows:
    
$ W_{t+1} = W_t - 0.01 *  \frac{d}{d W} NLL\_Loss(D,W,B) |_{W_t,B_t} $	
$ B_{t+1} = B_t - 0.01 * \frac{d}{d W} NLL\_Loss(D,W,B) |_{W_t,B_t} $

what are the values of W and B over iterations 0 to 500? (Don't compute by hand!)	
Write a script that computes these values for 500 iterations, and plot these lists of $\{W_0, W_1,.., W_{500}\}$, and $\{B_0, B_1,.., B_{500}\}$ via matplotlib here. 

### 3.3) (10 points) 
Use pytorch to implement Logistic Regression! We write the first parts of it. You fill in the rest. Plot W and B and value of the loss function over these 500 iterations. (3 plots)


In [None]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
from torch import optim

D = [(2,0), (4,0), (5, 1)]
X = [d[0] for d in D]
Y = [d[1] for d in D]
print(X, Y)

model = #?
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss = #?

for epoch in range(500):
    for i in range(3):
        xinput = Variable(torch.from_numpy(np.array([X[i]]))).type(torch.FloatTensor)
        ytarget = Variable(torch.from_numpy(np.array([Y[i]]))).type(torch.FloatTensor)
        # don't forget to zero_grad your model. 
        # forward input into your model and loss
        # do a backward step to compute gradients
        # make one step with the optimizer
        # keep track of the loss, W and b in some lists.

#Plot loss, W and b lists.


# Question 4: Learning Curves, Overfitting, and Machine Learning! 
# (34 points +10 Bonus points)

Now we know how to optimize, let's get some real machine learning done!	

Instead of the small dataset we had in questions 2 and 3, now let's use the the CBIS-DDSM (Curated Breast Imaging Subset of DDSM) dataset from <a href="https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM#385f2cd4e86f4142b1d32bdb5803bd96"> here</a> (Click on the 'Detailed Description' tab at the bottom of the page.)


In this homework, we will *only* focus on the following items in the dataset:	
Mass-Training-Description (csv)	
Mass-Test-Description (csv)	
(Don't download the images on your laptop! That file is too big and we deal with it on the cluster later!)

This dataset contains several features related to Mammography and detection of breast cancer. 

The Mass-Training-Description and Mass-Test-Description include these columns:

patient_id	
breast_density	
left or right breast	
image view		
abnormality id		
abnormality type	
mass shape	
mass margins	
assessment	
pathology

There is more data in this dataset, including images, but for this homework we will not focus on them.

We are interested in this question:	
Using variables:	

breast_density	
left or right breast	
image view		
abnormality id		
abnormality type	
mass shape	
mass margins	

How well can we predict the **pathology type**?

We can answer that by training a model on the Mass-Training-Description, and evaluating it on Mass-Test-Description. 
See questions 4.1 and 4.2



### 4.1) (10 points)
Write a script to convert the data from variables [breast_density, left or right breast, image view, abnormality id,
abnormality type, mass shape, mass margins] into input and [pathology type] into output.

The output of your script should be a matrix X and a vector Y,	
where each row of X are one set of variables for a patient	
and each row of Y is the pathology type class, for that patient.	

Use *matplotlib.imshow(X, aspect='auto')* to visualize the X.	
(And if there are multiple equivalent rows per patient, keep only one of them - any, up to you)


### 4.2 (4 points)
Repeat Question 4.1 for the test set.

### 4.2 (10 points)
Write your training script for a multi-layered-perceptron classifier with CrossEntropy loss.
Plot the ***average loss on all the train samples*** per epoch. (Stop the training after 100 epochs. You are welcome to compute for more than 100, however. Up to you.)



### 4.2 (10 points)
Add a test-set evaluation of the loss in your answer to 4.2, and plot the ***average loss on all the test samples*** per epoch. (Stop the training after 100 epochs. You are welcome to compute for more than 100, up to you.)

### 4.3 (5 points)
Change some of the hyper-parameters of your training - number of hidden nodes or layers - and plot the train and test loss per epoch.

Also repeat your experiments with and without **normalization** of columns of test and train set.

Note: You should only normalize the non-binary columns, usually.	
You can use something like:

In [65]:
def normalize_nonbinary_columns(x):
    for ix in range(x.shape[1]):
        if x[:,ix].min() != 0 or x[:,ix].max() != 1:
            if  x[:,ix].std() != 0:
                print( 'non-binary colunm!', ix)        
                x[:,ix] = (x[:,ix] - x[:,ix].mean()) / x[:,ix].std() 
    return x

Describe what is the best final model that you have found. What were the configurations?
What's the final best CrossEntropy loss on validation (test) set that you ever found?

### 4.4 (5 points)

Add AUC computation to your evaluation at each epoch for test and train set. What is the best Area Under ROC curve for prediction of MALIGNANT class on the validation(test) set you ever found?

### 4.3) (Bonus up to Max 10 points)
Be creative and think about other interesting machine learning tasks that could be done with this dataset.	
Any interesting idea gives a bonus point of +2 up to the 10 points max.
