# P11: Introduction to ML

In this problem sheet, we will revisit photometric redshift estimation and we will train a simple NN to estimate galaxy redshift from $\mathrm{mag}_u, \mathrm{mag}_g, \mathrm{mag}_r, \mathrm{mag}_i, \mathrm{mag}_z$ photometry. We will use the spectroscopic SDSS galaxy sample as an example.

For this, we will need `astroML` and `pytorch` (if you are more familar with another Deep Learning package you can of course use that instead).

## Problem 1: Get SDSS data using `astroML`

The python package `astroML` contains a number of useful data sets. In particular, it contains a sample of SDSS galaxies for which both photometric and spectroscopic measurements have been obtained. For all galaxies in this sample, we have access to photometrically-determined magnitudes $\mathrm{mag}_u, \mathrm{mag}_g, \mathrm{mag}_r, \mathrm{mag}_i, \mathrm{mag}_z$ and spectroscopically-determined redshift $z$. We will use this sample to train a neural network to predict redshift $z$ from 5 input magnitudes $\mathrm{mag}_u, \mathrm{mag}_g, \mathrm{mag}_r, \mathrm{mag}_i, \mathrm{mag}_z$, treating the spectroscopically determined redshift as ground truth.

(i) Fetch the matched SDSS sample from `astroML.datasets` making use of the routine `fetch_sdss_specgals()`.

(ii) Inspect the data set, retrieve the magnitudes $\mathrm{mag}_u, \mathrm{mag}_g, \mathrm{mag}_r, \mathrm{mag}_i, \mathrm{mag}_z$ and redshift $z$.

(iii) Plot the redshift distribution of the SDSS galaxies.

(iv) Plot the color-magnitude diagram $\mathrm{mag}_r$ vs. $\mathrm{mag}_u-\mathrm{mag}_r$ of the galaxies. What do you notice?

## Problem 2: Build NN with `pytorch`

Use `pytorch` to build a neural network consisting of an input layer (the data itself), a hidden linear layer, and a linear output layer. As activation function, you can use ReLU. Following what we discussed in class, it is simplest if you do not apply the activation function to the output layer.

Below you can find a code snippet that defines your NN. The initialization step defines the pieces needed for your network and the `forward` function describes how to combine them in the forward pass through the NN.

In [52]:
device = 'cpu'

Using cpu device


In [None]:
import torch.nn as nn

class model_1hl(nn.Module):
    def __init__(self, nh1):
        '''
        Class initialization.
        Args:
        nh1 (:obj:`integer`): number of neurons in hidden layer
        '''
        
        super().__init__() # Call base class' init function
        self.fc_h = # Linear hidden layer (use nn.Linear) #input are the 5 colors
        self.fc_o = # Linear output layer

    def forward(self, x):
        h = # Apply ReLU to output from fc_h (use nn.functional.ReLU)
        z = # Pass h though output layer
        pass

## Problem 3: Data preparation

In this problem, we will prepare our data for NN training.

(i) Split the data from problem 1 into three subsamples: training set, validation set, and test set. As a first approach you can use $70\%$ of the data for training, and $15\%$ each for validation and testing.

(ii) NN training is often more efficient when using normalized data, as this improves the performance of gradient descent algorithms. Using the data from (i), normalize it to zero mean and variance 1. 

(iii) Discuss if the normalization should be applied to the entire dataset or only to the training set? How to normalize the validation and test sets?

**Hint:** To interface with `pytorch`, you can either use the functionality in `torch.utils.data` provided by `TensorDataset` and `tensor`, or you can also define your own dataset class. Once you have created your dataset, you can create a data loader using `DataLoader`. 

## Problem 4: NN training

In this problem, we will write a routine to train our NN with `pytorch`. Note the following:

- As loss function, you can use the mean squared error.
- You can use the stochastic gradient descent optimizer.
- Log the training, validation loss for each epoch as numpy arrays.
- Use the validation loss to pick the optimal model for a given architecture and choice of hyperparameters, i.e. pick the model with the lowest validation loss.

(i) Once you have set up the training routine, train your model for 100 epochs and save the best-performing model.

(ii) Look at the loss curves as a function of training epoch and discuss your results. Have you trained the model long enough?

(iii) Experiment with changing hyperparameters, especially the learning rate.

## Problem 5: Evaluate the performance of your model 

Use the test set to evaluate the performance of your NN.