In this simple notebook we use a fully connected neural network to solve a previously seen problem in regression: the photometric redshift problem (see notebooks of Chapter 6 for more detail). We also explore some hyperparameter optimization strategies. 

It accompanies Chapter 8 of the book (2 of 2).

Copyright: Viviana Acquaviva (2023).

Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)



In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

In [2]:
import matplotlib
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 150)

font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
matplotlib.rcParams['figure.dpi'] = 300

In [3]:
import torch as torch

In [4]:
import os
os.environ["KERAS_BACKEND"] = "torch"

In [5]:
import keras

from keras import Input

from keras.models import Sequential #the model is built adding layers one after the other

from keras.layers import Dense #fully connected layers: every output talks to every input

from keras.layers import Dropout #for regularization

### Problem 2: photometric redshifts

I will start out from the reduced (high-quality) data set we used for Bagging and Boosting methods. For reference, our best model achieved a NMAD around 0.02 and an outlier fraction of 4%.

In [6]:
X = pd.read_csv('../Data/sel_features.csv', sep = '\t')
y = pd.read_csv('../Data/sel_target.csv')

In [7]:
X,y = shuffle(X,y, random_state = 12)

Just a reminder of what's in here:

In [None]:
display(X)

In [8]:
fifth = int(len(y)/5) #Divide data in fifths to use 60/20/20 split

In [9]:
X_train = X.values[:3*fifth,:]
y_train = y[:3*fifth]

X_val = X.values[3*fifth:4*fifth,:]
y_val = y[3*fifth:4*fifth]

X_test = X.values[4*fifth:,:]
y_test = y[4*fifth:]

In [None]:
X_train.shape

We know that we need to scale our data!

In [None]:
scaler = StandardScaler()

scaler.fit(X_train)

In [12]:
Xst_train = scaler.transform(X_train)
Xst_val = scaler.transform(X_val)
Xst_test = scaler.transform(X_test)

In a regression problem, we will choose a different activation for the output layer (e.g. linear), and an appropriate loss function (MSE, MAE, ...).

Our input layer has six neurons for this problem.

For other parameters and the network structure, we can start with two layers with 100 neurons and go from there.

In [None]:
model = Sequential()

# Tell subsequent layers what shape to expect
model.add(keras.Input(shape=(6,)))

optimizer = keras.optimizers.AdamW(learning_rate=0.001)

# Add an input layer and specify its size (number of original features)

model.add(Dense(100, activation='relu', input_shape=(6,)))

# Add one hidden layer and specify its size

model.add(Dense(100, activation='relu'))

# Add an output layer 

model.add(Dense(1, activation='linear'))

model.compile(loss='mse', optimizer=optimizer)


We begin with 100 epochs and batch size = 300.

In [None]:
mynet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=300)

In [None]:
results = model.evaluate(Xst_test, y_test)
print('MSE:', results) #we are only monitoring the MSE

As usual, we can plot the loss throughout the training process.

In [None]:
plt.plot(mynet.history['loss'], label = 'train')
plt.plot(mynet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.legend(loc='upper right', fontsize = 12)
plt.legend(fontsize = 12);
#plt.savefig('Photoz_NN.png')

### As always with regression problems, it is helpful to plot the predictions against the true values.



In [None]:
plt.figure(figsize=(5,5))
    
plt.xlabel('True redshift', fontsize = 14)
plt.ylabel('Estimated redshift', fontsize = 14)

plt.scatter(y_test, model.predict(Xst_test), s =10, c = 'teal');

plt.xlim(0,2)
plt.ylim(0,2)
plt.tight_layout()
#plt.savefig('Photoz_NN_scatter.png')

We didn't do cross validation, so we can only generate prediction on our single test fold in order to derive the other metrics we are interested in (OLF and NMAD).

In [None]:
ypred = model.predict(Xst_test)

Calculate Outlier Fraction

In [None]:
len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test)

Calculate Normalized Median Absolute Deviation (NMAD)

In [None]:
1.48*np.median(np.abs(y_test-ypred)/(1 + y_test))

We have decent, but not outstanding, numbers. We can play with/optimize the parameters; one thing that is very interesting IMO is to see the effect of using different losses on the residuals, and trying to add more layers.

### Let's try some optimization with keras tuner

In [25]:

from keras_tuner.tuners import RandomSearch
from keras import layers

#Some material below is adapted from the Keras Tuner documentation

# https://keras-team.github.io/keras-tuner/

This function specifies which parameters we want to tune. Tunable parameters can be of type "Choice" (we specify a set), Int, Boolean, or Float.

Keras-tuner has a lot of other options for how to set up hyperparameters, check out the API to see more: https://keras.io/keras_tuner/api/hyperparameters/
It even has ways to set up conditional hyperparameters, so you don't explore meaningless combinations of hyperparameters (which was an issue with our sk-learn hyperparameter searches).

In [27]:
def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 2, 6)): #We try between 2 and 6 layers
        model.add(layers.Dense(units=hp.Int('units_' + str(i),
                                            min_value=100, #Each of them has 100-300 neurons, in intervals of 100
                                            max_value=300,
                                            step=100),
                               activation='relu'))
    model.add(Dense(1, activation='linear')) #last one
    model.compile(
        optimizer=keras.optimizers.AdamW(
            hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])), #And a few learning rates
        loss='mse')
    return model

### Question:
List the hyperparameters being optimized and briefly describe what each one does.

Next, we specify how we want to explore the parameter space. The Random Search is the simplest choice, but often quite effective; alternatives are Hyperband (optimized Random Search where a larger fraction of models is trained for a smaller number of epochs, but only the most promising ones survive), or Bayesian Optimization, which attempts to build a probabilistic interpretation of the model scores (the posterior probability of obtaining score x, given the values of hyperparameters).


Normally, you'd try something like 40 models, at least, but this takes far too long to run for a studio (about 35 minutes, on my laptop). I've left the output of that full search below so you can see what happens, but we'll run an abbreviated search for now, so you can see how it works.

In [28]:
keras.backend.clear_session()
# This is an example of the settings you'd use for a more complete hyperparameter search
tuner = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=40, #number of combinations to try
    executions_per_trial=3,
    project_name='My Drive/Photoz') #may need to delete or reset

In [None]:
keras.backend.clear_session()

tuner_small = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=5, # too small for real use! Just showing you how it works
    executions_per_trial=3,
    project_name='My Drive/Photoz_small') #may need to delete or reset. This will store information about each trial

### Questions:
- What is the "objective" parameter controlling? What is it being set to?
- What is the "executions per trial" parameter controlling? Why is it set to something larger than 1? 

We can visualize the search space below:

In [None]:
tuner.search_space_summary() #for the full search

In [None]:
tuner_small.search_space_summary() #for the abbreviated search

Finally, it's time to put our tuner to work. 


This is a big job! 

### <span style="color:red"> Do not run this cell, unless you want to be waiting for a long time.</span>

In [34]:
tuner.search(Xst_train, y_train, #same signature as model.fit
             epochs=100, validation_data=(Xst_val, y_val), batch_size=300, verbose = 1) 

#Note: setting verbosity to 0 would give no output until done - it took about ~50 mins on my laptop. 
#It was faster when I tried running with Adam instead of AdamW, interestingly enough.

Trial 40 Complete [00h 01m 12s]
val_loss: 0.017915900175770123

Best val_loss So Far: 0.015789744754632313
Total elapsed time: 00h 49m 17s


</span>

### <span style="color:red"> Instead, run this cell to see how it works.</span>

It should take about 5 minutes.

In [None]:
tuner_small.search(Xst_train, y_train, #same signature as model.fit
             epochs=100, validation_data=(Xst_val, y_val), batch_size=300, verbose = 1) 

#Note: setting verbosity to 0 would give no output until done - it took about ~35 mins on my laptop

The "results\_summary(n)" function gives us access to the n best models. It's useful to look at a few because often the differences are minimal, and a smaller model might be preferable! Note that the "number of units" parameter would have a value assigned to it for each layer (even if the number of layers is smaller in that particular realization).

### <span style="color:red">  Output for the full search, this cell won't execute if you haven't run it yourself! Don't try to run it.</span>

In [37]:
tuner.results_summary(6)

Results summary
Results in ./My Drive/Photoz
Showing 6 best trials
Objective(name="val_loss", direction="min")

Trial 20 summary
Hyperparameters:
num_layers: 4
units_0: 300
units_1: 100
learning_rate: 0.01
units_2: 100
units_3: 200
units_4: 100
units_5: 200
Score: 0.015789744754632313

Trial 22 summary
Hyperparameters:
num_layers: 2
units_0: 100
units_1: 100
learning_rate: 0.01
units_2: 100
units_3: 200
units_4: 100
units_5: 100
Score: 0.016055627415577572

Trial 12 summary
Hyperparameters:
num_layers: 4
units_0: 300
units_1: 100
learning_rate: 0.01
units_2: 100
units_3: 200
units_4: 200
units_5: 200
Score: 0.017035806862016518

Trial 07 summary
Hyperparameters:
num_layers: 3
units_0: 300
units_1: 300
learning_rate: 0.01
units_2: 300
units_3: 200
units_4: 300
units_5: 100
Score: 0.01709962698320548

Trial 30 summary
Hyperparameters:
num_layers: 5
units_0: 200
units_1: 200
learning_rate: 0.001
units_2: 100
units_3: 300
units_4: 100
units_5: 300
Score: 0.017341187223792076

Trial 29 summ

### <span style="color:red">  Output for your small search, go ahead and execute this one.</span>

In [None]:
tuner_small.results_summary(6)

### Question:
- Based on the results summary from the full tuning, is there much variation in the results of the best-performing model? Can we confidently say which the best-performing model is, with this information? Or is there something else we'd need to do?
- Based on the results summary from the full tuning, approximately how many layers should our neural network have? Approximately how many neurons per layer? Feel free to give a range of values for each.

So you can run these cells below, we'll use the best result from your small search, but later on I'll give you the hyperparameters for the best model from the full search to actually run the network and see its performance. 

In [39]:
best_hps_small = tuner_small.get_best_hyperparameters()[0] #choose first model

In [None]:
best_hps_small.get('learning_rate')

In [None]:
best_hps_small.get('num_layers')

In [None]:
#Size of layers

print(best_hps_small.get('units_0'))
print(best_hps_small.get('units_1'))

In [43]:
model = tuner_small.hypermodel.build(best_hps_small) #define model = best model

In [44]:
model.build(input_shape=(None,6)) #build best model (if not fit yet, this will give access to summary)

In [None]:
model.summary() 

### Question:
Remember, you did a random search for hyperparameter optimization (and we ran a really small one), so answers to these questions will probably vary among your group!

Using the small model search...
- How many free parameters does your best model have? 

- Describe the structure of your best model: how many hidden layers, how many neurons per layer?

- What is the learning rate for your best model?

Now, we'll switch to the best hyperparameters I found for you using the full search, and build a neural net with the optimal hyperparameters.

There isn't a natural way to build the model using keras_tuner (as shown above for the small model), since we didnt use keras_tuner to find this model, so we'll just build it with keras, as we did at the beginning of the notebook. 

In [None]:

model = Sequential()

# Tell subsequent layers what shape to expect
model.add(keras.Input(shape=(6,)))

optimizer = keras.optimizers.AdamW(learning_rate=0.01)

# Add an input layer and specify its size (number of original features)

model.add(Dense(300, activation='relu', input_shape=(6,)))

# Add one hidden layer and specify its size

model.add(Dense(100, activation='relu'))

# Add another hidden layer and specify its size

model.add(Dense(100, activation='relu'))

# Add another hidden layer and specify its size

model.add(Dense(200, activation='relu'))

# Add an output layer 

model.add(Dense(1, activation='linear'))

model.compile(loss='mse', optimizer=optimizer)


We begin with 100 epochs and batch size = 300.

In [None]:
bestnet = model.fit(Xst_train, y_train, validation_data= (Xst_val, y_val), epochs=100, batch_size=300)

We can also look at the train vs validation curves for the optimal model found by the tuner.

In [None]:
plt.plot(bestnet.history['loss'], label = 'train')
plt.plot(bestnet.history['val_loss'],'-.m', label = 'validation')
plt.ylabel('Loss', fontsize = 14)
plt.xlabel('Epoch', fontsize = 14)
plt.ylim(0,0.1)
plt.legend(loc='upper right', fontsize = 12)
plt.legend(fontsize = 12);
#plt.savefig('OptimalNN_Photoz.png',dpi=300)

Finally, we report test scores for all the metrics of interest (MSE, OLF, NMAD):

In [None]:
model.evaluate(Xst_test, y_test)

In [None]:
ypred = model.predict(Xst_test)

#Calculate OLF

print('OLF', len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test))

#Calculate Normalized Median Absolute Deviation (NMAD)

print('NMAD', 1.48*np.median(np.abs(y_test-ypred)/(1 + y_test)))

### Question:
- Are the results improved, relative to the initial baseline model? 
- Are the results improved, relative to your ensemble decision tree models?
- Does this model have high variance?
- What else would we want to do to get definite answers to these questions?

### Below, we show the effect of changing the loss function (MSE/MAE/MAPE), and we estimate the uncertainties in the estimates of OLF/NMAD, so we can decide whether the differences are significant.

#### The model is the best model I found above (it came from a Random Search, you might find a different one).

In [None]:
#this took about 5 minutes to run on my laptop
#Architecture stays the same
keras.backend.clear_session()

model = Sequential()

# Tell subsequent layers what shape to expect
model.add(keras.Input(shape=(6,)))

optimizer = keras.optimizers.AdamW(learning_rate=0.01)

# Add an input layer and specify its size (number of original features)

model.add(Dense(300, activation='relu', input_shape=(6,)))

# Add one hidden layer and specify its size

model.add(Dense(100, activation='relu'))

# Add another hidden layer and specify its size

model.add(Dense(100, activation='relu'))

# Add another hidden layer and specify its size

model.add(Dense(200, activation='relu'))

# Add an output layer 

model.add(Dense(1, activation='linear'))

#We use three different loss functions and repeat the training 4x

for loss in ['mse','mae', 'mape']:

    model.compile(
        optimizer=keras.optimizers.AdamW(learning_rate = 0.01),
        loss=loss)

    OLF = np.zeros(4)
    NMAD = np.zeros(4)

    for i in range(0,4): #let's do this 4 times and change only random weights initialization
    
        model.fit(Xst_train, y_train,
             epochs=100,
             validation_data=(Xst_val, y_val), batch_size=300, verbose = 0)

        ypred = model.predict(Xst_test)

        #Calculate OLF

        OLF[i] = len(np.where(np.abs(y_test-ypred)>0.15*(1+y_test))[0])/len(y_test)

        #Calculate Normalized Median Absolute Deviation (NMAD)
        
        NMAD[i] = 1.48*np.median(np.abs(y_test-ypred)/(1 + y_test))

    print('OLF mean/std using loss', loss, 'is:', "{:.3f}".format(OLF.mean()), "{:.3f}".format(OLF.std()))
    print('NMAD mean/std using loss', loss, 'is:', "{:.2f}".format(NMAD.mean()), "{:.3f}".format(NMAD.std()))

### Question:
- What are the source(s) of the variance we've measured in this test? E.g. in the past, we've measured the variance associated with changing the test set by using CV. Is that what we're doing here? If not, what causes the difference between the models we're using to get the variance on each OLF/NMAD result?

- If we want to minimize OLF and NMAD, what loss (of these default options) should we use to train our model?

### Acknowledgement Statement:

You're done, go ahead and upload both studio notebooks to Gradescope!