### Hyperparameter (learning rate and batch size) tuning:

Here we roughly experiment with various learning rates and batch sizes to train HSAF-HRES mismatch using the `spatiotemporal` task

* Learning rates: 0.001, 0.05, 0.01, 0.05, 0.1
* Batch sizes: 1, 2, 5, 10, 24, 48

=30 runs = 30 excel files

Output: 30 excel files containing the training/validation data.

In [1]:
from py_env_train import *

# Define the following:
model_data = ["HRES"] # TSMP must come first for calculating the mismatch correctly in ensembles!!!
reference_data = ["HSAF"]
task_name = "spatiotemporal"
mm = "MM"  # or DM
date_start="2020-10-01"
date_end="2021-09-30"
variable="pr"
mask_type="no_na"
laginensemble=False

# The following is defined automatically:
n_ensembles = len(model_data)
n_channels = Func_Train.calculate_channels(n_ensembles, task_name, laginensemble=laginensemble)
if reference_data == ["COSMO_REA6"]:
    canvas_size = (400, 400) 
    topo_dir='/p/project/deepacf/kiste/patakchiyousefi1/IO/03-TOPOGRAPHY/EU-11-TOPO.npz'
    trim=True
    daily=True
if reference_data == ["HSAF"]:
    topo_dir='/p/project/deepacf/kiste/patakchiyousefi1/IO/03-TOPOGRAPHY/HSAF-TOPO.npz'
    canvas_size = (128, 256)
    trim=False
    daily=False
data_unique_name = f"train_data{'_daily' if daily else '_hourly'}.{variable}.{model_data}.{reference_data}.{mm}.{n_channels}.{'laginensemble' if laginensemble else ''}.{task_name}.{'.'.join(map(str, canvas_size))}.{date_start}.{date_end}.{mask_type}"
filename = f"{data_unique_name}.npz"

# load the data and define the training configurations:
train_files=np.load(TRAIN_FILES+"/"+filename)
xpixels=train_files["canvas_x"].shape[1]
ypixels=train_files["canvas_x"].shape[2]

In [2]:
# Define the following for network configs (the fixed hyperparameters)
loss="mse"
Filters=32
patience=8
epochs=64
val_split=0.25

learning_rates = [0.001, 0.05, 0.01, 0.05, 0.1]
batch_sizes = [1, 2, 5, 10, 24, 48]

# Define the variable hyperparameters (LR and BS):
for LR in learning_rates:
    for BS in batch_sizes:
        training_unique_name = data_unique_name+"."+loss+"."+str(Filters)+"."+str(LR)+"."+str(BS)+"."+str(patience)+"."+str(val_split)+"."+str(epochs)
        print("Training: BS: ", str(BS), "LR: ", str(LR))
        
        model = Func_Train.UNET(xpixels, ypixels, n_channels, Filters)
        optimizer = tf.keras.optimizers.Adam(learning_rate=LR, name='Adam')
        model.compile(optimizer=optimizer, loss=loss, metrics=['mse'])
        
        model_path = '/p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/'+training_unique_name+'.h5'
        checkpointer = tf.keras.callbacks.ModelCheckpoint(model_path, verbose=2, save_best_only=True, monitor='val_loss')
        callbacks = [tf.keras.callbacks.EarlyStopping(patience=patience, monitor='val_loss'),
                     tf.keras.callbacks.TensorBoard(log_dir='/p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/'+training_unique_name)]
        
        results = model.fit(train_files["canvas_x"], train_files["canvas_y"], 
                            validation_split=val_split, 
                            batch_size=BS, 
                            epochs=epochs, 
                            verbose=1, 
                            callbacks=[callbacks, checkpointer],
                            sample_weight=train_files["canvas_m"], 
                            shuffle=False)

        results_df = pd.DataFrame(results.history)
        results_df.to_csv(DUMP_RESULTS+"/"+training_unique_name+".csv")

Training: BS:  1 LR:  0.001


2023-03-28 12:53:16.643372: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-28 12:53:19.945310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14659 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:60:00.0, compute capability: 7.0
2023-03-28 12:53:20.033791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14659 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-03-28 12:53:20.034569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:

Epoch 1/64


2023-03-28 12:54:02.884514: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8301


   1/6490 [..............................] - ETA: 28:00:42 - loss: 1.8540 - mse: 763.1199

2023-03-28 12:54:11.838719: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2023-03-28 12:54:11.838749: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.


   2/6490 [..............................] - ETA: 2:21:44 - loss: 2.4626 - mse: 57936.9961

2023-03-28 12:54:12.702304: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2023-03-28 12:54:12.702810: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2023-03-28 12:54:12.736239: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:673]  GpuTracer has collected 522 callback api events and 521 activity events. 
2023-03-28 12:54:12.746254: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.


  16/6490 [..............................] - ETA: 13:37 - loss: 0.6343 - mse: 190556.2500

2023-03-28 12:54:13.106036: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/train_data_hourly.pr.['HRES'].['HSAF'].MM.6..spatiotemporal.128.256.2020-10-01.2021-09-30.no_na.mse.32.0.001.1.8.0.25.64/train/plugins/profile/2023_03_28_12_54_13

2023-03-28 12:54:13.114115: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/train_data_hourly.pr.['HRES'].['HSAF'].MM.6..spatiotemporal.128.256.2020-10-01.2021-09-30.no_na.mse.32.0.001.1.8.0.25.64/train/plugins/profile/2023_03_28_12_54_13/jwc09n081.juwels.trace.json.gz
2023-03-28 12:54:13.154755: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/train_data_hourly.pr.['HRES'].['HSAF'].MM.6..spatiotemporal.128.256.2020-10-01.2021-09-30.no_na.mse.32.0.001.1.8.0.


Epoch 00001: val_loss improved from inf to 0.19652, saving model to /p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/train_data_hourly.pr.['HRES'].['HSAF'].MM.6..spatiotemporal.128.256.2020-10-01.2021-09-30.no_na.mse.32.0.001.1.8.0.25.64.h5
Epoch 2/64

Epoch 00002: val_loss improved from 0.19652 to 0.19065, saving model to /p/project/deepacf/kiste/patakchiyousefi1/AI MODELS/00-UNET/train_data_hourly.pr.['HRES'].['HSAF'].MM.6..spatiotemporal.128.256.2020-10-01.2021-09-30.no_na.mse.32.0.001.1.8.0.25.64.h5
Epoch 3/64

Epoch 00003: val_loss did not improve from 0.19065
Epoch 4/64

Epoch 00004: val_loss did not improve from 0.19065
Epoch 5/64

Epoch 00005: val_loss did not improve from 0.19065
Epoch 6/64

Epoch 00006: val_loss did not improve from 0.19065
Epoch 7/64

Epoch 00007: val_loss did not improve from 0.19065
Epoch 8/64

Epoch 00008: val_loss did not improve from 0.19065
Epoch 9/64

Epoch 00009: val_loss did not improve from 0.19065
Epoch 10/64

Epoch 00010: val_loss did

2023-03-28 13:03:53.075665: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2023-03-28 13:03:53.075694: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2023-03-28 13:03:54.020045: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2023-03-28 13:03:54.020234: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2023-03-28 13:04:34.584370: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 811.25MiB (rounded to 850657280)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-03-28 13:04:34.584440: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2023-03-28 13:04:34

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.