# **CVAE_training**

In [1]:
import os, json

import papermill as pm
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import tensorflow as tf
import netCDF4
import cartopy

from tensorflow import keras
from keras import layers
from sklearn.model_selection import train_test_split 

print("TF version:", tf.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

2024-07-17 17:23:58.245301: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-17 17:23:58.245363: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-17 17:23:58.246889: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-17 17:23:58.253391: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


TF version: 2.16.2
GPU is available


2024-07-17 17:23:59.971440: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-17 17:24:00.016350: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-17 17:24:00.018332: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

# Download and Convert Data
On my [first Google hit for GEFS](https://www.ncei.noaa.gov/products/weather-climate-models/global-ensemble-forecast), I clicked on [AWS Open Data Registry for GEFS](https://registry.opendata.aws/noaa-gefs-pds/) and selected [NOAA GEFS Re-forecast](https://registry.opendata.aws/noaa-gefs-reforecast/) which has no useage restrictions.  The [GEFS Re-forecast data documentation](https://noaa-gefs-retrospective.s3.amazonaws.com/Description_of_reforecast_data.pdf) is very clear and we're going to download two files, 57 MB each.  The date of the initialization of the re-forecast is in the file name in the format YYYYMMDDHH.  The c00, p01, p02, p03, p04 are the control and perturbation ensemble members (5 total).

In [2]:
data_prefix = "./gefs_data"
data_dir = "./gefs_data/converted/"
model_dir = './model_dir'

In [3]:
# data definitions

# data download
def get_data(year, month, day, ensemble):
    if f'pres_msl_{year}{month}{day}00_{ensemble}.grib2' not in os.listdir(data_prefix):
        !wget -q -P {data_prefix} https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/{year}/{year}{month}{day}00/{ensemble}/Days%3A1-10/pres_msl_{year}{month}{day}00_{ensemble}.grib2

# data deletion
def remove_data():
    !find {data_prefix} -type f -delete

# data loading
def load_data(): 
    files = os.listdir(data_dir)
    files = [f for f in files if '.nc' in f]
    
    all_data = (np.expand_dims(
        np.concatenate(
            [netCDF4.Dataset(data_dir + converted_file)['msl'][:] for converted_file in files]
        ),
        -1
    ).astype("float32") - 85000) / (110000 - 85000)
    
    return all_data

# Neural Network Design

We need to get to a small latent space. Conv2D networks are good because they help reduce the number of connections in a network in a meaningful way.  I'm using terms as defined in [this definition of conv2D](https://towardsdatascience.com/conv2d-to-finally-understand-what-happens-in-the-forward-pass-1bbaafb0b148).

**Definitions:**
K -> kernel size;
P -> padding;
S -> stride;
D -> Dilation;
G -> Groups

**Filter options:**
Longitude is easy because it is large and even, so as long as you have an even stride, you get integer results when dividing.
e.g. lon 9: stride 4, lat 7: stride 5

- Latitude - whole numbers occurr for P = 2 & K = 3 or K = 11.
- 11 grid points * 0.25 deg * 100 km/deg = 275 km filter window (a good scale for weather)
- 9 grid points * 0.25 deg * 100 km/deg = 225 km
- Longitude - whole numbers occur for P = 0 & K = 11 (nice match with Latitude), P = 1 & K = 3 or 13, P = 2 & K = 5.

For a 5 x 7 filter with 3 stride (no overlap) and no padding:
- lat: (721 - 4) / 3 = 239 possible steps (good whole number!)
- lon: (1440 - 4) / 3 = 478.6666 possible steps

## Load and Preprocess Training Data:
The standard way of manipulating arrays in Conv2D layers in TF is to use arrays in the shape:
`batch_size,  height, width, channels = data.shape`
In our case, the the `batch_size` is the number of image frames (i.e. separate samples or rows in a `.csv` file), the `height` and `width` define the size of the image frame in number of pixels, and the `channels` are the number of layers in the frames.  Typically, channels are color layers (e.g. RGB or CMYK) but in our case, we could use different metereological variables.  However, for this first experiment, **we only need one channel** because we're only going to use mean sea level pressure (msl).

## Build the Encoder:
GFS grids I have available here are at 0.25 degree resolution.  I'm doing this as a "worst case" scenario since there are also 0.5 and 1.0 degree grids with lower resolution but I can't find that data quickly and don't know what's available.

These 0.25 degree grids are 721 x 1440.
Each forecast file is 3 hourly for 10 days = 8 steps/forecast * 10 days = 80 "frames"
This demo is only using two forecasts from the control ensemble
(one launched Jan 01, 2019 and one launched Jan 02, 2019) -> this is only 
a small subset of the variability possible in the model.

This particular data set spans 2000-2019 and there are 5 ensemble members.

## Build the Decoder:
With the 11 x 11 and 5 x 5 filters, non-overlapping stride, applied here, we have a final "image" size of 14 x 27 and 64 channels.

In [4]:
# model defintions

class Sampling(layers.Layer):
    """Uses (z_mean, z_log_var) to sample z, the vector encoding a digit."""

    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

def build_encoder(latent_dim):
    encoder_inputs = keras.Input(shape=(721, 1440, 1))
    
    x = layers.Conv2D(32, 11, activation = "relu", strides = [9, 10], padding = "valid")(encoder_inputs)
    x = layers.Conv2D(64, [5,9], activation = "relu", strides = [5, 9], padding = "valid")(x)
    x = layers.Flatten()(x)
    x = layers.Dense(16, activation="relu")(x)
    
    z_mean = layers.Dense(latent_dim, name="z_mean")(x)
    z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)
    z = Sampling()([z_mean, z_log_var])
    
    encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name = "encoder")
    
    print(encoder.summary())
    return encoder

def build_decoder(latent_dim):
    latent_inputs = keras.Input(shape=(latent_dim,))
    x = layers.Dense(15 * 15 * 64, activation="relu")(latent_inputs)
    x = layers.Reshape((15, 15, 64))(x)
    # FIXME - there is something wrong here, but at least there is a pattern.
    # Using output_padding as a fudge factor -> it may be that there is exactly
    # one "missing" filter stamp/convolution because for both Conv2DTranspose
    # operations, output_padding is set to maximum it could be in both dims
    # (i.e. exactly one less than the stride of each filter).
    x = layers.Conv2DTranspose(64, [5, 9], activation = "relu", strides = [5,9], padding = "valid", output_padding = [4, 8])(x)
    x = layers.Conv2DTranspose(32, 11, activation = "relu", strides = [9,10], padding = "valid", output_padding = [8, 9])(x)
    decoder_outputs = layers.Conv2DTranspose(1, 3, activation = "sigmoid", padding = "same")(x)
    decoder = keras.Model(latent_inputs, decoder_outputs, name = "decoder")
    
    print(decoder.summary())
    return decoder

class VAE(keras.Model):
    def __init__(self, encoder, decoder, **kwargs):
        super(VAE, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.total_loss_tracker = keras.metrics.Mean(name = "total_loss")
        self.reconstruction_loss_tracker = keras.metrics.Mean(name = "reconstruction_loss")
        self.kl_loss_tracker = keras.metrics.Mean(name = "kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    def train_step(self, data):
        
        with tf.GradientTape() as tape:
            z_mean, z_log_var, z = self.encoder(data)
            reconstruction = self.decoder(z)
            # FIXME: Normalize loss with the number of features (28 * 28)
            n_features = 28 * 28
            reconstruction_loss = tf.reduce_mean(
                tf.reduce_sum(
                    keras.losses.binary_crossentropy(data, reconstruction), axis = (1, 2)
                )
            ) / n_features
            kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
            kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis = 1)) / n_features
            total_loss = (reconstruction_loss + kl_loss)
        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        
        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

    # Needed to validate (validation loss) and to evaluate
    def test_step(self, data):
        if type(data) == tuple:
            data, _ = data
            
        z_mean, z_log_var, z = self.encoder(data)
        reconstruction = self.decoder(z)
        # FIXME: Normalize loss with the number of features (28 * 28)
        n_features = 28 * 28
        reconstruction_loss = tf.reduce_mean(
            tf.reduce_sum(
                keras.losses.binary_crossentropy(data, reconstruction), axis = (1, 2)
            )
        ) / n_features
        kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
        kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis = 1)) / n_features
        total_loss = (reconstruction_loss + kl_loss)
        # grads = tape.gradient(total_loss, self.trainable_weights)
        # self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        
        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

In [5]:
# training definitions

def train_model(X_train, X_test, X_valid, date, vae):
    early_stopping_cb = keras.callbacks.EarlyStopping(patience = 5, restore_best_weights = True) # stops training early if the validation loss does not improve
    
    if os.path.exists(os.path.join(model_dir, 'vae.weights.h5')): # if the model has already been trained at least once, load that model
        vae.load_weights(os.path.join(model_dir, 'vae.weights.h5'))

    history = vae.fit(
        X_train, epochs = 50, batch_size = 40,
        callbacks = [early_stopping_cb],
        validation_data = (X_valid,)
    )

    vae.save_weights(os.path.join(model_dir, 'vae.weights.h5')) # save model weights after training
    !cp model_dir + '/vae.weights.h5' model_dir + f'/vae.weights_{date}.h5' # make a copy to save
    
    hist_pd = pd.DataFrame(history.history)
    hist_pd.to_csv(os.path.join(model_dir, f'history_{date}.csv'), index = False)

    test_loss = vae.evaluate(X_test)
    test_loss = dict(zip(["loss", "reconstruction_loss", "kl_loss"], test_loss))

    print('Test loss:', test_loss)

    with open(os.path.join(model_dir, f'test_loss_{date}.json'), 'w') as json_file:
        json.dump(test_loss, json_file, indent = 4)
        
    !dvc add model_dir + f'/history_{date}.csv'
    !dvc add model_dir + '/vae.weights.h5' model_dir + f'/vae.weights_{date}.h5'
    
    !git add model_dir + f'/history_{date}.csv'
    !git add model_dir + '/vae.weights.h5' model_dir + f'/vae.weights_{date}.h5'
    
    !dvc push
    !git commit -m f'{date}'
    !git push
    
    !rm '/vae.weights.h5' model_dir + f'/vae.weights_{date}.h5' # delete copy
    
def run_train(num_files, date, vae):
    slp = load_data() # load data
    print("shape:", np.shape(slp)) # verify data shape
    
    # split the data - y values are throw away
    X_train, X_test, y_train, y_test = train_test_split(slp[0:(num_files * 80 - 1), :, :, :], np.arange(0, num_files * 80 - 1), test_size = 0.2, random_state = 1)
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.25, random_state = 1) # 0.25 x 0.8 = 0.2

    train_model(X_train, X_test, X_valid, date, vae)
    remove_data()

# Train the VAE model

In [6]:
# model build

latent_dim = 2

# build encoder
encoder = build_encoder(latent_dim)
print("Memory usage after building encoder:", tf.config.experimental.get_memory_info('GPU:0'))

# build decoder
decoder = build_decoder(latent_dim)
print("Memory usage after building decoder:", tf.config.experimental.get_memory_info('GPU:0'))

# build VAE (variational autoencoder)
vae = VAE(encoder, decoder)
vae.compile(optimizer = 'rmsprop') 
print("Memory usage after building VAE:", tf.config.experimental.get_memory_info('GPU:0'))

2024-07-17 17:24:00.181548: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-17 17:24:00.183681: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-07-17 17:24:00.185565: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

None
Memory usage after building encoder: {'current': 1311232, 'peak': 3153408}


None
Memory usage after building decoder: {'current': 3216640, 'peak': 6049536}
Memory usage after building VAE: {'current': 3219200, 'peak': 6049536}


In [7]:
# parameter cell for pm 
year = "2018"
month = "01"
days = ["00", "10", "20"]
ensembles = ["c00"] #, "p01", "p02", "p03", "p04"]

In [8]:
# Parameters
year = "2000"
month = "01"
days = ["01", "11", "21"]


In [9]:
# training

num_files = 0
first_day = days[0]
date = year + month + first_day

# get wanted data -------------------------------------------------------------------------
for day in days:
    for ensemble in ensembles:
        get_data(year, month, day, ensemble)

        if f'pres_msl_{year}{month}{day}00_{ensemble}.grib2' in os.listdir(data_prefix):
            num_files += 1
# ------------------------------------------------------------------------------------------          
            
!csh batch_grib2nc.csh # convert files
run_train(num_files, date, vae) # run training 
    
print("Memory usage after training:", tf.config.experimental.get_memory_info('GPU:0'))

Working on ./gefs_data/pres_msl_2000010100_c00.grib2


cdo    copy:   0%100%                 

[32mcdo    copy: [0mProcessed 83059200 values from 1 variable over 80 timesteps [1.44s 84MB]
Working on ./gefs_data/pres_msl_2000011100_c00.grib2


cdo    copy:   0%100%                 

[32mcdo    copy: [0mProcessed 83059200 values from 1 variable over 80 timesteps [1.46s 84MB]


Working on ./gefs_data/pres_msl_2000012100_c00.grib2


cdo    copy:   0%100%                 

[32mcdo    copy: [0mProcessed 83059200 values from 1 variable over 80 timesteps [1.45s 84MB]


shape: (240, 721, 1440, 1)


  saveable.load_own_variables(weights_store.get(inner_path))


Epoch 1/50


I0000 00:00:1721237065.458345  107628 service.cc:145] XLA service 0x1499c800c9c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1721237065.458396  107628 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
2024-07-17 17:24:25.493828: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.


2024-07-17 17:24:25.683505: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907


2024-07-17 17:24:31.043312: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 5.07GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-07-17 17:24:31.043484: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 5.07GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-07-17 17:24:31.043502: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 5.07GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-07-17 17:24:31.043523: W external/local_tsl/tsl/framework/bfc_

2024-07-17 17:24:33.606495: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[40,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[40,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:25:17.671321: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 45.064886115s
Trying algorithm eng0{} for conv (f32[40,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[40,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:25:39.541695: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[64,32,11,11]{3,2,1,0}, u8[0]{0}) custom-call(f32[40,32,721,1440]{3,2,1,0}, f32[40,64,79,143]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:25:44.226885: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 5.685264069s
Trying algorithm eng0{} for conv (f32[64,32,11,11]{3,2,1,0}, u8[0]{0}) custom-call(f32[40,32,721,1440]{3,2,1,0}, f32[40,64,79,143]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


I0000 00:00:1721237151.392565  107628 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m4:22[0m 87s/step - kl_loss: 0.1028 - loss: 863.4670 - reconstruction_loss: 863.3642

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 238ms/step - kl_loss: 0.4448 - loss: 867.4455 - reconstruction_loss: 867.0007

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 238ms/step - kl_loss: 0.4822 - loss: 867.9163 - reconstruction_loss: 867.4340

2024-07-17 17:25:57.569571: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[23,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[23,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:26:22.482737: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 25.913233655s
Trying algorithm eng0{} for conv (f32[23,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[23,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:26:35.837930: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[64,32,11,11]{3,2,1,0}, u8[0]{0}) custom-call(f32[23,32,721,1440]{3,2,1,0}, f32[23,64,79,143]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:26:37.569529: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 2.731662731s
Trying algorithm eng0{} for conv (f32[64,32,11,11]{3,2,1,0}, u8[0]{0}) custom-call(f32[23,32,721,1440]{3,2,1,0}, f32[23,64,79,143]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17s/step - kl_loss: 0.4747 - loss: 867.8565 - reconstruction_loss: 867.3818  

2024-07-17 17:26:45.928836: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[8,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[8,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:26:53.943479: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 9.014722825s
Trying algorithm eng0{} for conv (f32[8,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[8,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 22s/step - kl_loss: 0.4701 - loss: 867.8207 - reconstruction_loss: 867.3505 - val_kl_loss: 0.1237 - val_loss: 864.3923 - val_reconstruction_loss: 864.2686


Epoch 2/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m1s[0m 352ms/step - kl_loss: 0.1242 - loss: 862.6890 - reconstruction_loss: 862.5648

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 238ms/step - kl_loss: 0.1238 - loss: 862.9498 - reconstruction_loss: 862.8262

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 236ms/step - kl_loss: 0.1237 - loss: 863.1104 - reconstruction_loss: 862.9867

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 200ms/step - kl_loss: 0.1236 - loss: 863.1921 - reconstruction_loss: 863.0686

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 277ms/step - kl_loss: 0.1235 - loss: 863.2412 - reconstruction_loss: 863.1177 - val_kl_loss: 0.1244 - val_loss: 864.3391 - val_reconstruction_loss: 864.2147


Epoch 3/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.1248 - loss: 863.0936 - reconstruction_loss: 862.9688

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1230 - loss: 863.1221 - reconstruction_loss: 862.9991

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1220 - loss: 863.1944 - reconstruction_loss: 863.0724

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.1211 - loss: 863.2537 - reconstruction_loss: 863.1326

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 276ms/step - kl_loss: 0.1205 - loss: 863.2893 - reconstruction_loss: 863.1687 - val_kl_loss: 0.1145 - val_loss: 864.3011 - val_reconstruction_loss: 864.1866


Epoch 4/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 332ms/step - kl_loss: 0.1147 - loss: 863.4042 - reconstruction_loss: 863.2895

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1126 - loss: 863.4469 - reconstruction_loss: 863.3343

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.1121 - loss: 863.4134 - reconstruction_loss: 863.3013

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.1112 - loss: 863.4111 - reconstruction_loss: 863.2999

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.1106 - loss: 863.4098 - reconstruction_loss: 863.2991 - val_kl_loss: 0.1146 - val_loss: 864.3949 - val_reconstruction_loss: 864.2803


Epoch 5/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 328ms/step - kl_loss: 0.1147 - loss: 863.6104 - reconstruction_loss: 863.4957

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1059 - loss: 863.5328 - reconstruction_loss: 863.4269

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.1078 - loss: 863.5624 - reconstruction_loss: 863.4545

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.1058 - loss: 863.6291 - reconstruction_loss: 863.5232

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 306ms/step - kl_loss: 0.1046 - loss: 863.6691 - reconstruction_loss: 863.5645 - val_kl_loss: 0.1475 - val_loss: 864.8248 - val_reconstruction_loss: 864.6774


Epoch 6/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m1s[0m 368ms/step - kl_loss: 0.1477 - loss: 864.1984 - reconstruction_loss: 864.0507

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1265 - loss: 863.9836 - reconstruction_loss: 863.8571

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1205 - loss: 863.8864 - reconstruction_loss: 863.7659

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.1153 - loss: 863.8137 - reconstruction_loss: 863.6984

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 276ms/step - kl_loss: 0.1121 - loss: 863.7701 - reconstruction_loss: 863.6580 - val_kl_loss: 0.1005 - val_loss: 864.2725 - val_reconstruction_loss: 864.1720


Epoch 7/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.1007 - loss: 863.5375 - reconstruction_loss: 863.4368

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0934 - loss: 863.3271 - reconstruction_loss: 863.2338

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0921 - loss: 863.2849 - reconstruction_loss: 863.1927

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0903 - loss: 863.3235 - reconstruction_loss: 863.2333

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0892 - loss: 863.3467 - reconstruction_loss: 863.2576 - val_kl_loss: 0.1024 - val_loss: 864.3152 - val_reconstruction_loss: 864.2128


Epoch 8/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 328ms/step - kl_loss: 0.1027 - loss: 863.0536 - reconstruction_loss: 862.9509

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0909 - loss: 863.1511 - reconstruction_loss: 863.0602

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0923 - loss: 863.2838 - reconstruction_loss: 863.1914

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0896 - loss: 863.3875 - reconstruction_loss: 863.2977

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0880 - loss: 863.4496 - reconstruction_loss: 863.3616 - val_kl_loss: 0.1631 - val_loss: 864.9451 - val_reconstruction_loss: 864.7819


Epoch 9/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.1636 - loss: 863.5262 - reconstruction_loss: 863.3626

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.1317 - loss: 863.7136 - reconstruction_loss: 863.5820

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.1225 - loss: 863.8107 - reconstruction_loss: 863.6882

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.1151 - loss: 863.8016 - reconstruction_loss: 863.6866

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 276ms/step - kl_loss: 0.1106 - loss: 863.7962 - reconstruction_loss: 863.6856 - val_kl_loss: 0.0919 - val_loss: 864.1982 - val_reconstruction_loss: 864.1064


Epoch 10/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.0918 - loss: 864.3002 - reconstruction_loss: 864.2084

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0870 - loss: 864.0564 - reconstruction_loss: 863.9694

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0852 - loss: 863.8790 - reconstruction_loss: 863.7938

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0830 - loss: 863.6929 - reconstruction_loss: 863.6098

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 276ms/step - kl_loss: 0.0818 - loss: 863.5812 - reconstruction_loss: 863.4994 - val_kl_loss: 0.0800 - val_loss: 864.1505 - val_reconstruction_loss: 864.0705


Epoch 11/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 328ms/step - kl_loss: 0.0801 - loss: 863.4833 - reconstruction_loss: 863.4033

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0776 - loss: 863.3242 - reconstruction_loss: 863.2467

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0759 - loss: 863.2819 - reconstruction_loss: 863.2061

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0748 - loss: 863.2762 - reconstruction_loss: 863.2014

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 276ms/step - kl_loss: 0.0742 - loss: 863.2728 - reconstruction_loss: 863.1987 - val_kl_loss: 0.0634 - val_loss: 864.1230 - val_reconstruction_loss: 864.0597


Epoch 12/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 328ms/step - kl_loss: 0.0636 - loss: 862.8126 - reconstruction_loss: 862.7491

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0643 - loss: 862.9042 - reconstruction_loss: 862.8398

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0639 - loss: 863.0123 - reconstruction_loss: 862.9484

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0647 - loss: 863.0692 - reconstruction_loss: 863.0045

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0653 - loss: 863.1033 - reconstruction_loss: 863.0381 - val_kl_loss: 0.0291 - val_loss: 864.9606 - val_reconstruction_loss: 864.9315


Epoch 13/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 328ms/step - kl_loss: 0.0291 - loss: 864.0525 - reconstruction_loss: 864.0234

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0679 - loss: 864.4945 - reconstruction_loss: 864.4266

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0719 - loss: 864.4290 - reconstruction_loss: 864.3571

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0757 - loss: 864.4130 - reconstruction_loss: 864.3372

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0780 - loss: 864.4033 - reconstruction_loss: 864.3253 - val_kl_loss: 0.0570 - val_loss: 864.1862 - val_reconstruction_loss: 864.1292


Epoch 14/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m1s[0m 336ms/step - kl_loss: 0.0572 - loss: 863.1317 - reconstruction_loss: 863.0745

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0622 - loss: 863.1326 - reconstruction_loss: 863.0704

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0639 - loss: 863.1470 - reconstruction_loss: 863.0831

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0650 - loss: 863.1766 - reconstruction_loss: 863.1116

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 276ms/step - kl_loss: 0.0656 - loss: 863.1944 - reconstruction_loss: 863.1287 - val_kl_loss: 0.0654 - val_loss: 864.0660 - val_reconstruction_loss: 864.0006


Epoch 15/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m1s[0m 338ms/step - kl_loss: 0.0656 - loss: 862.8549 - reconstruction_loss: 862.7894

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0653 - loss: 862.8738 - reconstruction_loss: 862.8086

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0647 - loss: 862.9755 - reconstruction_loss: 862.9108

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0648 - loss: 863.0292 - reconstruction_loss: 862.9644

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0649 - loss: 863.0615 - reconstruction_loss: 862.9966 - val_kl_loss: 0.0570 - val_loss: 864.0812 - val_reconstruction_loss: 864.0242


Epoch 16/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 330ms/step - kl_loss: 0.0570 - loss: 863.8434 - reconstruction_loss: 863.7864

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0633 - loss: 863.7384 - reconstruction_loss: 863.6751

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0615 - loss: 863.6578 - reconstruction_loss: 863.5964

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0633 - loss: 863.6138 - reconstruction_loss: 863.5505

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0645 - loss: 863.5875 - reconstruction_loss: 863.5229 - val_kl_loss: 0.0250 - val_loss: 865.1282 - val_reconstruction_loss: 865.1031


Epoch 17/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.0251 - loss: 863.9148 - reconstruction_loss: 863.8897

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0430 - loss: 863.8891 - reconstruction_loss: 863.8461

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0469 - loss: 863.8857 - reconstruction_loss: 863.8389

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0502 - loss: 863.8121 - reconstruction_loss: 863.7620

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0522 - loss: 863.7679 - reconstruction_loss: 863.7158 - val_kl_loss: 0.0479 - val_loss: 864.1826 - val_reconstruction_loss: 864.1346


Epoch 18/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.0479 - loss: 863.9459 - reconstruction_loss: 863.8979

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 235ms/step - kl_loss: 0.0547 - loss: 863.5126 - reconstruction_loss: 863.4579

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0549 - loss: 863.3862 - reconstruction_loss: 863.3312

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0562 - loss: 863.3948 - reconstruction_loss: 863.3386

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0570 - loss: 863.4001 - reconstruction_loss: 863.3431 - val_kl_loss: 0.0471 - val_loss: 864.0865 - val_reconstruction_loss: 864.0394


Epoch 19/50


[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 329ms/step - kl_loss: 0.0472 - loss: 863.3409 - reconstruction_loss: 863.2937

[1m2/4[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0519 - loss: 863.1213 - reconstruction_loss: 863.0695

[1m3/4[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 234ms/step - kl_loss: 0.0520 - loss: 863.1355 - reconstruction_loss: 863.0835

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step - kl_loss: 0.0536 - loss: 863.1824 - reconstruction_loss: 863.1288

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 275ms/step - kl_loss: 0.0545 - loss: 863.2106 - reconstruction_loss: 863.1561 - val_kl_loss: 0.0359 - val_loss: 864.3295 - val_reconstruction_loss: 864.2936


cp: target 'f/vae.weights_20000101.h5' is not a directory


2024-07-17 17:27:25.185260: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[32,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:28:00.237504: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 36.052307632s
Trying algorithm eng0{} for conv (f32[32,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


[1m1/2[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m47s[0m 48s/step - kl_loss: 0.0654 - loss: 864.0977 - reconstruction_loss: 864.0323

2024-07-17 17:28:10.414283: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[16,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[16,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


2024-07-17 17:28:27.441377: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 18.02715484s
Trying algorithm eng0{} for conv (f32[16,32,721,1440]{3,2,1,0}, u8[0]{0}) custom-call(f32[16,64,79,143]{3,2,1,0}, f32[64,32,11,11]{3,2,1,0}), window={size=11x11 stride=9x10}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25s/step - kl_loss: 0.0654 - loss: 864.0366 - reconstruction_loss: 863.9712 

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m73s[0m 25s/step - kl_loss: 0.0655 - loss: 864.0163 - reconstruction_loss: 863.9508


Test loss: {'loss': 863.9100341796875, 'reconstruction_loss': 0.0655144676566124, 'kl_loss': 863.9755859375}


!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      

[31mERROR[39m: stage working dir '/home/lobielodan/parsl_mpi/run_on_cluster/cvae-weather-ensemble/f' does not exist


[0m

!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      

[31mERROR[39m: Cached output(s) outside of DVC project: /vae.weights.h5. See <[36mhttps://dvc.org/doc/user-guide/data-management/importing-external-data[39m> for more info.


[0m

fatal: pathspec '+' did not match any files


fatal: /vae.weights.h5: '/vae.weights.h5' is outside repository at '/home/lobielodan/parsl_mpi'


!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      !Collecting                                            |0.00 [00:00,    ?entry/s]Collecting                                            |0.00 [00:00,    ?entry/s]
!PushingPushing
Everything is up to date.


[0m

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .dvc/config[m
	[31mmodified:   CVAE_example.ipynb[m
	[31mmodified:   CVAE_log.ipynb[m
	[31mmodified:   CVAE_training.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.ipynb_checkpoints/[m
	[31mgefs_data/[m
	[31mmodel_dir/[m

no changes added to commit (use "git add" and/or "git commit -a")




Everything up-to-date


rm: cannot remove '/vae.weights.h5': No such file or directory
rm: cannot remove 'model_dir': Is a directory
rm: cannot remove '+': No such file or directory
rm: cannot remove 'f/vae.weights_20000101.h5': No such file or directory


Memory usage after training: {'current': 6921984, 'peak': 17629519872}
