# Exercise 1: Train a Small Encoder and Introduction to _hls4ml_

## Objectives
In this exercise, you will:
1. Train a simplified version of the Autoencoder
2. Learn the basics of _hls4ml_

## Instructions
Complete the code cells marked with `# TODO` comments. Follow the hints provided.

## Part 1: Environment Setup and Data Loading
Run the following cells to set up the environment (no changes needed).

In [None]:
# Load libraries and import packages (no changes needed)

# General imports
import os

# Numpy and plotting
import numpy as np
import matplotlib.pyplot as plt

# TimeSeries to hold our data
from gwpy.timeseries import TimeSeries

# Keras for the model
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Activation, Dropout, Flatten, Reshape, Input, InputLayer
from keras.optimizers import Nadam, Adam, SGD
from keras import regularizers

# Local file with some useful methods
from utils import *

# hls4ml
import hls4ml

# Set the correct libraries path for hls4ml
os.environ['XILINX_HLS']    = '/opt/tools/Xilinx/Vitis_HLS/2023.2'
os.environ['XILINX_VIVADO'] = '/opt/tools/Xilinx/Vivado/2023.2'
os.environ['XILINX_VITIS']  = '/opt/tools/Xilinx/Vitis/2023.2'
os.environ['PATH'] = os.environ["PATH"] + ":" \
                   + os.environ['XILINX_HLS'] + "/bin:" \
                   + os.environ['XILINX_VIVADO'] + "/bin:" \
                   + os.environ['XILINX_VITIS'] + "/bin:"

print("Environment setup correctly!")

Load the data needed to train the encoder.

In the first exercise you analyzed 1 full second of data, now we only want to analyze 64 input datapoints and compress them down to dimension 8.

In [None]:
# TODO: Load the input data
chunks = None  # Replace with correct code

# Hint: We want to train a small encoder that goes from 64 inputs down to 8 otuput, so the shape of the data must be adjusted accordingly.
# Hint: You can reuse the code from the previous exercise or look in utils.py if you find a useful method...
# Hint: remember the input directory is "/data/input_data/AI_INFN/gwdata".

print(f"Loaded {chunks.shape[0]} samples with shape: {chunks[0].shape}")

## Part 2: Train a small encoder
FPGAs have limited resources and the more resources are required by your design, the more time _hls4ml_ will need to produce a firmware. <br>
Additionally, the FPGA board that we are targetting for this exercise (`AMD Alveo U55c`) is quite large, increasing further the compilation time.

### Task 2.1: Define the Autoencoder
In order to fit within the hands-on session today we will focus on a small encoder of fully connected dense layers (no Convolution):
- 64 inputs
- Few hidden layers
- 8 "encoded" outputs

**Important:** Remember to name all your layers, it will be useful when optimizing the model! <br>
Example:
```python
model.Add(
    Dense(
            64,                 # N Neurons
            activation='tanh',  # Activation
            name="encoder1"     # Name of the layer
    )
)
```

In [None]:
# TODO: Define a "small" autoencoder which has 64 inputs and latent dimension of 8

# Hint: You can re-use the same model structure seen in the previous exercise...
#       but remmemer to update the layers and the neurons to get a small model!

def autoencoder_model( '''YOUR CODE HERE''' ):

    # ====== Encoder ======
    encoder = Sequential(name="encoder")
    # YOUR CODE HERE

    # ====== Decoder ======
    decoder = Sequential(name="decoder")
    # YOUR CODE HERE

    # ====== Full Autoencoder ======
    # YOUR CODE HERE
    autoencoder = Model(inputs=input_signal, outputs=decoded, name="autoencoder")
    autoencoder.compile(optimizer=Adam(learning_rate=1e-4), loss='mse', metrics=['mse'])

    # Return both the full model and the encoder/decoder models
    return autoencoder, decoder, encoder

In [None]:
# TODO: declare your model and visualize it

# Define model
# YOUR CODE HERE
(autoencoder, decoder, encoder) = None  # Replace with correct code

# And visualize it
print("\n - AUTOENCODER -")
autoencoder.summary()
print("\n - ENCODER -")
encoder.summary()
print("\n - DECODER -")
decoder.summary()

### Task 2.2: Train the Autoencoder

In [None]:
# TODO: train the model

# Hint: we can limit the training to few epochs (~30) as we don't particularly care about
#       the results, we only want to compare its performace when run on CPU and on FPGA

# Actual training
history = autoencoder.fit( ''' YOUR CODE HERE ''' ) # Replace with correct code

### Task 2.3: Save the Autoencoder
We will check its performance later, for now we just save the model.

In [None]:
# Ensure output directory exists (no changes needed)
if not os.path.exists('small_model'):
    os.makedirs('small_model')

# Save all three models
encoder    .save("small_model/small_encoder.keras"    , save_format="keras_v3")
decoder    .save("small_model/small_decoder.keras"    , save_format="keras_v3")
autoencoder.save("small_model/small_autoencoder.keras", save_format="keras_v3")

We also save a small subset of the input data that we can use later to test our converted model:

In [None]:
# Save 2000 events for testing (no changes needed)
X_test = chunks[:2000]
np.save('X_test.npy', X_test)

print(f"{X_test.shape[0]} test events (with shape {X_test[0].shape}) have been saved to X_test.npy")

## Part 3: _hls4ml_ Basics
We will now use the _hls4ml_ library to convert our encoder to a low latency firmware to be run on the FPGA.
This process has few steps:
-  Define an hls4ml config
-  Convert the model using the config
-  Compile
-  Build the firmware

For additional details:
- here is the official documentation: https://fastmachinelearning.org/hls4ml/index.html
- here is the GitHub page: https://github.com/fastmachinelearning/hls4ml

### Task 3.1: _hls4ml_ Configuration Files
We first define the _hls4ml_ config files needed to translate our model into a firmware. <br>
We need two configs:
- `hls_config`: to control the conversion options of the model
   - here is where we will mostly optimize our framework 
- `main_config`: which contains
   - to specify the details of the FPGA board we are targeting
   - the hls_config
   - the model
   - some I/O options

In [None]:
# Define a default hls config and inspect it (no changes needed)
hls_config = hls4ml.utils.config_from_keras_model(encoder, granularity='model')

print("="*20, "HLS Config", "="*20)
print_dict(hls_config)
print("="*50)

# You will see a few parameters that can be customized:
#  - `Precision`: the bit-wise representation of all the numbers in the model
#  - `ReuseFactor`: a mechanism to tune out firmware parallelism
#  - `Strategy`: how _hls4ml_ should do the conversion, if targeting resources- or latency-optimization
# In the next exercise we will explore some of them and verify what is their impact the model outputs.

In [None]:
# TODO: fix the missing parameters in the main_config

# Main config
main_cfg = hls4ml.converters.create_config(
    board = 'alveo-u55c',            # Target boad                           - DO NOT CHANGE
    part = 'xcu55c-fsvh2892-2L-e',   # Target FPGA part                      - DO NOT CHANGE
    clock_period = 3,                # Clock period in ns (i.e. ~333 MHz)
    backend = 'VivadoAccelerator'    # Backend to convert the NN in firmware - DO NOT CHANGE
)

# Few more customizations
# YOUR CODE HERE
main_cfg['HLSConfig'] = None                                                      # Replace with correct config
main_cfg['IOType'] = 'io_parallel'                                                # DO NOT CHANGE
main_cfg['AcceleratorConfig']['Platform'] = 'xilinx_u55c_gen3x16_xdma_3_202210_1' # DO NOT CHANGE
main_cfg['KerasModel'] = None                                                     # Replace with correct model
main_cfg['OutputDir'] = 'small_model/' + None                                     # Replace with output name

# Inspect final config
print("="*20, "Main Config", "="*20)
print_dict(main_cfg)
print("="*50)

### Task 3.2: HLS Model and Inference
We are now ready to define an hls model, compile it and run some inference.

In [None]:
# TODO: declare the HLS model and run inference on the test data

# Define the hls model
hls_model = hls4ml.converters.keras_v2_to_hls(main_cfg)

# Compile it
hls_model.compile()

# Run inference
# Hint: use the hls4ml `predict(X)` method, just like in _Keras_, this will emulate
#       the performance of the model converted into firmware with hls4ml

# YOUR CODE HERE
y_cpu = None # Replace with correct code
y_hls = None # Replace with correct code

# Save predictions for later comparisons
np.save('y_cpu.npy', y_cpu)
np.save('y_hls.npy', y_hls)
print("Software predictions saved as 'y_cpu.npy' and 'y_hls.npy'.")

### Task 3.3: Calculate Comparison Metrics
Calculate metrics to compare the hardware and software predictions:
- Mean Squared Error (MSE) per sample
- Overall MSE
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)

In [None]:
# TODO: Calculate comparison metrics
# 1. MSE per sample: mean of (y_cpu - y_hls)^2 along axis 1
# 2. Overall MSE: mean of all MSE per sample
# 3. MAE: mean of absolute differences
# 4. RMSE: square root of overall MSE

# YOUR CODE HERE
mse_per_sample = None  # Calculate MSE for each sample
overall_mse = None     # Calculate overall MSE
mae = None             # Calculate MAE
rmse = None            # Calculate RMSE

# Print the metrics
print(f"\n=== Software vs Hardware Reconstruction Metrics ===")
print(f"Overall MSE           : {overall_mse:.6f}")
print(f"Average MSE per sample: {np.mean(mse_per_sample):.6f}")
print(f"Min MSE               : {np.min(mse_per_sample):.6f}")
print(f"Max MSE               : {np.max(mse_per_sample):.6f}")
print(f"Mean Absolute Error   : {mae:.6f}")
print(f"RMSE                  : {rmse:.6f}")

### Task 3.4: Visualize the Comparison
Create visualizations to compare CPU and HLS predictions.

In [None]:
# Visualize comparison for first 3 samples (no changes needed)
n_examples = 3
fig, axes = plt.subplots(n_examples, 3, figsize=(15, 3*n_examples))

for i in range(n_examples):
    # CPU predictions
    axes[i, 0].plot(y_cpu[i])
    axes[i, 0].set_title(f'CPU Prediction {i}')
    axes[i, 0].set_ylabel('Amplitude')
    axes[i, 0].grid(True)
    
    # HLS predictions
    axes[i, 1].plot(y_hls[i])
    axes[i, 1].set_title(f'HLS Prediction {i}')
    axes[i, 1].grid(True)
    
    # Difference
    axes[i, 2].plot(y_cpu[i] - y_hls[i])
    axes[i, 2].set_title(f'Error (MSE: {mse_per_sample[i]:.4f})')
    axes[i, 2].grid(True)

plt.tight_layout()
plt.savefig('cpu_hls_comparison.png')
print("\nComparison plot saved as 'cpu_hls_emulation_comparison.png'")

In [None]:
# Visualize error distribution (no changes needed)
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.hist(mse_per_sample, bins=20, edgecolor='black')
plt.xlabel('MSE per Sample')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.boxplot(mse_per_sample)
plt.ylabel('MSE')
plt.title('MSE Distribution (Boxplot)')
plt.grid(True)

plt.tight_layout()
plt.savefig('error_distribution.png')
print("Error distribution plot saved as 'emulation_error_distribution.png'")

### Task 4.4: Synthesize the Model
The last step of this exercise is to run the synthesis of the model.

In this step `hls4ml` uses the Vivado/Vitis_HLS libraries to convert the Neural Network into the electrical circuit that will be loaded on the FPGA. After this step we will have a first estimate of how many resources our project will need and what will be the latency.

**Important:** This step takes 5-10 minutes depending on how many layers/neurons you have used, so fix the cell below and run it now!

### Bonus:
The `build ` method potentially takes many more parameters:
```python
hls_model.build(
    csim          = False, # Run C++ emulation
    synth         = True,  # Run Synthesis
    export        = True,  # Run Synthesis + Implementation + packaging into custo IP
    export_xo     = True,  # Build an .xo file for integration in larger project
    export_bitfile= True   # Build an .xlcbin file for direct board deployment
)
```
Building a full `.xlcbin` file might require several hours! In the last exercise today (`notebook3`) you will use some pre-compiled firmwares to run inference on the real `Alveo U55c` boards.

In [None]:
# TODO: use the same output dir that you have used in the main_cfg

# Run the synthesis
hls_model.build(csim=False) # -- DO NOT CHANGE

#and print the reports
print("Resource usage and latency:")
print_report(''' YOUR CODE HERE ''')