# Exercise 3: FPGA Inference and Performance Analysis

## Objectives
In this exercise, you will:

0. Download all the pre-built bitstream files (each of them corresponds to a different hls4ml configuration) of the encoder model
1. Run inference on the FPGA hardware
2. Collect performance metrics (latency, throughput)
3. Compare hardware predictions with software predictions
4. Analyze the differences and understand the trade-offs

## Instructions
Complete the code cells marked with `# TODO` comments. Follow the hints provided.

## Part 0: Download all the pre-built bitstream files

In [None]:
%%bash

# Array of firmware files
firmware_files=(
    "firmware-01_baseline.xclbin"
    "firmware-02_optimized_precision.xclbin"
    "firmware-03_extreme_precision.xclbin"
    "firmware-04_ultra_extreme_precision.xclbin"
    "firmware-05_reuse_2.xclbin"
    "firmware-06_reuse_4.xclbin"
    "firmware-07_reuse_8.xclbin"
    "firmware-09_resource_strategy.xclbin"
    "firmware-11_optimized_precision_reuse_2.xclbin"
    "firmware-12_optimized_precision_reuse_8.xclbin"
    "firmware-13_extreme_precision_reuse_8.xclbin"
    "firmware-14_extreme_precision_io_stream.xclbin"
    "firmware-15_area_optimized.xclbin",
    "firmware-16_resource_strategy_reuse_8.xclbin"
)

base_url="https://minio.131.154.98.45.myip.cloud.infn.it/public-data/firmware-hackathon"

# Download each firmware file if it doesn't exist
for firmware in "${firmware_files[@]}"; do
    if [ -f "$firmware" ]; then
        echo "Skipping $firmware (already exists)"
    else
        echo "Downloading $firmware..."
        wget -q "$base_url/$firmware"
    fi
done

## Part 1: Environment Setup
Run the following cells to set up the environment (no changes needed).

In [None]:
import os
os.environ["PATH"]=os.environ["PATH"]+":"+os.environ["BONDMACHINE_DIR"]
os.environ['XILINX_HLS'] = '/tools/Xilinx/Vitis_HLS/2023.2'
os.environ['XILINX_VIVADO'] = '/tools/Xilinx/Vivado/2023.2'
os.environ['XILINX_VITIS'] = '/tools/Xilinx/Vitis/2023.2'
os.environ['PATH']=os.environ["PATH"]+":"+os.environ['XILINX_HLS']+"/bin:"+os.environ['XILINX_VIVADO']+"/bin:"+os.environ['XILINX_VITIS']+"/bin:"
os.environ['XILINX_XRT'] = '/opt/xilinx/xrt'
os.environ['LD_LIBRARY_PATH'] = '/opt/xilinx/xrt/lib'

notebook_directory = os.path.abspath(os.path.dirname((os.environ["JPY_SESSION_NAME"])))
os.chdir(notebook_directory)

In [None]:
from utils import NeuralNetworkOverlay
import numpy as np
import matplotlib.pyplot as plt
from keras.models import load_model
import os

current_dir = os.getcwd()

inference_dir = os.path.join(current_dir, 'inference')
os.makedirs(inference_dir, exist_ok=True)

print("Loading test data from file...")
X_test = np.load(os.path.join(current_dir, 'X_test.npy'))
print(f"Test data shape: {X_test.shape}")

## Part 2: FPGA Hardware Inference

### Task 2.1: Load the FPGA Overlay
Complete the code to load the FPGA overlay.

In [None]:
# TODO: Load the FPGA overlay
# Hint: Use NeuralNetworkOverlay class with xclbin_name="myproject_kernel.xclbin"
# Store the overlay object in a variable called 'ol'

print("Loading FPGA overlay...")
# YOUR CODE HERE
ol = None  # Replace None with the correct code

print("FPGA overlay loaded...")
print(ol)

# You will discover that ol is an instance of the custom class NeuralNetworkOverlay which inherits from the PYNQ Overlay class
# You may ask youserlf, what is PYNQ? PYNQ is a Python-based framework for using Xilinx FPGAs and lets you control FPGA hardware directly from Python
# It loads the FPGA bitstream file by default and you can ispect the properties of this object by running the next cell

In [None]:
ol.ip_dict

### Task 2.2: Run Hardware Inference with Performance Profiling

Use the FPGA overlay to run inference on the test data. The `predict` method returns three values:
- Hardware predictions
- Latency (time taken)
- Throughput (inferences per second)

**Important:** Set `profile=True` to enable performance profiling.

In [None]:
# TODO: Run inference on FPGA hardware with profiling
# Hint: Use ol.predict() with parameters: X_test, output_shape, profile=True
# Store the three return values in variables: y_hw, latency, throughput

output_shape = (X_test.shape[0], 8) # Why 8? Where does the 8 come from?

# YOUR CODE HERE
y_hw = None       # Replace with correct code
latency = None    # Replace with correct code
throughput = None # Replace with correct code

### Task 2.3: Display Hardware Predictions

Print the hardware predictions to verify they were computed correctly.

In [None]:
# TODO: Display the hardware predictions
# YOUR CODE HERE


### Task 2.4: Analyze Performance Metrics

Print out the latency and throughput information in a clear format.
- Latency: time taken for all inferences
- Throughput: number of inferences per second
- Calculate the average time per single inference

In [None]:
# TODO: Print performance metrics
# Calculate and print:
# 1. Total time for all inferences (latency)
# 2. Throughput (inferences per second)
# 3. Average time per single inference (in microseconds)

print("\n=== FPGA Hardware Performance Metrics ===")
# YOUR CODE HERE
# Hint: throughput is already inferences/second
# Hint: time per inference = latency / number_of_samples
# Hint: Convert to microseconds by multiplying by 1e6


### Cleanup: Free the FPGA Overlay

After completing inference, free the FPGA resources.

In [None]:
# Free the FPGA overlay (no changes needed)
ol.free_overlay()
print("FPGA overlay freed.")

## Part 3: Software (CPU) Inference

### Task 3.1: Load the Software Model and Run Inference

Load the Keras model and run inference on the CPU for comparison.

In [None]:
# TODO: Load the Keras encoder model
# Hint: Use load_model() with filename 'small_encoder.keras'

# YOUR CODE HERE
encoder = None  # Replace with correct code

In [None]:
# TODO: Run CPU inference
# Hint: Use encoder.predict() on X_test
# Store the result in y_cpu

# YOUR CODE HERE
y_cpu = None  # Replace with correct code

## Part 4: Compare Hardware vs Software Predictions

### Task 4.1: Convert Hardware Predictions to NumPy Array

The hardware predictions need to be converted to a NumPy array for comparison.

In [None]:
# TODO: Convert y_hw to a NumPy array
# Hint: Use np.asarray()
# Store the result in y_hls

# YOUR CODE HERE
y_hls = None  # Replace with correct code

### Task 4.2: Calculate Comparison Metrics

Calculate metrics to compare the hardware and software predictions:
- Mean Squared Error (MSE) per sample
- Overall MSE
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)

In [None]:
# TODO: Calculate comparison metrics
# 1. MSE per sample: mean of (y_cpu - y_hls)^2 along axis 1
# 2. Overall MSE: mean of all MSE per sample
# 3. MAE: mean of absolute differences
# 4. RMSE: square root of overall MSE

# YOUR CODE HERE
mse_per_sample = None  # Calculate MSE for each sample
overall_mse = None     # Calculate overall MSE
mae = None             # Calculate MAE
rmse = None            # Calculate RMSE

# Print the metrics
print(f"\n=== Software vs Hardware Reconstruction Metrics ===")
print(f"Overall MSE           : {overall_mse:.6f}")
print(f"Average MSE per sample: {np.mean(mse_per_sample):.6f}")
print(f"Min MSE               : {np.min(mse_per_sample):.6f}")
print(f"Max MSE               : {np.max(mse_per_sample):.6f}")
print(f"Mean Absolute Error   : {mae:.6f}")
print(f"RMSE                  : {rmse:.6f}")

### Task 4.3: Visualize the Comparison

Create visualizations to compare CPU and HLS predictions.

In [None]:
# Visualize comparison for first 3 samples (no changes needed)
n_examples = 3
fig, axes = plt.subplots(n_examples, 3, figsize=(15, 3*n_examples))

for i in range(n_examples):
    # CPU predictions
    axes[i, 0].plot(y_cpu[i])
    axes[i, 0].set_title(f'CPU Prediction {i}')
    axes[i, 0].set_ylabel('Amplitude')
    axes[i, 0].grid(True)
    
    # HLS predictions
    axes[i, 1].plot(y_hls[i])
    axes[i, 1].set_title(f'HLS Prediction {i}')
    axes[i, 1].grid(True)
    
    # Difference
    axes[i, 2].plot(y_cpu[i] - y_hls[i])
    axes[i, 2].set_title(f'Error (MSE: {mse_per_sample[i]:.4f})')
    axes[i, 2].grid(True)

plt.tight_layout()
plt.savefig('cpu_hls_comparison.png')
print("\nComparison plot saved as 'cpu_hls_comparison.png'")

In [None]:
# Visualize error distribution (no changes needed)
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.hist(mse_per_sample, bins=20, edgecolor='black')
plt.xlabel('MSE per Sample')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors')
plt.grid(True)

plt.subplot(1, 2, 2)
plt.boxplot(mse_per_sample)
plt.ylabel('MSE')
plt.title('MSE Distribution (Boxplot)')
plt.grid(True)

plt.tight_layout()
plt.savefig('error_distribution.png')
print("Error distribution plot saved as 'error_distribution.png'")

## Part 5: Latency Comparison - FPGA vs CPU

Now let's compare the inference latencies between FPGA and CPU to understand the performance benefits of hardware acceleration.

### Task 5.1: Measure CPU Inference Latency

Measure the time it takes for the CPU to perform inference on the same test data.
We'll use Python's `time` module to measure the elapsed time.

In [None]:
# TODO: Measure CPU inference latency
# 1. Import the time module
# 2. Record start time before prediction
# 3. Run encoder.predict(X_test) again
# 4. Record end time after prediction
# 5. Calculate cpu_latency = end_time - start_time

import time

print("Measuring CPU inference latency...")

# YOUR CODE HERE
start_time = None  # Record start time using time.time()
# Run prediction here
end_time = None    # Record end time using time.time()

cpu_latency = None # Calculate the difference

print(f"CPU inference completed in {cpu_latency:.6f} seconds")

### Task 5.2: Calculate Performance Metrics

Calculate key performance metrics to compare FPGA and CPU:
- Speedup factor (how many times faster is FPGA)
- Throughput for both platforms
- Time per inference for both platforms

In [None]:
# TODO: Calculate comparison metrics
# 1. Calculate speedup = cpu_latency / fpga_latency (use 'latency' variable from Part 2)
# 2. Calculate cpu_throughput = num_samples / cpu_latency
# 3. Calculate cpu_time_per_inference = (cpu_latency / num_samples) * 1e6 (in microseconds)
# 4. Calculate fpga_time_per_inference = (latency / num_samples) * 1e6 (in microseconds)

num_samples = X_test.shape[0]

# YOUR CODE HERE
speedup = None                    # Calculate speedup
cpu_throughput = None             # Calculate CPU throughput
cpu_time_per_inference = None     # Calculate CPU time per inference in μs
fpga_latency = latency            # FPGA latency from Part 2
fpga_throughput = throughput      # FPGA throughput from Part 2
fpga_time_per_inference = None    # Calculate FPGA time per inference in μs

# Print comparison table
print("\n" + "="*70)
print("LATENCY COMPARISON: FPGA vs CPU")
print("="*70)
print(f"\n{'Metric':<30} {'FPGA':<20} {'CPU':<20}")
print("-"*70)
print(f"{'Total Latency (seconds)':<30} {fpga_latency:<20.6f} {cpu_latency:<20.6f}")
print(f"{'Throughput (inferences/sec)':<30} {fpga_throughput:<20.2f} {cpu_throughput:<20.2f}")
print(f"{'Time per Inference (μs)':<30} {fpga_time_per_inference:<20.4f} {cpu_time_per_inference:<20.4f}")
print("-"*70)
print(f"{'SPEEDUP FACTOR':<30} {speedup:<20.2f}x {'1.00x':<20}")
print("="*70)
print(f"\nThe FPGA is {speedup:.2f}x faster than the CPU for this inference task.")

### Task 5.3: Visualize Latency Comparison

Create visualizations to compare the performance metrics.

In [None]:
# TODO: Create comparison visualizations
# Create bar charts comparing FPGA and CPU performance

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Total Latency Comparison
platforms = ['FPGA', 'CPU']
latencies = [fpga_latency, cpu_latency]
colors = ['#2ecc71', '#e74c3c']

axes[0].bar(platforms, latencies, color=colors, alpha=0.7, edgecolor='black')
axes[0].set_ylabel('Latency (seconds)', fontsize=12)
axes[0].set_title('Total Inference Latency', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(latencies):
    axes[0].text(i, v, f'{v:.4f}s', ha='center', va='bottom', fontweight='bold')

# Plot 2: Throughput Comparison
throughputs = [fpga_throughput, cpu_throughput]
axes[1].bar(platforms, throughputs, color=colors, alpha=0.7, edgecolor='black')
axes[1].set_ylabel('Throughput (inferences/sec)', fontsize=12)
axes[1].set_title('Inference Throughput', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)
for i, v in enumerate(throughputs):
    axes[1].text(i, v, f'{v:.0f}', ha='center', va='bottom', fontweight='bold')

# Plot 3: Time per Inference Comparison
times_per_inf = [fpga_time_per_inference, cpu_time_per_inference]
axes[2].bar(platforms, times_per_inf, color=colors, alpha=0.7, edgecolor='black')
axes[2].set_ylabel('Time per Inference (μs)', fontsize=12)
axes[2].set_title('Time per Single Inference', fontsize=14, fontweight='bold')
axes[2].grid(axis='y', alpha=0.3)
for i, v in enumerate(times_per_inf):
    axes[2].text(i, v, f'{v:.2f}μs', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig('latency_comparison.png', dpi=300, bbox_inches='tight')
print("\nLatency comparison plot saved as 'latency_comparison.png'")
plt.show()

In [None]:
# Speedup visualization
fig, ax = plt.subplots(figsize=(8, 6))

# Create a horizontal bar showing speedup
y_pos = [0, 1]
performance = [1.0, speedup]
labels = ['CPU (Baseline)', f'FPGA ({speedup:.2f}x faster)']
colors_speedup = ['#e74c3c', '#2ecc71']

bars = ax.barh(y_pos, performance, color=colors_speedup, alpha=0.7, edgecolor='black', height=0.6)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=12)
ax.set_xlabel('Relative Performance (Speedup Factor)', fontsize=12)
ax.set_title('FPGA Speedup vs CPU', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, performance)):
    width = bar.get_width()
    ax.text(width, bar.get_y() + bar.get_height()/2, 
            f'  {val:.2f}x', 
            ha='left', va='center', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.savefig('speedup_comparison.png', dpi=300, bbox_inches='tight')
print("Speedup comparison plot saved as 'speedup_comparison.png'")
plt.show()

## Part 6: Simulated vs Real Latency Comparison

In this final part, you will compare the **simulated latency** (from HLS synthesis) with the **real measured latency** from the FPGA.

The simulation gives you the number of **clock cycles** required for the computation. To compare with real latency, you need to convert clock cycles to time using the FPGA clock frequency.

### Task 6.1: Get Simulated Latency and Clock Frequency

From your previous HLS synthesis notebooks, you should have the simulated latency in clock cycles.

**Note**: The FPGA clock frequency is typically available as a property of the overlay or is a known constant (e.g., 200 MHz).

In [None]:
# TODO: Input your simulated latency from HLS synthesis
# This value should come from your synthesis report (e.g., 52 clock cycles)

simulated_clock_cycles = None  # Replace with your value (e.g., 52)

# TODO: Get the FPGA clock frequency
# Option 1: Try to get it from the overlay object
# Check if ol has a clock frequency property (common properties: clock_freq, frequency_mhz)
# Option 2: Use a known constant (typically 100-300 MHz for these boards)

# Try to get frequency from overlay (uncomment and try different property names)
# fpga_clock_freq_mhz = ol.clock_freq  # or ol.frequency_mhz, or ol.clock_frequency

# Or set it manually if you know the frequency:
fpga_clock_freq_mhz = None  # Replace with your FPGA clock frequency in MHz (e.g., 200)

print(f"Simulated latency: {simulated_clock_cycles} clock cycles")
print(f"FPGA clock frequency: {fpga_clock_freq_mhz} MHz")

### Task 6.2: Calculate Theoretical Latency from Simulation

Convert the simulated clock cycles to time (microseconds).

**Formula**: 
- Clock period = 1 / frequency
- Time = clock_cycles × clock_period
- If frequency is in MHz, then: Time (μs) = clock_cycles / frequency_MHz

In [None]:
# TODO: Calculate the theoretical latency per inference from simulation
# Formula: theoretical_latency_us = simulated_clock_cycles / fpga_clock_freq_mhz

# YOUR CODE HERE
theoretical_latency_us = None  # Calculate theoretical latency in microseconds

print(f"\nTheoretical latency (from simulation): {theoretical_latency_us:.4f} μs per inference")

### Task 6.3: Compare with Real Measured Latency

Compare the theoretical latency with the actual measured latency from Part 2.
Calculate the overhead factor.

In [None]:
# TODO: Compare simulated vs real latency
# Use fpga_time_per_inference from Part 5
# Calculate overhead_factor = real_latency / theoretical_latency

real_latency_us = fpga_time_per_inference  # From Part 5

# YOUR CODE HERE
overhead_factor = None  # Calculate the overhead factor

# Print comparison
print("\n" + "="*70)
print("SIMULATED vs REAL LATENCY COMPARISON")
print("="*70)
print(f"\n{'Metric':<35} {'Value':<20}")
print("-"*70)
print(f"{'Simulated clock cycles':<35} {simulated_clock_cycles:<20}")
print(f"{'FPGA clock frequency (MHz)':<35} {fpga_clock_freq_mhz:<20}")
print(f"{'Theoretical latency (μs)':<35} {theoretical_latency_us:<20.4f}")
print(f"{'Real measured latency (μs)':<35} {real_latency_us:<20.4f}")
print("-"*70)
print(f"{'Overhead factor':<35} {overhead_factor:<20.2f}x")
print(f"{'Additional latency (μs)':<35} {(real_latency_us - theoretical_latency_us):<20.4f}")
print("="*70)
print(f"\nThe real latency is {overhead_factor:.2f}x higher than the simulated latency.")

### Task 6.4: Visualize Simulated vs Real Latency

In [None]:
# Visualization of simulated vs real latency
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Latency comparison
categories = ['Simulated\n(Theoretical)', 'Real\n(Measured)']
latencies = [theoretical_latency_us, real_latency_us]
colors = ['#3498db', '#e74c3c']

bars = axes[0].bar(categories, latencies, color=colors, alpha=0.7, edgecolor='black', width=0.6)
axes[0].set_ylabel('Latency per Inference (μs)', fontsize=12)
axes[0].set_title('Simulated vs Real Latency', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars, latencies)):
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{val:.4f}μs',
                ha='center', va='bottom', fontweight='bold')

# Plot 2: Breakdown of latency components
components = ['Computation\n(Simulated)', 'Overhead\n(Data Transfer,\nProtocol, etc.)']
values = [theoretical_latency_us, real_latency_us - theoretical_latency_us]
colors_breakdown = ['#3498db', '#f39c12']

bars2 = axes[1].bar(components, values, color=colors_breakdown, alpha=0.7, edgecolor='black', width=0.6)
axes[1].set_ylabel('Latency (μs)', fontsize=12)
axes[1].set_title('Latency Breakdown', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

# Add value labels
for i, (bar, val) in enumerate(zip(bars2, values)):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height,
                f'{val:.4f}μs',
                ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig('simulated_vs_real_latency.png', dpi=300, bbox_inches='tight')
print("\nSimulated vs real latency plot saved as 'simulated_vs_real_latency.png'")
plt.show()