# **EQODEC: A Carbon-Aware Deep Learning Framework for Sustainable Video Compression**

---

A core component of our EQODEC framework is the careful selection and structured preparation of training and benchmarking datasets. Because EQODEC aims to balance reconstruction quality with environmental efficiency, we require datasets that provide both the scale needed for deep representation learning and the diversity necessary to evaluate reconstruction fidelity across modern video formats. Two dedicated notebooks—`01_data_collection.ipynb` and `02_data_preprocessing.ipynb`—outline how we acquired and prepared these datasets.

### **Data Collection**

**Rationale for Dataset Selection**

Our overarching objective in EQODEC is to learn compact latent representations that preserve visual fidelity while minimizing energy expenditure during compression and inference. Achieving this goal requires data that captures a wide spectrum of temporal dynamics, spatial textures, and scene complexities.

To meet these requirements, we selected two complementary datasets:

1. **Vimeo-90K Septuplet** — our primary training dataset.
2. **Ultra Video Group (UVG)** — our benchmarking dataset for evaluating reconstruction performance.

Each dataset plays a critical role: Vimeo-90K supports learning temporal and spatial redundancy, while UVG challenges our model with high-resolution, high-entropy content typical of real-world video applications.

**Training Dataset: Vimeo-90K (Septuplet)**

We selected the **Vimeo-90K Septuplet dataset** as the main training source due to its extensive archive of aligned seven-frame sequences. With **91,701 septuplets**, each at a fixed resolution of **448 × 256 pixels**, the dataset provides enough volume and consistency for stable training.

The temporal coherence across each septuplet allows our model to learn temporal redundancies essential for motion compensation—an important mechanism for reducing bitrate while maintaining perceptual quality.

Because this dataset is only available through manual download, we obtained the Septuplet package from the authors’ site and placed it into our project under `data/raw/vimeo/sequences`.

**Benchmarking Dataset: Ultra Video Group (UVG)**

To benchmark EQODEC meaningfully against traditional codecs such as H.264 and H.265, we required a dataset that presents substantial compression challenges. We therefore selected the **UVG dataset**, which contains **1920×1080 Full HD videos**, often recorded at **high framerates (up to 120 fps)**.

Sequences such as *HoneyBee*, *Jockey*, and *ShakeNDry* provide intricate textures and rapid motion, enabling us to assess both reconstruction quality and environmental efficiency under demanding conditions.

We downloaded the sequences manually in Raw YUV format and stored them under `data/raw/uvg/`.

**Directory Organization**

To maintain clarity and ensure experimental reproducibility, we organized our dataset using a structured directory hierarchy. Raw videos and frames are stored separately from processed training assets such as split indices, enabling a clean and modular data pipeline.

```
eqodec/
├── notebooks/
│   ├── 01_data_collection.ipynb
│   ├── 02_data_preprocessing.ipynb
└── data/
    ├── raw/
    │   ├── vimeo/
    │   │   └── sequences/
    │   └── uvg/
    └── processed/
```

---

### **Data Preprocessing**

In the next stage of preparation, implemented in **02_data_preprocessing.ipynb**, we focused on generating an efficient and lightweight subset of the Vimeo-90K dataset. Given its size (≈82GB), training on the entire dataset is computationally expensive and environmentally intensive. To reduce overhead, we opted to retain only **10% of the sequences**, sampled randomly but deterministically.

**Subsetting the Dataset**

Our preprocessing script begins by scanning the Vimeo directory and identifying all valid seven-frame sequences. After gathering the full list, we apply a deterministic shuffle using a fixed random seed (42). This ensures that subsequent experiments can be replicated exactly.

We then apply a **sample ratio of 0.10**, reducing the dataset size while keeping representative variability in motion and scene content.

**Train/Validation Split**

After forming the reduced subset, we split it into **Training (90%)** and **Validation (10%)** sets. We apply a deterministic split index to ensure that the same samples are consistently assigned to each set across multiple runs. This split allows us to train EQODEC effectively while monitoring generalization performance on unseen sequences.

**Saving Split Indices**

To integrate smoothly with the training pipeline, we save the resulting indices as:

* `train_split.json`
* `val_split.json`

These files, stored under `data/processed/`, allow our training notebooks to load the dataset immediately without rescanning the raw directory. This reduces processing time and keeps the training workflow efficient and consistent.

---

### **Summary**

Together, our data collection and preprocessing stages establish a structured and reproducible pipeline tailored to the objectives of a carbon-aware deep learning framework. The **Vimeo-90K Septuplet dataset** provides the temporal structure required for learning efficient latent representations, while the **UVG dataset** supplies the challenging, high-resolution benchmarks needed to compare EQODEC with traditional codecs. Our preprocessing strategy further streamlines training by subsetting, splitting, and indexing the data, ensuring that we maintain both efficiency and experimental rigor in alignment with EQODEC’s environmental goals.