## Finetuning LLaVA with Project Aria Dataset

[Project Aria](https://www.projectaria.com/) includes publically avialble [datasets](https://www.projectaria.com/datasets/) with terabytes of video data and more produced by the smart glasses. As the goal of this project was to extract context from the framed video data to gain a higher representation of this data, we decided to try to finetune [LLaVA](https://huggingface.co/liuhaotian/llava-v1.5-7b). LLaVA is a multimodal, meaning it can understand images and text! 

### Notebook Content
0. Preliminary Steps
1. Data Preprocessing
2. LLaVA & Dependcy Installations
3. Deepspeed config and Training
4. WandB Results
---

## 1. Data Preproccessing

Included in this git is the `ADT_download_urls.json`, for the Aria Digital Twin Datset, however you could utilize and change the data downloading as neccearry

In [1]:
from res import download_convert

download_convert.run(
    json_path="ADT_download_urls.json",
    dataset_dir="dataset",
    metadata_output_path="metadata.csv",
    json_output_path="output.json",
    fps=30,
    max_download=1
)


Downloading ADT_Apartment_release_clean_seq131_M1292_preview_rgb.mp4: 100%|██████████| 98.6M/98.6M [00:01<00:00, 72.1MiB/s]
Downloading ADT_Apartment_release_clean_seq131_M1292_main_groundtruth.zip: 100%|██████████| 19.2M/19.2M [00:00<00:00, 46.9MiB/s]
Processing sequences: 100%|██████████| 1/1 [00:12<00:00, 13.00s/sequence]


Metadata saved to metadata.csv
Total sequences downloaded and processed: 1


Converting metadata to JSON: 1it [00:00,  5.88it/s]


Converted 3526 samples to 'output.json'


Splitting the downloaded data into a train and test set for training. This gets stored under datset/train and datset/test

In [13]:
!pip install datasets && pip install --upgrade --force-reinstall Pillow

Collecting Pillow
  Using cached pillow-11.0.0-cp310-cp310-win_amd64.whl.metadata (9.3 kB)
Using cached pillow-11.0.0-cp310-cp310-win_amd64.whl (2.6 MB)
Installing collected packages: Pillow
  Attempting uninstall: Pillow
    Found existing installation: pillow 10.4.0
    Uninstalling pillow-10.4.0:
      Successfully uninstalled pillow-10.4.0
Successfully installed Pillow-11.0.0


  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 4.16.0 requires pillow<11.0,>=8.0, but you have pillow 11.0.0 which is incompatible.


In [14]:
from datasets import Dataset
import json
import os
from PIL import Image

def load_custom_dataset(json_path):
    """
    Load a custom dataset from a JSON file.

    Parameters:
    - json_path: Path to the JSON file.

    Returns:
    - A Hugging Face Dataset object.
    """
    with open(json_path, 'r') as f:
        data = json.load(f)
    return Dataset.from_list(data)

def process_and_save(dataset, output_folder, subset_name, max_samples=None):
    """
    Process and save images and labels from the dataset.

    Parameters:
    - dataset: The Hugging Face Dataset to process.
    - output_folder: The directory where processed data will be saved.
    - subset_name: Name of the subset (e.g., 'train', 'test').
    - max_samples: Maximum number of samples to process. If None, process all.
    """
    # Define image subfolder within output folder
    image_subfolder = os.path.join(output_folder, 'images')

    if not os.path.exists(image_subfolder):
        os.makedirs(image_subfolder)
        print(f"Created image directory at '{image_subfolder}'.")

    # Initialize list to hold all JSON data
    json_data_list = []

    # If max_samples is set, shuffle and select
    if max_samples is not None:
        if len(dataset) < max_samples:
            raise ValueError(f"Requested {max_samples} samples, but dataset contains only {len(dataset)} samples.")
        dataset = dataset.shuffle(seed=42).select(range(max_samples))
        print(f"Sampling {max_samples} samples for subset '{subset_name}'.")

    # Process and save images and labels
    for idx, item in enumerate(dataset):
        # Load image
        image_path = "dataset/" + str(item['image'])
        
        # Skip past the '/workspace/' part if it exists
        workspace_prefix = '/workspace/'
        if image_path.startswith(workspace_prefix):
            image_path = image_path[len(workspace_prefix):]  # Remove the prefix
            print(f"[{idx+1}/{len(dataset)}] Adjusted image path to: {image_path}")
        
        # Verify if the adjusted path exists
        if not os.path.exists(image_path):
            print(f"[{idx+1}/{len(dataset)}] Adjusted image path does not exist: {image_path}")
            continue

        try:
            image = Image.open(image_path).convert('RGB')
        except Exception as e:
            print(f"[{idx+1}/{len(dataset)}] Error loading image {image_path}: {e}")
            continue

        # Use existing unique ID
        unique_id = item['id']

        # Define image path in the output directory
        output_image_path = os.path.join(image_subfolder, f"{unique_id}.jpg")

        # Save image
        try:
            image.save(output_image_path)
            print(f"[{idx+1}/{len(dataset)}] Saved image to {output_image_path}")
        except Exception as e:
            print(f"[{idx+1}/{len(dataset)}] Error saving image {output_image_path}: {e}")
            continue

        # Extract the GPT response
        gpt_response = next((conv['value'] for conv in item['conversations'] if conv['from'] == 'gpt'), '')

        # Structure for LLaVA JSON
        json_data = {
            "id": unique_id,
            "image": f"{unique_id}.jpg",  # Changed from os.path.join('images', f"{unique_id}.jpg") to just the filename
            "conversations": [
                {
                    "from": "human",
                    "value": "<image>\nWhat main activity is taking place?."
                },
                {
                    "from": "gpt",
                    "value": gpt_response
                }
            ]
        }

        # Append to list
        json_data_list.append(json_data)

    # Save the JSON data list to a file
    subset_folder = os.path.join(output_folder, subset_name)
    if not os.path.exists(subset_folder):
        os.makedirs(subset_folder)
        print(f"Created subset directory at '{subset_folder}'.")

    json_output_path = os.path.join(subset_folder, 'dataset.json')
    try:
        with open(json_output_path, 'w') as json_file:
            json.dump(json_data_list, json_file, indent=4)
        print(f"Saved processed data to '{json_output_path}'.")
    except Exception as e:
        print(f"Error saving JSON file '{json_output_path}': {e}")


In [15]:
# Define paths and parameters
output_folder = 'dataset'
json_file_path = 'output.json'  # Update this path accordingly
max_samples = 2000  # Total samples to process

# Load the dataset
dataset = load_custom_dataset(json_file_path)

print(f"Total number of samples in the dataset: {len(dataset)}")

# Ensure that the dataset has enough samples
if len(dataset) < max_samples:
    raise ValueError(f"Dataset contains only {len(dataset)} samples, which is less than the requested {max_samples} samples.")

# Shuffle and select 2,000 samples
sampled_dataset = dataset.shuffle(seed=42).select(range(max_samples))
print(f"Number of samples after sampling: {len(sampled_dataset)}")

# Optionally, split into training and testing subsets
train_ratio = 0.8
train_size = int(max_samples * train_ratio)

train_test_split = sampled_dataset.train_test_split(test_size=(max_samples - train_size))
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of testing samples: {len(test_dataset)}")

# Process and save the datasets
process_and_save(train_dataset, output_folder, 'train')
process_and_save(test_dataset, output_folder, 'test')


Total number of samples in the dataset: 3526
Number of samples after sampling: 2000
Number of training samples: 1600
Number of testing samples: 400
[1/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_release_clean_seq131_M1292_frame0286.jpg
[2/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_release_clean_seq131_M1292_frame0520.jpg
[3/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_release_clean_seq131_M1292_frame1021.jpg
[4/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_release_clean_seq131_M1292_frame2944.jpg
[5/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_release_clean_seq131_M1292_frame1026.jpg
[6/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_release_clean_seq131_M1292_frame2815.jpg
[7/1600] Saved image to dataset\images\Apartment_release_clean_seq131_M1292_Apartment_rele

## 2. LLaVA & Dependcy Installations

To utilize the LLaVA, we first have to clone it from this [git](https://github.com/haotian-liu/LLaVA). Following the instruction on its read me, or by running the following cells, should setup our environment in order to finetune LLaVA

In [None]:
!git clone https://github.com/haotian-liu/LLaVA.git

In [None]:
!cd LLaVA && pip install -e . 
!pip install -e ".[train]" && pip install flash-attn --no-build-isolation
!pip install peft wandb

## 3. Deepspeed Config and Training

Deepspeed, created by Microsoft, is a deep learning optimization library which helps to enhance the training speed for AI models. Allowing developers to train models that have billions of parameters more efficietly even on limited hardware with the use of Lora and Qlora adapters..

In [None]:
!pip install deepspeed

In [None]:
!deepspeed LLaVA/llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed LLaVA/scripts/zero3.json \
    --model_name_or_path liuhaotian/llava-v1.5-13b \
    --version v1 \
    --data_path ./dataset/train/dataset.json \
    --image_folder ./dataset/images \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-13b-task-lora \
    --num_train_epochs 5 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

## 4. Weights and Bias (WandB) Results: