diff --git a/docs/inputs.png b/docs/inputs.png new file mode 100644 index 00000000..57847277 Binary files /dev/null and b/docs/inputs.png differ diff --git a/docs/signature.png b/docs/signature.png new file mode 100644 index 00000000..e5bbfa2d Binary files /dev/null and b/docs/signature.png differ diff --git a/docs/xattn_langstream.png b/docs/xattn_langstream.png new file mode 100644 index 00000000..09844bdf Binary files /dev/null and b/docs/xattn_langstream.png differ diff --git a/open_flamingo/eval/README.md b/open_flamingo/eval/README.md index d67b9651..ea59e613 100644 --- a/open_flamingo/eval/README.md +++ b/open_flamingo/eval/README.md @@ -1,5 +1,4 @@ # OpenFlamingo Evaluation Suite - This is the evaluation module of OpenFlamingo. It contains a set of utilities for evaluating multimodal models on various benchmarking datasets. *This module is a work in progress! We will be updating this README as it develops. In the meantime, if you notice an issue, please file a Bug Report or Feature Request [here](https://github.com/mlfoundations/open_flamingo/issues/new/choose).* @@ -19,8 +18,19 @@ This is the evaluation module of OpenFlamingo. It contains a set of utilities fo When evaluating a model using `num_shots` shots, we sample the exemplars from the training split. Performance is evaluated on a disjoint test split, subsampled to `--num_samples` examples (or using the full test split if `--num_samples=-1`). +## Supported models +This evaluation module interfaces with models using the `EvalModel` class defined in `eval/eval_models/eval_model.py`. The `EvalModel` wrapper standardizes the generation and rank classification interfaces. + +To help standardize VLM evaluations, we have implemented EvalModel wrappers for models from three code repositories: + +* This open_flamingo repository, i.e. all models created using this repository's `src` code +* The pretrained [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) models. Note that this model can only take in one image per input sequence; this is not to be confused with the BLIP-like implementation in the open_flamingo repository, which can take in arbitrarily interleaved image/text sequences +* Huggingface's [IDEFICS](https://huggingface.co/blog/idefics) models + ## Sample scripts -Our codebase uses DistributedDataParallel to parallelize evaluation by default, so please make sure to set the `MASTER_ADDR` and `MASTER_PORT` environment variables or use `torchrun`. We provide a sample Slurm evaluation script in `open_flamingo/open_flamingo/scripts/run_eval.sh`. +Our codebase uses DistributedDataParallel to parallelize evaluation by default, so please make sure to set the `MASTER_ADDR` and `MASTER_PORT` environment variables or use `torchrun`. We provide a sample Slurm evaluation script in `open_flamingo/open_flamingo/scripts/run_eval.sh`. + +We have also implemented distributed evaluation using Deepspeed, which additionally shards model parameters across GPUs for memory savings. To use Deepspeed instead of DDP, use the `--deepspeed` flag. We also support evaluating at a lower precision using the `--precision` flag. We find minimal difference between evaluating at full precision vs. amp_bf16. diff --git a/open_flamingo/scripts/fill_vqa_testdev_results.py b/open_flamingo/scripts/fill_vqa_testdev_results.py index c86c5ceb..512848e5 100644 --- a/open_flamingo/scripts/fill_vqa_testdev_results.py +++ b/open_flamingo/scripts/fill_vqa_testdev_results.py @@ -1,5 +1,5 @@ """ -Helper scripts to prepare a vqa test-dev evaluation for EvalAI submission. +Helper scripts to prepare a Vizwiz or VQAv2 test-dev evaluation for EvalAI submission. Note: EvalAI requires VQAv2 submissions to have predictions for all the questions in the test2015 set, not just the test-dev set. Given a json with a subset of the vqa questions, fill in the rest of the questions with an empty string as the model prediction. """ diff --git a/open_flamingo/scripts/run_eval.sh b/open_flamingo/scripts/run_eval_ddp.sh similarity index 97% rename from open_flamingo/scripts/run_eval.sh rename to open_flamingo/scripts/run_eval_ddp.sh index 52a17099..acd72d1a 100644 --- a/open_flamingo/scripts/run_eval.sh +++ b/open_flamingo/scripts/run_eval_ddp.sh @@ -9,6 +9,7 @@ Notes: - VQAv2 test-dev and test-std annotations are not publicly available. To evaluate on these splits, please follow the VQAv2 instructions and submit to EvalAI. This script will evaluate on the val split. +- Vizwiz test-dev annotations are also not publicly available; please go through EvalAI. com export PYTHONFAULTHANDLER=1 diff --git a/open_flamingo/scripts/run_eval_deepspeed.sh b/open_flamingo/scripts/run_eval_deepspeed.sh new file mode 100644 index 00000000..fbba58ff --- /dev/null +++ b/open_flamingo/scripts/run_eval_deepspeed.sh @@ -0,0 +1,77 @@ +#!/bin/bash +#SBATCH --nodes=1 +#SBATCH --ntasks-per-node=2 +#SBATCH --gpus-per-task=1 + +< 2.0.1! +com + +export PYTHONFAULTHANDLER=1 +export CUDA_LAUNCH_BLOCKING=0 +export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"` +export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) +export MASTER_PORT=$(shuf -i 0-65535 -n 1) +export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l` + +export PYTHONPATH="$PYTHONPATH:open_flamingo" +srun --cpu_bind=v --accel-bind=gn python open_flamingo/open_flamingo/train/train.py \ + --lm_path meta-llama/Llama-2-13b \ + --tokenizer_path meta-llama/Llama-2-13b \ + --model_family flamingo \ + --cross_attn_every_n_layers 4 \ + --dataset_resampled \ + --batch_size_mmc4 16 \ + --batch_size_laion 32 \ + --fsdp \ + --fsdp_sharding_strategy hybrid \ + --train_num_samples_mmc4 125000\ + --train_num_samples_laion 250000 \ + --loss_multiplier_laion 0.2 \ + --workers=4 \ + --run_name "fsdp" \ + --num_epochs 480 \ + --warmup_steps 0 \ + --mmc4_textsim_threshold 0.0 \ + --laion_shards "/path/to/laion-samples/{000000..000001}.tar" \ + --mmc4_shards "/path/to/mmc4-samples/{000000..000001}.tar" \ + --report_to_wandb diff --git a/open_flamingo/src/README.md b/open_flamingo/src/README.md new file mode 100644 index 00000000..550b69a7 --- /dev/null +++ b/open_flamingo/src/README.md @@ -0,0 +1,56 @@ +# OpenFlamingo: Modeling +We provide modules to mix-and-match into several vision-language model architectures. + +## What is a VLM? +A **vision-language model (VLM)** is a language model capable of processing a sequence of arbitraily interleaved images/videos with text to output text. + +![A VLM takes in a sequence of interleaved images/videos with text and outputs text.](../../docs/signature.png) + +The forward signature of a VLM is as follows: + +* `vision_x`: The batch of images / videos to process. This is a tensor of the shape `(B, T_img, F, C, H, W)`, where `B` is the batch dimension, `T_img` collates the images/videos within one input sequence, `F` collates frames within a video, and `(C, H, W)` are the channel, height, and width dimensions respectively. +* `lang_x`: The batch of input_ids (text) to process. This is a tensor of the shape `(B, T_txt)`, where `T_txt` is the number of text tokens within one input sequence. + +To explain to the model how to interleave the image/text elements within a sequence, `lang_x` should include `` tokens ("media tokens") that specify where the images/videos are placed. (See figure below) + +![Illustration of what the inputs to a VLM look like.](../../docs/inputs.png) + + +## VLM modeling with the open_flamingo repository +This repository provides modules for constructing various VLM architectures. + +All models inherit from the `VLM` (vision-language model) class defined in `src/vlm.py`. As documented there, a VLM is defined by four component modules: +1. A **vision encoder** that extracts features from pixels (e.g. CLIP). This module should take in vision inputs of the shape `(B, T_img, F, C, H, W)` and output features of the shape `(B, T_img, F, v, d)`. +2. A **vision tokenizer** that converts features from the vision encoder into token-like embeddings (e.g. PerceiverResampler). This module should take in vision features of the shape `(B, T_img, F, v, d)` and output tokens of the shape `(B, T_img, n, d)`. +3. A fusion method that allows the language model to attend to these tokens, e.g. cross-attention (as done in [Flamingo](https://arxiv.org/abs/2204.14198)), or placing the tokens directly in the language model's input sequence (as done in [Kosmos](https://arxiv.org/abs/2306.14824)). +4. A language model. + +This repository allows us to construct architectures by mixing-and-matching options for all four kinds of modules. + +### Supported vision encoders +All CLIP-style encoders from the [OpenCLIP](https://github.com/mlfoundations/open_clip) library are supported. This includes OpenAI's models. + +### Supported vision tokenizers +* [Perceiver Resampler](https://arxiv.org/abs/2103.03206) +* [Q-former](https://arxiv.org/abs/2301.12597) +* Linear projection + +### Supported fusion methods +Models are further split into those that inherit from `VLMWithCrossAttention` (dense cross attention to fuse vision + language, Flamingo-style) vs. `VLMWithLanguageStream` (insert vision tokens into the language stream, Kosmos-style). + +![A VLM with cross attention and a VLM with language stream represent two methods for fusing the vision and language inputs.](../../docs/xattn_langstream.png) + +### Supported language models +All autoregressive language models from [Huggingface Transformers](https://huggingface.co/models) are supported. + +## Example architectures +Using these modules, the following architectures are implemented as examples. + +|Model|Vision tokenizer|Fusion method|Trainable parameters| +|----|------------|------------|------------| +|[Flamingo](https://arxiv.org/abs/2204.14198)|Perceiver|Cross attention|Added language model embeddings, vision tokenizer| +|[Kosmos](https://arxiv.org/abs/2306.14824)|Perceiver|Language stream|Everything except the vision encoder| +|[BLIP](https://arxiv.org/abs/2301.12597)|Q-former|Language stream|Added language model embeddings, vision tokenizer| + +We welcome contributions! If you'd like to add additional vision tokenizers, fusion methods, or model types, please open a PR. + diff --git a/open_flamingo/train/README.md b/open_flamingo/train/README.md index d676e91b..47a029a8 100644 --- a/open_flamingo/train/README.md +++ b/open_flamingo/train/README.md @@ -1,9 +1,23 @@ # OpenFlamingo Training -To train OpenFlamingo, please ensure your environment matches that of `environment.yml`. +We provide efficient data loading and distributed training code. +To train with OpenFlamingo, please ensure your environment matches that of `environment.yml`. + +Table of contents: + +* [Data](#data) +* [Example commands](#example-training-command) +* [Distributed training](#distributed-training) ## Data Our codebase uses [WebDataset](https://github.com/webdataset/webdataset) to efficiently load `.tar` files containing image and text sequences. We recommend resampling shards with replacement during training using the `--dataset_resampled` flag. +Supported pretraining datasets +* LAION-2B +* Multimodal C4 (MMC4) +* ChatGPT-generated sequences from OpenFlamingo [technical report](https://arxiv.org/abs/2308.01390) + +We plan to add additional datasets in the future, and we welcome contributions! If you'd like to add support for a pretraining dataset, please open a PR. + ### LAION-2B Dataset [LAION-2B](https://arxiv.org/abs/2210.08402) contains 2B web-scraped (image, text) pairs. We use [img2dataset](https://github.com/rom1504/img2dataset) to download this dataset into tar files. @@ -27,7 +41,7 @@ Models trained with ChatGPT-generated sequences: * OpenFlamingo-4B-vitl-rpj3b-langinstruct ## Example training command -We provide a sample Slurm training script in `scripts/`. You can also modify the following command: +We provide sample Slurm training scripts in `scripts/`. You can also modify the following command: ``` torchrun --nnodes=1 --nproc_per_node=4 train.py \ @@ -52,9 +66,17 @@ torchrun --nnodes=1 --nproc_per_node=4 train.py \ *Note: The MPT-1B [base](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b) and [instruct](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b-dolly) modeling code does not accept the `labels` kwarg or compute cross-entropy loss directly within `forward()`, as expected by our codebase. We suggest using a modified version of the MPT-1B models found [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b) and [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b-dolly).* ## Distributed training +Our codebase supports distributed training using three frameworks: + +* Pytorch's [DistributedDataParallel](https://pytorch.org/docs/stable/torch.nn.parallel.DistributedDataParallel.html). This is the default method used by `train.py`. +* Pytorch's [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP). Use the `--fsdp` flag. +* [DeepSpeed](https://github.com/microsoft/DeepSpeed) stages 1-3. Use the `--deepspeed` flag. + +Note that you should use exactly one of these training methods. + +`train/distributed.py` contains utilities to help with setting up distributed training using Slurm / `torchrun`. See example scripts in the `scripts` directory. + +### FSDP notes +To use FSDP, make sure to use Pytorch Nightly (> 2.0.1). -By default, `train.py` uses Pytorch's [DistributedDataParallel](https://pytorch.org/docs/stable/torch.nn.parallel.DistributedDataParallel.html) for training. -To use [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html), make sure to use Pytorch Nightly (> 2.0.1), and use the `--fsdp` flag. -To use [DeepSpeed](https://github.com/microsoft/DeepSpeed), use the `--deepspeed` flag. -(Note that you should use *either* FSDP or Deepspeed, not both.) -We also implement gradient checkpointing and mixed precision training. Use the `--gradient_checkpointing` and `--precision` arguments respectively. \ No newline at end of file +We support two sharding strategies for FSDP: full sharding (model sharing across all nodes and GPUs) or hybrid sharding (model sharding across GPUs within nodes, data parallel between nodes). The former saves GPU memory; the latter saves on communication costs. diff --git a/requirements-eval.txt b/requirements-eval.txt index d594adb4..f5bd9a99 100644 --- a/requirements-eval.txt +++ b/requirements-eval.txt @@ -5,9 +5,7 @@ inflection pycocoevalcap pycocotools tqdm - -black mypy pylint pytest -requests +requests \ No newline at end of file diff --git a/requirements-training.txt b/requirements-training.txt index 79ff0bc9..8b46a831 100644 --- a/requirements-training.txt +++ b/requirements-training.txt @@ -3,3 +3,4 @@ braceexpand webdataset tqdm wandb +deepspeed \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 8f271192..143a3c5e 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,7 @@ einops einops-exts transformers>=4.28.1 -torch==2.0.1 +torch>=2.0.1 pillow open_clip_torch>=2.16.0 sentencepiece==0.1.98 \ No newline at end of file