Remove FSDP (#473)

* remove fsdp * rm fsdp in tests * Readme --------- Co-authored-by: Pascal Pfeiffer <pascal.pfeiffer@h2o.ai>
h2oai · Oct 31, 2023 · b86709a · b86709a
1 parent 511b4a8
commit b86709a
Show file tree

Hide file tree

Showing 21 changed files with 31 additions and 83 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -53,3 +53,15 @@ Please make sure your pull request fulfills the following checklist:
 ☐ If your contribution is still a work in progress, change the PR to draft mode.  
 ☐ Ensure that the existing tests pass by running `make test`.  
 ☐ Make sure `make style` passes to maintain consistent code style.  
+
+## Installing custom packages
+
+If you need to install additional Python packages into the environment, you can do so using pip after activating your virtual environment via ```make shell```. For example, to install flash-attention, you would use the following commands:
+
+```bash
+make shell
+pip install flash-attn --no-build-isolation
+pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
+```
+
+For a PR, update the Pipfile and the Pipfile.lock via ```pipenv install package_name```.
diff --git a/README.md b/README.md
@@ -54,9 +54,10 @@ Using CLI for fine-tuning LLMs:
 
 ## What's New
 
+- [PR 288](https://github.com/h2oai/h2o-llmstudio/pull/288) Introduced Deepspeed for sharded training allowing to train larger models on machines with multiple GPUs. Requires NVLink. This feature replaces FSDP and offers more flexibility. Deepspeed requires a system installation of cudatoolkit and we recommend using version 11.8. See [Recommended Install](#recommended-install).
 - [PR 449](https://github.com/h2oai/h2o-llmstudio/pull/449) New problem type for Causal Classification Modeling allows to train binary and multiclass models using LLMs.
 - [PR 364](https://github.com/h2oai/h2o-llmstudio/pull/364) User secrets are now handled more securely and flexible. Support for handling secrets using the 'keyring' library was added. User settings are tried to be migrated automatically.
-- [PR 328](https://github.com/h2oai/h2o-llmstudio/pull/328) RLHF is now a separate problem type. Note that starting a new RLHF experiment from an old experiment that used RLHF is no longer supported. To continue from a previous experiment, please start a new experiment and enter the settings from the previous experiment manually. 
+- [PR 328](https://github.com/h2oai/h2o-llmstudio/pull/328) RLHF is now a separate problem type. Note that starting a new RLHF experiment from an old experiment that used RLHF is no longer supported. To continue from a previous experiment, please start a new experiment and enter the settings from the previous experiment manually.
 - [PR 308](https://github.com/h2oai/h2o-llmstudio/pull/308) Sequence to sequence models have been added as a new problem type.
 - [PR 152](https://github.com/h2oai/h2o-llmstudio/pull/152) Add RLHF functionality for fine-tuning LLMs.
 - [PR 132](https://github.com/h2oai/h2o-llmstudio/pull/131) Add 4bit training that allows training of larger LLM backbones with less GPU memory. See [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for a comprehensive summary of this method.
@@ -90,13 +91,21 @@ If deploying on a 'bare metal' machine running Ubuntu, one may need to install t
 ```bash
 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
 sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
-wget https://developer.download.nvidia.com/compute/cuda/11.4.3/local_installers/cuda-repo-ubuntu2004-11-4-local_11.4.3-470.82.01-1_amd64.deb
-sudo dpkg -i cuda-repo-ubuntu2004-11-4-local_11.4.3-470.82.01-1_amd64.deb
-sudo apt-key add /var/cuda-repo-ubuntu2004-11-4-local/7fa2af80.pub
-sudo apt-get -y update
+wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-ubuntu2004-11-8-local_11.8.0-520.61.05-1_amd64.deb
+sudo dpkg -i cuda-repo-ubuntu2004-11-8-local_11.8.0-520.61.05-1_amd64.deb
+sudo cp /var/cuda-repo-ubuntu2004-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
+sudo apt-get update
 sudo apt-get -y install cuda
 ```
 
+alternatively, one can install cudatoolkits in a cuda environmet:
+
+```bash
+conda create -n llmstudio python=3.10
+conda activate llmstudio
+conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
+```
+
 #### Create virtual environment (pipenv)
 
 The following command will create a virtual environment using pipenv and will install the dependencies using pipenv:
@@ -113,18 +122,6 @@ If you wish to use conda or another virtual environment, you can also install th
 pip install -r requirements.txt
 ```
 
-### Installing custom packages
-
-If you need to install additional Python packages into your environment, you can do so using pip after activating your virtual environment via ```make shell```. For example, to install flash-attention, you would use the following commands:
-
-```bash
-make shell
-pip install flash-attn --no-build-isolation
-pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
-```
-
-Alternatively, you can also directly install via ```pipenv install package_name```.
-
 ## Run H2O LLM Studio GUI
 
 You can start H2O LLM Studio using the following command:

diff --git a/documentation/docs/guide/experiments/experiment-settings.md b/documentation/docs/guide/experiments/experiment-settings.md
@@ -76,7 +76,6 @@ import PStopp from '../../tooltips/experiments/_top-p.mdx';
 import ESgpus from '../../tooltips/experiments/_gpus.mdx';
 import ESmixedprecision from '../../tooltips/experiments/_mixed-precision.mdx';
 import EScompilemodel from '../../tooltips/experiments/_compile-model.mdx';
-import ESusefsdp from '../../tooltips/experiments/_use-fsdp.mdx';
 import ESfindunusedparameters from '../../tooltips/experiments/_find-unused-parameters.mdx';
 import EStrustremotecode from '../../tooltips/experiments/_trust-remote-code.mdx';
 import ESnumofworkers from '../../tooltips/experiments/_number-of-workers.mdx';
@@ -430,10 +429,6 @@ The settings under each category are listed and described below.
 
 <EScompilemodel/>
 
-### Use FSDP
-
-<ESusefsdp/>
-
 ### Find unused parameters
 
 <ESfindunusedparameters/>

diff --git a/documentation/docs/tooltips/experiments/_use-fsdp.mdx b/documentation/docs/tooltips/experiments/_use-fsdp.mdx
diff --git a/llm_studio/python_configs/cfg_checks.py b/llm_studio/python_configs/cfg_checks.py
@@ -100,13 +100,6 @@ def check_for_common_errors(cfg: DefaultConfigProblemBase) -> dict:
                 "Please use LORA or set Backbone Dtype to float32."
             ]
 
-    # deepspeed related checks
-    if cfg.environment.use_deepspeed and cfg.environment.use_fsdp:
-        errors["title"] += ["Deepspeed and FSDP cannot be used at the same time."]
-        errors["message"] += [
-            "Deepspeed and FSDP are mutually exclusive. "
-            "We recommend to disable FSDP which will be deprecated."
-        ]
     if cfg.environment.use_deepspeed and cfg.architecture.backbone_dtype in [
         "int8",
         "int4",

diff --git a/llm_studio/python_configs/text_causal_language_modeling_config.py b/llm_studio/python_configs/text_causal_language_modeling_config.py
@@ -342,7 +342,6 @@ class ConfigNLPCausalLMEnvironment(DefaultConfig):
     mixed_precision: bool = True
 
     compile_model: bool = False
-    use_fsdp: bool = False
     use_deepspeed: bool = False
     deepspeed_reduce_bucket_size: int = int(1e6)
     deepspeed_stage3_prefetch_bucket_size: int = int(1e6)

diff --git a/llm_studio/python_configs/text_rlhf_language_modeling_config.py b/llm_studio/python_configs/text_rlhf_language_modeling_config.py
@@ -167,7 +167,6 @@ class ConfigRLHFLMEnvironment(ConfigNLPCausalLMEnvironment):
 
     def __post_init__(self):
         super().__post_init__()
-        self._visibility["use_fsdp"] = -1
         self._visibility["compile_model"] = -1
 
 

diff --git a/llm_studio/src/utils/modeling_utils.py b/llm_studio/src/utils/modeling_utils.py
@@ -14,10 +14,6 @@
 from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
 from peft import LoraConfig, get_peft_model
 from torch.cuda.amp import autocast
-from torch.distributed.fsdp.fully_sharded_data_parallel import (
-    FullyShardedDataParallel,
-    MixedPrecision,
-)
 from torch.nn.parallel import DistributedDataParallel
 from tqdm import tqdm
 from transformers import (
@@ -264,28 +260,7 @@ def wrap_model_distributed(
     val_dataloader: torch.utils.data.DataLoader,
     cfg: Any,
 ):
-    if cfg.environment.use_fsdp:
-        auto_wrap_policy = None
-
-        mixed_precision_policy = None
-        dtype = None
-        if cfg.environment.mixed_precision:
-            dtype = torch.float16
-        if dtype is not None:
-            mixed_precision_policy = MixedPrecision(
-                param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype
-            )
-        model = FullyShardedDataParallel(
-            model,
-            # sharding_strategy=ShardingStrategy.SHARD_GRAD_OP,
-            # cpu_offload=CPUOffload(offload_params=True),
-            auto_wrap_policy=auto_wrap_policy,
-            mixed_precision=mixed_precision_policy,
-            device_id=cfg.environment._local_rank,
-            # use_orig_params=False
-            limit_all_gathers=True,
-        )
-    elif cfg.environment.use_deepspeed:
+    if cfg.environment.use_deepspeed:
         ds_config = get_ds_config(cfg)
         model, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize(
             model=model,

diff --git a/tests/integration/test_causal_binary_classification_modeling_cfg.yaml b/tests/integration/test_causal_binary_classification_modeling_cfg.yaml
@@ -49,7 +49,6 @@ environment:
     seed: -1
     trust_remote_code: true
     use_deepspeed: false
-    use_fsdp: false
 experiment_name: solid-spaniel
 llm_backbone: facebook/opt-125m
 logging:

diff --git a/tests/integration/test_causal_binary_classification_modeling_cpu_cfg.yaml b/tests/integration/test_causal_binary_classification_modeling_cpu_cfg.yaml
@@ -49,7 +49,6 @@ environment:
     seed: -1
     trust_remote_code: true
     use_deepspeed: false
-    use_fsdp: false
 experiment_name: solid-spaniel
 llm_backbone: MaxJeblick/llama2-0b-unit-test
 logging:

diff --git a/tests/integration/test_causal_language_modeling_oasst_cfg.yaml b/tests/integration/test_causal_language_modeling_oasst_cfg.yaml
@@ -44,7 +44,6 @@ environment:
     number_of_workers: 8
     seed: -1
     trust_remote_code: true
-    use_fsdp: false
 experiment_name: test-causal-language-modeling-oasst
 llm_backbone: h2oai/h2ogpt-4096-llama2-7b
 logging:

diff --git a/tests/integration/test_causal_language_modeling_oasst_cpu_cfg.yaml b/tests/integration/test_causal_language_modeling_oasst_cpu_cfg.yaml
@@ -44,7 +44,6 @@ environment:
   number_of_workers: 8
   seed: -1
   trust_remote_code: true
-  use_fsdp: false
 experiment_name: test-causal-language-modeling-oasst-cpu
 llm_backbone: MaxJeblick/llama2-0b-unit-test
 logging:

diff --git a/tests/integration/test_causal_multiclass_classification_modeling_cfg.yaml b/tests/integration/test_causal_multiclass_classification_modeling_cfg.yaml
@@ -49,7 +49,6 @@ environment:
     seed: -1
     trust_remote_code: true
     use_deepspeed: false
-    use_fsdp: false
 experiment_name: solid-spaniel
 llm_backbone: facebook/opt-125m
 logging:

diff --git a/tests/integration/test_causal_multiclass_classification_modeling_cpu_cfg.yaml b/tests/integration/test_causal_multiclass_classification_modeling_cpu_cfg.yaml
@@ -49,7 +49,6 @@ environment:
     seed: -1
     trust_remote_code: true
     use_deepspeed: false
-    use_fsdp: false
 experiment_name: solid-spaniel
 llm_backbone: MaxJeblick/llama2-0b-unit-test
 logging:

diff --git a/tests/integration/test_rlhf_language_modeling_oasst_cfg.yaml b/tests/integration/test_rlhf_language_modeling_oasst_cfg.yaml
@@ -44,7 +44,6 @@ environment:
     number_of_workers: 8
     seed: -1
     trust_remote_code: true
-    use_fsdp: false
 experiment_name: test-rlhf-language-modeling-oasst
 llm_backbone: facebook/opt-125m
 logging:

diff --git a/tests/integration/test_rlhf_language_modeling_oasst_cpu_cfg.yaml b/tests/integration/test_rlhf_language_modeling_oasst_cpu_cfg.yaml
@@ -44,7 +44,6 @@ environment:
     number_of_workers: 8
     seed: -1
     trust_remote_code: true
-    use_fsdp: false
 experiment_name: test-rlhf-language-modeling-oasst
 llm_backbone: MaxJeblick/llama2-0b-unit-test
 logging:

diff --git a/tests/integration/test_sequence_to_sequence_modeling_oasst_cfg.yaml b/tests/integration/test_sequence_to_sequence_modeling_oasst_cfg.yaml
@@ -44,7 +44,6 @@ environment:
     number_of_workers: 8
     seed: -1
     trust_remote_code: true
-    use_fsdp: false
 experiment_name: test-sequence-to-sequence-modeling-oasst
 llm_backbone: t5-small
 logging:

diff --git a/tests/integration/test_sequence_to_sequence_modeling_oasst_cpu_cfg.yaml b/tests/integration/test_sequence_to_sequence_modeling_oasst_cpu_cfg.yaml
@@ -44,7 +44,6 @@ environment:
     number_of_workers: 8
     seed: -1
     trust_remote_code: true
-    use_fsdp: false
 experiment_name: test-sequence-to-sequence-modeling-oasst
 llm_backbone: t5-small
 logging:

diff --git a/tests/src/test_data/cfg.yaml b/tests/src/test_data/cfg.yaml
@@ -31,7 +31,6 @@ environment:
     mixed_precision: true
     number_of_workers: 8
     seed: -1
-    use_fsdp: false
 experiment_name: test
 llm_backbone: EleutherAI/pythia-12b-deduped
 logging:

diff --git a/tests/src/utils/test_load_yaml_file.py b/tests/src/utils/test_load_yaml_file.py
@@ -39,7 +39,6 @@ def test_load_config_yaml():
     assert cfg.environment.mixed_precision is True
     assert cfg.environment.number_of_workers == 8
     assert cfg.environment.seed == -1
-    assert cfg.environment.use_fsdp is False
 
     assert cfg.logging.logger == "None"
     assert cfg.logging.neptune_project == ""

diff --git a/train.py b/train.py
@@ -20,7 +20,6 @@
 import pandas as pd
 import torch
 from torch.cuda.amp import GradScaler, autocast
-from torch.distributed.fsdp.sharded_grad_scaler import ShardedGradScaler
 from torch.utils.data import DataLoader
 from tqdm import tqdm
 from transformers.deepspeed import HfDeepSpeedConfig
@@ -178,12 +177,9 @@ def run_train(
         Last train batch
     """
 
-    scaler: GradScaler | ShardedGradScaler | None = None
+    scaler: GradScaler | None = None
     if cfg.environment.mixed_precision:
-        if cfg.environment.use_fsdp:
-            scaler = ShardedGradScaler()
-        else:
-            scaler = GradScaler()
+        scaler = GradScaler()
 
     optimizer.zero_grad(set_to_none=True)
 
@@ -427,12 +423,9 @@ def run_train_rlhf(
         Last train batch
     """
 
-    scaler: GradScaler | ShardedGradScaler | None = None
+    scaler: GradScaler | None = None
     if cfg.environment.mixed_precision:
-        if cfg.environment.use_fsdp:
-            scaler = ShardedGradScaler()
-        else:
-            scaler = GradScaler()
+        scaler = GradScaler()
 
     optimizer.zero_grad(set_to_none=True)
 
@@ -754,8 +747,6 @@ def run(cfg: Any) -> None:
     else:
         cfg.environment._seed = cfg.environment.seed
 
-    if cfg.environment.use_deepspeed and cfg.environment.use_fsdp:
-        raise ValueError("Deepspeed and FSDP cannot be used at the same time.")
     if (
         cfg.architecture.backbone_dtype in ["int8", "int4"]
         and cfg.environment.use_deepspeed