# First we must install Pytorch

In [1]:
!conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia -y

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



# Lets quickly look at a more naive approach to fine-tuning, one I originally explored. 

## One easy way to fine-tune small transformers is to use a library called Happytransformer. Happytransformer is a package built on top of the Hugging face transformer library. Using it, you can normally fine-tune small transformers very easiliy. Lets install it.

In [2]:
!pip install transformers happytransformer



## We will be testing this method by fine-tunning models on Shakespeare on my RTX 3090.

In [3]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


In [4]:
from happytransformer import HappyGeneration 
model_to_use = "EleutherAI/gpt-neo-125M"
#1.3B Won't work on 24GB or less cards
# model_to_use = "EleutherAI/gpt-neo-1.3B"

happy_gen = HappyGeneration("GPT-NEO", model_to_use)
happy_gen.train("train.csv")

06/14/2021 04:17:22 - INFO - happytransformer.happy_transformer -   Using model: cuda


KeyboardInterrupt: 

In [4]:
#this is clean up free up GPU VRAM
try:
    del happy_gen
except:
    pass
import gc 
import torch
gc.collect()
torch.cuda.empty_cache()

## We can see that even the relatively small 125M model, that it takes 10GB of VRAM to fine-tune the model, and the 1.3B parameter model can't fit on 24GB, let alone the 2.7B model.  This was even with a batch size of 1

# Is there hope to fine-tune these larger models then with consumer grade hardware then?  Yes, but we need to use a library called DeepSpeed

# First we need to clone the DeepSpeed Repo, as we must build some optional items in the package from the source

## DeepSpeed is a Deep Learning optimization library by Microsoft that allows researchers to more easily run and train larger models that they otherwise would not be able to.

In [4]:
!git clone https://github.com/microsoft/DeepSpeed -b v0.4.0

Cloning into 'DeepSpeed'...
remote: Enumerating objects: 9447, done.[K
remote: Counting objects: 100% (207/207), done.[K
remote: Compressing objects: 100% (151/151), done.[K
remote: Total 9447 (delta 103), reused 109 (delta 53), pack-reused 9240[K
Receiving objects: 100% (9447/9447), 18.52 MiB | 10.32 MiB/s, done.
Resolving deltas: 100% (6412/6412), done.
Note: switching to '2d302d6abb2cfa181f63320da3ed1be45e34ded3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false



## We are now going to run ls to see our folder structure, we should now see DeepSpeed, we are going to cd into that folder

In [5]:
!ls

DeepSpeed  GPT_Neo_Fine-tune.ipynb  test  train.csv


In [6]:
%cd DeepSpeed

/mnt/shared_drive/projects/personal/finetune_vid/DeepSpeed


## If we run ls again, we will see the contents of the DeepSpeed repo

In [7]:
!ls

azure		    csrc	       install.sh   requirements  version.txt
bin		    deepspeed	       LICENSE	    SECURITY.md
CODE_OF_CONDUCT.md  DeepSpeedExamples  MANIFEST.in  setup.cfg
CODEOWNERS	    docker	       op_builder   setup.py
CONTRIBUTING.md     docs	       README.md    tests


## We are now going to install DeepSpeed from source using a flag to insure that all the needed ops are installed

In [8]:
!DS_BUILD_OPS=1 pip install .

Processing /mnt/shared_drive/projects/personal/finetune_vid/DeepSpeed
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... [?25l/^C
[?25canceled
[31mERROR: Operation cancelled by user[0m


## We can now make sure that DeepSpeed is properly installed by running the snippet below.  All of the compatible ops should be installed.  Some are required that like cpu_adam and transformer.  Others may not be, such as async_io.  Personally all were compatible with my system but async.io

In [10]:
!ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [92m[OKAY][0m
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [92m[YES][0m ...... [92m[OKAY][0m
fused_adam ............. [92m[YES][0m ...... [92m[OKAY][0m
fused_lamb ............. [92m[YES][0m ...... [92m[OKAY][0m
sparse_attn ............ [92m[YES][0m ...... [92m[OKAY][0m
transformer ............ [92m[YES][0m ...... [92m[OKAY][0m
stochastic_transformer . [92m[YES][0m ...... [92m[OKAY][0m
async_io .........

# Next we need to download the repo that will be actually finetuning the GPT Neo model using DeepSpeed.

## First we need to exit the DeepSpeed repo

In [11]:
%cd ..

/mnt/shared_drive/projects/personal/finetune_vid


In [12]:
!ls

DeepSpeed  GPT_Neo_Fine-tune.ipynb  test  train.csv


## Now we clone the finetuning repo

In [14]:
!git clone https://github.com/Xirider/finetune-gpt2xl

Cloning into 'finetune-gpt2xl'...
remote: Enumerating objects: 354, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 354 (delta 2), reused 0 (delta 0), pack-reused 304[K
Receiving objects: 100% (354/354), 3.41 MiB | 3.23 MiB/s, done.
Resolving deltas: 100% (221/221), done.


## Now we need enter the pulled repo

In [1]:
!ls

DeepSpeed  finetune-gpt2xl  GPT_Neo_Fine-tune.ipynb  train.csv


In [1]:
%cd finetune-gpt2xl

/mnt/shared_drive/projects/personal/finetune_vid/finetune-gpt2xl


## Lastly, we need to download the datasets library that this repo uses.

In [2]:
!pip install datasets



# At this point we are able to finetune GPT Neo(including 2.7B) and other GPT models

## For GPT NEO 2.7B parameters, we need a high end machine.  Roughly 70GB of RAM is the minimum required for it, along with roughly 16GB of VRAM. GPT Neo 1.3B and other smaller GPT2 models don't have as high of requirements. This can rented for an ok price from a cloud provider if you dont have a powerful enough machine.

## Lets now finetune model with the provided Shakespeare dataset with the example flags

In [3]:
!deepspeed --num_gpus=1 run_clm.py \
--deepspeed ds_config_gptneo.json \
--model_name_or_path EleutherAI/gpt-neo-1.3B \
--train_file train.csv \
--validation_file validation.csv \
--do_train \
--do_eval \
--fp16 \
--overwrite_cache \
--evaluation_strategy="steps" \
--output_dir finetuned \
--num_train_epochs 1 \
--eval_steps 15 \
--gradient_accumulation_steps 2 \
--per_device_train_batch_size 4 \
--use_fast_tokenizer False \
--learning_rate 5e-06 \
--warmup_steps 10

[2021-06-14 15:33:34,610] [INFO] [runner.py:360:main] cmd = /home/blake/anaconda3/envs/gptneo_finetuned/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config_gptneo.json --model_name_or_path EleutherAI/gpt-neo-1.3B --train_file train.csv --validation_file validation.csv --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 15 --gradient_accumulation_steps 2 --per_device_train_batch_size 4 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10
[2021-06-14 15:33:35,151] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2021-06-14 15:33:35,152] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-06-14 15:33:35,152] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-06-14 15:33:35,152] [INFO] [launch.py:102:main] d

[INFO|modeling_utils.py:1155] 2021-06-14 15:33:39,062 >> loading weights file https://huggingface.co/EleutherAI/gpt-neo-1.3B/resolve/main/pytorch_model.bin from cache at /home/blake/.cache/huggingface/transformers/7c5fac9d60b015cbc7c007ab8fe6d0512787fbaef81968922959898c49468d73.4c6a483fbfb5a25ac384bfcd71a1ff15245f06583a00c4ab4c44ed0f761f0b08
[INFO|modeling_utils.py:1339] 2021-06-14 15:33:52,868 >> All model checkpoint weights were used when initializing GPTNeoForCausalLM.

[INFO|modeling_utils.py:1348] 2021-06-14 15:33:52,868 >> All the weights of GPTNeoForCausalLM were initialized from the model checkpoint at EleutherAI/gpt-neo-1.3B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoForCausalLM for predictions without further training.
100%|█████████████████████████████████████████████| 1/1 [00:04<00:00,  4.50s/ba]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 60.93ba/s]
  f"The tokenizer picked seems to h

  8%|███▌                                      | 15/178 [01:45<19:19,  7.12s/it][INFO|trainer.py:2115] 2021-06-14 15:36:01,866 >> ***** Running Evaluation *****
[INFO|trainer.py:2117] 2021-06-14 15:36:01,866 >>   Num examples = 4
[INFO|trainer.py:2120] 2021-06-14 15:36:01,866 >>   Batch size = 8

                                                                                [A
[A{'eval_loss': 3.630859375, 'eval_runtime': 0.3171, 'eval_samples_per_second': 12.613, 'epoch': 0.08}
  8%|███▌                                      | 15/178 [01:46<19:19,  7.12s/it]
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 657.72it/s][A
 17%|███████                                   | 30/178 [03:31<17:28,  7.09s/it][INFO|trainer.py:2115] 2021-06-14 15:37:47,791 >> ***** Running Evaluation *****
[INFO|trainer.py:2117] 2021-06-14 15:37:47,791 >>   Num examples = 4
[INFO|trainer.py:2120] 2021-06-14 15:37:47,791 >>   Batch size = 8

                                                   

[INFO|trainer.py:2115] 2021-06-14 15:55:02,058 >> ***** Running Evaluation *****
[INFO|trainer.py:2117] 2021-06-14 15:55:02,058 >>   Num examples = 4
[INFO|trainer.py:2120] 2021-06-14 15:55:02,059 >>   Batch size = 8
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  3.66it/s]
[INFO|trainer_pt_utils.py:907] 2021-06-14 15:55:02,432 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:912] 2021-06-14 15:55:02,432 >>   epoch                     =        1.0
[INFO|trainer_pt_utils.py:912] 2021-06-14 15:55:02,432 >>   eval_loss                 =     3.6035
[INFO|trainer_pt_utils.py:912] 2021-06-14 15:55:02,432 >>   eval_mem_cpu_alloc_delta  =        0MB
[INFO|trainer_pt_utils.py:912] 2021-06-14 15:55:02,432 >>   eval_mem_cpu_peaked_delta =        0MB
[INFO|trainer_pt_utils.py:912] 2021-06-14 15:55:02,432 >>   eval_mem_gpu_alloc_delta  =        0MB
[INFO|trainer_pt_utils.py:912] 2021-06-14 15:55:02,432 >>   eval_mem_gpu_peaked_delta =     2371MB
[INFO|trainer_pt_utils.py