<a href="https://colab.research.google.com/github/joexu22/llama2-finetune/blob/main/Fine_tune_llama_2_7b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutoriel** - Fine-tuning Llama2

This code notebook makes it possible to fine-tune llama 2 to speak 17th century French, with an instruction dataset extracted from a few hunder French novels published from 1600 to 1700. This is a fun sample to check whether finetuning is working or not (and no, ChatGPT cannot do it).

Obviously you can use it for any dataset hosted on huggingface provided they have the ###human / ###assistant data structure.

# Installation

In [1]:
from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/My Drive/llama"

Mounted at /content/drive
[Errno 2] No such file or directory: '/content/drive/My Drive/llama'
/content


We are going to use a script created by Younes Belkada to finetune Llama 2.

In [2]:
!rm -rf e90381ed142ee80c8e7ea602b18d50f0
!git clone https://gist.github.com/Pclanglais/e90381ed142ee80c8e7ea602b18d50f0

Cloning into 'e90381ed142ee80c8e7ea602b18d50f0'...
remote: Enumerating objects: 38, done.[K
remote: Total 38 (delta 0), reused 0 (delta 0), pack-reused 38[K
Receiving objects: 100% (38/38), 6.27 KiB | 6.27 MiB/s, done.
Resolving deltas: 100% (14/14), done.


We check for the dependences.

In [3]:
!pip install accelerate==0.21.0
!pip install peft==0.4.0
!pip install bitsandbytes==0.40.2
!pip install transformers==4.30.2
!pip install trl==0.4.7

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/244.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0
Collecting peft==0.4.0
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting transformers (from peft==0.4.0)
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors (from peft==0.4.

We get inside the folder that contains the finetuning script.

In [4]:
%cd "e90381ed142ee80c8e7ea602b18d50f0"

/content/e90381ed142ee80c8e7ea602b18d50f0


We launch the script.

In [5]:
!python finetune_llama_v2.py --dataset_name "AlexanderDoria/novel17_test" --max_steps 500 --merge_and_push True --model_name "daryl149/llama-2-7b-chat-hf"

Downloading (…)lve/main/config.json: 100% 507/507 [00:00<00:00, 2.73MB/s]
Downloading (…)model.bin.index.json: 100% 26.8k/26.8k [00:00<00:00, 68.6MB/s]
Downloading shards:   0% 0/2 [00:00<?, ?it/s]
Downloading (…)l-00001-of-00002.bin:   0% 0.00/9.98G [00:00<?, ?B/s][A
Downloading (…)l-00001-of-00002.bin:   0% 10.5M/9.98G [00:00<01:50, 90.1MB/s][A
Downloading (…)l-00001-of-00002.bin:   0% 21.0M/9.98G [00:00<01:41, 97.9MB/s][A
Downloading (…)l-00001-of-00002.bin:   0% 31.5M/9.98G [00:00<01:39, 99.9MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 52.4M/9.98G [00:00<01:36, 102MB/s] [A
Downloading (…)l-00001-of-00002.bin:   1% 62.9M/9.98G [00:00<01:37, 102MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 73.4M/9.98G [00:00<01:38, 100MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 83.9M/9.98G [00:00<01:54, 86.7MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 94.4M/9.98G [00:00<01:49, 90.4MB/s][A
Downloading (…)l-00001-of-00002.bin:   1% 105M/9.98G [00:01<01:46, 92.9MB/s] [A
Dow

We are going to test the inference but I highly recommend to renew the runtime: otherwise your ram is very likely to be saturated.

Then we load the model

In [8]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("results/final_merged_checkpoint")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Then we generate some text (in old French). It helps if you give an headstart to the model (since fine-tuning is really rough and simple)

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-chat-hf")

from transformers import pipeline

generator = pipeline(task="text-generation", model=model, tokenizer = tokenizer, max_length=200)

result = generator("Ecrivez un texte en ancien français du 17e siècle sur le meilleur moyen de voyager sur la lune ### Assistant: Je serois parti sur la Lune")

print(result)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'generated_text': "Ecrivez un texte en ancien français du 17e siècle sur le meilleur moyen de voyager sur la lune ### Assistant: Je serois parti sur la Lune.###  Buisson. Qui plus beau moyen fut jamais findré de voyager sur la lune que de faire une barque de cire d'un jour au lendemain et de la faire voler dans les airs jusque-lui donner la lune au visage. Car si elle est belle de son aspect terrestre, elle est encore plus belle de près, et les étoiles ne sont point si éloignées qu'on le pense. Si le vent ne vous empêche pas de prendre la barque, si la lune ne vous étonne point de sa clarté, si les étoiles ne vous éblouissent point de leur splendeur, vous verrez des choses mer"}]
