<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/train-llama2-chatbot/finetune_llama2_autotrain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For no-code solution, refer to this video: [LLAMA2 🦙: FINE-TUNE ON YOUR DATA WITHOUT WRITING SINGLE LINE OF CODE 🤗
](https://youtu.be/uszxDfQ2qbc?si=kcsywtr8E_mAA2FL)

# Fine-tuning Llama 2 7b with AutoTrain

In this notebook, I willwalk you through the steps to fine-tune Llama 2 7b using your own dataset. (For this example we will be using alpaca datgaset from huggingface)

### Install necessary library

Before we get started, let's ensure we have all the necessary packages installed.

In [None]:
!pip install autotrain-advanced huggingface_hub


The step below is required for AutoTrain in Colab


In [None]:
!autotrain setup --update-torch #(optional - needed for Google Colab)

> [1mINFO    Installing latest transformers@main[0m
> [1mINFO    Successfully installed latest transformers[0m
> [1mINFO    Installing latest peft@main[0m
> [1mINFO    Successfully installed latest peft[0m
> [1mINFO    Installing latest diffusers@main[0m
> [1mINFO    Successfully installed latest diffusers[0m
> [1mINFO    Installing latest trl@main[0m
> [1mINFO    Successfully installed latest trl[0m
> [1mINFO    Installing latest xformers[0m
> [1mINFO    Successfully installed latest xformers[0m
> [1mINFO    Installing latest PyTorch[0m
> [1mINFO    Successfully installed latest PyTorch[0m


#### Getting a Hugging Face token to login
Steps:
1. Navigate to this URL: https://huggingface.co/settings/tokens
2. Create a `write` token and copy it to your clipboard
3. Run the code below and enter your token



In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Upload your dataset
Add your data set to the root directory in the Colab under the name `train.csv`. The AutoTrain command will look for your data there under that name.

I will be using the same dataset I used in my last video, lets get the data.






In [None]:
!git clone https://github.com/sudarshan-koirala/youtube-stuffs.git
%cd youtube-stuffs/train-llama2-chatbot
!unzip alpaca.csv.zip alpaca.csv
%mv alpaca.csv train.csv
%mv train.csv ./../../train.csv
%cd ./../../

fatal: destination path 'youtube-stuffs' already exists and is not an empty directory.
/content/youtube-stuffs/train-llama2-chatbot
Archive:  alpaca.csv.zip
  inflating: alpaca.csv              
/content


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('train.csv')
display(data)

Unnamed: 0,instruction,input,output,text
0,Give three tips for staying healthy.,,1.Eat a balanced diet and make sure to include...,Below is an instruction that describes a task....
1,What are the three primary colors?,,"The three primary colors are red, blue, and ye...",Below is an instruction that describes a task....
2,Describe the structure of an atom.,,"An atom is made up of a nucleus, which contain...",Below is an instruction that describes a task....
3,How can we reduce air pollution?,,There are a number of ways to reduce air pollu...,Below is an instruction that describes a task....
4,Describe a time when you had to make a difficu...,,I had to make a difficult decision when I was ...,Below is an instruction that describes a task....
...,...,...,...,...
51997,Generate an example of what a resume should li...,,"Jean Tremaine\n1234 Main Street, Anytown, CA 9...",Below is an instruction that describes a task....
51998,Arrange the items given below in the order to ...,"cake, me, eating",I eating cake.,"Below is an instruction that describes a task,..."
51999,Write an introductory paragraph about a famous...,Michelle Obama,Michelle Obama is an inspirational woman who r...,"Below is an instruction that describes a task,..."
52000,Generate a list of five things one should keep...,,1. Research potential opportunities and carefu...,Below is an instruction that describes a task....


## Use AutoTrain to finetune llama2

In [None]:
#!autotrain -h

In [None]:
#!autotrain llm -h

## Short overview of what the command flags do.


The given code is used to train a large language model (LLM) using the AutoTrain library from HuggingFace. Here is a step-by-step breakdown of the code:

1. `!autotrain llm --train`: This command initiates the training process for an LLM using AutoTrain.

2. `--project_name llama2-finetune-alpaca`: This specifies the name of the project.

3. `--data_path .`: This specifies the path to the data that will be used for training.

4. `--model meta-llama/Llama-2-7b-hf`: This specifies the pre-trained LLM model that will be used for fine-tuning.

5. `--learning_rate 2e-4`: This sets the learning rate for the training process.

6. `--num_train_epochs 3`: This sets the number of epochs for the training process.

7. `--train_batch_size 2`: This sets the batch size for the training process.

8. `--model_max_length 2048`: This sets the maximum length of the input sequence.

9. `--use_peft`: This specifies that the PEFT (Parameter Efficient Fine-Tuning) method will be used for training. [LINK](https://huggingface.co/docs/peft/index)

10. `--use_int4`: This specifies that 4-bit integer quantization will be used for training.

11. `--trainer sft`: This specifies that the SFT (Supervised fine-tuning) method will be used for training. [LINK](https://huggingface.co/docs/trl/main/en/sft_trainer)

12. `--push_to_hub`: This specifies that the trained model will be pushed to the Hugging Face Hub.

13. `--repo_id DataScienceBasics/llama2-finetune-alpaca`: This specifies the repository ID for the trained model.

14. `--block_size 2048`: This sets the block size for the training process.

15. `> training.log`: This redirects the output of the training process to a log file named "training.log".

In [None]:
# fine-tune the Llama-2 model with pre-existing dataset we downloaded
!autotrain llm --train \
    --project_name llama2-finetune-alpaca \
    --data_path . \
    --text_column text \
    --model meta-llama/Llama-2-7b-hf \
    --learning_rate 2e-4 \
    --num_train_epochs 3 \
    --train_batch_size 2 \
    --model_max_length 2048 \
    --use_peft \
    --use_int4 \
    --trainer sft \
    --push_to_hub \
    --repo_id DataScienceBasics/llama2-finetune-alpaca \
    --block_size 2048 > training.log

In [None]:
# fine-tune the Llama-2 model with pre-existing dataset
!autotrain llm --train \
    --project_name llama2-finetune-alpaca \
    --data_path tatsu-lab/alpaca \
    --text_column text \
    --model meta-llama/Llama-2-7b-hf \
    --learning_rate 2e-4 \
    --num_train_epochs 3 \
    --train_batch_size 2 \
    --model_max_length 2048 \
    --use_peft \
    --use_int4 \
    --trainer sft \
    --push_to_hub \
    --repo_id DataScienceBasics/llama2-finetune-alpaca \
    --block_size 2048

### It is going to take hours so not going to go with this one but once completed, it will be uploaded to huggingface models.

## Alternative is to use the sharded model. But this might also take several hours depending upon the model you use, internet speed and dataset.

- A "sharded" model, such as "TinyPixel/Llama-2-7B-bf16-sharded," typically refers to a variant of a language model that has been divided or "sharded" into smaller pieces or chunks to enable more efficient and parallelized processing.

- Keep in mind that the specifics of how sharding is implemented and its impact on performance can vary depending on the model and the framework used for its deployment. It's essential to refer to the model's documentation or the organization that released it for more detailed information on how to use and deploy a sharded model effectively.

In [None]:
# fine-tune the Llama-2 model with pre-existing dataset smaller one
!autotrain llm --train \
    --project_name llama2-finetune-youtube \
    --data_path timdettmers/openassistant-guanaco \
    --model TinyPixel/Llama-2-7B-bf16-sharded \
    --text_column text \
    --learning_rate 2e-4 \
    --num_train_epochs 3 \
    --train_batch_size 2 \
    --model_max_length 2048 \
    --use_peft \
    --use_int4 \
    --trainer sft \
    --push_to_hub \
    --repo_id DataScienceBasics/llama2-finetune-youtube \
    --block_size 2048 > training.log

> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, train=True, deploy=False, inference=False, data_path='timdettmers/openassistant-guanaco', train_split='train', valid_split=None, text_column='text', model='TinyPixel/Llama-2-7B-bf16-sharded', learning_rate=0.0002, num_train_epochs=3, train_batch_size=2, warmup_ratio=0.1, gradient_accumulation_steps=1, optimizer='adamw_torch', scheduler='linear', weight_decay=0.0, max_grad_norm=1.0, seed=42, add_eos_token=False, block_size=2048, use_peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.05, logging_steps=-1, project_name='llama2-finetune-youtube', evaluation_strategy='epoch', save_total_limit=1, save_strategy='epoch', auto_find_batch_size=False, fp16=False, push_to_hub=True, use_int8=False, model_max_length=2048, repo_id='DataScienceBasics/llama2-finetune-youtube', use_int4=True, trainer='sft', target_modules=None, merge_adapter=False, token=None, backend='default', username=None, use_flash_attention_2=False, func