<a href="https://colab.research.google.com/github/royam0820/HuggingFace/blob/main/amr_dataset_generate_images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.youtube.com/watch?v=z2QE12p3kMM


# Setup

In [None]:
!python --version

Python 3.10.12


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Why LLM Fine-Tuning?

Fine-tuning of LLMs is the conventional method that retrains all model parameters for a specific task or domain. Fine-tuning a Large Language Model (LLM) is beneficial for several reasons:

- **Domain Specificity**: General-purpose language models are trained on a wide variety of data and are not specialized in any particular domain. Fine-tuning allows you to adapt the model to specific industries, topics, or types of language, such as medical terminology, legal jargon, or technical language.

- **Improved Accuracy**: Fine-tuning on a specific dataset can improve the model‚Äôs performance on tasks related to that data. This could mean more accurate classifications, better sentiment analysis, or more relevant generated text.

- **Resource Efficiency**: Fine-tuning only a subset of the model‚Äôs parameters can be more computationally efficient than training a new model from scratch. This can be particularly important when computational resources are limited.

- **Data Privacy**: If you have sensitive or proprietary data, fine-tuning a pre-trained model on your own infrastructure allows you to benefit from the capabilities of large language models without sharing your data externally.

- **Task Adaptation**: General-purpose language models are not optimized for specific tasks like question-answering, summarization, or translation. Fine-tuning can adapt the model for these specialized tasks.

- **Contextual Understanding**: Fine-tuning can help the model better understand the context in which it will be used, making it more effective at generating appropriate and useful responses.

- **Reduced Training Time**: Starting with a pre-trained model and fine-tuning it for a specific task can be much faster than training a model from scratch.

- **Avoid Overfitting**: When you have a small dataset, training a large model from scratch can lead to overfitting. Fine-tuning can mitigate this risk, as the model has already learned general language features from a large dataset and only needs to adapt to the specificities of the new data.

- **Leverage Pre-trained Features**: Large language models trained on extensive datasets have already learned a wide array of features, from basic syntax and grammar to high-level semantic understanding. Fine-tuning allows you to leverage these features for your specific application.

- **Customization**: Fine-tuning allows you to tailor the model‚Äôs behavior to specific requirements, such as generating text in a particular style, tone, or format.

In summary, fine-tuning a large language model allows you to customize its capabilities for specific tasks, domains, or datasets, improving its performance and making it more applicable to your particular needs.

In [None]:
# getting from Google drive the llm fine-tuned
!unzip /content/drive/MyDrive/llm_tuning/llama2-MJ-prompts.zip

Archive:  /content/drive/MyDrive/llm_tuning/llama2-MJ-prompts.zip
   creating: content/llama2-MJ-prompts/
 extracting: content/llama2-MJ-prompts/added_tokens.json  
  inflating: content/llama2-MJ-prompts/adapter_model.bin  
  inflating: content/llama2-MJ-prompts/tokenizer.model  
  inflating: content/llama2-MJ-prompts/special_tokens_map.json  
   creating: content/llama2-MJ-prompts/checkpoint-7/
 extracting: content/llama2-MJ-prompts/checkpoint-7/added_tokens.json  
  inflating: content/llama2-MJ-prompts/checkpoint-7/adapter_model.bin  
  inflating: content/llama2-MJ-prompts/checkpoint-7/tokenizer.model  
  inflating: content/llama2-MJ-prompts/checkpoint-7/trainer_state.json  
  inflating: content/llama2-MJ-prompts/checkpoint-7/special_tokens_map.json  
  inflating: content/llama2-MJ-prompts/checkpoint-7/rng_state.pth  
  inflating: content/llama2-MJ-prompts/checkpoint-7/pytorch_model.bin  
  inflating: content/llama2-MJ-prompts/checkpoint-7/README.md  
  inflating: content/llama2-MJ-p

# GPT4 - Code interpreter - Dataset creation

https://chat.openai.com/share/5380107c-821d-4849-993b-938fc8545268

The goal of the dataset we are going to create is to ask ChatGPT to give us a prompt for an AI image generator based on a concept.

Below is the prompt
```
vv create a dataset that contains concept-prompt pair. For each of the concepts like "A person walking in the rain" create a detailed description that can be used by an AI image generator to create images.
```

Based on the prompt example above, ChatGPT will give you several records to satisfy your request. The dataset generated contains two fields:
- the `concept` and
- the `description`

To complete the dataset for the training of a language model, you need to request ChatGPT to add a column `text`




- to provide the user with a concept regarding a dataset making base on two fields :
- concept
- description

In your prompt, you make sure to give the language model an example of what you want, so that the language can replicate your demand by providing many example rows.

```
vv create a dataset that contains concept-prompt pair. For each of the concepts like "A person walking in the rain" create a detailed description that can be used by an AI image generator to create images.
```

You can repeat this process several times, so that you generate at least 300 rows for the dataset so that the model can learn during training.

Then, you need to add an additional column called `text` that will have a specific format for the LLM with the tokens ###Human and ###Assistant:

```
create another column called text which follows the following structure:
text=f"###Human:\ngenerate a midjourney prompt for {concept}\n\n###Assistant:\n{description}"
```
The added `text` column will hold the concept and description, with the tokens identified as ### . The model fine-tuning will be based  on this column `text`.

Below is an example of a generated output:

```
###Human: generate a midjourney prompt for A sunset over the mountains ###Assistant: The sun is setting behind jagged mountain peaks. The sky is filled with shades of orange, pink, and purple, casting a warm glow on the mountains. Clouds lightly scattered, allowing the colors to shine through.
```

This dataset creation steps will give you a csv file called `train`, with encoding in `utf-8`.


```
df.to_csv('train.csv', encoding='utf-8', index=False)
```



Now we are are ready for fine-tuning a large language model.


### Issue with the train.csv file
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 18905: invalid start byte


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('train.csv',  encoding='unicode_escape')

In [None]:
df.to_csv('train.csv', encoding='utf-8', index=False)

In [None]:
df=pd.read_csv('train.csv')

In [None]:
df.text[0]

'###Human:\n\ngenerate a midjourney prompt for A person walking in the rain\n\n####Assistant:\nA young adult wearing a navy-blue raincoat and matching rain boots walks on a wet cobblestone street. Raindrops create ripples in the puddles. They hold a red umbrella that shields them from the pouring rain. Their face is relaxed, enjoying the rainfall.'

NB:  LINKS:

- autotrain: https://huggingface.co/autotrain
- autotrain GitHub: https://github.com/huggingface/autotr...

# Training with autotrain-advanced
ü§ó AutoTrain is a no-code tool for training state-of-the-art models for Natural Language Processing (NLP) tasks, for Computer Vision (CV) tasks, and for Speech tasks and even for Tabular tasks. It is built on top of the awesome tools developed by the Hugging Face team, and it is designed to be easy to use.

[Autotrain](https://huggingface.co/docs/autotrain/index)




## Autotrain Help



In [None]:
# #autotrain help
# !autotrain llm --help

usage: autotrain <command> [<args>] llm [-h] [--train] [--deploy] [--inference]
                                        [--data_path DATA_PATH] [--train_split TRAIN_SPLIT]
                                        [--valid_split VALID_SPLIT] [--text_column TEXT_COLUMN]
                                        [--rejected_text_column REJECTED_TEXT_COLUMN]
                                        [--model MODEL] [--learning_rate LEARNING_RATE]
                                        [--num_train_epochs NUM_TRAIN_EPOCHS]
                                        [--train_batch_size TRAIN_BATCH_SIZE]
                                        [--warmup_ratio WARMUP_RATIO]
                                        [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                                        [--optimizer OPTIMIZER] [--scheduler SCHEDULER]
                                        [--weight_decay WEIGHT_DECAY]
                                        [--max_grad_norm MAX_GRAD_NORM] [-

## AutoTrain Setup

In [None]:
!pip install -U autotrain-advanced
!pip install -U huggingface_hub

Collecting autotrain-advanced
  Downloading autotrain_advanced-0.6.41-py3-none-any.whl (128 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/128.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ[0m [32m122.9/128.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m128.7/128.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting codecarbon==2.2.3 (from autotrain-advanced)
  Downloading codecarbon-2.2.3-py3-none-any.whl (174 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m174.1/174.1 kB[0m [31m14.1 MB

Collecting huggingface_hub
  Using cached huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.17.3
    Uninstalling huggingface-hub-0.17.3:
      Successfully uninstalled huggingface-hub-0.17.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tokenizers 0.14.1 requires huggingface_hub<0.18,>=0.16.4, but you have huggingface-hub 0.18.0 which is incompatible.[0m[31m
[0mSuccessfully installed huggingface_hub-0.18.0


NB: WARNING restart the runtime.

In [None]:
!autotrain setup --update-torch
#!autotrain setup

> [1mINFO    Installing latest transformers@main[0m
> [1mINFO    Successfully installed latest transformers[0m
> [1mINFO    Installing latest peft@main[0m
> [1mINFO    Successfully installed latest peft[0m
> [1mINFO    Installing latest diffusers@main[0m
> [1mINFO    Successfully installed latest diffusers[0m
> [1mINFO    Installing latest trl@main[0m
> [1mINFO    Successfully installed latest trl[0m
> [1mINFO    Installing latest xformers[0m
> [1mINFO    Successfully installed latest xformers[0m
> [1mINFO    Installing latest PyTorch[0m
> [1mINFO    Successfully installed latest PyTorch[0m


In [None]:
# logging to the HF hub to get access to the authentication token
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## AutoTrain training

The code below can just be one line of code, but I have split it in different parts so that you can easily see what is involved.



```
autotrain llm --train --project_name my-llm --model meta-llama/Llama-2-7b-hf --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 12 --num_train_epochs 3 --trainer sft
```



In [None]:
# fine-tune LLM
!autotrain llm \
--train   \
--project_name 'llama2-MJ-prompts-v3' \
--model 'abhishek/llama-2-7b-hf-small-shards' \
--data_path .  \
--text_column text \
--use_peft  \
--use_int4 \
--fp16  \
--learning_rate 2e-4  \
--train_batch_size 4 \
--num_train_epochs 3 \
#--push_to_hub  \
#--token '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' \
#--repo_id 'llama2-MJ-prompts-v3'  \
--trainer sft  > training.log &\



> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, train=True, deploy=False, inference=False, data_path='.', train_split='train', valid_split=None, text_column='text', rejected_text_column='rejected', model='abhishek/llama-2-7b-hf-small-shards', learning_rate=0.0002, num_train_epochs=3, train_batch_size=4, warmup_ratio=0.1, gradient_accumulation_steps=1, optimizer='adamw_torch', scheduler='linear', weight_decay=0.0, max_grad_norm=1.0, seed=42, add_eos_token=False, block_size=-1, use_peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.05, logging_steps=-1, project_name='llama2-MJ-prompts-v3', evaluation_strategy='epoch', save_total_limit=1, save_strategy='epoch', auto_find_batch_size=False, fp16=True, push_to_hub=False, use_int8=False, model_max_length=1024, repo_id=None, use_int4=True, trainer='default', target_modules=None, merge_adapter=False, token=None, backend='default', username=None, use_flash_attention_2=False, log='none', disable_gradient_checkpointin

In [None]:
# saving the fine-tune training folder
#!zip -r /content/llama2-MJ-prompts.zip /content/llama2-MJ-prompts

# Inference


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import DataParallel #for multiple gpus

In [None]:
# accessing the newly fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("/content/llama2-MJ-prompts-v3")
model = AutoModelForCausalLM.from_pretrained("/content/llama2-MJ-prompts-v3")

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]



In [None]:
input_context = '''
"###Human:
generate a midjourney prompt for a child running in a rain, give a detailed description

####Assistant:
'''

NB: the `###Assitant:` is empty because the newly fine-tuned model  will be able to generate the prompt.

In [None]:
input_ids = tokenizer.encode(input_context, return_tensors='pt')

In [None]:
output = model.generate(input_ids, max_length=85, temperature=0.3, num_return_sequences=1)



In [None]:
generated_text = tokenizer.decode(output[0], skip_special_token=True)
print(generated_text)

<s> 
"###Human:
generate a midjourney prompt for a child running in a rain, give a detailed description

####Assistant:
A child is running in the rain, with a smile on their face. They are wearing a yellow raincoat and blue rain boots. They are holding an umbrella in one hand and a toy in the other. They


In [None]:
inputs = tokenizer(input_context, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, do_sample=True, max_length=85, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



"###Human:
generate a midjourney prompt for a child running in a rain, give a detailed description

####Assistant:
The child is running in the rain, wearing a raincoat and holding an umbrella. The rain is falling steadily, and the child is laughing and splashing in the puddles. The child's


NB:
- `**inputs`: Unpacks the tokenized input data.
- `num_beams=4`: Beam search with 4 beams. Provides a trade-off between quality and speed.
- `do_sample=True`: Sampling is enabled, making the output text more random. It switches the generation mode from deterministic to probabilistic (or stochastic).
- `max_new_tokens=1024`: The maximum number of tokens for the generated text.

> **Beam search** is a search algorithm used for finding the most likely sequence of tokens when generating text from a language model. It is  commonly employed in natural language processing tasks like machine translation, text summarization, and text generation.

# Merging Models and Uploading to HF Hub.

The original model used for fine-tuning is using this class  

```
from transformers import AutoModelForCausalLM
```

> The fine-tuning model has a **`config.json`** file

The resulting fine-tuned model is using this class:



```
from peft import PeftModel
```

> The tuned-model has the file **`adapter_config.json`**

The script below will merge both model into a new fine-tuned model and uploading it to the HF hub. Once, done, the model is now ready for production.

The HF task for [inference endpoints](https://ui.endpoints.huggingface.co/royam0820/endpoints) can be used to do that.


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# the orginal model used to fine-tuning
base_model_path = "abhishek/llama-2-7b-hf-small-shards"
# the fine-tuned model
adapter_path = "/content/llama2-MJ-prompts-v3"

model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)
model = PeftModel.from_pretrained(model, adapter_path)
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = model.merge_and_unload()
print('Merge Complete')

hf_repo = "royam0820/my-llm-v1"
access_token = "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"

model.push_to_hub(f"{hf_repo}", use_temp_dir=True, use_auth_token=access_token)
print('Model pushed to Hub')
tokenizer.push_to_hub(f"{hf_repo}", use_temp_dir=True, use_auth_token=access_token)
print('Tokenizer pushed to Hub')


Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]

Merge Complete



Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


pytorch_model-00002-of-00003.bin:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model-00001-of-00003.bin:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

pytorch_model-00003-of-00003.bin:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

Model pushed to Hub




tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Tokenizer pushed to Hub


# Documentation
**`model.generate`** method is often used with transformers-based models like those provided by the Hugging Face Transformers library. Below are some commonly used arguments for `model.generate`:

- **`input_ids`**: Tensor containing the token IDs to be fed into the model.

- **`max_length`**: Maximum sequence length for the generated text. The generation will stop once this length is reached.

- **`min_length`**: Minimum sequence length for the generated text. The model will continue generating until this length is reached.

- **`do_sample`**: Whether to sample the next token randomly based on the distribution of the logits (True) or to take the token with the highest logit (False).

- **`temperature`**: Controls the randomness of the token sampling when `do_sample=True`. Higher values make the output more random, while lower values make it more deterministic.

- **`top_k`**: Limits the number of highest-probability tokens considered for sampling. Only relevant when `do_sample=True`.

- **`top_p`**: Also known as "nucleus sampling," this parameter sets a cumulative probability threshold. Tokens with a cumulative probability above this value are excluded from sampling. Only relevant when `do_sample=True`.

- **`num_return_sequences`**: Number of different sequences to generate. Useful for getting multiple outputs for a single input.

- **`pad_token_id`**: Token ID used for padding when the generated sequence is shorter than max_length.

- **`eos_token_id`**: Token ID signaling the end of a sequence. When this token is generated, the sequence will stop.

- **`length_penalty`**: Exponential penalty to apply to the sequence length. Values > 1.0 encourage longer sequences, while values < 1.0 encourage shorter sequences.

- **`early_stopping`**: Whether to stop generation as soon as the end-of-sequence token is generated.

- **`num_beams`**: Number of beams for beam search. Beam search is a technique that explores multiple possibilities in parallel, aiming to find the most probable sequence. Setting this to a value greater than 1 enables beam search.

- **`no_repeat_ngram_size`**: Size of the n-gram window used to prevent repetition of n-grams in the generated text.

- **`bad_words_ids`**: List of token IDs that should not appear in the generated text.

- **`attention_mask`**: Mask to apply to the attention mechanism, typically to ignore padding tokens.

- **`decoder_start_token_id`**:
Token ID that should be used as the starting token for decoding in sequence-to-sequence models.

Ref.: https://huggingface.co/docs/transformers/main_classes/text_generation

# Ressources

[Datasets for LLM training](https://github.com/Zjh-819/LLMDataHub) also [HF datasets](https://huggingface.co/datasets)


[Model Configuration](https://huggingface.co/docs/transformers/v4.34.1/en/generation_strategies#default-text-generation-configuration)

[OpenAI fine-tuning API](
https://platform.openai.com/docs/guides/fine-tuning/use-a-fine-tuned-model)

[Introduction to Gradio](https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt)

[HF text generation classes](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)