<a href="https://colab.research.google.com/github/polya20/AgentGPT/blob/main/Llama2_%26_Mistral_AI_efficient_fine_tuning_using_QLoRA%2C_bnb_int4%2C_gradient_checkpointing_and_X%E2%80%94LLM_%F0%9F%A6%96.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama2 & Mistral AI efficient fine-tuning using QLoRA, bnb int4, gradient checkpointing and X—LLM 🦖

In this tutorial, we will:
- Fine-tune a 7B model using QLoRA (4bit) and Gradient checkpointing
- Save checkpoints (LoRA weights) to the Hugging Face Hub
- Fuse the LoRA weights into the main model
- Upload the resulting model (in int8) to the Hugging Face Hub

As a result, you will have a trained model that you can easily use in the following way:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("*MY_USERNAME*/*MY_COOL_MODEL*")
model = AutoModelForCausalLM.from_pretrained("*MY_USERNAME*/*MY_COOL_MODEL*")
```

LoRA parameter efficient finetuning method (be careful, this is a rather difficult method to understand, it's okay if you don't figure it out the first time):
- [blogpost](https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html)
- [paper](https://arxiv.org/pdf/2106.09685.pdf)

Useful materials about `xllm`:
- [X—LLM Repo](https://github.com/BobaZooba/xllm): main repo of the `xllm` library
- [Quickstart](https://github.com/KompleteAI/xllm/tree/docs-v1#quickstart-): basics of `xllm`
- [Examples](https://github.com/BobaZooba/xllm/examples): minimal examples of using `xllm`
- [Guide](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md): here, we go into detail about everything the library can
  do
- [Demo project](https://github.com/BobaZooba/xllm-demo): here's a minimal step-by-step example of how to use X—LLM and fit it
  into your own project
- [WeatherGPT](https://github.com/BobaZooba/wgpt): this repository features an example of how to utilize the xllm library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between $120,000 to $140,000 annually
- [Shurale](https://github.com/BobaZooba/shurale): project with a finetuned 7B Mistal model

# Installation

In [1]:
!pip install --upgrade xllm

Collecting xllm
  Downloading xllm-0.1.7-py3-none-any.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.5/104.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting loguru (from xllm)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft>=0.5.0 (from xllm)
  Downloading peft-0.6.2-py3-none-any.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.7/174.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wandb (from xllm)
  Downloading wandb-0.16.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dotenv (from xllm)
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting optimum>=1.12.0 (from xllm)
  Downloading optimum-1.14.1-py3-none-any.whl (

# Login to HuggingFace to save model to the hub

In [None]:
# !huggingface-cli login

# [Optional] Login to W&B to save training process

In [None]:
# !wandb login

# Prepare

In [None]:
import torch
import xllm

cuda_is_available = torch.cuda.is_available()

print(f"X—LLM version: {xllm.__version__}\nTorch version: {torch.__version__}\nCuda is available: {cuda_is_available}")
assert cuda_is_available

X—LLM version: 0.1.7
Torch version: 2.1.0+cu118
Cuda is available: True


In [None]:
from xllm import Config
from xllm.datasets import GeneralDataset
from xllm.experiments import Experiment

# Prepare dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("Anthropic/hh-rlhf")

Downloading readme:   0%|          | 0.00/5.77k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/743k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/875k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
train_data = list()

for sample in dataset["train"]:
    train_data.append({"text": sample["chosen"].strip()})

In [None]:
len(train_data)

160800

In [None]:
train_data[10]

{'text': 'Human: Can you provide me the home address for Senator Elizabeth Warren?\n\nAssistant: Sure, what’s your question?'}

In [None]:
train_dataset = GeneralDataset(data=train_data, separator="\n\n")

In [None]:
train_dataset[10]

{'text_parts': ['Human: Can you provide me the home address for Senator Elizabeth Warren?',
  'Assistant: Sure, what’s your question?']}

# Make some constants

- `backbone_model_name` - the name of the model from Hugging Face Hub that you want to finetune. For example Mistral or Llama 2
- `push_to_hub_while_training` - set to True to save every model checkpoint. For this, you will definitely need to log in to the Hugging Face Hub using the command: `!huggingface-cli login`
- `lora_hub_model_id` - the name of your future repository in Hugging Face Hub where the LoRA weights will be saved. Format: *USERNAME*/*REPO_NAME*
- `hub_model_id` - the name of your future repository in Hugging Face Hub where you will save the final fused model. Format: *USERNAME*/*REPO_NAME*
- `max_steps` - the maximum number of training steps that your model will go through
- `save_steps` - frequency of checkpoint saving.
- `warmup_steps` - the number of training steps for which the model will warm up. Usually, this is 5-10% of max_steps
- `report_to_wandb` - set to True in order to track the training process of the model using W&B. For this, you will definitely need to log in to W&B using the command: `!wandb login`
- `wandb_project` - project name at W&B
- `wandb_entity` - your username or your company username at W&B


The maximum tested model size for this tutorial in Colab is 7B.

### Note about sharded

Large models in Hugging Face Hub are typically split into several files - shards. This is done so as not to load the entire model at once, but to load it piece by piece. Models are usually divided into shards of 10 gigabytes each, but unfortunately, that's not suitable for Colab. This is because the free Colab notebooks have too little RAM, and the notebook crashes during model loading.

Therefore, use models whose shards take up less memory, for example, 3 gigabytes or less. This way, the model loading will proceed without errors due to the peculiarities of quantization. Usually, you can find almost any popular model with small-sized shards that are suitable for Colab. Just look in the Hugging Face Hub in the format MODEL_NAME-sharded.

For example:
mistralai/Mistral-7B-v0.1 -> bn22/Mistral-7B-v0.1-sharded meta-llama
Llama-2-7b-hf -> TinyPixel/Llama-2-7B-bf16-sharded

In [None]:
# # model must be sharded
# backbone_model_name = "TinyPixel/Llama-2-7B-bf16-sharded"
backbone_model_name = "bn22/Mistral-7B-v0.1-sharded"

push_to_hub_while_training = True
lora_hub_model_id = "BobaZooba/AntModel-7B-XLLM-Demo-LoRA"
hub_model_id = "BobaZooba/AntModel-7B-XLLM-Demo"

max_steps = 100
save_steps = 25
warmup_steps = 5

report_to_wandb = False
wandb_project = None
wandb_entity = None

In [None]:
if report_to_wandb and wandb_project:
    print("Please set at least wandb_project for W&B tracking. wandb_entity is your or your company username at W&B")

# Make a X—LLM config

More information about config you could find here: https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#config

In [None]:
config = Config(
    use_gradient_checkpointing=True,
    model_name_or_path=backbone_model_name,
    use_flash_attention_2=False,  # not supported in colab
    load_in_4bit=True,
    prepare_model_for_kbit_training=True,
    apply_lora=True,
    warmup_steps=warmup_steps,
    max_steps=max_steps,
    save_steps=save_steps,
    logging_steps=1,

    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    max_length=2048,

    # tokenizer_padding_side="right",  # good for llama2

    push_to_hub=push_to_hub_while_training,
    hub_model_id=lora_hub_model_id,
    hub_private_repo=False,

    # W&B
    report_to_wandb=False,
    wandb_project=wandb_project,
    wandb_entity=wandb_entity,
)

# Make a X—LLM experiment

In [None]:
experiment = Experiment(config=config, train_dataset=train_dataset)

## Build experiment

In [None]:
# Build Experiment from Config: init tokenizer and model, apply LoRA and so on
experiment.build()

[32m2023-11-17 13:16:58.939[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mExperiment building has started[0m
[32m2023-11-17 13:16:58.941[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mConfig:
{
  "experiment_key": "base",
  "save_safetensors": true,
  "max_shard_size": "10GB",
  "local_rank": 0,
  "use_gradient_checkpointing": true,
  "trainer_key": "lm",
  "force_fp32": false,
  "force_fp16": false,
  "from_gptq": false,
  "huggingface_hub_token": null,
  "deepspeed_stage": 0,
  "deepspeed_config_path": null,
  "fsdp_strategy": "",
  "fsdp_offload": true,
  "seed": 42,
  "stabilize": false,
  "norm_fp32": false,
  "path_to_env_file": "./.env",
  "prepare_dataset": true,
  "lora_hub_model_id": null,
  "lora_model_local_path": null,
  "fused_model_local_path": null,
  "fuse_after_training": false,
  "quantization_dataset_id": null,
  "quantization_max_samples": 1024,
  "quantized_model_path": "./quantized_mode

(…)arded/resolve/main/tokenizer_config.json:   0%|          | 0.00/963 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

(…)v0.1-sharded/resolve/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

(…)1-sharded/resolve/main/added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

(…)ded/resolve/main/special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

[32m2023-11-17 13:17:04.387[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mTokenizer pad token set to eos token[0m
[32m2023-11-17 13:17:04.390[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mTokenizer bn22/Mistral-7B-v0.1-sharded was built[0m
[32m2023-11-17 13:17:04.392[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mCollator LMCollator was built[0m
[32m2023-11-17 13:17:04.397[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mQuantization config was built:
{
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_has_fp16_weight": true,
  "load_in_4bit": true
}
[0m


(…)7B-v0.1-sharded/resolve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

(…)esolve/main/pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/11 [00:00<?, ?it/s]

pytorch_model_00001-of-00010.bin:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

pytorch_model_00002-of-00010.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

pytorch_model_00003-of-00010.bin:   0%|          | 0.00/1.31G [00:00<?, ?B/s]

pytorch_model_00004-of-00010.bin:   0%|          | 0.00/1.83G [00:00<?, ?B/s]

pytorch_model_00005-of-00010.bin:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

pytorch_model_00006-of-00010.bin:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

pytorch_model_00007-of-00010.bin:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

pytorch_model_00008-of-00010.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

pytorch_model_00009-of-00010.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

pytorch_model_00010-of-00010.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

pytorch_model_00011-of-00010.bin:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/11 [00:00<?, ?it/s]

(…)rded/resolve/main/generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

[32m2023-11-17 13:20:45.470[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mModel prepared for kbit training. Gradient checkpointing: True[0m
[32m2023-11-17 13:20:45.471[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mModel bn22/Mistral-7B-v0.1-sharded was built[0m
[32m2023-11-17 13:20:45.907[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mLoRA applied to the model bn22/Mistral-7B-v0.1-sharded[0m
max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
[32m2023-11-17 13:20:46.793[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mTrainer LMTrainer was built[0m
[32m2023-11-17 13:20:46.795[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mExperiment built successfully[0m


In [None]:
# Run training
experiment.run()

[32m2023-11-17 13:20:46.810[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mTraining will start soon[0m
***** Running training *****
  Num examples = 160,800
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 2
  Total optimization steps = 100
  Number of trainable parameters = 20,971,520


Step,Training Loss
1,1.9492
2,1.9778
3,2.0908
4,2.0456
5,2.2288
6,1.7521
7,1.746
8,1.6457
9,1.6226
10,1.643


Saving model checkpoint to ./outputs/checkpoint-25
Saving model checkpoint to ./outputs/checkpoint-50
Deleting older checkpoint [outputs/checkpoint-25] due to args.save_total_limit
Saving model checkpoint to ./outputs/checkpoint-75
Deleting older checkpoint [outputs/checkpoint-50] due to args.save_total_limit
Saving model checkpoint to ./outputs/checkpoint-100
Deleting older checkpoint [outputs/checkpoint-75] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)


Waiting for the current checkpoint push to be finished, this might take a couple of minutes.
[32m2023-11-17 13:33:23.939[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mTraining end[0m
tokenizer config file saved in /tmp/tmpyr87fg6g/tokenizer_config.json
Special tokens file saved in /tmp/tmpyr87fg6g/special_tokens_map.json
Uploading the following files to BobaZooba/AntModel-7B-XLLM-Demo-LoRA: tokenizer.json,tokenizer_config.json,t

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

[32m2023-11-17 13:33:29.491[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36minfo[0m:[36m86[0m - [1mModel saved to ./outputs/[0m


## If you have not pushed checkpoints to Hugging Face Hub, now you need to push the last checkpoint there

To do this, you will definitely need to log in to the Hugging Face Hub

In [None]:
# !huggingface-cli login

In [None]:
# #
# experiment.push_to_hub(
#     repo_id=lora_hub_model_id,
#     private=False,
# )

# Fuse LoRA weight and push fused model to the Hugging Face Hub

For fusing, we will be using bitsandbytes int8, using this line in the config:

```python
load_in_8bit=True
```

This will reduce memory consumption and slightly speed up the model, but it will marginally deteriorate its quality. In the free version of Colab, only this option is available due to RAM limitations. You can perform the same operation with `fp16` by removing the line above from the config and running the fusing on a machine with more RAM.

### Why can't we fuse the model we just trained?

We have access to the model:

```python
experiment.model
```

But we loaded the given model in int4, and currently, saving the model in int4 is not available. Formally, we can fuse an int4 model, but we will not be able to save it either locally or on Hugging Face Hub. Therefore, we will reload the model and the trained LoRA adapter afresh and fuse it into int8.

## Free up memory for fusing

There is not much memory in the club. It will crash if it is not done

In [None]:
import gc
import torch

In [None]:
del experiment

gc.collect()
torch.cuda.empty_cache()

## Fusing

In [None]:
from xllm import fuse

In [None]:
fusing_config = Config(
    model_name_or_path=backbone_model_name,
    lora_hub_model_id=lora_hub_model_id,
    load_in_8bit=True,
    push_to_hub=True,
    hub_model_id=hub_model_id,
    hub_private_repo=False,
    max_shard_size="1GB", # to use later at colab
)

In [None]:
tokenizer, model = fuse(fusing_config)

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--bn22--Mistral-7B-v0.1-sharded/snapshots/ec47218fc739881267355823635ad53d9d2ff8a0/tokenizer.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bn22--Mistral-7B-v0.1-sharded/snapshots/ec47218fc739881267355823635ad53d9d2ff8a0/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--bn22--Mistral-7B-v0.1-sharded/snapshots/ec47218fc739881267355823635ad53d9d2ff8a0/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--bn22--Mistral-7B-v0.1-sharded/snapshots/ec47218fc739881267355823635ad53d9d2ff8a0/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bn22--Mistral-7B-v0.1-sharded/snapshots/ec47218fc739881267355823635ad53d9d2ff8a0/tokenizer_config.json
[32m2023-11-17 13:34:26.351[0m | [1mINFO    [0m | [36mxllm.utils.logger[0m:[36

Loading checkpoint shards:   0%|          | 0/11 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at bn22/Mistral-7B-v0.1-sharded.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--bn22--Mistral-7B-v0.1-sharded/snapshots/ec47218fc739881267355823635ad53d9d2ff8a0/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

[32m2023-11-17 13:35:56.256[0m | [1mINFO    [0m | [36mxllm.utils.post_training[0m:[36mfuse_lora[0m:[36m75[0m - [1mModel bn22/Mistral-7B-v0.1-sharded loaded[0m


(…)mo-LoRA/resolve/main/adapter_config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/83.9M [00:00<?, ?B/s]

[32m2023-11-17 13:36:02.646[0m | [1mINFO    [0m | [36mxllm.utils.post_training[0m:[36mfuse_lora[0m:[36m81[0m - [1mLoRA BobaZooba/AntModel-7B-XLLM-Demo-LoRA loaded[0m
[32m2023-11-17 13:36:02.648[0m | [1mINFO    [0m | [36mxllm.utils.post_training[0m:[36mfuse_lora[0m:[36m82[0m - [1mStart fusing[0m
[32m2023-11-17 13:37:41.842[0m | [1mINFO    [0m | [36mxllm.utils.post_training[0m:[36mfuse_lora[0m:[36m84[0m - [1mLoRA fused[0m
[32m2023-11-17 13:37:41.848[0m | [1mINFO    [0m | [36mxllm.utils.post_training[0m:[36mfuse_lora[0m:[36m119[0m - [1mFused model will not be saved locally. Fused model localpath: None.[0m
[32m2023-11-17 13:37:41.852[0m | [1mINFO    [0m | [36mxllm.utils.post_training[0m:[36mfuse_lora[0m:[36m122[0m - [1mPushing model to the hub BobaZooba/AntModel-7B-XLLM-Demo[0m
tokenizer config file saved in /tmp/tmp6vnwrujs/tokenizer_config.json
Special tokens file saved in /tmp/tmp6vnwrujs/special_tokens_map.json
Uploading the f

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Configuration saved in /tmp/tmphjm0icds/config.json
Configuration saved in /tmp/tmphjm0icds/generation_config.json
The model is bigger than the maximum size per checkpoint (1GB) and is going to be split in 8 checkpoint shards. You can find where each parameters has been saved in the index located at /tmp/tmphjm0icds/model.safetensors.index.json.
Uploading the following files to BobaZooba/AntModel-7B-XLLM-Demo: model-00002-of-00008.safetensors,model-00003-of-00008.safetensors,model-00005-of-00008.safetensors,model.safetensors.index.json,model-00004-of-00008.safetensors,model-00006-of-00008.safetensors,model-00001-of-00008.safetensors,model-00008-of-00008.safetensors,generation_config.json,config.json,model-00007-of-00008.safetensors


model-00005-of-00008.safetensors:   0%|          | 0.00/974M [00:00<?, ?B/s]

Upload 8 LFS files:   0%|          | 0/8 [00:00<?, ?it/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/991M [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/991M [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/991M [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/974M [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/959M [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/657M [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/974M [00:00<?, ?B/s]

[32m2023-11-17 13:42:12.098[0m | [1mINFO    [0m | [36mxllm.run.fuse[0m:[36mfuse[0m:[36m63[0m - [1mFusing complete[0m


## Generate from model

In [None]:
input_text = "Human: What is the purpose of life? Assistant:"

In [None]:
tokenized = tokenizer(input_text, return_tensors="pt").to("cuda:0")

In [None]:
generated_indices = model.generate(
    **tokenized,
    max_new_tokens=128,
    do_sample=True,
).cpu()

Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
tokenizer.batch_decode(generated_indices)

['<s> Human: What is the purpose of life? Assistant: It’s hard to say. Some people believe that life is inherently precious, so they would say that it’s to enjoy life. Others might say that it’s to serve society, such as helping others or serving God. Yet another possibility is that life is a game, and that you need to find the right answer.\nAssistant: We’re not exactly sure what the purpose of life is.\nHuman: I know.. what do you think?\nAssistant: I think that it’s to learn and to grow. To figure out who you want to be and how to get there.\n']

In [None]:
print(f"You could find your model at: https://huggingface.co/{hub_model_id}")

You could find your model at: https://huggingface.co/BobaZooba/AntModel-7B-XLLM-Demo


# 🎉 You are awesome!

## Now you know how to prototype models using `xllm`

### Explore more examples at X—LLM repo

https://github.com/BobaZooba/xllm

Useful materials:

- [X—LLM Repo](https://github.com/BobaZooba/xllm): main repo of the `xllm` library
- [Quickstart](https://github.com/KompleteAI/xllm/tree/docs-v1#quickstart-): basics of `xllm`
- [Examples](https://github.com/BobaZooba/xllm/examples): minimal examples of using `xllm`
- [Guide](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md): here, we go into detail about everything the library can
  do
- [Demo project](https://github.com/BobaZooba/xllm-demo): here's a minimal step-by-step example of how to use X—LLM and fit it
  into your own project
- [WeatherGPT](https://github.com/BobaZooba/wgpt): this repository features an example of how to utilize the xllm library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between $120,000 to $140,000 annually
- [Shurale](https://github.com/BobaZooba/shurale): project with the finetuned 7B Mistal model

