<a href="https://colab.research.google.com/github/royam0820/HuggingFace/blob/main/amr_llama2_code_instr_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama2
LLaMA 2 is available in 3 sizes:
- 7b 7 billion parameters
- 13b 13 billion parameters
- 70b 70 billion parameters

This notebook uses the 7b parameters to fine tune our dataset [TokenBender/code_instructions_120k_alpaca_style](https://huggingface.co/datasets/TokenBender/code_instructions_120k_alpaca_style)

[Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) is a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAI's text-davinci-003.


# Fine Tuning a LLM
**Why doing it?** To add a domain specific corpus of data to a foundational LLM, i.e. legal or medical corpus for example.

Here are the general steps involved in fine-tuning a LLM:

**Obtain a Pre-trained Model**: The first step is to have a pre-trained model ready for fine-tuning. Pre-training involves training the model on a large corpus of text data. The aim of this step is to learn a good representation of the language that can be used as a starting point for various tasks. OpenAI provides pre-trained models like GPT-3.

**Prepare Your Fine-Tuning Data**: The next step is to gather and prepare the data you will use for fine-tuning. This data should be relevant to the specific task or domain you want the model to perform or understand better. For example, if you're fine-tuning the model for medical text generation, you might use a corpus of medical literature. The data should be preprocessed and formatted in a way that's compatible with the model.

**Fine-Tune the Model**: Once you have your data prepared, you can start the fine-tuning process. This involves continuing the training of the model on your specific data. The learning rate during fine-tuning is usually much smaller than during pre-training because you don't want to drastically change the already learned representations. You're just trying to adapt them to your specific task.

**Evaluate the Model**: After fine-tuning, it's important to evaluate the model on a separate validation dataset to ensure it's learning the correct patterns. This can be done by using metrics relevant to your specific task, like accuracy, F1-score, perplexity, etc.

**Use the Fine-Tuned Model**: If the evaluation results are satisfactory, you can then use your fine-tuned model for your specific task. If not, you might need to go back and adjust some parameters, get more fine-tuning data, or make other changes.

> Remember, fine-tuning a model effectively requires a good understanding of the model architecture, the task at hand, and the data you're working with. It's part art, part science.

# Efficient methods used to train a LLM

- **LoRA**, which stands for Low-Rank Adapters (LoRA), are **small sets of trainable parameters, injected into each layer of the Transformer architecture** while fine-tuning. While original model weights are frozen and not updated, these small sets of injected weights are updated during fine-tuning. This greatly reduces the number of trainable parameters for downstream tasks. Gradients during stochastic gradient descent are passed through the frozen pre-trained model weights to the adapter. Thus, only these adapters, with a small memory footprint, are updated during the time of training.
- **Quantization** means “rounding” off values, from one data type to another. It works with **squeezing larger values into data types with less number of bits**, but with a small loss of precision.


## HuggingFace Support for fine-tuning

HuggingFace has released several libraries that can be used to fine-tune LLMs easily.

These include:

- **PEFT** (Parameter Efficient Fine Tuning) which has support for LoRA.
- **Quantization** Support — Many models can be loaded in 8-bit and 4-bit precision using `bitsandbytes` module. The basic way to load a model in 4bit is to pass the argument `load_in_4bit=True` when calling the from_pretrained method.
- **Accelerate** library — Accelerate library has many features to make it easy to reduce the memory requirements of models.
- **SFT** (Supervised Fine-Tuning Trainer) — The SFT trainer is the trainer class for supervised fine-tuning of Large LLMs.

These combined techniques are used here to train the llama2-7b on a code instruction dataset. Notice, we set the storage type to 4-bit and the computation type to FP-16.

# Minimal Code

Here, a 4-bit quantization and PEFT to fine-tune Llama2-7b on a single Google Colab instance!
> Two major components that democratize the training of LLMs are: **Parameter-Efficient Fine-tuning (PEFT)** (e.g: LoRA (Low Rank Adapter)
, Adapter) and **quantization** techniques (8-bit, 4-bit)


![image](https://pbs.twimg.com/media/F1fj-SqWYAELWpM?format=jpg&name=medium)

NB:
- `AutoModelForCausalLM` is used for auto-regressive language models like all the GPT models.
- `SFTTrainer`: [Supervised fine-tuning](https://huggingface.co/docs/trl/main/en/sft_trainer)
( (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.







# Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Run the cells below to setup and install the required libraries. For our experiment we will need accelerate, peft, transformers, datasets,scipy and TRL to leverage SFTTrainer. We will use bitsandbytes to quantize the base model into 4bit. We will also install einops but it was mainly used for loading falcon so I will remove it in later versions.

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops scipy wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.8 MB/s[0m eta [36m0

NB:
- The `bitsandbytes` is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.
- `Quantization` is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).
- `[Accelerate]`(https://huggingface.co/docs/accelerate/index) is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! In short, training and inference at scale made simple, efficient and adaptable.
-`einops` Einops is a Python library that provides a flexible and powerful way to manipulate tensors and perform operations on them. The name stands for "Einstein Operations" as its syntax is inspired by Einstein summation conventions.
- `scipy`: SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.
- `wandb`:WandB is a central dashboard to keep track of your hyperparameters, system metrics, and predictions so you can compare models live, and share your findings.



# Dataset
https://huggingface.co/datasets/TokenBender/code_instructions_120k_alpaca_style

In [2]:
from datasets import load_dataset

dataset_name = 'TokenBender/code_instructions_120k_alpaca_style'
dataset = load_dataset(dataset_name, split="train")




In [3]:
dataset

Dataset({
    features: ['input', 'instruction', 'text', 'output'],
    num_rows: 121959
})

NB: This dataset holds several key-value pairs:
`input` and `output` holds the query and answer. `instruction` might hold some sort of command or directive related to the data, and `text` might hold some additional textual data.

In [4]:
dataset[0]

{'input': '[1, 2, 3, 4, 5]',
 'instruction': 'Create a function to calculate the sum of a sequence of integers.',
 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Create a function to calculate the sum of a sequence of integers. ### Input: [1, 2, 3, 4, 5] ### Output: # Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum',
 'output': '# Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum'}

In [5]:
# loging to the HF hub to get access to the authentication token
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


# Loading the Model

In [7]:
# Loading a pre-trained transformer model for causal language modeling
# (i.e., predicting the next word in a sentence)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer, TrainingArguments

# model_name = "meta-llama/Llama-2-7b-chat-hf"
#if you're running on google colab free tier, uncomment below model and use it instead
model_name = "abhishek/llama-2-7b-hf-small-shards"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", # The "nf4" value suggests that the model is using "narrow full" 4-bit quantization, which is a specific scheme of 4-bit quantization that prioritizes retaining more information over saving more memory.
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]

Let's also load the tokenizer below

In [8]:
# loading the tokenizer used for the model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Nb:
`tokenizer.pad_token = tokenizer.eos_token` is setting the pad token to be the same as the EOS token. This might be done if your model needs to interpret padding as the end of a sentence, or if you're working with a model or dataset that already uses the EOS token for padding.
`trust_remote_code=True`: This is a flag used to allow or disallow the execution of custom code in the tokenizer configuration file. If trust_remote_code=True, it means the tokenizer configuration file is allowed to run custom code. This can be useful when the tokenizer includes some special rules or procedures, but it can potentially be a security risk if the source of the tokenizer is not trusted.

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add dense, dense_h_to_4_h and dense_4h_to_h layers in the target modules in addition to the mixed query key value layer.

In [18]:
# LoRA - setting up hyperparameters for LoraConfig
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.03
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    task_type="CAUSAL_LM"
)

NB: LoRA hyperparameters:

- `lora_alpha=16` ( int , optional) – A hyper-parameter to control the init scale of loralib.
- `lora_dropout=0.1` ( float , optional) – The dropout rate in lora.
- `lora_r=64` means that the low-rank matrix used in the LoRA method will have a **rank of 64**. The "rank" of a matrix is the maximum number of linearly independent rows or columns in the matrix. This is a hyperparameter that can be tuned depending on the specific task and the resources available for training the model.
- `target_modules`: "`dense_h_to_4h`" and "`dense_4h_to_h`": These likely refer to specific dense layers in the model. The names suggest these layers might be involved in projecting the model's hidden state h to a space that's four times larger, and then back down to the original size. **This is a common pattern in transformer models, where the input to the feed-forward network is first projected to a larger dimension (often called the expansion dimension), and then projected back to the original dimension**.

By specifying these modules as the `target_modules`, you're telling the LoRA method to specifically target these parts of the model during the fine-tuning process. The exact effect will depend on how the LoRA method is implemented, but generally, it will involve modifying these modules in some way to improve the model's ability to rank and select high-quality answers.


NB: `peft`: [Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware](https://huggingface.co/blog/peft)}

# Loading the Trainer

Here we will use the SFTTrainer from TRL library that gives a wrapper around transformers Trainer to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [19]:
# setting up the training configuration for the model
output_dir = "./results"
per_device_train_batch_size = 4 # batch size per device
gradient_accumulation_steps = 4 # the number of steps to accumulate gradients before performing an optimization step
optim = "paged_adamw_32bit" # AdamW optimizer that works with 32-bit precision.
save_steps = 100 # # model checkpoints
logging_steps = 10 # model logging
learning_rate = 1.4e-4 # learning rate for the optimizer
max_grad_norm = 0.3 # sets the maximum norm of the gradients for gradient clipping to avoid exploding gradients
max_steps = 200 # number of training steps
warmup_ratio = 0.03 # sets the ratio of warmup steps in the learning rate scheduler
lr_scheduler_type = "constant" # A "constant" scheduler keeps the learning rate constant throughout training.

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    #number_train_epoch=1
)

NB: **Mixed Precision Training**: This is a method used to speed up training and reduce the memory requirements of your model. In mixed precision training, some of your model's parameters and activations are stored as 16-bit floating point numbers (as opposed to the standard 32-bit), which take up less memory and can be processed faster by certain GPUs.

`fp16=True`: By setting this parameter to True, you're telling the training script to use mixed precision training. This means that the model will use a mix of 16-bit and 32-bit floating point numbers during training.

Then finally pass everthing to the trainer

In [20]:
# setting up the trainer for fine-tuning
from trl import SFTTrainer

#max_seq_length = 2048
max_seq_length = 1024

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/121959 [00:00<?, ? examples/s]

In [14]:
# ??SFTTrainer

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [21]:
# change the data type of certain modules in a PyTorch model
# Specifically, it's changing the data type of all modules with "norm" in their name to torch.float32.
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)


In [22]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.03, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(
            in_features=4096, out_features=4096, bias=False
            (lora_dropout): ModuleDict(
              (default):

**Model Explanation**

This block of code is providing a textual representation of a transformer model class (**LlamaForCausalLM**) structure. The architecture of this model seems to be specifically designed for Causal Language Modeling (CLM), a task where the model predicts the next token in a sequence given the history of previous tokens.

Here's a breakdown of the main components:

**LlamaForCausalLM**: This is the top-level class for the model. It's a transformer model specifically designed for causal language modeling.

**LlamaModel**: This is the main body of the model, which contains the layers of the transformer model.

**Embedding**: This layer is responsible for converting input tokens into vectors of a specific size (in this case, 4096). **The model has a vocabulary size of 32000 and each token is represented as a 4096-dimensional vector**.

**ModuleList**: This is a PyTorch container for holding a list of layers. In this case, **it holds 32 LlamaDecoderLayer instances**.

**LlamaDecoderLayer**: This is a **single layer of the transformer model**. Each layer includes self-attention mechanism (**LlamaAttention**) and a feed-forward neural network (**LlamaMLP**), along with normalization layers (**LlamaRMSNorm**).

**LlamaAttention**: This represents the **self-attention mechanism in the transformer model**. It includes the query, key, and value projections (**q_proj**, **k_proj**, **v_proj**), each of which is a Linear4bit layer, indicating that they are using 4-bit precision for the linear transformations. It also includes **LoRA** (Low-Rank Adapters) components that are used to reduce the computational complexity of the attention mechanism.

**LlamaMLP**: This is the **feed-forward neural network** within each transformer layer, which includes linear layers (Linear4bit) and an activation function (**SiLUActivation**).

**LlamaRMSNorm**: This is a variant of Layer Normalization, which is a technique used to stabilize the activations in the model and speed up training.

**lm_head**: This is the **final layer of the model**, which maps the output of the transformer layers to the vocabulary size, effectively giving a probability distribution over the vocabulary for each token in the output sequence.

Note: This model seems to be using 4-bit precision for its linear layers (Linear4bit), which is likely a part of a quantization strategy to reduce memory usage and speed up computation. It's also using a technique called Low-Rank Adapters (LoRA) to reduce the computational complexity of the attention mechanism. The exact details of these techniques might vary depending on the specific implementation and configuration of the model.

# Train the model
Now let's train the model! Simply call `trainer.train()`

The `trainer.train()` function call is part of the Hugging Face's Transformers library. This line of code is used to start the training process of a model. When you create a Trainer object in the Transformers library, you typically provide it with:
- a model to train,
- a training dataset,
- a tokenizer, and various training parameters (like the learning rate, batch size, etc.).

The Trainer object encapsulates the training loop and provides several utility functions to make training easier.

In [23]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mroyam0820[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,61.3786
20,0.4179
30,9.8368
40,5.2354
50,40.7751
60,11.1015
70,1.4832
80,40.8745
90,13.2374
100,13.4156


TrainOutput(global_step=200, training_loss=28.307640545368194, metrics={'train_runtime': 801.655, 'train_samples_per_second': 3.992, 'train_steps_per_second': 0.249, 'total_flos': 1.3095702503424e+16, 'train_loss': 28.307640545368194, 'epoch': 0.03})

**Training Results Explanation**

**global_step=200**: This indicates that the model was trained for 200 steps. A step usually means one update to the model's weights, which typically occurs after processing a batch of data.

**training_loss**=28.307640545368194: This is the loss value of the model on the training data. The loss is a measure of how well the model's predictions match the actual values. Lower loss values indicate that the model's predictions are closer to the actual values.

**metrics**: This is a dictionary containing various metrics that provide additional information about the training process:

> **train_runtime**: 801.655 This indicates that the training process took approximately 801.655 seconds.

> **train_samples_per_second**: 3.992 This measures the speed of the training process in terms of the number of training samples processed per second.

> **train_steps_per_second**: 0.249 This measures the speed of the training process in terms of the number of training steps performed per second.

> **total_flos**: 1.3095702503424e+16 FLOPS (Floating Point Operations Per Second) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. This is the total number of floating point operations that were performed during training.

> **train_loss**: 28.307640545368194 This is the same as the training_loss mentioned above. (13.36 mn)

> **epoch**: 0.03 This indicates that the training process completed 0.03 epochs. An epoch is one complete pass through the entire training dataset. In this case, it means the training process did not complete a full pass through the training data.

NB: the epoch value of "0.03" likely means that the training process only went through 3% of your entire training dataset. This could happen if you're training for a fixed number of steps that doesn't cover the whole dataset, or if you stop training early.
Indeed, the number of epochs can be a fraction if training doesn't complete a full pass through the data. For example, if you have a training dataset of 1000 examples, and you set up your training loop to run for 200 steps with a batch size of 10, then you would only go through 200*10 = 2000 examples during training. Since this is two times the size of your dataset, you've effectively gone through "2 epochs".



During QLoRA training, the training losses are spiking and falling sharply. The training loss also drops to zero after 200 steps in my training.

The SFTTrainer will take care of properly saving only the adapters during training instead of saving the entire model.

In [24]:
# saving the model
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

# Model Inference

In [25]:
!nvidia-smi

Mon Jul 24 12:35:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    41W / 300W |  12584MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [26]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

`lora_config = LoraConfig.from_pretrained('outputs')`: This line is loading a pre-trained LoraConfig from the directory named 'outputs'. LoraConfig likely contains the configuration parameters for the LoRA technique, such as the rank of the approximation, dropout rate, and other parameters.

`model = get_peft_model(model, lora_config)`: This line is calling a function named get_peft_model to apply the LoRA technique to the model. It's taking two arguments: the original model and the LoRA configuration. The function is expected to return a new model that has been modified according to the LoRA technique as specified by the lora_config.

LoRA is a technique used to reduce the parameter count and computational complexity of large models. It does this by approximating the large weight matrices in the model with the product of two smaller matrices.



NB: Lora: is a technique that accelerates the fine-tuning of large models while consuming less memory.`lora_config` allows you to control how LoRA is applied to the base model through the following parameters: r : the rank of the update matrices, expressed in int .

In [27]:
# prompt formatting
text = '''[INST]<<SYS>>
 You are a helpful coding assistant that provides code based on the given query in context.
<</SYS>>
Write a python program to perform binary search in a given list.[/INST]'''
device = "cuda:0"

`[INST]`: marker indicating the start of an instruction.

`<<SYS>>`: marker indicating the start of a system message. System messages might contain metadata or additional instructions for the AI model.

*You are a helpful coding assistant that provides code based on the given query in context.* This is the content of the system message.

`<</SYS>>`: marker indicating the end of a system message.

*Write a python program to perform binary search in a given list.: This is the actual task that the AI model is being asked to complete.*

[/INST]: marker indicating the end of an instruction.`

The below code indicates the following:

`inputs = tokenizer(text, return_tensors="pt").to(device)`: The input text is tokenized, and return_tensors="pt" means that the tokenizer should return PyTorch tensors. The .to(device) part sends the inputs to the specified device, which is generally a GPU ("cuda:0") or CPU ("cpu").

`outputs = model.generate(**inputs, max_new_tokens=1024)`: The model generates output text given the input. It does this by predicting what comes next for a maximum of 1024 tokens. The **inputs is using Python's syntax for passing the key-value pairs in the dictionary inputs as keyword arguments to the generate function.

`print(tokenizer.decode(outputs[0], skip_special_tokens=True))`: The generated output, which is in the form of token IDs, is decoded back into readable text. skip_special_tokens=True means that special tokens (like padding or end-of-sequence tokens) used by the model will be removed from the output.

In [28]:
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



[INST]<<SYS>>
 You are a helpful coding assistant that provides code based on the given query in context.
<</SYS>>
Write a python program to perform binary search in a given list.[/INST]

### Solution

```python
def binary_search(arr, target):
    low = 0
    high = len(arr) - 1
    while low <= high:
        mid = (low + high) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            high = mid - 1
        else:
            low = mid + 1
    return -1

arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
target = 5

print(binary_search(arr, target))
```

### Explanation

The idea is to find the index of the element in the list that is equal to the given target.

The algorithm is as follows:

1. Initialize the low and high index to 0 and len(arr) - 1 respectively.
2. While the low index is less than the high index, do the following:
    - Calculate the mid index by dividing the low and high index by 2.
    - If the element at the mid index is equal to the 

# Model push to the HF Hub

In [29]:
model.push_to_hub("amr_llama2-CodeInstr-finetuned-model")

adapter_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/royam0820/amr_llama2-CodeInstr-finetuned-model/commit/fbde88ba3b703f1191834697238d6b5a6dc27653', commit_message='Upload model', commit_description='', oid='fbde88ba3b703f1191834697238d6b5a6dc27653', pr_url=None, pr_revision=None, pr_num=None)

# Chat Web UI for Llama

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-3.38.0-py3-none-any.whl (19.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m85.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.100.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.7/65.7 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>=0.2.10 (from gradio)
  Downloading gradio_client-0.2.10-py3-none-any.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.0/289.0 kB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

We will import:
- `gradio` is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!
- `text_generation` is an llm task for producing new text. These models can, for example, fill in incomplete text or paraphrase.

In [None]:
!pip install text_generation

Collecting text_generation
  Downloading text_generation-0.6.0-py3-none-any.whl (10 kB)
Installing collected packages: text_generation
Successfully installed text_generation-0.6.0


NB: Text generation works by utilizing algorithms and language models to process input data and generate output text. It involves training AI models on large datasets of text to learn patterns, grammar, and contextual information. These models then use this learned knowledge to generate new text based on given prompts or conditions.

In [None]:
import os

import gradio as gr
from text_generation import Client


# llama prompt starts with <s> and ends with </s>
# system prompt
PROMPT = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

"""


# Specify the URL of the LLM server

#LLAMA_70B = os.environ.get("LLAMA_70B", "http://localhost:3000")
LLAMA_70B = os.environ.get("https://huggingface.co/royam0820/llama2-chat-hub-my-finetuned-model")
CLIENT = Client(base_url=LLAMA_70B)

# creating a dictionary for the LLM parameters
PARAMETERS = {
    "temperature": 0.9,
    "top_p": 0.95,
    "repetition_penalty": 1.2,
    "top_k": 50,
    "truncate": 1000,
    "max_new_tokens": 1024,
    "seed": 42,
    "stop_sequences": ["</s>"],
}

# Message formatting
def format_message(message, history, memory_limit=5):
    # handling the context, keeping 5 last messages in memory
    # always keep len(history) <= memory_limit
    if len(history) > memory_limit:
        history = history[-memory_limit:]

    if len(history) == 0:
        return PROMPT + f"{message} [/INST]"

    formatted_message = PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"

    # Handle conversation history
    for user_msg, model_answer in history[1:]:
        formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"

    # Handle the current message
    formatted_message += f"<s>[INST] {message} [/INST]"

    return formatted_message

# Inference
# message is the current user query, history are the past user's queries
def predict(message, history):
    query = format_message(message, history)
    text = "" #it will hold the response text
    for response in CLIENT.generate_stream(query, **PARAMETERS):
        if not response.token.special:
            text += response.token.text
            yield text

# lauching the Gradio webui chat interface
gr.ChatInterface(predict).queue().launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ff6174b69466bae4a1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


