# **Ludwig + DeepLearning.ai: Efficient Fine-Tuning for Llama2-7b on a Single GPU** 🙌

Let's explore how to fine-tune an LLM on a single commodity GPU with [Ludwig](https://ludwig.ai/latest/), an open-source package that empowers you to effortlessly build and train machine learning models like LLMs, neural networks and tree based models through declarative config files.

In this notebook, we'll show an example of how to fine-tune Llama-2-7b to generate code using the CodeAlpaca dataset.

By the end of this example, you will have gained a comprehensive understanding of the following key aspects:

1. **Ludwig**: An intuitive toolkit that simplifies fine-tuning for open-source Language Model Models (LLMs).
2. **Exploring the base model with prompts**: Dive into the intricacies of prompts and prompt templates, unlocking new dimensions in LLM interaction.
3. **Fine-Tuning Large Language Models**: Navigate the world of model fine-tuning optimizations for getting the most out of a single memory-contrained GPU, including: LoRA and 4-bit quantization.



### **Install Ludwig and Ludwig's LLM related dependencies.**

Install Ludwig from the latest release

In [1]:
# !pip uninstall -y tensorflow --quiet
# !pip install ludwig
# !pip install ludwig[llm]

Install Ludwig from Ludwig master

In [1]:
!pip uninstall -y tensorflow --quiet
!pip install git+https://github.com/ludwig-ai/ludwig.git@master --quiet
!pip install "git+https://github.com/ludwig-ai/ludwig.git@master#egg=ludwig[llm]" --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[33mDEPRECATION: git+https://github.com/ludwig-ai/ludwig.git@master#egg=ludwig[llm] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [2]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

### **Setup Your HuggingFace Token** 🤗

We'll be exploring Llama-2 today, which a model released by Meta. However, the model is not openly-accessible and requires requesting for access (assigned to your HuggingFace token).

Obtain a [HuggingFace API Token](https://huggingface.co/settings/tokens) and request access to [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) before proceeding. You may need to signup on HuggingFace if you don't aleady have an account: https://huggingface.co/join

Incase you haven't been given access to Llama-2-7b, that is alright. We can just use Llama-1 for the rest of this example: [huggyllama/llama-7b](https://huggingface.co/huggyllama/llama-7b).

In [4]:
# !pip uninstall torch -y
# !pip install torch
# import torch

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f865bbf26e0, raw_cell="# !pip uninstall torch -y
# !pip install torch
# i.." store_history=True silent=False shell_futures=True cell_id=f8c8f927-991b-4fa4-8caa-4a995f113a26>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [5]:
# # !pip uninstall torch torchtext -y
# !pip install torch torchtext
# # !pip uninstall torchaudio -y
# !pip install torchaudio

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f865827c6d0, raw_cell="# # !pip uninstall torch torchtext -y
# !pip insta.." store_history=True silent=False shell_futures=True cell_id=39b95633-6624-47cc-99a1-9073be6f9209>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [3]:
import locale

# Override getpreferredencoding to always return UTF-8
locale.getpreferredencoding = lambda _=None: "UTF-8"

import getpass
# import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import yaml

from ludwig.api import LudwigModel


os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7f83712c80, raw_cell="import locale

# Override getpreferredencoding to .." store_history=True silent=False shell_futures=True cell_id=933b638d-a7f1-4444-b785-934dfcb762f1>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

PyTorch version 2.1.0+cu118 available.


Token: ········


### **Import The Code Generation Dataset** 📋



In [7]:
# from google.colab import drive
# drive.mount('/content/drive')

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f841e054520, raw_cell="# from google.colab import drive
# drive.mount('/c.." store_history=True silent=False shell_futures=True cell_id=05e8c065-cf53-4fcb-96cf-3e6738504009>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [8]:
# from google.colab import data_table; data_table.enable_dataframe_formatter()
# import numpy as np; np.random.seed(123)
# import pandas as pd

# import json
# with open('/content/drive/MyDrive/arxiv_physics_instruct_30k.jsonl') as f1:
#     data1 = [json.loads(line) for line in f1]
# with open('/content/drive/MyDrive/arxiv_math_instruct_50k.jsonl') as f2:
#     data2 = [json.loads(line) for line in f2]

# df1 = pd.DataFrame(data1)
# df2 = pd.DataFrame(data2)
# df = pd.concat([df1, df2])
# main_df = df.sample(frac=1, random_state=42)
# main_df.reset_index(drop=True, inplace=True)
# # We're going to create a new column called `split` where:
# # 90% will be assigned a value of 0 -> train set
# # 5% will be assigned a value of 1 -> validation set
# # 5% will be assigned a value of 2 -> test set
# # Calculate the number of rows for each split value
# total_rows = len(main_df)
# split_0_count = int(total_rows * 0.9)
# split_1_count = int(total_rows * 0.05)
# split_2_count = total_rows - split_0_count - split_1_count

# # Create an array with split values based on the counts
# split_values = np.concatenate([
#     np.zeros(split_0_count),
#     np.ones(split_1_count),
#     np.full(split_2_count, 2)
# ])

# # Shuffle the array to ensure randomness
# np.random.shuffle(split_values)

# # Add the 'split' column to the DataFrame
# main_df['split'] = split_values
# main_df['split'] = main_df['split'].astype(int)

# # For this webinar, we will just 500 rows of this dataset.
# main_df = main_df.head(n=1000)

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f841e054400, raw_cell="# from google.colab import data_table; data_table..." store_history=True silent=False shell_futures=True cell_id=e8e6dd4b-ab0e-487f-a5a6-8fd5cbe80ee9>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [9]:
# from google.colab import data_table; data_table.enable_dataframe_formatter()
# import numpy as np; np.random.seed(123)
# import pandas as pd
# import json

# # Enable DataFrame formatter
# data_table.enable_dataframe_formatter()

# # Load data from the JSONL file
# jsonl_file_path = 'train_data.jsonl'
# with open(jsonl_file_path) as f:
#     data = [json.loads(line) for line in f]

# # Create DataFrame from the loaded data
# df = pd.DataFrame(data)

# # Shuffle the DataFrame
# main_df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# # We're going to create a new column called `split` where:
# # 90% will be assigned a value of 0 -> train set
# # 5% will be assigned a value of 1 -> validation set
# # 5% will be assigned a value of 2 -> test set

# # Calculate the number of rows for each split value
# total_rows = len(main_df)
# split_0_count = int(total_rows * 0.9)
# split_1_count = int(total_rows * 0.05)
# split_2_count = total_rows - split_0_count - split_1_count

# # Create an array with split values based on the counts
# split_values = np.concatenate([
#     np.zeros(split_0_count),
#     np.ones(split_1_count),
#     np.full(split_2_count, 2)
# ])

# # Shuffle the array to ensure randomness
# np.random.shuffle(split_values)

# # Add the 'split' column to the DataFrame
# main_df['split'] = split_values.astype(int)

# # Display a sample of the DataFrame
# main_df.sample(5)


Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f841e054940, raw_cell="# from google.colab import data_table; data_table..." store_history=True silent=False shell_futures=True cell_id=d344b8b8-5fb9-4778-ac3e-0ee0506f2851>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [4]:
import numpy as np
import pandas as pd
import json

# Load data from the JSONL file
jsonl_file_path = 'ludwig_train_data.jsonl'
with open(jsonl_file_path) as f:
    data = [json.loads(line) for line in f]

# Create DataFrame from the loaded data
df = pd.DataFrame(data)

# Shuffle the DataFrame
main_df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# We're going to create a new column called `split` where:
# 90% will be assigned a value of 0 -> train set
# 5% will be assigned a value of 1 -> validation set
# 5% will be assigned a value of 2 -> test set

# Calculate the number of rows for each split value
total_rows = len(main_df)
split_0_count = int(total_rows * 0.9)
split_1_count = int(total_rows * 0.05)
split_2_count = total_rows - split_0_count - split_1_count

# Create an array with split values based on the counts
split_values = np.concatenate([
    np.zeros(split_0_count),
    np.ones(split_1_count),
    np.full(split_2_count, 2)
])

# Shuffle the array to ensure randomness
np.random.shuffle(split_values)

# Add the 'split' column to the DataFrame
main_df['split'] = split_values.astype(int)

# Display a sample of the DataFrame
main_df.sample(5)


Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7d51c7cc40, raw_cell="import numpy as np
import pandas as pd
import json.." store_history=True silent=False shell_futures=True cell_id=314a4da5-1fb5-4abd-b138-c4c0b7158410>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

Unnamed: 0,input,output,instruction,split
1558,\n\nTitle: Watercress: A Rich Source of Bioact...,"[[3,4-dihydroxy-5-all-trans-hexaprenylbenzoate...",The task is to extract chemical-related triple...,0
521,\n\nTitle: Chemical Constituents of Angelica: ...,"[[Ureidoisobutyric acid, sourced through, Ange...",The task is to extract chemical-related triple...,0
1005,\n\nTitle: Biolocation of Pharmaceutical Compo...,"[[suloctidil, biolocation is, Blood], [Polymyx...",The task is to extract chemical-related triple...,0
1545,\n\nTitle: Sourcing of Bioactive Compounds in ...,"[[5-Methylthioribose, sourced through, Robusta...",The task is to extract chemical-related triple...,0
1072,\n\nTitle: The Role of PA(16:0/18:1(11Z)) in C...,"[[PA(16:0/18:1(11Z)), involved in, Cardiolipin...",The task is to extract chemical-related triple...,0


In [5]:
len(main_df)

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7d51c7ecb0, raw_cell="len(main_df)" store_history=True silent=False shell_futures=True cell_id=55f9b6b1-a72c-4320-9fcf-687cc732b1a6>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

1879

In [6]:
main_df.head(n=10)

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7f837129e0, raw_cell="main_df.head(n=10)" store_history=True silent=False shell_futures=True cell_id=83aa2908-3d20-478e-ab0c-193eccca313c>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

Unnamed: 0,input,output,instruction,split
0,\n\nCytidine triphosphate (CTP) is a crucial m...,"[[Cytidine triphosphate, involved in, Cardioli...",The task is to extract chemical-related triple...,2
1,"\n\nTitle: LysoPA(18:2(9Z,12Z)/0:0): A Key Pla...","[[LysoPA(18:2(9Z,12Z)/0:0), involved in, Cardi...",The task is to extract chemical-related triple...,0
2,"\n\nTitle: The Role of Neotussilagolactone, Et...","[[Neotussilagolactone, has role of, Surfactant...",The task is to extract chemical-related triple...,0
3,\n\nTitle: Sourcing L-Canaline and Related Com...,"[[L-canaline, sourced through, Daikon radish],...",The task is to extract chemical-related triple...,0
4,\n\nTitle: Phosphatidic Acid (PA) (16:0/18:1(1...,"[[PA(16:0/18:1(11Z)), involved in, De Novo Tri...",The task is to extract chemical-related triple...,0
5,\n\nTitle: LysoPA(16:0/0:0) Involvement in Lip...,"[[LysoPA(16:0/0:0), involved in, Cardiolipin B...",The task is to extract chemical-related triple...,0
6,\n\nTitle: Roles of Phosphate in De Novo Triac...,"[[Phosphate, involved in, De Novo Triacylglyce...",The task is to extract chemical-related triple...,0
7,\n\nTitle: Sources of Cis-Zeatin-9-N-Glucoside...,"[[cis-Zeatin-9-N-glucoside, sourced through, I...",The task is to extract chemical-related triple...,0
8,\n\nTitle: Palmityl-CoA: A Key Player in Lipid...,"[[Palmityl-CoA, involved in, De Novo Triacylgl...",The task is to extract chemical-related triple...,0
9,\n\nTitle: Exploring the Phytochemical Composi...,"[[trans-p-Coumaric acid, sourced through, Bitt...",The task is to extract chemical-related triple...,0


As you can see below, the dataset is pretty balanced in terms of the number of examples of each type of instruction (also true for the full dataset with 20,000 rows).

In [7]:
# num_self_sufficient = (df['input'] == '').sum()
num_need_context = main_df.shape[0]
# print(num_need_context)
# We are only using 100 rows of this dataset for this webinar
print(f"Total number of examples in the dataset: {main_df.shape[0]}")

# print(f"% of examples that are self-sufficient: {round(num_self_sufficient/main_df.shape[0] * 100, 2)}")
print(f"% of examples that are need additional context: {round(num_need_context/main_df.shape[0] * 100, 2)}")

Error in callback <function set_css at 0x7f6143fb3490> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f5eec7030a0, raw_cell="# num_self_sufficient = (df['input'] == '').sum()
.." store_history=True silent=False shell_futures=True cell_id=a2933d48-6daf-4f03-b4e5-3f3cd2f39ed9>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

Total number of examples in the dataset: 1879
% of examples that are need additional context: 100.0


The other aspect worth noting is the average number of characters in each of the three columns `instruction`, `input` and `output` in the dataset. Typically, every 3-4 characters maps to a *token* (the basic building blocks that language models use to understand and analyze text data), and large language models have a limit on the number of tokens they can take as input.

The maximum context length for the base LLaMA-2 model is 4096 tokens. Ludwig automatically truncates texts that are too long for the model, but looking at these sequence lengths, we should be able to fine-tune on full length examples without needing any truncation.





In [14]:
# # Calculating the length of each cell in each column
# df['num_characters_question'] = df['question'].apply(lambda x: len(x))
# # df['num_characters_input'] = df['input'].apply(lambda x: len(x))
# df['num_characters_answer'] = df['answer'].apply(lambda x: len(x))

# # Show Distribution
# df.hist(column=['num_characters_question', 'num_characters_answer'])

# # Calculating the average
# average_chars_instruction = df['num_characters_question'].mean()
# # average_chars_input = df['num_characters_input'].mean()
# average_chars_output = df['num_characters_answer'].mean()

# print(f'Average number of tokens in the instruction column: {(average_chars_instruction / 3):.0f}')
# # print(f'Average number of tokens in the input column: {(average_chars_input / 3):.0f}')
# print(f'Average number of tokens in the output column: {(average_chars_output / 3):.0f}', end="\n\n")

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f841e0572e0, raw_cell="# # Calculating the length of each cell in each co.." store_history=True silent=False shell_futures=True cell_id=ce2b9fd3-072d-456a-b50e-4413460d69eb>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

There are three different fine-tuning approaches in Ludwig:

1. **Full Fine-Tuning**:

- Involves training the entire pre-trained model on new data from scratch.
- All model layers and parameters are updated during fine-tuning.
- Can lead to high accuracy but requires a significant amount of computational resources and time.
- Runs the risk of catastrophic forgetting: occasionally, since we are updating all of the weights in the model, this process can lead to the algorithm inadvertently losing knowledge of its past tasks, i.e., the knowledge it gained during pretraining. The outcome may vary, with the algorithm experiencing heightened error margins in some cases, while in others, it might completely erase the memory of a specific task leading to terrible performance.
- Best suited when the target task is significantly different from the original pre-training task.

2. **Parameter Efficient Fine-Tuning (PEFT), e.g. LoRA**:

- Focuses on updating only a subset of the model's parameters.
- Often involves freezing certain layers or parts of the model to avoid catastrophic forgetting, or inserting additional layers that are trainable while keeping the original model's weights frozen.
- Can result in faster fine-tuning with fewer computational resources, but might sacrifice some accuracy compared to full fine-tuning.
- Includes methods like LoRA, AdaLoRA and Adaption Prompt (LLaMA Adapter)
- Suitable when the new task shares similarities with the original pre-training task.

3. **Quantization-Based Fine-Tuning (QLoRA)**:

- Involves reducing the precision of model parameters (e.g., converting 32-bit floating-point values to 8-bit or 4-bit integers). This reduces the amount of CPU and GPU memory required by either 4x if using 8-bit integers, or 8x if using 4-bit integers.
- Typically, since we're changing the weights to 8 or 4 bit integers, we will lose some precision/performance.
- This can lead to reduced memory usage and faster inference on hardware with reduced precision support.
- Particularly useful when deploying models on resource-constrained devices, such as mobile phones or edge devices.


**Today, we're going to fine-tune using method 3 since we only have access to a single T4 GPU with 16GiB of GPU VRAM on Colab.** If you have more compute available, give LoRA based fine-tuning or full fine-tuning a try! Typically this requires 4 GPUs with 24GiB of GPU VRAM on a single node multi-GPU cluster and fine-tuning Deepspeeed.


To do this, the new parameters we're introducing are:

- `adapter`: The PEFT method we want to use
- `quantization`: Load the weights in int4 or int8 to reduce memory overhead.
- `trainer`: We enable the `finetune` trainer and can configure a variety of training parameters such as epochs and learning rate.

Note, there are a few additional preprocessing parameters we should set to ensure that training runs smoothly:

```yaml
preprocessing:
  global_max_sequence_length: 512
  split:
    type: random
    probabilities:
    - 1
    - 0
    - 0
```

- Some of the examples in the dataset have long sequences, so we set a `global_max_sequence_length` of 512 to ensure that we do not OOM.
- Splits are set up such that we use all of the data for training as evaluation phases are synchronous and will take additional time. In a full training run, we recommend using setting aside some data for the test split for evaluation metrics.

In [8]:
import torch

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7d51644820, raw_cell="import torch" store_history=True silent=False shell_futures=True cell_id=3f1fc8ef-576a-4be5-86cf-3dd1eeb45041>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [9]:
model = None
clear_cache()

qlora_fine_tuning_config = yaml.safe_load(
"""
model_type: llm
base_model: berkeley-nest/Starling-LM-7B-alpha

input_features:
  - name: input
    type: text

output_features:
  - name: output
    type: text

prompt:
  template: >-
    Below is an instruction that describes a task, paired with an input
    that provides further context. Write a response that appropriately
    completes the request.

    ### Instruction: {instruction}

    ### Input: {input}

    ### Response:

generation:
  temperature: 0.1
  max_new_tokens: 5000

adapter:
  type: lora

quantization:
  bits: 4

preprocessing:
  global_max_sequence_length: 5000
  split:
    type: random
    probabilities:
    - 1
    - 0
    - 0

trainer:
  type: finetune
  epochs: 1
  batch_size: 1
  eval_batch_size: 2
  gradient_accumulation_steps: 16
  learning_rate: 0.0004
  learning_rate_scheduler:
    warmup_fraction: 0.03
"""
)

model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)
results = model.train(dataset=main_df)

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7d51c7fbb0, raw_cell="model = None
clear_cache()

qlora_fine_tuning_conf.." store_history=True silent=False shell_futures=True cell_id=99fa871d-f17b-4dd2-a7ee-8ca1840c4c1c>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given


╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤═════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                          │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                     │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /workspace/results/api_experiment_run_2                                                 │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version   │ '0.10.1.dev'                                                                            │
├──────────────────┼─────────

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'None': 1358 (without start and stop symbols)
Max sequence length is 1358 for feature 'None'


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'output': 726 (without start and stop symbols)
Max sequence length is 726 for feature 'output'


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building dataset: DONE
Writing preprocessed training set cache to /workspace/f3d4d56ed75811ee97680242c0a83002.training.hdf5
Writing preprocessed validation set cache to /workspace/f3d4d56ed75811ee97680242c0a83002.validation.hdf5
Writing preprocessed test set cache to /workspace/f3d4d56ed75811ee97680242c0a83002.test.hdf5
Writing train set metadata to /workspace/f3d4d56ed75811ee97680242c0a83002.meta.json
Validation set empty. If this is unintentional, please check the preprocessing configuration.
Test set empty. If this is unintentional, please check the preprocessing configuration.

Dataset Statistics
╒═══════════╤═══════════════╤════════════════════╕
│ Dataset   │   Size (Rows) │ Size (In Memory)   │
╞═══════════╪═══════════════╪════════════════════╡
│ Training  │          1879 │ 440.52 Kb          │
╘═══════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Loading large language model...
We will use 90% of the memory on device 0 for storing the model, and 10% 

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


Done.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 3,407,872 || all params: 7,245,156,352 || trainable%: 0.047036555657757044

╒══════════╕
│ TRAINING │
╘══════════╛

Creating fresh model training run.
Training for 1879 step(s), approximately 1 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 9395 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training: 100%|██████████| 1879/1879 [29:43<00:00,  1.20s/it, loss=0.000617]
Running evaluation for step: 1879, epoch: 1
Evaluation took 0.2060s

╒═══════════════════════╤════════════╤══════════════╤════════╕
│                       │      train │ validation   │ test   │
╞═══════════════════════╪════════════╪══════════════╪════════╡
│ bleu                  │     0.0533 │              │        │
├───────────────────────┼────────────┼──────────────┼────────┤
│ char_error_rate       │     6.8877 │              │        │
├───────────────────────┼────────────┼──────────────┼──

#### Perform Inference

We can now use the model we fine-tuned above to make predictions on some test examples to see whether fine-tuning the large language model improve its ability to follow instructions/the tasks we're asking it to perform.

In [10]:
print(model)

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7bafdd7b80, raw_cell="print(model)" store_history=True silent=False shell_futures=True cell_id=abc8e85d-7cf5-46dc-9c83-e77ff0537903>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

<ludwig.api.LudwigModel object at 0x7f7f805b4b50>


In [18]:
test_examples = pd.DataFrame([
    {
        "instruction" : """
          The task is to extract chemical-related triples from scientific research papers.
          The rules are:
          1. Only use the following predicates in the triple: “causes”, ”biolocation is”, “exposed through”, “sourced through”, “has role of”, “involved in”.
          2. If there is more than one noun in the object, separate it into multiple triples.
          3. If you don't find relevant chemical-related triples in the paper or you are not sure, return: null.
          4. The response is an array of the relevant triples in the form: [subject, predicate, object].
          Q: Interaction of TMPC with ANXA2 mediated attachment and colonization of S. anginosus and induced mitogen-activated protein kinase (MAPK) activation.
          A: ["TMPC", "involved in", "MAPK activation"],
          ["ANXA2", "involved in", "MAPK activation"]
          Q: α-Lipoic acid plays an essential role in mitochondrial dehydrogenase reactions.
          A: ["alpha-Lipoic acid", "involved in", "mitochondrial dehydrogenase reactions"]
          Q: Ferroptosis, a form of regulated cell death that is driven by iron-dependent phospholipid peroxidation, has been implicated in multiple diseases, including cancer
          A: ["Ferroptosis", "causes", "cancer"]
          Q: These transporters and other biotin-binding proteins partition biotin to the cytoplasm and mitochondria cell compartments.
          A: ["Biotin", "biolocation is", "cytoplasm"],
          ["Biotin", "biolocation is", "mitochondria"]""",
        "input": """Bioconversion of <Chemical>Chitin</Chemical> into <Chemical>chitin oligosaccharides</Chemical> using a novel chitinase with high <Chemical>Chitin</Chemical>-binding capacity. <Chemical>Chitin</Chemical> is the second largest renewable biomass resource in nature, it can be enzymatically degraded into high-value <Chemical>chitin oligosaccharides</Chemical> (<Chemical>CHOSs</Chemical>) by chitinases. In this study, a chitinase (ChiC8-1) was purified and biochemically characterized, its structure was analyzed by molecular modeling. ChiC8-1 had a molecular mass of approximately 96 kDa, exhibited its optimal activity at pH 6.0 and 50  C. The Km and Vmax values of ChiC8-1 towards colloidal <Chemical>Chitin</Chemical> were 10.17 mgmL-1 and 13.32 U/mg, respectively. Notably, ChiC8-1 showed high <Chemical>Chitin</Chemical>-binding capacity, which may be related to the two <Chemical>Chitin</Chemical> binding domains in the N-terminal. Based on the unique properties of ChiC8-1, a modified affinity chromatography method, which combines protein purification with <Chemical>Chitin</Chemical> hydrolysis process, was developed to purify ChiC8-1 while hydrolyzing <Chemical>Chitin</Chemical>. In this way, 9.36 +- 0.18 g <Chemical>CHOSs</Chemical> powder was directly obtained by hydrolyzing 10 g colloidal <Chemical>Chitin</Chemical> with crude enzyme solution. The <Chemical>CHOSs</Chemical> were composed of 14.77-2.83 % <Chemical>Acetylglucosamine</Chemical> and 85.23-97.17 % <Chemical>N N-diacetylchitobiose</Chemical> at different enzyme-substrate ratio. This process simplifies the tedious purification and separation steps, and may enable its potential application in the field of green production of <Chemical>chitin oligosaccharides</Chemical>.""",
    }])
# test_examples = pd.DataFrame([
#     {
#         "input": "Hi there.",
#     }])
predictions = model.predict(test_examples)[0]
for input_with_prediction in zip(test_examples['instruction'], test_examples['input'], predictions['output_response']):
  print(f"Instruction: {input_with_prediction[0]}")
  print("\n\n")
  print(f"Input: {input_with_prediction[1]}")
  print("\n\n")
  print(f"Generated Output: {input_with_prediction[2][0]}")
  print("\n\n")


Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7bafd9fa90, raw_cell="test_examples = pd.DataFrame([
    {
        "inst.." store_history=True silent=False shell_futures=True cell_id=35fa2cd6-afd9-44e1-8dff-4ad8b74e4190>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prediction: 100%|██████████| 1/1 [00:07<00:00,  7.08s/it]


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded HuggingFace implementation of berkeley-nest/Starling-LM-7B-alpha tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Finished predicting in: 8.35s.
Instruction: 
          The task is to extract chemical-related triples from scientific research papers.
          The rules are:
          1. Only use the following predicates in the triple: “causes”, ”biolocation is”, “exposed through”, “sourced through”, “has role of”, “involved in”.
          2. If there is more than one noun in the object, separate it into multiple triples.
          3. If you don't find relevant chemical-related triples in the paper or you are not sure, return: null.
          4. The response is an array of the relevant triples in the form: [subject, predicate, object].
          Q: Interaction of TMPC with ANXA2 mediated attachment and colonization of S. anginosus and induced mitogen-activated protein kinase (MAPK) activation.
          A: ["TMPC", "involved in", "MAPK activation"],
          ["ANXA2", "involved in", "MAPK activation"]
          Q: α-Lipoic acid plays an essential role in mitochondrial dehydrogenase reactions.
    

  return np.sum(np.log(sequence_probabilities))


#### **Observations From QLoRA Fine-Tuning** 🔎
- Even when we just fine-tune the model on 100 examples from our dataset (which only takes about 4 minutes), it significantly improves the model on our task 🔥
- The answers are not perfect when we just use 100 examples, but if we inspect the *logic* in the response, we can see that it is 95% of the way there. This is SIGNIFICANTLY better than before - there is no repetition and the actual code aspects of the answers are all correct.
- The partial errors such as `sierp` instead of `arrray` etc indicate that we need to train on a larger amount of data for the model to better learn how to follow instructions and not make these kinds of mistakes.

If you're looking for a managed solution to handle all of the hassle of figuring out the right compute for your fine-tuning task, ensuring that they always succeed without CPU or GPU out-of-memory errors, and be able to rapidly deploy them for fast real-time inference, check out [Predibase](https://www.predibase.com/).

In [19]:
!export HUGGING_FACE_HUB_TOKEN="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"

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f83b874ba00, raw_cell="!export HUGGING_FACE_HUB_TOKEN="hf_ZLUKXLBDvYQnnHw.." store_history=True silent=False shell_futures=True cell_id=c21d64a1-2131-4e0a-884e-776fec7ef654>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
!ludwig upload hf_hub --repo_id mrmagic251/ChemStarling --model_path workspace/results/api_experiment_run_2

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7bafdd6830, raw_cell="!ludwig upload hf_hub --repo_id mrmagic251/ChemSta.." store_history=True silent=False shell_futures=True cell_id=9c6f15b0-ed89-40e2-bd9e-faa15700ac22>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 

In [21]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

config = PeftConfig.from_pretrained("arnavgrg/codealpaca_v3")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(model, "arnavgrg/codealpaca_v3")

Error in callback <function set_css at 0x7f865bb5f6d0> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f83b874b070, raw_cell="from peft import PeftModel, PeftConfig
from transf.." store_history=True silent=False shell_futures=True cell_id=0decf902-e5e4-4c5b-85e2-9978681d8360>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

ValueError: Can't find 'adapter_config.json' at 'arnavgrg/codealpaca_v3'

In [19]:
from ludwig.api import LudwigModel
import pandas as pd

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7bafd9fca0, raw_cell="from ludwig.api import LudwigModel
import pandas a.." store_history=True silent=False shell_futures=True cell_id=02409a05-0dc6-4ae1-8f5e-d47112c7b92a>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

In [20]:
ludwig_model = LudwigModel.load("workspace/results/api_experiment_run_2/model")

Error in callback <function set_css at 0x7f7f836a3010> (for pre_run_cell), with arguments args (<ExecutionInfo object at 7f7bafdd7d90, raw_cell="ludwig_model = LudwigModel.load("workspace/results.." store_history=True silent=False shell_futures=True cell_id=226b2171-65aa-4d23-b726-2179209821b4>,),kwargs {}:


TypeError: set_css() takes 0 positional arguments but 1 was given

FileNotFoundError: [Errno 2] No such file or directory: '/workspace/workspace/results/api_experiment_run_2/model/model_hyperparameters.json'