### 1B Parameter QLORA TRAINING

* This is an attempt to train a 1B Parameter LLM using QLORA.

#### Check CUDA Availability

* We first need to check to ensure that CUDA is available.  We can start with the nvidia-smi shell tool.

In [1]:
!nvidia-smi

Sun Jan 21 01:24:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1070        Off | 00000000:01:00.0 Off |                  N/A |
| 27%   34C    P2              29W / 151W |     16MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Set Working Directory

* Since we are importing various toolsets and saving items in specific folders, let's set the working directory so we know where we are globally for the remainder of this notebook.

In [2]:
import os

# Get the current working directory first
current_directory = os.getcwd()

print("Original Working Directory: ", current_directory)

# Set the working directory to the notebook's location
notebook_dir = '/home/jovyan/work/llm_training_experiments'
os.chdir(notebook_dir)
print("New Working Directory: ", os.getcwd())

Original Working Directory:  /home/jovyan/work/llm_training_experiments
New Working Directory:  /home/jovyan/work/llm_training_experiments


### Check Torch and GPU Specifications

* We need to ensure that torch is available within this kernel.
* we have developed a utility to check the GPU availability which we can import with sys

In [3]:
import sys

# Since we set the working directory to the notebook directory, we append utils as being one direcotry back
sys.path.append('../utils')

from specs import show_torch_and_gpu_stats

show_torch_and_gpu_stats()

2.1.2+cu121
GPU is available.
Current GPU device: 0
GPU Name: NVIDIA GeForce GTX 1070
GPU Memory Capacity: 8.501919744 GB


### Get Dataset for Training

* We want to use a custom dataset, starting with our own CSV file, rather than using the dataset tools available on HuggingFace. We're doing it this way because we want to understand how to be able to fundamentally fine tune a language model according to our own custom datasets, not optimize against publicly available datasets.
* That being said, we can use a sample dataset to start off with and transfer this into our own CSV file. The sample dataset we draw from is here: https://huggingface.co/datasets/Stevross/mmlu/viewer/abstract_algebra/dev

#### What the MMLU Dataset Includes

The MMLU dataset is a part of an LLM benchmark which is designed to measure LLM performance. The dataset includes various detailed questions, possible choices, and the correct answer. This format is typical for multiple-choice questions (MCQs), which are common in language understanding tasks. The components of MMLU include:

* Question: The main body of the text presenting a scenario or asking a question.
* Choices: A list of possible answers from which the correct one must be selected.
* Answer: The correct choice, often indicated by an index or key that corresponds to the correct option in the choices list.

The idea behind this format is that it is supposed to mimic real-world tasks, with the idea that in real-life, a person might take a test to get certified in something, and that test might be multiple-choice. Alternative open-ended essay based testing would be a different kind of benchmark that might include some kind of subjective assessment, as ELO Scores measure.

#### What Our Dataset Needs to Include

So to train a language model, you really want to bias it toward a set of answers to questions, so an input format like the following would suffice:

```
question	answer
What is the definition of a group in abstract algebra?	A set with an operation that is associative, has an identity element, and every element has an inverse.
```![image.png](attachment:image.png)

In [3]:
import pandas as pd
import os

df = pd.read_csv("./abstract_algebra.csv")
df

Unnamed: 0,question,answer
0,What is the definition of a group in abstract ...,"A set with an operation that is associative, h..."
1,How is a ring in algebra different from a group?,A ring has two operations (addition and multip...
2,What is a field in algebra?,A field is a ring in which every non-zero elem...
3,Explain the concept of a vector space.,A vector space is a collection of vectors that...
4,What is an isomorphism in algebra?,An isomorphism is a bijective homomorphism bet...


#### Train Test Split

* Once the data has been obtained, we can split it up into training and testing data using a standard sklearn tool, [model_selection](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), which splits data into random subsets for training and testing.
* When this notebook was authored, sklearn was not installed so we need to use conda to get it installed and usable within the kernel.

In [5]:
!conda install -y scikit-learn

Collecting package metadata (current_repodata.json): | ^C
/ 

In [4]:
import sklearn
print(sklearn.__version__)

1.3.2


In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.1)

* Once we have split the data up randomly, we can view the "train" and "test" set individually to make sure they look correct.
* This train and test split is likely to be ineffective because they have nothing to do with one another, this is merely a development test, but we're doing it as a best practice.

In [6]:
train

Unnamed: 0,question,answer
0,What is the definition of a group in abstract ...,"A set with an operation that is associative, h..."
4,What is an isomorphism in algebra?,An isomorphism is a bijective homomorphism bet...
3,Explain the concept of a vector space.,A vector space is a collection of vectors that...
2,What is a field in algebra?,A field is a ring in which every non-zero elem...


In [7]:
test

Unnamed: 0,question,answer
1,How is a ring in algebra different from a group?,A ring has two operations (addition and multip...


### Config Model

Prior to setting up a training run, you set up a configuration dictionary, CONFIG which organizes various settings and parameters for training a machine learning model. This configuration is structured into different sections.

#### Directory and Naming Conventions

* We need a START_MODEL_NAME which represents the model we are deriving from, on the HuggingFace hub.
* We need a NEW_MODEL_NAME which represents what our model will be called afterward.
* DATASET_PATH is where we can find our dataset that we're using to fine-tune the model.
* OUTPUT_DIR is where we will save the various important files after the training is completed.

#### Block, Batch, Steps, Save Limit, Learning Rate - Within basic_config

* "block_size": is a part of the [Preprocess](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) step, data often needs to be broken into smaller, manageable sequences. This is because models have a maximum sequence length they can handle (often 512 or 1024 tokens for Transformer-based models). The block_size parameter specifies the maximum length of these sequences. An example block_size of 128 means that each sequence of tokens fed to the model will be at most 128 tokens long. For a weaker GPU, the block size might need to be smaller.
* "num_train_epochs": is a part of the [Trainer](https://huggingface.co/transformers/v4.2.2/main_classes/trainer.html) class, and represents the Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training). More epochs will take more time.
* "per_device_train_batch_size": is an option found under [Gradient Accumulation](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-accumulation), which is a method used to make training more efficient on a single GPU. Batch size during training determines how many samples are processed before the model's internal parameters are updated. A larger batch size typically leads to more stable and accurate gradient estimates, but it also requires more GPU memory. For example if per_device_train_batch_size is set to 4, it means each GPU (if you have more than one) processes 4 samples per iteration. Larger batch sizes on limited hardware, it can slow down the training process
* "save_steps": Number of updates steps before two checkpoint saves, defaults to 500. This specifies the number of update steps (where the model's weights are updated) between saving the model checkpoints. A checkpoint is essentially a snapshot of your model's state, including its weights, optimizer state, and other parameters. For example if this is set to 500, it means the training process will save the model checkpoint after every 500 update steps. The tradeoff here is speed and space. Frequent saving takes up more space and can slow down the process due to I/O.  An example scenario would be, if you set per_device_train_batch_size=1 and gradient_accumulation_steps=4, the effective batch size becomes 4. Suppose your training dataset has 1000 examples. With an effective batch size of 4, there will be 250 update steps in one epoch (assuming no gradient accumulation). If save_steps is 500, the model will save a checkpoint after completing two epochs.
* "save_total_limit": sets the maximum number of model checkpoints to keep, with older checkpoints being deleted.
* "learning_rate":  key hyperparameter in the optimization process of training a machine learning model. It determines the size of the steps taken during the optimization process and thus the rate at which the model learns. A higher learning rate allows the model to learn faster, with larger updates to weights during training. However, if the learning rate is too high, the model might overshoot optimal solutions or fail to converge.

#### bnb_config

These are settings for quantization and data type used in the [BitsAndBytes library](https://github.com/TimDettmers/bitsandbytes), which is a wrapper . 

* "load_in_4bit": specifies whether to load the model in 4-bit quantization mode or not.
* "load_4bit_use_double_quant": specifies whether to use double quantization, meaning twice in succession, which can used in certain scenarios to maintain precision. 
* "bnb_4bit_quant_type": the type of quantization to be used, such as nf4, which is an option in BitsAndBytes.
* "bnb_4bit_compute_dtype": defines the data type, e.g. torch.bfloat16 for half precision.

#### lora_config



In [3]:
START_MODEL_NAME = "tiiuae/falcon-rw-1b"
START_MODEL_PATH = "/home/jovyan/work/models/" + f"{START_MODEL_NAME}"
NEW_MODEL_NAME = "tiiuae/falcon-rw-1b-slight-tweak"
notebook_dir = '/home/jovyan/work/llm_training_experiments/'
csv_file = 'abstract_algebra.csv'
DATASET_PATH = notebook_dir + csv_file
OUTPUT_DIR = f'{notebook_dir}{NEW_MODEL_NAME}/'


CONFIG = {
    'basic_config': 
    {
        "pretrained_model": f'{START_MODEL_NAME}',
        "start_model_path": f'{START_MODEL_PATH}',
        "new_model": f'{NEW_MODEL_NAME}',
        "train_path": f'{DATASET_PATH}',
        "output_dir": f'{OUTPUT_DIR}',
        "block_size": 512,
        "num_train_epochs": 1,
        "per_device_train_batch_size": 2,
        "save_steps": 1000,
        "save_total_limit": 2,
        "learning_rate": 5e-05,
    },
    'lora_config': 
    {
        "r": 16,
        "lora_alpha": 32,
        "target_modules": ["query_key_value"],
        "lora_dropout": 0.05,
        "bias": "none",
        "task_type": "CAUSAL_LM"
    },
    'bnb_config': {
        "load_in_4bit": True,
        "load_4bit_use_double_quant": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": "torch.bfloat16"
    },
}

CONFIG

{'basic_config': {'pretrained_model': 'tiiuae/falcon-rw-1b',
  'start_model_path': '/home/jovyan/work/models/tiiuae/falcon-rw-1b',
  'new_model': 'tiiuae/falcon-rw-1b-slight-tweak',
  'train_path': '/home/jovyan/work/llm_training_experiments/abstract_algebra.csv',
  'output_dir': '/home/jovyan/work/llm_training_experiments/tiiuae/falcon-rw-1b-slight-tweak/',
  'block_size': 512,
  'num_train_epochs': 1,
  'per_device_train_batch_size': 2,
  'save_steps': 1000,
  'save_total_limit': 2,
  'learning_rate': 5e-05},
 'lora_config': {'r': 16,
  'lora_alpha': 32,
  'target_modules': ['query_key_value'],
  'lora_dropout': 0.05,
  'bias': 'none',
  'task_type': 'CAUSAL_LM'},
 'bnb_config': {'load_in_4bit': True,
  'load_4bit_use_double_quant': True,
  'bnb_4bit_quant_type': 'nf4',
  'bnb_4bit_compute_dtype': 'torch.bfloat16'}}

### config.json and Saving

* Running the save_config() function imported from utilities saves the config.json 


In [5]:
import sys

# remember to append the working path relative to our working directory
print("Working Directory: ", os.getcwd())
sys.path.append('../utils')

from config import save_config

Working Directory:  /home/jovyan/work/llm_training_experiments


  from .autonotebook import tqdm as notebook_tqdm


In [6]:
save_config(CONFIG)

config directory is: /home/jovyan/work/llm_training_experiments/tiiuae/falcon-rw-1b-slight-tweak/config
Config file written to /home/jovyan/work/llm_training_experiments/tiiuae/falcon-rw-1b-slight-tweak/config/config.json.


### Load Quantization Configuration to 4 bit from Hugging Face

* [Quantization](https://huggingface.co/docs/transformers/main_classes/quantization) reduces memory and computation caosts by representing weights and activations with lower-precision data types, such as 4-bit integers or 4-bit integers rather than floats.
* The [BitsAndBytesConfg](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig) is a wrapper class with all of the possible attribute and features that you can play with in a model that has been loaded using bitsandbytes for quantization.
* Currently as the time of writing this, there are three different types of quantization, LM.int8, FP4, NF4.

In [7]:
import sys

# remember to append the working path relative to our working directory
print("Working Directory: ", os.getcwd())
sys.path.append('../utils')

from config import get_bnb_config

bnb_config = get_bnb_config(CONFIG)

# Look at some of the objects
print("load_in_4bit:", bnb_config.load_in_4bit)
print("load_in_4bit_use_double_quant:", bnb_config.load_in_4bit_use_double_quant)
print("bnb_4bit_quant_type:", bnb_config.bnb_4bit_quant_type)
print("bnb_4bit_compute_dtype:", bnb_config.bnb_4bit_compute_dtype)

Working Directory:  /home/jovyan/work/llm_training_experiments
load_in_4bit: True
load_in_4bit_use_double_quant: True
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: torch.bfloat16


### Load the Lora Configuration

* peft focuses on fine-tuning large-scale PLMs in a computationally and storage-efficient manner. Instead of fine-tuning all the model's parameters, PEFT methods fine-tune only a small number of (extra) model parameters. This approach can significantly decrease computational and storage costs while achieving performance comparable to full fine-tuning.

* Integration with Hugging Face Accelerate: The package is seamlessly integrated with Hugging Face Accelerate for handling large-scale models, leveraging technologies like DeepSpeed and Big Model Inference for efficient computing.

Supported Methods:

* LoRA (Low-Rank Adaptation): Adapts large language models by adding low-rank matrices to existing weights.
* Prefix Tuning, P-Tuning, and Prompt Tuning: These are techniques that involve optimizing continuous prompts or prepending learned embeddings to model inputs to adapt models to new tasks.
* AdaLoRA: Adaptive budget allocation for fine-tuning.
* MultiTask Prompt Tuning: Uses prompts for multitask transfer learning.
* LoHa, LoKr, LoftQ: Different parameter-efficient tuning methods focusing on aspects like low-rank Hadamard product, Kronecker Adapter, and quantization, respectively.
* OFT (Orthogonal Finetuning): A method for controlling diffusion in text-to-image models.

* LoraConfig in peft:
* LoraConfig is a configuration class for setting up Low-Rank Adaptation (LoRA) within the peft package. It would include parameters like r (rank of the adaptation), lora_alpha (scaling factor), lora_dropout (dropout rate for regularization), and target_modules (which parts of the model to apply LoRA to). The purpose of this configuration is to efficiently adapt large pre-trained models by adding low-rank structures to the model's layers, allowing for efficient fine-tuning with a smaller number of additional parameters.


In [13]:
conda install -c conda-forge peft

done
Solving environment: done


  current version: 4.10.1
  latest version: 23.11.0

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /opt/conda/envs/tf_gpu_env

  added / updated specs:
    - peft


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    accelerate-0.26.1          |     pyhd8ed1ab_0         178 KB  conda-forge
    peft-0.7.1                 |     pyhd8ed1ab_0          90 KB  conda-forge
    safetensors-0.3.3          |   py39h9fdd4d6_1         1.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.3 MB

The following NEW packages will be INSTALLED:

  accelerate         conda-forge/noarch::accelerate-0.26.1-pyhd8ed1ab_0
  peft               conda-forge/noarch::peft-0.7.1-pyhd8ed1ab_0
  safetensors        conda-forge/linux-64::

In [17]:
from peft import LoraConfig
def get_lora_config(config):
    """Gets the Lora Config for creating adapter weights for model."""
    lora_config = LoraConfig(
        r = config['lora_config']['r'],
        lora_alpha = config['lora_config']['lora_alpha'],
        target_modules = config['lora_config']['target_modules'],
        lora_dropout = config['lora_config']['lora_dropout'],
        bias = config['lora_config']['bias'],
        task_type = config['lora_config']['task_type']
    )
    return lora_config

In [18]:
get_lora_config(CONFIG)

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=16, target_modules={'query_key_value'}, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

### Load the Model and Tokenizer

* tokenizer
* model

Once we load the model to cuda, we should get an output showing the model having been loaded as:

```
FalconModel(
  (word_embeddings): Embedding(50304, 2048)
  (h): ModuleList(
    (0-23): 24 x FalconDecoderLayer(
      (self_attention): FalconAttention(
        (query_key_value): FalconLinear(in_features=2048, out_features=6144, bias=True)
        (dense): FalconLinear(in_features=2048, out_features=2048, bias=True)
        (attention_dropout): Dropout(p=0.0, inplace=False)
      )
      (mlp): FalconMLP(
        (dense_h_to_4h): FalconLinear(in_features=2048, out_features=8192, bias=True)
        (act): GELU(approximate='none')
        (dense_4h_to_h): FalconLinear(in_features=8192, out_features=2048, bias=True)
      )
      (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
      (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    )
  )
  (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
)
```

word_embeddings: This is an embedding layer with a vocabulary size of 50,304 and an embedding dimension of 2,048. Embedding layers are typically used to convert token indices into dense vectors of fixed size.

h: This is a ModuleList containing multiple layers, specifically 24 FalconDecoderLayer instances, indexed from 0 to 23. This suggests the model has 24 layers, which is common in large transformer models.

Each FalconDecoderLayer includes:

self_attention: This is the self-attention mechanism, a hallmark of transformer models, with specific components:
query_key_value: A linear layer that projects input features (of size 2,048) into query, key, and value vectors (of total size 6,144). This is essential for the attention mechanism.
dense: Another linear layer that projects the output of the attention mechanism back to the feature size (2,048).
attention_dropout: A dropout layer to prevent overfitting, set to 0.0, indicating it's currently not active.
mlp: A feed-forward neural network (MLP) within each transformer layer, consisting of:
dense_h_to_4h: A linear layer that expands the feature dimension from 2,048 to 8,192.
act: An activation function, here a GELU (Gaussian Error Linear Unit), used for introducing non-linearity.
dense_4h_to_h: Another linear layer to project the features back from 8,192 to 2,048 dimensions.
input_layernorm and post_attention_layernorm: Layer normalization components, used to stabilize the learning process. They are applied before and after the self-attention mechanism, respectively.
ln_f: This is a final layer normalization applied to the output of the last layer in the network.



In [8]:
import torch

torch.cuda.empty_cache()

In [9]:
from transformers import AutoTokenizer, AutoModel
sys.path.append('../utils')

from clear_gpu import clear_gpu_memory

# use our utility to clear the GPU memory first.
clear_gpu_memory()

tokenizer = AutoTokenizer.from_pretrained(CONFIG['basic_config']['pretrained_model'], trust_remote_code=True)
model = AutoModel.from_pretrained(CONFIG['basic_config']['pretrained_model'], trust_remote_code=True)
model.to('cuda')

torch.cuda.empty_cache() executed.


Some weights of the model checkpoint at tiiuae/falcon-rw-1b were not used when initializing FalconModel: ['lm_head.weight']
- This IS expected if you are initializing FalconModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing FalconModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


FalconModel(
  (word_embeddings): Embedding(50304, 2048)
  (h): ModuleList(
    (0-23): 24 x FalconDecoderLayer(
      (self_attention): FalconAttention(
        (query_key_value): FalconLinear(in_features=2048, out_features=6144, bias=True)
        (dense): FalconLinear(in_features=2048, out_features=2048, bias=True)
        (attention_dropout): Dropout(p=0.0, inplace=False)
      )
      (mlp): FalconMLP(
        (dense_h_to_4h): FalconLinear(in_features=2048, out_features=8192, bias=True)
        (act): GELU(approximate='none')
        (dense_4h_to_h): FalconLinear(in_features=8192, out_features=2048, bias=True)
      )
      (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
      (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    )
  )
  (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
)

# Deleting the Model

* So really, we don't need th model, we only need the tokenizer, and we're going to load the qlora model.
* Running nvidia-smi on terminal will show the following:

```
0   N/A  N/A     36447      C   /opt/conda/envs/tf_gpu_env/bin/python      5088MiB
```

One we do the following, we can run nvidia-smi again and the pid in question along with the memory it is taking up should be deleted.

In [16]:
import gc

model.to('cpu')
del model
torch.cuda.empty_cache()

# garbage collection
gc.collect()

995

### Load QLora Model

* Now that we 

In [11]:
import sys

# Since we set the working directory to the notebook directory, we append utils as being one direcotry back
sys.path.append('../utils')

from transformers import AutoModel

from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)

from config import get_bnb_config
from config import get_lora_config

bnb_config = get_bnb_config(CONFIG)


# load the model with the bitsandbytes configuration
# model = FalconForCausalLM.from_pretrained(start_model_path, device_map="auto", trust_remote_code=True, quantization_config=bnb_config)
model = AutoModel.from_pretrained(
    CONFIG['basic_config']['pretrained_model'],
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

# prepare quantized model using kbit training
quantized_model = prepare_model_for_kbit_training(model)

# get the lora configuration
lora_config = get_lora_config(CONFIG)

# finally, the quantized lora model using get_peft_model from the peft library
quantized_lora_model = get_peft_model(
    quantized_model, 
    lora_config
)

Some weights of the model checkpoint at tiiuae/falcon-rw-1b were not used when initializing FalconModel: ['lm_head.weight']
- This IS expected if you are initializing FalconModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing FalconModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


False

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//matplotlib_inline.backend_inline')}
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=121, Highest Compute Capability: 6.1.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Required library version not found: libbitsandbytes_cuda121_nocublaslt.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

CUDA SETUP: CUDA detection failed! Possible reasons:
1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes


python -m bitsandbytes


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


RuntimeError: 
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

In [10]:
import gc
import torch

model.to('cpu')
del model
torch.cuda.empty_cache()

# garbage collection
gc.collect()

2006

In [None]:
        elif model_type == 'QLora':
            
            bnb_config = self.get_bnb_config()
            model = FalconForCausalLM.from_pretrained(start_model_path, device_map="auto", trust_remote_code=True, quantization_config=bnb_config)
            quantized_model = prepare_model_for_kbit_training(model)
            lora_config = self.get_lora_config()
            quantized_lora_model = get_peft_model(quantized_model, lora_config)
            self.MODEL = quantized_lora_model
            return quantized_lora_model