## <span style="color:#ff5f27">📝 Imports </span>

In [2]:
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig
from transformers import TrainingArguments
from trl import SFTTrainer

from functions.prompt_engineering import generate_prompt

2024-01-29 09:33:00,460 INFO: PyTorch version 2.1.2 available.
2024-01-29 09:33:00,517 INFO: TensorFlow version 2.11.0 available.


## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [3]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/1143
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;">🪝 Feature View Retrieval </span>

In [4]:
# Retrieve the 'cqa' feature view
feature_view = fs.get_feature_view(
    name='cqa',
    version=1,
)

In [5]:
# Initialize batch scoring for the feature view
feature_view.init_batch_scoring()

# Get batch data from the feature view
data = feature_view.get_batch_data()

# Display the first three rows of the batch data
data.head(3)

Finished: Reading data from Hopsworks, using ArrowFlight (0.91s) 


Unnamed: 0,context,questions,responses
0,"NIST SP 800- 53, REV. 5 ...",What is the purpose of device identification a...,The purpose of device identification and authe...
1,"NIST SP 800- 53, REV. 5 ...",What is attack surface reduction and how does ...,Attack surface reduction is the process of red...
2,prot ection impact assessment should also be m...,What is the purpose of a data protection impac...,A data protection impact assessment is conduct...


## <span style="color:#ff5f27;">🗄️ Dataset Creation </span>

In [6]:
# Generate prompts for each record in the DataFrame using context, questions, and responses
prompts = data.apply(
    lambda record: generate_prompt(record['context'], record['questions']) + f'\n### RESPONSE:\n{record["responses"]}', 
    axis=1,
).tolist()

In [7]:
# Create a dataset from a dictionary with a single column named "text" containing prompts
dataset = Dataset.from_dict({
    "text": prompts,
})

In [8]:
print(dataset[10]['text'])


[INST] 
Instruction: You are an AI assistant specialized in regulatory documents. 
Your role is to provide accurate and informative answers based on the given context.
[/INST]

### CONTEXT:

The Office of the Comptroller of the Currency ( OCC) is responsible for supervising the 
federal banking system. The OCC ’s mission is to ensure that national banks, federal savings 
associations  (FSA) , and federal branches and agencies of foreign banking organizations1 
(collectively, banks2) operate in a safe and sound manner, provide fair access to financial 
services, treat customers fairly, and comply with applicable laws and regulations. To support this mission , the OCC  has prepared the “Bank Supervision Process” booklet of the 
Comptroller’s Handbook  for use by OCC examiners in connection with their supervision of 
banks.  This booklet is  the central reference for  the OCC ’s bank supervision policy, explains the 
OCC ’s risk -based  bank supervision approach, and  discusses the gener

## <span style="color:#ff5f27">⬇️ Model Loading </span>

In [9]:
# Define the model identifier for Mistral-7B-Instruct
MODEL_ID = 'mistralai/Mistral-7B-Instruct-v0.2'

In [10]:
# Load the tokenizer for Mistral-7B-Instruct model
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
)

# Set the pad token to the unknown token to handle padding
tokenizer.pad_token = tokenizer.unk_token

# Set the padding side to "right" to prevent warnings during tokenization
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [11]:
# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [12]:
# Load the Mistral-7B-Instruct model with quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    quantization_config=bnb_config,
)

# Configure the pad token ID in the model to match the tokenizer's pad token ID
model.config.pad_token_id = tokenizer.pad_token_id

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

2024-01-29 09:54:48,233 INFO: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## <span style="color:#ff5f27">⚙️ Configuration </span>

In [13]:
peft_config = LoraConfig(
        lora_alpha=64,
        lora_dropout=0.1,
        r=32,
        bias="none",
        task_type="CAUSAL_LM", 
        target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
            "o_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
            "lm_head",
        ],
    )

In [14]:
training_arguments = TrainingArguments(
    output_dir="mistral7b_finetuned",       # directory to save and repository id
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
)

## <span style="color:#ff5f27">🏃🏻‍♂️ Training</span>

In [15]:
# Create the Supervised Fine-tuning Trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
    dataset_text_field='text',
)

Map:   0%|          | 0/2382 [00:00<?, ? examples/s]

In [16]:
# Train the model
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,1.4511
20,1.2486
30,1.1366
40,1.0882
50,1.048
60,1.0094
70,1.0315
80,1.0146
90,0.8944
100,0.8919




TrainOutput(global_step=1191, training_loss=0.3319690090883289, metrics={'train_runtime': 9693.2331, 'train_samples_per_second': 0.737, 'train_steps_per_second': 0.123, 'total_flos': 3.602651020216566e+17, 'train_loss': 0.3319690090883289, 'epoch': 3.0})

## <span style="color:#ff5f27">💾 Saving Model</span>

In [17]:
# Save the trained model
trainer.save_model()



## <span style="color:#ff5f27">🗄️ Model Registry</span>

In [None]:
# Create a Python model in the model registry
model_llm = mr.python.create_model(
    name="mistral_model", 
    description="Mistral Fine-tuned Model",
)

In [None]:
# Save the model directory with the fine-tuned model to the model registry
model_llm.save(training_arguments.output_dir)

---