<a href="https://colab.research.google.com/github/joshuaalpuerto/ML-guide/blob/main/Fine_tune_Mistral7b_instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git --progress-bar off
!pip install -q -U git+https://github.com/huggingface/peft.git --progress-bar off
!pip install -q -U git+https://github.com/huggingface/accelerate.git --progress-bar off
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

Set up constants so it will be easier for us to load and save the model

In [None]:

# model_id = "EleutherAI/pythia-2.8b-deduped"
# If you want to use 7b models (or more) you need to use sharded version.
# otherwise they won't fit in 12gb memory (you need more!)
model_id = "alexsherstinsky/Mistral-7B-v0.1-sharded"
OUTPUT_DIR = "experiments"

hugging_face_uname = 'JoshuaCAlpuerto'
base_model = model_id.split('/')[-1].lower()
hf_repo = "{hugging_face_uname}/{base_model}-fellow-man".format(hugging_face_uname=hugging_face_uname, base_model=base_model)

print(hf_repo)

JoshuaCAlpuerto/mistral-7b-v0.1-sharded-fellow-man


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import json

def load_json(json_file, columns=['question', 'answer', 'country']):
  with open(f"/content/drive/MyDrive/datasets/{json_file}", 'r') as json_file:
    json_data = json.load(json_file)

  json_data = json_data["questions"] if "questions" in json_data else json_data

  data = pd.DataFrame(json_data)

  data = data.dropna()

  data = data[columns]

  # Remove duplicates where question and answer is the same.
  data = data.drop_duplicates(subset=['question', 'answer'])

  return data

data = load_json('qna-augmented.json')
# data = load_json('ecommerce-faq.json',  columns=['question', 'answer'])
# data = data[data['country'] == 'Germany']


# Print modified data
print(len(data))
data.head()

1976


Unnamed: 0,question,answer,country
0,"If I move to a new country, will my tax reside...","No, you'll have to notify the Estonian tax aut...",Estonia
1,Will my tax residency be changed if I move to ...,"No, you'll have to notify the Estonian tax aut...",Estonia
2,"If I move to another country, will my tax resi...","No, you'll have to notify the Estonian tax aut...",Estonia
3,Is it necessary to change my tax residency whe...,"No, you'll have to notify the Estonian tax aut...",Estonia
4,Does the choice of country I move to affect my...,"No, you'll have to notify the Estonian tax aut...",Estonia


In [None]:
import torch
# Inference
def inference(text, model, tokenizer):
  generation_config = model.generation_config
  generation_config.max_new_tokens = 256   # maxium no of token in output will get
  generation_config.temperature = 0.3
  generation_config.top_p = 0.7
  generation_config.num_return_sequences = 1
  generation_config.pad_token_id = tokenizer.eos_token_id
  generation_config.eos_token_id = tokenizer.eos_token_id

  device = model.device
  # Tokenize
  encodings = tokenizer(
      text,
      return_tensors="pt",
  ).to(device)
  # Generate

  model.eval()
  with torch.no_grad():
    output = model.generate(
      input_ids=encodings.input_ids,
      attention_mask=encodings.attention_mask,
      generation_config=generation_config
    )

  # The text generated here includes promp
  generated_text_with_prompt = tokenizer.decode(output[0], skip_special_tokens=True)

  # Strip the prompt
  return generated_text_with_prompt[len(text):]

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    # The model will be loaded in the memory with 4-bit precision.
    load_in_4bit=True,
    # We will do the double quantization proposed by QLoRa.
    bnb_4bit_use_double_quant=True,
    # This is the type of quantization. “nf4” stands for 4-bit NormalFloat.
    bnb_4bit_quant_type="nf4",
    # While we load and store the model in 4-bit,
    # we will partially dequantize it when needed and do all the computations with a 16-bit precision (bfloat16).
    # This is to not reduce the quality of the model drastically
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
# why padding_side="left"? - https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token


Downloading (…)lve/main/config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00008.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00008.bin:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00008.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00008.bin:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/979 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/145 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
max_length = 512 # This was an appropriate max length for my dataset

def formatting_func(example):
    text = f"### Question from {example['country']}: {example['question']}\n ### Answer: {example['answer']}"
    return text

def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        formatting_func(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
from datasets import Dataset
finetuning_dataset = Dataset.from_pandas(data)
finetuning_dataset = finetuning_dataset.map(generate_and_tokenize_prompt, remove_columns=finetuning_dataset.column_names)

In [None]:
finetuning_dataset = finetuning_dataset.train_test_split(test_size=0.1)

In [None]:
finetuning_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1976
})

In [None]:
# Test not finedtuned model
question = "Is it also possible to obtain a Blue Card without a higher education qualification but based on completed vocational training?"
eval_prompt = "### Question from {country}: {question}\n ### Answer:".format(country="Germany", question=question)
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2, repetition_penalty=1.3)[0], skip_special_tokens=True))

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


### Question from Germany: Is it also possible to obtain a Blue Card without a higher education qualification but based on completed vocational training?
 ### Answer:м, yes. The German Federal Ministry of the Interior has published an information sheet (in English) which explains how you can apply for a residence permit as a skilled worker in accordance with § 21a AufenthG and thus receive a so-called “Blue Card”. This is not only open to university graduates; rather, applicants who have successfully completed their professional or technical college studies are equally eligible. In addition, there must be proof that your salary will exceed €56,800 per year after taxes. You should therefore contact us at our office if you would like more detailed advice regarding this matter. We look forward to hearing from you!
### Question from France: I am currently working here in Berlin as a freelancer. Can I get a work visa for my wife and children?
Answer: Yes, you certainly can – provided that c

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
from peft import LoraConfig, get_peft_model

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 85041152 || all params: 3837112320 || trainable%: 2.2162799758751914


In [None]:
%load_ext tensorboard
# Default of logging directory is output_dir/runs
%tensorboard --logdir experiments/runs

# https://w0276v3kai1-496ff2e9c6d22116-6006-colab.googleusercontent.com/?tensorboardColab=true

In [None]:
import transformers
from datetime import datetime

tokenizer.pad_token = tokenizer.eos_token

training_args = transformers.TrainingArguments(
    output_dir=OUTPUT_DIR,
    warmup_steps=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    max_steps=800,
    learning_rate=2.5e-5, # Want a small lr for finetuning
    fp16=True,
    optim="paged_adamw_8bit",
    logging_steps=25,              # When to start reporting loss
    save_strategy="steps",       # Save the model checkpoint every logging step
    save_steps=100,                # Save checkpoints every 50 steps
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=finetuning_dataset,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Save in hugging-face

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model.push_to_hub(hf_repo,
                  use_auth_token=True,
                  commit_message="basic training",
                  private=True)



adapter_model.bin:   0%|          | 0.00/340M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/JoshuaCAlpuerto/mistral-7b-v0.1-sharded-fellow-man/commit/9c887b2c432087eab9c4be16210a06d91d311d60', commit_message='basic training', commit_description='', oid='9c887b2c432087eab9c4be16210a06d91d311d60', pr_url=None, pr_revision=None, pr_num=None)

By default, the PEFT library will only save the QLoRA adapters, so we need to first load the base model from the Huggingface Hub:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    trust_remote_code=True,
    use_auth_token=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token



Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now load the QLoRA adapter from the appropriate checkpoint directory, i.e. the best performing model checkpoint:

In [None]:
from peft import PeftModel

local_path = f"{OUTPUT_DIR}/checkpoint-800"

ft_model = PeftModel.from_pretrained(base_model, hf_repo)

JoshuaCAlpuerto/mistral-7b-v0.1-sharded-fellow-man


Downloading adapter_model.bin:   0%|          | 0.00/340M [00:00<?, ?B/s]

In [None]:
import torch
# Inference
def __inference(text, model, tokenizer):
  generation_config = model.generation_config
  generation_config.max_new_tokens = 100   # maxium no of token in output will get
  generation_config.do_sample = True # to set temperature and top_p
  generation_config.temperature = 0.1
  generation_config.top_p = 0.5
  generation_config.top_k = 5

  # generation_config.num_return_sequences = 1
  # generation_config.pad_token_id = tokenizer.eos_token_id
  # generation_config.eos_token_id = tokenizer.eos_token_id

  device = model.device
  # Tokenize
  encodings = tokenizer(
      text,
      return_tensors="pt",
  ).to(device)
  # Generate

  model.eval()
  with torch.no_grad():
    output = model.generate(
      input_ids=encodings.input_ids,
      attention_mask=encodings.attention_mask,
      generation_config=generation_config
      # do_sample=True,
      # temperature=0.1,
      # top_p=0.5,
      # max_new_tokens=100,
    )

  # The text generated here includes promp
  generated_text_with_prompt = tokenizer.decode(output[0], skip_special_tokens=True)

  # Strip the prompt
  return generated_text_with_prompt

In [None]:
country ="Estonia"
questions = [
    # "If I move to a new country, will my tax residency change automatically?",
    # "There is a time period marked on my short-term employment registration. What happens if I start working after the starting date specified?"

    # Check semantic question -> How long does it take to receive a residence permit card?
    "When will I receive my residence permit card?", # Able to answer from FAQ! :)

    # Check semantic question -> I'm having difficulty finding permanent accommodation. Can I register a hotel or Airbnb address?
    "Can I use airbnb as my accommodation?" # The asnwer from this question was not from our FAQ. It might need requery formulation.
]

for question in questions:
  print('-- Response --')
  prompt = "### Question from {country}: {question}\n ### Answer: ".format(country=country, question=question)
  print(__inference(prompt, ft_model, tokenizer))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


-- Response --


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


### Question from Estonia: When will I receive my residence permit card?
 ### Answer:  By law, the processing of your application can take up to 2 months from submission, but the Police Board has the right to extend the deadline if necessary. Additional time for processing can take up to a month.
The card will be printed within 1 month from date of decision.
If you are applying on the basis of employment, the employment registration will be valid until the residence permit card is printed. If you are applying on the basis of EU long-term visa, Schengen
-- Response --
### Question from Estonia: Can I use airbnb as my accommodation?
 ### Answer:  Yes, as long as the property is registered as a short-term rental, it is allowed. It's best to check with the host to be sure.
If the place is not registered, the owner risks a fine, and you risk being fined as well as expelled from the country. So please be careful!
To check if a property has a short-term rental license, please visit www.addres

In [None]:
country ="Germany"
questions = [
    "Is it also possible to obtain a Blue Card without a higher education qualification but based on completed vocational training?",
    "From what point in time can I change employers?"
]

for question in questions:
  print('-- Response --')
  prompt = "### Question from {country}: {question}\n ### Answer: ".format(country=country, question=question)
  print(__inference(prompt, ft_model, tokenizer))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


-- Response --
prompt: ### Question from Germany: Is it also possible to obtain a Blue Card without a higher education qualification but based on completed vocational training?
 ### Answer: 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 No, the Blue Card requires a higher education qualification. But holders of a German higher education qualification or a recognized or comparable foreign higher education qualification may apply for a residence title that allows skilled workers to seek employment. This allows the holder to stay in Germany for up to six months to seek employment that corresponds to their qualifications. This residence title does not entitle the holder to take up gainful employment, so the holder must have sufficient funds for the duration of their job search. If the search is successful, the residence title can be changed to a residence title for the purpose of employment. This allows the holder to work in a position that corresponds to their qualifications. This residence title can be issued for a duration of up to 21 months. If the employment contract has a set duration, the residence title can be issued for the duration of the contract. It can be extended for up to a further 14 months if the employm