<a href="https://www.kaggle.com/code/lucamassaron/gemma-2-2b-learns-how-to-tutor-in-ai-ml?scriptVersionId=224586360" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Artificial intelligence & machine learning Q&A Enhanced with Gemma 2 2b-it

We begin the Kaggle notebook by installing some Python packages using pip:

* The first line installs the bitsandbytes package from the Python Package Index (PyPI). The -q flag suppresses installation output, -U updates the package if it's already installed, and -i specifies the package index URL.

* Next, we install the trl library (Transformers Reinforcement Learning), a comprehensive library by Hugging Face that provides tools for training transformer-based models with reinforcement learning, from Supervised Fine-Tuning (SFT) and Reward Modeling (RM) to Proximal Policy Optimization (PPO). The -q and -U flags are used as before.

* Finally, the last line installs the wikipedia-api library, which provides a simple interface to interact with Wikipedia data. As with the other installations, -q suppresses output, and -U ensures the package is up-to-date.

In [1]:
!pip install -q -U -i https://pypi.org/simple/ bitsandbytes
!pip install -q -U trl
!pip install -q -U wikipedia-api

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone


We then proceed by importing the os module and setting two environment variables:

* CUDA_VISIBLE_DEVICES: This variable instructs PyTorch on which GPU(s) to use. Setting it to 0 specifies that only the first GPU will be utilized by PyTorch for computations.

* TOKENIZERS_PARALLELISM: This variable controls whether the Hugging Face Transformers library parallelizes the tokenization process. By setting it to false, tokenization is run in a single thread, preventing parallelization

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

The following code snippet imports the warnings module and configures it to suppress all warnings. This prevents any warnings from being displayed. While these warnings typically don’t affect the fine-tuning process, they can be distracting and may cause unnecessary concern during training.

In [3]:
import warnings
warnings.filterwarnings("ignore")

Finally, we define a global variable to limit the training time when using this code as a live demo.

In [4]:
DEMO_MODE = False

The next cell contains all the main imports needed to run the notebook:

In [5]:
import re
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import wikipediaapi

import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer,
                          AutoConfig,
                          BitsAndBytesConfig, 
                          TrainingArguments,
                          DataCollatorForSeq2Seq
                          )

from datasets import Dataset
from peft import LoraConfig, PeftConfig
import bitsandbytes as bnb
from trl import SFTTrainer, SFTConfig

Before starting, we define the device to be used by the model, based on our resources

# Step 1: get the knowledge base

Apart from the first two functions helpful in cleaning the text from tags and formatting, the following code extracts references, such as pages or other Wikipedia categories, using the extract_wikipedia_pages function. Then, the get_wikipedia_pages function crawls to all the pages and information related to some initial Wikipedia category or page.

The following cell defines two main functions (remove_braces_and_content and clean_string) and uses a pre-compiled regular expression pattern to improve efficiency when removing content between curly braces ({ }) from text.

In [6]:
# Pre-compile the regular expression pattern for better performance
BRACES_PATTERN = re.compile(r'\{.*?\}|\}')

def remove_braces_and_content(text):
    """Remove all occurrences of curly braces and their content from the given text"""
    return BRACES_PATTERN.sub('', text)

def clean_string(input_string):
    """Clean the input string."""
    
    # Remove extra spaces by splitting the string by spaces and joining back together
    cleaned_string = ' '.join(input_string.split())
    
    # Remove consecutive carriage return characters until there are no more consecutive occurrences
    cleaned_string = re.sub(r'\r+', '\r', cleaned_string)
    
    # Remove all occurrences of curly braces and their content from the cleaned string
    cleaned_string = remove_braces_and_content(cleaned_string)
    
    # Return the cleaned string
    return cleaned_string

This code defines a function, get_wikipedia_pages, which retrieves content from Wikipedia pages based on a list of input categories and organizes the retrieved text for further use. 

In [7]:
def get_wikipedia_pages(categories):
    """Retrieve Wikipedia pages from a list of categories and extract their content"""
    
    # Create a Wikipedia object
    wiki_wiki = wikipediaapi.Wikipedia('Gemma AI Assistant (gemma@example.com)', 'en')
    
    # Initialize lists to store explored categories and Wikipedia pages
    explored_categories = []
    wikipedia_pages = []

    # Iterate through each category
    print("- Processing Wikipedia categories:")
    for category_name in categories:
        print(f"\tExploring {category_name} on Wikipedia")
        
        # Get the Wikipedia page corresponding to the category
        category = wiki_wiki.page("Category:" + category_name)
        
        # Extract Wikipedia pages from the category and extend the list
        wikipedia_pages.extend(extract_wikipedia_pages(wiki_wiki, category_name))
        
        # Add the explored category to the list
        explored_categories.append(category_name)

    # Extract subcategories and remove duplicate categories
    categories_to_explore = [item.replace("Category:", "") for item in wikipedia_pages if "Category:" in item]
    wikipedia_pages = list(set([item for item in wikipedia_pages if "Category:" not in item]))
    
    # Explore subcategories recursively
    while categories_to_explore:
        category_name = categories_to_explore.pop()
        print(f"\tExploring {category_name} on Wikipedia")
        
        # Extract more references from the subcategory
        more_refs = extract_wikipedia_pages(wiki_wiki, category_name)

        # Iterate through the references
        for ref in more_refs:
            # Check if the reference is a category
            if "Category:" in ref:
                new_category = ref.replace("Category:", "")
                # Add the new category to the explored categories list
                if new_category not in explored_categories:
                    explored_categories.append(new_category)
            else:
                # Add the reference to the Wikipedia pages list
                if ref not in wikipedia_pages:
                    wikipedia_pages.append(ref)

    # Initialize a list to store extracted texts
    extracted_texts = []
    
    # Iterate through each Wikipedia page
    print("- Processing Wikipedia pages:")
    for page_title in tqdm(wikipedia_pages):
        try:
            # Make a request to the Wikipedia page
            page = wiki_wiki.page(page_title)

            # Check if the page summary does not contain certain keywords
            if "Biden" not in page.summary and "Trump" not in page.summary:
                # Append the page title and summary to the extracted texts list
                if len(page.summary) > len(page.title):
                    extracted_texts.append(page.title + " : " + clean_string(page.summary))

                # Iterate through the sections in the page
                for section in page.sections:
                    # Append the page title and section text to the extracted texts list
                    if len(section.text) > len(page.title):
                        extracted_texts.append(page.title + " : " + clean_string(section.text))
                        
        except Exception as e:
            print(f"Error processing page {page.title}: {e}")
                    
    # Return the extracted texts
    return extracted_texts

The next function takes a category name, checks if the category exists, and if it does, collects and returns the titles of all pages within that category on Wikipedia. This list of page contents can then be used.

In [8]:
def extract_wikipedia_pages(wiki_wiki, category_name):
    """Extract all references from a category on Wikipedia"""
    
    # Get the Wikipedia page corresponding to the provided category name
    category = wiki_wiki.page("Category:" + category_name)
    
    # Initialize an empty list to store page titles
    pages = []
    
    # Check if the category exists
    if category.exists():
        # Iterate through each article in the category and append its title to the list
        for article in category.categorymembers.values():
            pages.append(article.title)
    
    # Return the list of page titles
    return pages

To gather the information necessary to answer the most tricky questions about AI and machine learning, I’ve listed a few key categories putting enphasis on generative AI topics.

In [9]:
if DEMO_MODE:
    categories = ["OpenAI", "Generative_artificial_intelligence", "Large_language_models"]
else:
    categories = ["Machine_learning", "Data_science", "Statistics", "Deep_learning", "Artificial_intelligence",
"Neural_network_architectures", "Large_language_models", "OpenAI", "Generative_pre-trained_transformers",
"Artificial_neural_networks", "Generative_artificial_intelligence", "Natural_language_processing"]

In [10]:
extracted_texts = get_wikipedia_pages(categories)
print("Found", len(extracted_texts), "Wikipedia pages")

- Processing Wikipedia categories:
	Exploring Machine_learning on Wikipedia
	Exploring Data_science on Wikipedia
	Exploring Statistics on Wikipedia
	Exploring Deep_learning on Wikipedia
	Exploring Artificial_intelligence on Wikipedia
	Exploring Neural_network_architectures on Wikipedia
	Exploring Large_language_models on Wikipedia
	Exploring OpenAI on Wikipedia
	Exploring Generative_pre-trained_transformers on Wikipedia
	Exploring Artificial_neural_networks on Wikipedia
	Exploring Generative_artificial_intelligence on Wikipedia
	Exploring Natural_language_processing on Wikipedia
	Exploring Tasks of natural language processing on Wikipedia
	Exploring Statistical natural language processing on Wikipedia
	Exploring Natural language processing researchers on Wikipedia
	Exploring Optical character recognition on Wikipedia
	Exploring Natural language processing software on Wikipedia
	Exploring Natural language generation on Wikipedia
	Exploring Machine translation on Wikipedia
	Exploring Fin

100%|██████████| 3801/3801 [04:10<00:00, 15.18it/s]

Found 17675 Wikipedia pages





In [11]:
extracted_texts[7]

'Explainable artificial intelligence : Despite ongoing endeavors to enhance the explainability of AI models, they persist with several inherent limitations.'

# Step 2: convert the knowledge base into a Q&A dataset

Now, having collected our knowledge base on AI, we need to leverage Gemma to convert it into something more useful for training a model. The idea is to use a Q&A approach.

First, let’s upload Gemma 2 2b-it into memory by quantizing it into a 4-bit version using BitsAndBytes.

In [12]:
model_name = "/kaggle/input/gemma-2/transformers/gemma-2-2b-it/2"
compute_dtype = getattr(torch, "float16") # we use float16 to save memory

# Efficient loading and quantization of large models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # 4-bit quantization of model weights
    bnb_4bit_use_double_quant=False,       # whether double quantization is used
                                           # Double quantization quantizes weights 
                                           # twice, essentially creating quantized 
                                           # lookup tables. While this can save 
                                           # additional memory, it can sometimes 
                                           # impact model performance or accuracy
    bnb_4bit_quant_type="nf4",             # type of quantization to use
                                           # nf4 stands for "normal float 4"
    bnb_4bit_compute_dtype=compute_dtype,  # sets the data type used during computation
)

# Loading a model configuration from a pretrained model identifier
config = AutoConfig.from_pretrained(model_name)
config.final_logit_softcapping = None  # Disable soft-capping
                                       # Soft-capping applies a limit on the range of final layer activations to 
# control the output range and potentially smooth predictions. Setting it 
# to None disables this feature. Basically, you are here allowing high 
# predicted probabilities for tokens

# Loading a language model specifically designed for causal (left-to-right) 
# language modeling tasks, such as text generation
model = AutoModelForCausalLM.from_pretrained(
    model_name,                     #
    device_map="auto",              # maps model layers to the available resources
    config=config,                  #
    attn_implementation="eager",    # specifies the attention computation strategy
    quantization_config=bnb_config, #
)

model.config.pretraining_tp = 1     # for setups where tensor parallelism is not needed 
                                    # or only a single device is available

max_seq_length = 2304  # Defines the maximum token sequence length
                       # Note that longer sequences require more memory and compute resources

# Loading a pretrained tokenizer corresponding to the model_name, 
# ensuring compatibility with the model’s input requirements
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

A simple function can wrap up all the steps necessary to inquire about Gemma on a topic or pose a question. The function allows for the pointing out of different temperatures and can return the answer as a stdout or a string.

Using all of the knowledge base and posing multiple answers derived from the same text will help build out fine-tuning training data. Asking multiple answers is a necessity because Gemma will pick just a topic from the test, and it will tend to answer briefly.

In [13]:
def question_gemma(question, 
                   model=model, 
                   tokenizer=tokenizer,
                   device="cuda",
                   temperature=0.0,
                   max_new_tokens=2048,
                   return_answer=False):
    input_ids = tokenizer(question + "\n", return_tensors="pt").to(device)
    if temperature > 0:
        do_sample=True
    else:
        do_sample=False
    outputs = model.generate(**input_ids, 
                             max_new_tokens=max_new_tokens, 
                             do_sample=do_sample, 
                             temperature=temperature)
    result = (str(tokenizer.decode(outputs[0]))
              .replace("<bos>", "") # removing the "beginning of sequence" token
              .replace("<eos>", "") # removing the "end of sequence" token
              .strip()
             )
    if return_answer:
        return result
    else:
        print(result)

We can control how Gemma returns the question and answer, proposing it to return a JSON file in the form {“question”: “…”, “answer”: “…”}. Hence, it will be easy to retrieve the data from the output text utilizing regex.

In [14]:
qa_data = []

def extract_json(text, word):
    """Extract the value associated with a given key from a JSON-like string"""
    pattern = fr'"{word}": "(.*?)"'
    match = re.search(pattern, text)
    if match:
        return match.group(1)
    else:
        return ""

if DEMO_MODE:
    no_extracted_texts = 5 # increment this number up to len(extracted_texts)
    no_questions = 1 # increment this number to produce more questions (suggested: 5)
else:
    no_extracted_texts = 2_000
    no_questions = 1
    
for i in tqdm(range(len(extracted_texts[:no_extracted_texts]))):

    question_text = f"""Create a question and its answer from the following piece of information,
    put all the necessary information into the question (do not assume the reader knows the text),
    and return it exclusively in JSON format in the format {'{"question": "...", "answer": "..."}'}

    Here is the piece of information to elaborate:
    {extracted_texts[i]}

    OUTPUT JSON:
    """

    for j in range(no_questions):
    
        result = question_gemma(question_text, model=model, temperature=0.9, return_answer=True)
        result = result.split("OUTPUT JSON:")[-1]

        question = extract_json(result, "question")
        answer = extract_json(result, "answer")

        qa_data.append(f"{question}\n{answer}")

  0%|          | 0/2000 [00:00<?, ?it/s]The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.
100%|██████████| 2000/2000 [8:48:15<00:00, 15.85s/it]


In [15]:
print(qa_data[0])

What is the probability matching strategy and how does it work? 
The probability matching strategy is a decision strategy where predictions of class membership are proportional to the class base rates. In other words, if the training set has a 60% positive and 40% negative class base rate, the observer using this strategy will predict 60% positive and 40% negative class labels for unlabeled examples. 


In [16]:
print(qa_data[1])

What is the significance of probability matching in behavioral decision-making?
According to a study by Shanks, Tunney, and McCarthy, probability matching is a theory that explains how people make decisions. This theory suggests that individuals tend to choose options based on their perceived probabilities of outcomes, not necessarily on actual outcomes. The theory suggests that people use a simplified heuristic to reason about probabilities, which can lead to systematic errors in decision-making.


Now that the dataset has been gathered, it is time to turn it into an HF Dataset.

In [17]:
train_data = (pd.DataFrame(qa_data, columns=["text"])
              .sample(frac=1, random_state=5)
              .drop_duplicates()
             )
train_data = Dataset.from_pandas(train_data)

# Step 3: fine-tune the Gemma model

In the following cells, LoRA is set, and the training parameters are defined. Afterward, the fine-tuning can start.

In [18]:
output_dir = "gemma_assistant"

# configuration object for LoRA
peft_config = LoraConfig(
    r=64,                    # the rank of the low-rank matrix
    lora_alpha=16,           # scaling factor applied to the low-rank matrices
    lora_dropout=0,          # dropout rate for LoRA added low-rank matrices 
    bias="none",             # how biases are handled in the adaptation layers
    task_type="CAUSAL_LM",   # the type of task for which the model is being fine-tuned
                             # Causal Language Modeling for autoregressive models 
                             # (like GPT), where the model generates text by predicting 
                             # the next token in a sequence
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
                             # layers related to projections (_proj) 
                             # and additional processing (gate_proj, up_proj, down_proj) 
                             # are targeted
)

# Settings for training a model
training_arguments = SFTConfig(
    output_dir=output_dir,          # Directory for model checkpoints, logs, and outputs
    num_train_epochs=1,             # Number of training passes through the dataset
    gradient_checkpointing=True,    # Enables gradient checkpointing to reduce memory
                                    # Gradient checkpointing saves memory by storing  
                                    # only a subset of activations, recomputing  
                                    # the rest during backpropagation.
    per_device_train_batch_size=1,  # Batch size for each device used in training
    gradient_accumulation_steps=8,  # Accumulates gradients over multiple steps 
                                    # before updating model weights
    optim="paged_adamw_8bit",       # 8-bit version of the AdamW optimizer
    save_steps=0,                   # Checkpoint saving frequency (0 = disabled)
    logging_steps=25,               # Frequency of logging training metrics
    learning_rate=5e-4,             # Initial learning rate for training
    weight_decay=0.001,             # L2 Penalty to non-zero weights to prevent overfitting
    fp16=True,                      # 16-bit floating-point precision
    bf16=False,                     # bfloat16 precision (alternative to fp16)
    max_grad_norm=0.3,              # Limits the maximum gradient norm for stability
    max_steps=-1,                   # Total number of training steps
    warmup_ratio=0.03,              # Warmup period for the learning rate
                                    # During the first 3% of training, the learning rate 
                                    # gradually increases, helping the model stabilize 
                                    # and prevent sudden large updates at the start
    group_by_length=False,          # Whether batches are grouped by sequence length
    evaluation_strategy='no',       # Evaluation frequency (no = no evaluation)
    lr_scheduler_type="cosine",     # Learning rate scheduler type
                                    # The "cosine" scheduler gradually reduces the 
                                    # learning rate following a cosine decay pattern
    report_to="none",               # Reporting destination for logging
    dataset_text_field="text",      #  Field in train_data containing the input text
    packing=False,                  # keeps each sequence independent in the dataset (no packing)
    max_seq_length=max_seq_length,
)

In [19]:
# Supervised Fine-Tuning (SFT)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
)

Converting train dataset to ChatML:   0%|          | 0/1998 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/1998 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1998 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1998 [00:00<?, ? examples/s]

In [20]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
25,20.1845
50,17.3651
75,16.9591
100,17.4077
125,17.047
150,16.304
175,15.8589
200,16.4289
225,16.3991


TrainOutput(global_step=249, training_loss=16.996801537203503, metrics={'train_runtime': 973.9319, 'train_samples_per_second': 2.051, 'train_steps_per_second': 0.256, 'total_flos': 1688150108150784.0, 'train_loss': 16.996801537203503})

After we finish, we simply save the model and try to reload it in order to check if everything works as expected.

In [21]:
trainer.save_model() # Saves the current state of the model to disk
tokenizer.save_pretrained(output_dir) # Saves the tokenizer to disk

('gemma_assistant/tokenizer_config.json',
 'gemma_assistant/special_tokens_map.json',
 'gemma_assistant/tokenizer.model',
 'gemma_assistant/added_tokens.json',
 'gemma_assistant/tokenizer.json')

In [22]:
from peft import AutoPeftModelForCausalLM

finetuned_model = output_dir
compute_dtype = getattr(torch, "float16")
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoPeftModelForCausalLM.from_pretrained(
     finetuned_model,
     torch_dtype=compute_dtype,
     return_dict=False,
     low_cpu_mem_usage=True, # Reduces CPU memory consumption during model loading
     device_map="auto",
)

model = model.to("cuda")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [23]:
!ls -l --block-size=MB ./gemma_assistant

total 372MB
-rw-r--r-- 1 root root   1MB Feb 26 19:37 adapter_config.json
-rw-r--r-- 1 root root 333MB Feb 26 19:37 adapter_model.safetensors
drwxr-xr-x 2 root root   1MB Feb 26 19:37 checkpoint-249
-rw-r--r-- 1 root root   1MB Feb 26 19:37 README.md
-rw-r--r-- 1 root root   1MB Feb 26 19:37 special_tokens_map.json
-rw-r--r-- 1 root root   1MB Feb 26 19:37 tokenizer_config.json
-rw-r--r-- 1 root root  35MB Feb 26 19:37 tokenizer.json
-rw-r--r-- 1 root root   5MB Feb 26 19:37 tokenizer.model
-rw-r--r-- 1 root root   1MB Feb 26 19:37 training_args.bin


In [24]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear(

# Step 4: save the LoRA weights and merge them into Gemma

Now, the tricky part is combining the trained LoRA weights with the Gemma original model. The result is our new fine-tuned Gemma!

This cell cleans up the CPU and GPU memory.

In [25]:
import gc

del [model, tokenizer, peft_config, trainer, train_data, bnb_config, training_arguments]
del [TrainingArguments, SFTTrainer, LoraConfig, BitsAndBytesConfig]

for _ in range(10):
    torch.cuda.empty_cache()
    gc.collect()

Now we proceed to the merging procedure:


In [26]:
from peft import AutoPeftModelForCausalLM

finetuned_model = output_dir
compute_dtype = getattr(torch, "float16")
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoPeftModelForCausalLM.from_pretrained(
     finetuned_model,
     torch_dtype=compute_dtype,
     return_dict=False,
     low_cpu_mem_usage=True,
     device_map="auto",
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./gemma_assistant_merged",
                             safe_serialization=True, 
                             max_shard_size="2GB")
tokenizer.save_pretrained("./gemma_assistant_merged")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('./gemma_assistant_merged/tokenizer_config.json',
 './gemma_assistant_merged/special_tokens_map.json',
 './gemma_assistant_merged/tokenizer.model',
 './gemma_assistant_merged/added_tokens.json',
 './gemma_assistant_merged/tokenizer.json')

In [27]:
!ls -l --block-size=MB ./gemma_assistant_merged

total 5268MB
-rw-r--r-- 1 root root    1MB Feb 26 19:37 config.json
-rw-r--r-- 1 root root    1MB Feb 26 19:37 generation_config.json
-rw-r--r-- 1 root root 1987MB Feb 26 19:37 model-00001-of-00003.safetensors
-rw-r--r-- 1 root root 1997MB Feb 26 19:37 model-00002-of-00003.safetensors
-rw-r--r-- 1 root root 1246MB Feb 26 19:38 model-00003-of-00003.safetensors
-rw-r--r-- 1 root root    1MB Feb 26 19:38 model.safetensors.index.json
-rw-r--r-- 1 root root    1MB Feb 26 19:38 special_tokens_map.json
-rw-r--r-- 1 root root    1MB Feb 26 19:38 tokenizer_config.json
-rw-r--r-- 1 root root   35MB Feb 26 19:38 tokenizer.json
-rw-r--r-- 1 root root    5MB Feb 26 19:38 tokenizer.model


Again, memory cleaning.

In [28]:
import gc

del [model, tokenizer, merged_model, AutoPeftModelForCausalLM]

for _ in range(10):
    torch.cuda.empty_cache()
    gc.collect()

In [29]:
for _ in range(10):
    torch.cuda.empty_cache()
    gc.collect()

The final step is reloading the fine-tuned model and try using it!

In [30]:
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig)

model_name = "./gemma_assistant_merged"

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

max_seq_length = 1024
tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

We start by a series of DS and ML questions:

In [31]:
questions = ["In simple terms, what is a Large Language Model (LLM)?", 
             "What differentiates LLMs from traditional chatbots?", 
             "How are LLMs typically trained? (e.g., pre-training, fine-tuning)",
             "What are some of the typical applications of LLMs? (e.g., text generation, translation)", 
             "What is the role of transformers in LLM architecture?", 
             "Explain the concept of bias in LLM training data and its potential consequences.", 
             "How can prompt engineering be used to improve LLM outputs?", 
             "Describe some techniques for evaluating the performance of LLMs. (e.g., perplexity, BLEU score)", 
             "Discuss the limitations of LLMs, such as factual accuracy and reasoning abilities.", 
             "What are some ethical considerations surrounding the use of LLMs?", 
             "How do LLMs handle out-of-domain or nonsensical prompts?", 
             "Explain the concept of few-shot learning and its applications in fine-tuning LLMs.", 
             "What are the challenges associated with large-scale deployment of LLMs in real-world applications?", 
             "Discuss the role of LLMs in the broader field of artificial general intelligence (AGI).", 
             "How can the explainability and interpretability of LLM decisions be improved?", 
             "Compare and contrast LLM architectures, such as GPT-3 and LaMDA.", 
             "Explain the concept of self-attention and its role in LLM performance.", 
             "Discuss the ongoing research on mitigating bias in LLM training data and algorithms.", 
             "How can LLMs be leveraged to create more human-like conversations?", 
             "Explore the potential future applications of LLMs in various industries.", 
             "You are tasked with fine-tuning an LLM to write creative content. How would you approach this?", 
             "An LLM you’re working on starts generating offensive or factually incorrect outputs. How would you diagnose and address the issue?",
             "A client wants to use an LLM for customer service interactions. What are some critical considerations for this application?", 
             "How would you explain the concept of LLMs and their capabilities to a non-technical audience?", 
             "Imagine a future scenario where LLMs are widely integrated into daily life. What ethical concerns might arise?", 
             "Discuss some emerging trends in LLM research and development.", 
             "What are the potential societal implications of widespread LLM adoption?", 
             "How can we ensure the responsible development and deployment of LLMs?",]

In [32]:
for i, question in enumerate(questions):
    print(f"QUESTION {i}")
    question_gemma(question, model=model, tokenizer=tokenizer)
    print("-" * 64)
    if DEMO_MODE:
        break

QUESTION 0
In simple terms, what is a Large Language Model (LLM)?
A Large Language Model (LLM) is a type of artificial intelligence (AI) that can understand and generate human-like text. It is trained on a massive dataset of text and code, allowing it to perform various tasks such as writing different kinds of creative content, translating languages, and answering questions in an informative way. LLMs are becoming increasingly popular and are used in a variety of applications, including chatbots, virtual assistants, and content creation tools. They are also used in research and development, with researchers exploring their potential for various tasks, including scientific discovery and medical diagnosis. 
<end_of_turn>
----------------------------------------------------------------
QUESTION 1
What differentiates LLMs from traditional chatbots?
LLMs are not just chatbots. They are a new class of AI that can generate human-like text, code, and even music. They are trained on massive dat

Our final test is asking for help in understanding Gemma 2 paper:

Team, Gemma, et al. "Gemma 2: Improving open language models at a practical size." arXiv preprint arXiv:2408.00118 (2024). https://arxiv.org/pdf/2408.00118

In [33]:
arch = """
Similar to previous Gemma models (Gemma Team, 2024), the Gemma 2 models are based on a
decoder-only transformer architecture (Vaswani et al., 2017). 
A few architectural elements are similar to the first version of Gemma models; namely, a context
length of 8192 tokens, the use of Rotary Position Embeddings (RoPE) (Su et al., 2021), and
the approximated GeGLU non-linearity (Shazeer, 2020). A few elements differ between Gemma 1
and Gemma 2, including using deeper networks. We summarize the key differences below.
Local Sliding Window and Global Attention.
We alternate between a local sliding window attention (Beltagy et al., 2020a,b) 
and global attention (Luong et al., 2015) in every other layer.
The sliding window size of local attention layers is set to 4096 tokens, while the span of the 
global attention layers is set to 8192 tokens.
Logit soft-capping. We cap logits (Bello et al., 2016) in each attention layer and the final layer
such that the value of the logits stays between −soft_cap and +soft_cap. More specifically, we
cap the logits with the following function: logits ← soft_cap ∗ tanh(logits/soft_cap).
We set the soft_cap parameter to 50.0 for the selfattention layers and to 30.0 for the final layer.
Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm (Zhang and
Sennrich, 2019) to normalize the input and output of each transformer sub-layer, the attention
layer, and the feedforward layer. 
Grouped-Query Attention (Ainslie et al., 2023).
We use GQA with num_groups = 2, based on ablations showing increased speed at inference time
while maintaining downstream performance.
"""

prompt = """You are acting as a valuable study assistant for AI/ML topics. 
Please explain the following technical excerpt, providing information on the mentioned technical topics. 
When finished explaining each topic, just stop."""
prompt += arch

question_gemma(prompt, model=model, tokenizer=tokenizer)

You are acting as a valuable study assistant for AI/ML topics. 
Please explain the following technical excerpt, providing information on the mentioned technical topics. 
When finished explaining each topic, just stop.
Similar to previous Gemma models (Gemma Team, 2024), the Gemma 2 models are based on a
decoder-only transformer architecture (Vaswani et al., 2017). 
A few architectural elements are similar to the first version of Gemma models; namely, a context
length of 8192 tokens, the use of Rotary Position Embeddings (RoPE) (Su et al., 2021), and
the approximated GeGLU non-linearity (Shazeer, 2020). A few elements differ between Gemma 1
and Gemma 2, including using deeper networks. We summarize the key differences below.
Local Sliding Window and Global Attention.
We alternate between a local sliding window attention (Beltagy et al., 2020a,b) 
and global attention (Luong et al., 2015) in every other layer.
The sliding window size of local attention layers is set to 4096 tokens, while

We conclude the tutorial here. By following the same steps, you can fine-tune Gemma for any topic.

Enjoy fine-tuning with Google Gemma!