## Introduction

## Gemma for Beginners with Hugging Face
In this notebook, we'll learn the very basics of using the Gemma model, incorporating the powerful tools from Hugging Face. It's focused on the simplest content without any complex processing. This practical exercise is about training a Large Language Model (LLM) to generate ***Basic Concepts of Data Science*** using the Gemma model with the support of Hugging Face libraries.

## Dataset Used
* [1000-Data-Science-Concepts](https://www.kaggle.com/datasets/hserdaraltan/1000-data-science-concepts) : This dataset covers more than 1000 common data science concepts. It covers several topics related to Statistics, Machine Learning, and Artificial Intelligence. It has two columns, one of which is questions or instructions, the other is responses to these instructions.

## 1. What is Gemma?

**Gemma Model** is a collection of lightweight open-source generative AI (GenAI) models developed by Google DeepMind. These models are primarily aimed at developers and researchers. Gemma was released alongside Gemini, Google's closed-source generative AI chatbots.

There are two main models in the Gemma collection: **Gemma 2B** and **Gemma 7B**. These models are text-to-text decoder large language models (LLMs) with pretrained and instruction-tuned variants. Gemma 2B has a neural network with 2 billion parameters, while Gemma 7B has a neural network with seven billion parameters.

Google offers pretrained and instruction-tuned Gemma models suitable for running on laptops and workstations. These models are available to developers through various platforms. Additionally, Meta's Llama 2 is another open-source AI model designed to run on laptops, serving as more of a business tool compared to Gemma. Gemma is often favored for scientific tasks, while Llama 2 is considered more suitable for general-purpose tasks.

### Inputs and Outputs
* Input: Gemma models take in text strings, which can range from questions and prompts to longer documents that require summarization.
* Output: In response, they generate text in English, offering answers, summaries, or other forms of text-based output, tailored to the input provided.

<div style="text-align:center;">
    <img src="https://www.kaggle.com/competitions/64148/images/header" alt="Gemma Model" style="width:50%;"/>
</div>


## 2. Package Installation and Importing Libraries

* `huggingface_hub`: This library provides access to models, datasets, and other resources shared by the Hugging Face community.

* `transformers`: Formerly known as `pytorch-transformers` or `pytorch-pretrained-bert`, this library is developed by Hugging Face. It provides state-of-the-art pre-trained models for natural language understanding (NLU) and natural language generation (NLG) tasks.

* `accelerate`: Accelerate is a library developed by Hugging Face that simplifies distributed training for deep learning models and provides an easy-to-use interface for distributed computing frameworks.

* `BitsAndBytes`: This library provides functions and utilities for working with binary data in Python. It includes functions for performing bitwise operations.

* `trl`: The Text Representation Learning (TRL) library is developed by Hugging Face and provides tools and utilities for training and fine-tuning text representations.

* `peft`: PEFT (PyTorch Extensible Fine-Tuning) is a library that extends PyTorch for fine-tuning large language models (LLMs) such as GPT and BERT.

In [1]:
!pip install -q -U huggingface_hub
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U BitsAndBytes
%pip install -q trl
%pip install -q peft

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


**Python basic module**
* `os`: Provides ways to interact with the operating system and its environment variables.
* `torch`: PyTorch library for deep learning applications.
* `pandas`: Powerful data processing tool, ideal for handling CSV files and other forms of structured data.
* `re` : Provides support for working with regular expressions, enabling powerful pattern-based string operations.

**Transformers module**
* `AutoTokenizer`: Used to automatically load a pre-trained tokenizer.
* `AutoModelForCausalLM`: Used to automatically load pre-trained models for causal language modeling.
* `BitsAndBytesConfig`: Configuration class for setting up the Bits and Bytes tokenizer.
* `AutoConfig`: Used to automatically load the model's configuration.
* `TrainingArguments`: Defines arguments for training setup.

**Wordcloud module**
* `WordCloud` : Python library used for generating word clouds, which are visual representations of text data where the size of each word indicates its frequency or importance.
* `STOPWORDS` : set of commonly used words that are often excluded from text analysis because they typically do not carry significant meaning or contribute to the understanding of the text. 

**Datasets module**
* `Dataset`: A class for handling datasets.

**Peft module**
* `LoraConfig` : A configuration class for configuring the Lora model.
* `PeftModel`: A class that defines the PEFT model.
* `prepare_model_for_kbit_training` : A function that prepares a model for k-bit training.
* `get_peft_model` : Function to get the PEFT model.

**trl module**
* `SFTTrainer`: Trainer class for SFT (Supervised Fine-Tuning) training.

**IPython.display module**
* `Markdown` : Used to output text in Markdown format.
* `display` : Used to display objects in Jupyter notebooks.

In [2]:
import torch
import os
import pandas as pd
import re
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig, AutoConfig, TrainingArguments, pipeline
from wordcloud import WordCloud, STOPWORDS
from datasets import Dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from IPython.display import Markdown as md
import warnings
warnings.filterwarnings('ignore')

2024-04-28 20:46:03.836628: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-28 20:46:03.836729: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-28 20:46:03.956971: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

access_token=UserSecretsClient().get_secret('HUGGING_FACE')
login(token=access_token)

BackendError: Unexpected response from the service. Response: {'errors': ['No user secrets exist for kernel id 55587985 and label HUGGING_FACE.'], 'error': {'code': 5, 'details': []}, 'wasSuccessful': False}.

#### Check if CUDA is available

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


## 3. Loading Gemma Model

* `BitsAndBytesConfig` : The quantization_config object defines the configuration for quantization using the BitsAndBytes library. Here are the key arguments:

    * **load_in_4bit (bool, optional)**: Enables 4-bit quantization, reducing memory usage by approximately fourfold compared to the original model.
    * **bnb_4bit_quant_type (str, optional)**: Specifies the type of 4-bit quantization to use. Here, it's set to "nf4", a specific quantization format supported by BitsAndBytes.
    * **bnb_4bit_compute_dtype (torch.dtype, optional)**: Defines the data type used for computations during inference. Here, it's set to torch.bfloat16, a lower-precision format that can improve speed on compatible hardware.


* `Loading Tokenizer and Model with Quantization` :

    * **AutoTokenizer**: The AutoTokenizer.from_pretrained function loads the tokenizer associated with the pre-trained model at the specified path (model). The quantization_config argument is crucial here. It tells the tokenizer to consider the quantization information (e.g., potential padding changes) while processing text.

    * **AutoModelForCausalLM**: Similarly, AutoModelForCausalLM.from_pretrained loads the actual LLM model from the path (model). Again, the device_map="auto" argument allows automatic device placement (CPU or GPU), and the quantization_config ensures the model is loaded with the 4-bit quantization configuration.

Overall, this code snippet aims to achieve two goals:

* **Load a pre-trained LLM**: It retrieves a pre-trained causal language model from the specified path.
* **Enable Quantization for Efficiency**: By using the BitsAndBytesConfig and arguments during loading, the code configures the tokenizer and model to leverage 4-bit quantization for memory reduction and potentially faster inference on compatible hardware.

In [5]:
%%time
tokenizer= AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/2b-it/3")
quantization_config=BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_use_double_quant=True,
                    bnb_4bit_quant_type='nf4',
                    bnb_4bit_compute_dtype=torch.bfloat16,)
model = AutoModelForCausalLM.from_pretrained("/kaggle/input/gemma/transformers/2b-it/3",quantization_config=quantization_config,low_cpu_mem_usage=True)
print(model)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
    

## 4. Q & A Using Gemma

* The code measures the execution time of generating a text summary using a pre-trained Gemma model. It initializes an input text, tokenizes it, and generates a summary using the model. 
* The generated summary is then decoded and printed. 
* This process is timed using the `%%time` magic command.The execution time of the entire process is displayed. 
* The Gemma model utilizes the GPU for faster computation. 
* The summary length is limited to 256 tokens.

In [6]:
#---------------------------------------------------------------------------

import time
import torch

input_text = 'Answer common questions about the Python programming language.'
input_ids = tokenizer(input_text, return_tensors='pt').to("cuda")

# Function to get memory usage on CUDA device
def get_cuda_memory_usage():
    return torch.cuda.memory_allocated() / 1024 / 1024  # Memory usage in MB

# Memory usage before execution
memory_before = get_cuda_memory_usage()

# Execution
outputs = model.generate(**input_ids, max_new_tokens=256)

# Memory usage after execution
memory_after = get_cuda_memory_usage()

memory_used = memory_after - memory_before

print(tokenizer.decode(outputs[0]))
print('')
print("Memory used:", memory_used, "MB")


<bos>Answer common questions about the Python programming language.

**1. What is Python?**

* Python is a high-level, interpreted programming language.
* It is known for its clear and concise syntax, making it easier to learn and use than other programming languages.
* Python is widely used for various purposes, including data science, machine learning, web development, and scripting.

**2. What are the key features of Python?**

* **Dynamic typing:** Python does not require you to explicitly declare the data type of variables.
* **Indentation:** Python uses indentation to define blocks of code, making it clear and readable.
* **Modules:** Python has a vast collection of modules that extend the functionality of the language.
* **Concurrency:** Python supports multithreading, allowing multiple tasks to run concurrently.
* **Regular expressions:** Python provides powerful regular expression capabilities for text manipulation.

**3. What are the different types of data in Python?**

* **

## 5. Load Dataset

Loading your data is the first step in the machine learning pipeline. This section will guide you through loading your dataset into the Jupyter notebook environment.

To download a dataset, follow these simple steps:
1. Look for the "Input" option located below the "Notebook" section in the right-side menu.
2. Click on the "+ Add Input" button.
3. In the search bar that appears, type "1000+-data-science-concepts".
4. Find the dataset in the search results and click the "+" button to add it to your notebook. This action will automatically download the dataset for you.

In [None]:
data=pd.read_csv('/kaggle/input/1000-data-science-concepts/data_science_concepts.csv')
dataset= Dataset.from_pandas(data)

#### Dataset Information and Null Value Check
* Check the number of features and values in dataset.
* Check for any NULL values in the dataset

In [None]:
print("Information of Dataset: ")
print(data.info(),'\n')

print("Check for NULL values: ")
print(data.isnull().sum().sum())

## 6. Visualize Data
This code generates word clouds for each column in a DataFrame (`data`). Here's a step-by-step explanation:

1. **Initialization**: Initialize an empty string `comment_words` to store concatenated words from all columns of the dataset and define a set of stopwords using the `STOPWORDS` set from the `wordcloud` library.

2. **Subplots Creation**: Create subplots with a single row and a number of columns equal to the number of columns in the dataset. Set the figure size to (10, 6) to control the overall size of the plot.

3. **Colormap Definition**: Define the colormap to be used for generating the WordCloud images. In this case, the 'viridis' colormap is chosen.

4. **Iteration through Columns**: Iterate through each column in the dataset, extracting the column name and its corresponding values. Concatenate all values in the column into a single string, converting them to uppercase.

5. **WordCloud Generation**: Generate a WordCloud for each column using the concatenated string of values. Customize the WordCloud's width, height, stopwords, minimum font size, and colormap. Plot each WordCloud image on the respective subplot, setting the title of each subplot to indicate the column it represents. Finally, adjust the layout and display the plot.

Overall, this code visualizes the distribution of words in each column of the DataFrame by creating word clouds, providing insights into the most frequent words or terms within each column.

In [None]:
comment_words = ''
stopwords = set(STOPWORDS)

# Create subplots
fig, axs = plt.subplots(1, len(data.columns), figsize=(10, 6))

# Define the colormap
colormap = 'viridis'

# Iterate through the csv file
for i, col in enumerate(data.columns):
    # Concatenate all values in the column into a single string
    # and convert to lowercase
    comment_words += ' '.join(str(val).upper() for val in data[col]) + ' '

    # Generate WordCloud for the current column
    wordcloud = WordCloud(width=500, height=500,
                          stopwords=stopwords,
                          min_font_size=8,
                          colormap=colormap).generate(comment_words)

    # Plot the WordCloud image
    axs[i].imshow(wordcloud, interpolation='bilinear')
    axs[i].axis("off")
    axs[i].set_title(f"Word Cloud for {col}")

plt.tight_layout()
plt.show()


## 7. Defining Functions
This Python code defines two functions:

1. **generate_DS_answers** : This function generates answers to a given prompt related to Data Science. It first constructs a prompt template using a provided context. Then, it creates a message containing the prompt using the tokenizer's apply_chat_template method. Next, it encodes the prompt into tokens, sends it to the GPU for processing, generates a response using the model, and decodes the output tokens into text. Finally, it returns the generated response.

2. **extract_content** : This function extracts the content from a given text that comes after the marker <start_of_turn>model. It searches for this marker in the text and returns the content that follows it. If the marker is not found, it returns a message indicating that the content was not found.

These functions work together to generate responses to prompts related to Data Science and extract the relevant content from generated text.

In [None]:
max_new_tokens=300
def generate_DS_answers(context):
    prompt_template=f"""Provide the Answer for the following Question in 300 words.
    Provide only useful information:
    Context:{context}
    
    Output: """
    
    messages=[
        {"role": "user","content": prompt_template},
        ]
    prompt= tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
    input_ids=tokenizer.encode(prompt,add_special_tokens=True,return_tensors='pt').to("cuda")
    
    # 100 tokens = 75 words
    outputs=model.generate(input_ids,max_new_tokens=500)

    response=tokenizer.decode(outputs[0])
    
    return response

def extract_content(text):
    index=text.find('<start_of_turn>model')
  
    if index!=-1:
        content_after_model=text[index+len('<start_of_turn>model'):].strip()
    else:
        return "Content not found after '<start_of_turn>model'"
    return content_after_model


In [None]:
context="Question: "+dataset['Question'][0]
print(context)

#### Generate Response for a given input from Dataset

In [None]:
Answers=generate_DS_answers(context)
print(Answers)

#### Check for Relevant Information

In [None]:
Answers=extract_content(Answers)
md(Answers)

## 8. Test Model before Fine Tuning

Before we start the finetuning process, let's see how the Gemma model performs out of the box on our dataset. This section will show you how to run a simple question-answering test.

In [None]:
for i in range(3):
    context = "### Question: " + dataset['Question'][i]
    display(md(context))
    display(md("### Answer: "))
    display(md(extract_content(generate_DS_answers(context))))
    print('\n\n')

## 9. Fine Tune Gemma

* Define a formatting function for the model output.

* WANDB is a useful tool for experiment tracking in machine learning. You might disable WANDB if you don't need experiment tracking or for debugging purposes.


In [None]:
def formatting_func(example):
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    line = template.format(instruction=example['Question'], response=example['Answer'])
    return [line]
os.environ["WANDB_DISABLED"] = "true"

### LoRA - Low-Rank Adaptation

**LoRA** is a technique used to fine-tune large language models (LLMs) more efficiently. It allows you to adapt pre-trained models to new tasks with minimal memory and computational cost compared to traditional fine-tuning.

LoraConfig Parameters:

* `r (int)`: This parameter defines the rank of the low-rank decomposition used in LoRA. A lower value of r uses less memory but might lead to slightly lower accuracy. The default value is typically 8, as set in our code.

* `target_modules (List[str])`: This list specifies the Transformer layers where LoRA will be applied.
    * q_proj: Query projection
    * o_proj: Output projection
    * k_proj: Key projection
    * v_proj: Value projection
    * gate_proj: Gate projection (used in attention layers)
    * up_proj: Upsampling projection (used in some encoder-decoder architectures)
    * down_proj: Downsampling projection (used in some encoder-decoder architectures)
    
By applying LoRA to these projection layers, the model can learn task-specific adaptations without modifying the original large model weights significantly.

* `task_type (str, optional)`: This parameter specifies the type of task you're fine-tuning the model for. While not used in this specific configuration, some libraries might leverage this information to optimize LoRA for specific task categories (e.g., "CAUSAL_LM" for causal language modeling).

In [None]:
lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",
)

### SFTTrainer for Supervised Fine-Tuning

This code snippet configures and initializes an SFTTrainer for fine-tuning a pre-trained model with LoRA for memory efficiency. The training hyperparameters are set within the TrainingArguments object.

SFTTrainer is used to fine-tune a pre-trained model (model) on a specific training dataset (dataset). It's designed for tasks where you have labeled data and want to adapt the model for a new purpose.

**Key Parameters** :

* `model (PreTrainedModel)`: This argument specifies the pre-trained model you want to fine-tune.

* `train_dataset (Dataset)`: This argument points to the training dataset you'll use for fine-tuning. The dataset should be formatted appropriately for the task.

* `max_seq_length (int)`: This parameter defines the maximum sequence length allowed in the training data.

* `args (TrainingArguments)`: This argument is an instance of TrainingArguments that defines various hyperparameters for the training process. Here are some notable arguments within args:

    * **per_device_train_batch_size (int)** : Sets the batch size per device (GPU/TPU) during training.
    * **gradient_accumulation_steps (int)** : This parameter allows accumulating gradients over several batches before updating the model weights.
    * **warmup_steps (int)** : This defines the number of warmup steps where the learning rate is gradually increased from 0 to its full value.
    * **max_steps (int)** : This parameter specifies the total number of training steps.
    * **learning_rate (float)** : This sets the learning rate for the optimizer. Here, it's set to 2e-4, which is a common starting point for fine-tuning.
    * **fp16 (bool)** : Enables training using 16-bit floating-point precision (mixed precision) for faster training with minimal accuracy loss (if supported by your hardware).
    * **logging_steps (int)** : Defines how often training metrics are logged during training.
    * **output_dir (str)** : Specifies the directory where training outputs (model checkpoints, logs, etc.) will be saved.
    * **optim (str)** : Defines the optimizer used for training. Here, it's set to "paged_adamw_8bit", which is likely an optimizer with specific memory optimizations.

* `peft_config (LoraConfig)`:It provides the configuration for LoRA (Low-Rank Adaptation), which helps fine-tune the model more efficiently.

* `formatting_func (Callable)`: This argument specifies a custom function for formatting the training data before feeding it to the model. This allows for specific pre-processing steps tailored to your task.

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=512,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

In [None]:
trainer.train()

## 10. Test Fine- Tune Model

After training, let's see how much our Gemma model has improved. We'll rerun the question-answering test and compare the results to the pre-finetuning performance.

In [None]:
context = "### Question: " + "What are the Basic Concepts of Data Science?"
display(md(context))
display(md("### Answer: "))
display(md(extract_content(generate_DS_answers(context))))
print('\n')

In [None]:
for i in range(3):
    context = "### Question: " + dataset['Question'][i]
    display(md(context))
    display(md("### Answer: "))
    display(md(extract_content(generate_DS_answers(context))))
    print('\n')

After fine-tuning, the Gemma model demonstrates a notable refinement in its outputs, delivering heightened precision and accuracy tailored to the specific queries posed. Through this process, Gemma has evolved to provide responses that are finely attuned to the intricacies of the questions posed, facilitating a more effective and insightful interaction with the user.
This is of great importance, and through this notebook, Gemma can also learn about topics it was previously unfamiliar with.

## Conclusion

In this beginner-friendly notebook, we've outlined the process of fine-tuning the Gemma model, a Large Language Model (LLM), specifically for Basic Data Science Concept generation. Starting from data loading, we've demonstrated how to train the Gemma model effectively, even for those new to working with LLMs.

Achieving the best performance with the Gemma model (or any LLM) generally requires training with more extensive datasets and over more epochs. Future enhancements could include integrating Retrieval-Augmented Generation (RAG) and Direct Preference Optimization (DPO) training techniques, offering a way to further improve the model by incorporating external knowledge bases for more precise and relevant responses.

Ultimately, this notebook is designed to make the Gemma model approachable for beginners, illustrating that straightforward steps can unlock the potential of LLMs for diverse domain-specific tasks. It encourages users to experiment with the Gemma model across various fields, broadening the scope of its application and enhancing its utility.

#### **If you find this notebook useful, please consider upvoting it.**

#### **This will help others find it and encourage us to write more code, which benefits everyone.**