# **Importation**

In [None]:
!pip install -q -U huggingface_hub
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U BitsAndBytes
%pip install -q trl
%pip install -q peft

# Addition of torch, os, transfromers and other relavant parameters

In [2]:
import torch
import os
import pandas as pd
import re
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig, AutoConfig, TrainingArguments, pipeline
from wordcloud import WordCloud, STOPWORDS
from datasets import Dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from IPython.display import Markdown as md
import warnings
warnings.filterwarnings('ignore')

2024-04-28 20:46:03.836628: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-28 20:46:03.836729: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-28 20:46:03.956971: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Loading Gemma Model

* `BitsAndBytesConfig`


* `Loading Tokenizer and Model with Quantization`

In [5]:
%%time
tokenizer= AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/2b-it/3")
quantization_config=BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_use_double_quant=True,
                    bnb_4bit_quant_type='nf4',
                    bnb_4bit_compute_dtype=torch.bfloat16,)
model = AutoModelForCausalLM.from_pretrained("/kaggle/input/gemma/transformers/2b-it/3",quantization_config=quantization_config,low_cpu_mem_usage=True)
print(model)

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
    

## 4. Q & A Using Gemma

* The code measures the execution time of generating a text summary using a pre-trained Gemma model. It initializes an input text, tokenizes it, and generates a summary using the model. 
* The generated summary is then decoded and printed. 
* This process is timed using the `%%time` magic command.The execution time of the entire process is displayed. 
* The Gemma model utilizes the GPU for faster computation. 
* The summary length is limited to 256 tokens.

In [6]:
#---------------------------------------------------------------------------

import time
import torch

input_text = 'Answer common questions about the Python programming language.'
input_ids = tokenizer(input_text, return_tensors='pt').to("cuda")

# Function to get memory usage on CUDA device
def get_cuda_memory_usage():
    return torch.cuda.memory_allocated() / 1024 / 1024  # Memory usage in MB

# Memory usage before execution
memory_before = get_cuda_memory_usage()

# Execution
outputs = model.generate(**input_ids, max_new_tokens=256)

# Memory usage after execution
memory_after = get_cuda_memory_usage()

memory_used = memory_after - memory_before

print(tokenizer.decode(outputs[0]))
print('')
print("Memory used:", memory_used, "MB")


<bos>Answer common questions about the Python programming language.

**1. What is Python?**

* Python is a high-level, interpreted programming language.
* It is known for its clear and concise syntax, making it easier to learn and use than other programming languages.
* Python is widely used for various purposes, including data science, machine learning, web development, and scripting.

**2. What are the key features of Python?**

* **Dynamic typing:** Python does not require you to explicitly declare the data type of variables.
* **Indentation:** Python uses indentation to define blocks of code, making it clear and readable.
* **Modules:** Python has a vast collection of modules that extend the functionality of the language.
* **Concurrency:** Python supports multithreading, allowing multiple tasks to run concurrently.
* **Regular expressions:** Python provides powerful regular expression capabilities for text manipulation.

**3. What are the different types of data in Python?**

* **