### step1 : Install Libraries

In [1]:
! pip install bitsandbytes accelerate
! pip install -U transformers

Collecting transformers
  Using cached transformers-4.42.3-py3-none-any.whl.metadata (43 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Using cached tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached transformers-4.42.3-py3-none-any.whl (9.3 MB)
Using cached tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.2
    Uninstalling tokenizers-0.15.2:
      Successfully uninstalled tokenizers-0.15.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed tokenizers-0.19.1 transformers-4.42.3


**Bitsandbytes**: A lightweight CUDA library for quantized matrix multiplication, useful for faster and more memory-efficient computations.

**Accelerate**: A library by Hugging Face to accelerate training and inference on multiple devices, including CPUs, GPUs, and TPUs.

**Device Mapping for Model Layers**
When dealing with large models, distributing the layers across multiple devices can help in managing memory and computational resources more efficiently. The device_maps variable is used to specify which layers of the model are assigned to which device.

In [2]:
device_maps = [('model.layers.0', 0),
 ('model.layers.1', 0),
 ('model.layers.2', 0),
 ('model.layers.3', 0),
 ('model.layers.4', 0),
 ('model.layers.5', 0),
 ('model.layers.6', 0),
 ('model.layers.7', 0),
 ('model.layers.8', 0),
 ('model.layers.9', 0),
 ('model.layers.10', 0),
 ('model.layers.11', 0),
 ('model.layers.12', 0),
 ('model.layers.13', 0),
 ('model.layers.14', 0),
 ('model.layers.15', 0),
 ('model.layers.16', 0),
 ('model.layers.17', 0),
 ('model.layers.18', 0),
 ('model.layers.19', 1),
 ('model.layers.20', 1),
 ('model.layers.21', 1),
 ('model.layers.22', 1),
 ('model.layers.23', 1),
 ('model.layers.24', 1),
 ('model.layers.25', 1),
 ('model.layers.26', 1),
 ('model.layers.27', 1),
 ('model.layers.28', 1),
 ('model.layers.29', 1),
 ('model.layers.30', 1),
 ('model.layers.31', 1),
 ('model.layers.32', 1),
 ('model.layers.33', 1),
 ('model.layers.34', 1),
 ('model.layers.35', 1),
 ('model.layers.36', 1),
 ('model.layers.37', 1),
 ('model.layers.38', 1),
 ('model.layers.39', 1),
 ('model.layers.40', 1),
 ('model.layers.41', 1),
 ('model.embed_tokens', 1),
 ('model.layers', 1)]

In [3]:
import torch
torch.backends.cuda.enable_mem_efficient_sdp(False)

By using torch.backends.cuda.enable_mem_efficient_sdp(False), you can disable memory-efficient SDP in PyTorch, which might be necessary for certain models or debugging purposes. This configuration can help achieve better performance or consistency across different hardware setups.

### Step2 : Import Model and its tokenizer 

In [5]:
from transformers import (AutoTokenizer, AutoModelForCausalLM, 
                          BitsAndBytesConfig, AutoConfig)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_name = "google/gemma-2-9b-it"
hf_token = "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"   # get authontication permisson from haggingface
tokenizer = AutoTokenizer.from_pretrained(model_name,
                                          token =hf_token) 


In [6]:
device= {layer:gpu_mem for (layer,gpu_mem) in device_maps}

In [7]:
 
config = AutoConfig.from_pretrained(model_name,token=hf_token)
config.gradient_checkpointing = True

In [8]:
config

Gemma2Config {
  "_name_or_path": "google/gemma-2-9b-it",
  "architectures": [
    "Gemma2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "attn_logit_softcapping": 50.0,
  "bos_token_id": 2,
  "cache_implementation": "hybrid",
  "eos_token_id": 1,
  "final_logit_softcapping": 30.0,
  "gradient_checkpointing": true,
  "head_dim": 256,
  "hidden_act": "gelu_pytorch_tanh",
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "gemma2",
  "num_attention_heads": 16,
  "num_hidden_layers": 42,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "query_pre_attn_scalar": 224,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "sliding_window_size": 4096,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 256000
}

In [9]:

model = AutoModelForCausalLM.from_pretrained(model_name ,torch_dtype="auto",quantization_config=quantization_config,token =hf_token,
                                             device_map="auto",trust_remote_code=True,config=config)

Downloading shards: 100%|██████████| 4/4 [11:18<00:00, 169.65s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.73s/it]


### Step3 : Generate Correct Sentence

In [10]:
input_text = "correct this sentense: He does eat meat everyday"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids,max_length=150)
print(tokenizer.decode(outputs[0]))



<bos>correct this sentense: He does eat meat everyday.

The corrected sentence is: **He eats meat every day.**


Here's why:

* **Subject-Verb Agreement:**  "He" is singular, so the verb needs to be "eats" (singular present tense) instead of "does eat" (present tense with auxiliary verb).
* **Word Order:**  In English, we typically place adverbs of frequency (like "every day") before the main verb. 



Let me know if you have any other sentences you'd like help with!<end_of_turn>
<eos>
