# ** ðŸª„ CodeLlamaâ€‘7B Instruct â€” Professional Google Colab Notebook**

This notebook demonstrates how to run CodeLlamaâ€‘7Bâ€‘Instruct efficiently on Google Colab using 8â€‘bit quantization.
It is structured for clarity, stability.

# **Notebook Overview**

Model: codellama/CodeLlama-7b-Instruct-hf
Framework: Hugging Face Transformers
Environment: Google Colab (T4 / L4 GPU)
Quantization: 8â€‘bit (via bitsandbytes)

**Why NOT pipeline()?**

The Hugging Face pipeline() API loads the fullâ€‘precision model, which exceeds Colab GPU memory and causes crashes. This notebook uses manual model loading with 8â€‘bit quantization, which is the recommended and productionâ€‘safe approach.

## Local Inference on GPU
Model page: https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf


## Step 1 â€” Install Required Libraries

**Purpose:**  
Installs the Python dependencies needed for model loading and inference:
- `transformers` â†’ Model & tokenizer
- `accelerate` â†’ Device mapping
- `bitsandbytes` â†’ 8-bit quantization

ðŸ“Œ Run this cell only once at the start.


In [None]:
!pip install -U transformers accelerate bitsandbytes


## Step 2 â€” Import Required Python Modules

**Purpose:**  
Import the core libraries used in this notebook for:
- Model loading
- Tokenization
- GPU support

These must be imported before any model code runs.


## Step 3 â€” Load CodeLlama-7B-Instruct (8-bit)

**Purpose:**  
Loads the CodeLlama-7B-Instruct model in **8-bit quantized format** so it
can run without crashing on Google Colab GPUs.

This approach:
âœ” Reduces GPU memory usage  
âœ” Avoids OOM crashes  
âœ” Works reliably on T4 / L4 GPUs

**Notes:**
- The tokenizer is also loaded here
- pad_token is fixed to prevent warnings


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf"
)

model = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf",
    load_in_8bit=True,
    device_map="auto"
)


## Step 4 â€” Sanity Test (Single Prompt)

**Purpose:**  
Verify that the model loaded correctly by giving it a simple prompt
and printing the response. This helps ensure everything is working
before starting the interactive chat.

We pass a short message and generate a short reply.


In [None]:
import torch

messages = [
    {"role": "user", "content": "Who are you?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
)

inputs = inputs.to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = tokenizer.decode(
    outputs[0][inputs.shape[-1]:],
    skip_special_tokens=True
)

print(response)


In [None]:
tokenizer.pad_token = tokenizer.eos_token


## Step 5 â€” Prepare Model Inputs Using Chat Template

**Purpose:**
Formats the conversation history into a model-compatible prompt
using CodeLlamaâ€™s official chat template.

This step:
- Preserves conversation roles
- Ensures correct assistant generation
- Moves tensors to the appropriate device (GPU/CPU)


In [None]:
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    padding=True
)

inputs = inputs.to(model.device)


## Step 6 â€” Define Interactive Chat Function

**Purpose:**  
This block defines a function named `chat()` that:
âœ” Accepts user input
âœ” Keeps a running message history
âœ” Formats input for the model
âœ” Generates responses
âœ” Prints replies in chat format

Youâ€™ll run this function in the next step.


In [None]:
def chat():
    messages = []

    while True:
        user_input = input("You: ")
        if user_input.lower() in ["exit", "quit"]:
            break

        messages.append({"role": "user", "content": user_input})

        inputs = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt",
            padding=True
        ).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=256,
                do_sample=True,
                temperature=0.7,
                top_p=0.9
            )

        response = tokenizer.decode(
            outputs[0][inputs.shape[-1]:],
            skip_special_tokens=True
        )

        print("CodeLlama:", response)
        messages.append({"role": "assistant", "content": response})





## Step 7 â€” Start Interactive Chat

**Purpose:**  
Boots up the interactive chat loop so you can talk to CodeLlama.
After running this cell:
- A prompt will appear below
- Type your message
- Press Enter to get a reply
- Type `exit` or `quit` when done


In [None]:
chat()


## âœ… Notebook Completed

Congratulations! You have:
âœ” Installed libraries  
âœ” Loaded a large language model in 8-bit  
âœ” Verified model output  
âœ” Built an interactive chat loop  
âœ” Created a professional, GitHub-ready notebook  

ðŸš€ Next steps (optional):
- Add a web UI (Gradio)
- Save chat logs
- Add more system prompts
