<a href="https://colab.research.google.com/github/heerr2005/codellama-7b-instruct-quantized/blob/main/CodeLlama_7B_Colab_Chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ** ü™Ñ CodeLlama‚Äë7B Instruct ‚Äî Professional Google Colab Notebook**

This notebook demonstrates how to run CodeLlama‚Äë7B‚ÄëInstruct efficiently on Google Colab using 8‚Äëbit quantization.
It is structured for clarity, stability.

# **Notebook Overview**

Model: codellama/CodeLlama-7b-Instruct-hf
Framework: Hugging Face Transformers
Environment: Google Colab (T4 / L4 GPU)
Quantization: 8‚Äëbit (via bitsandbytes)

**Why NOT pipeline()?**

The Hugging Face pipeline() API loads the full‚Äëprecision model, which exceeds Colab GPU memory and causes crashes. This notebook uses manual model loading with 8‚Äëbit quantization, which is the recommended and production‚Äësafe approach.

## Local Inference on GPU
Model page: https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf


## Step 1 ‚Äî Install Required Libraries

**Purpose:**  
Installs the Python dependencies needed for model loading and inference:
- `transformers` ‚Üí Model & tokenizer
- `accelerate` ‚Üí Device mapping
- `bitsandbytes` ‚Üí 8-bit quantization

üìå Run this cell only once at the start.


In [None]:
!pip install -U transformers accelerate bitsandbytes


Collecting bitsandbytes
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.49.0


## Step 2 ‚Äî Import Required Python Modules

**Purpose:**  
Import the core libraries used in this notebook for:
- Model loading
- Tokenization
- GPU support

These must be imported before any model code runs.


## Step 3 ‚Äî Load CodeLlama-7B-Instruct (8-bit)

**Purpose:**  
Loads the CodeLlama-7B-Instruct model in **8-bit quantized format** so it
can run without crashing on Google Colab GPUs.

This approach:
‚úî Reduces GPU memory usage  
‚úî Avoids OOM crashes  
‚úî Works reliably on T4 / L4 GPUs

**Notes:**
- The tokenizer is also loaded here
- pad_token is fixed to prevent warnings


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf"
)

model = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf",
    load_in_8bit=True,
    device_map="auto"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Step 4 ‚Äî Sanity Test (Single Prompt)

**Purpose:**  
Verify that the model loaded correctly by giving it a simple prompt
and printing the response. This helps ensure everything is working
before starting the interactive chat.

We pass a short message and generate a short reply.


In [None]:
import torch

messages = [
    {"role": "user", "content": "Who are you?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
)

inputs = inputs.to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = tokenizer.decode(
    outputs[0][inputs.shape[-1]:],
    skip_special_tokens=True
)

print(response)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. My primary function is to generate human-like text responses to user input, whether it be a question, statement, or prompt. I can answer questions, provide information, and engage in conversation. I can understand and respond to natural language input, making me a highly useful tool for a wide range of applications, such as customer service, language translation, and content creation.


In [None]:
tokenizer.pad_token = tokenizer.eos_token


## Step 5 ‚Äî Prepare Model Inputs Using Chat Template

**Purpose:**
Formats the conversation history into a model-compatible prompt
using CodeLlama‚Äôs official chat template.

This step:
- Preserves conversation roles
- Ensures correct assistant generation
- Moves tensors to the appropriate device (GPU/CPU)


In [None]:
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    padding=True
)

inputs = inputs.to(model.device)


## Step 6 ‚Äî Define Interactive Chat Function

**Purpose:**  
This block defines a function named `chat()` that:
‚úî Accepts user input
‚úî Keeps a running message history
‚úî Formats input for the model
‚úî Generates responses
‚úî Prints replies in chat format

You‚Äôll run this function in the next step.


In [None]:
def chat():
    messages = []

    while True:
        user_input = input("You: ")
        if user_input.lower() in ["exit", "quit"]:
            break

        messages.append({"role": "user", "content": user_input})

        inputs = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt",
            padding=True
        ).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=256,
                do_sample=True,
                temperature=0.7,
                top_p=0.9
            )

        response = tokenizer.decode(
            outputs[0][inputs.shape[-1]:],
            skip_special_tokens=True
        )

        print("CodeLlama:", response)
        messages.append({"role": "assistant", "content": response})





## Step 7 ‚Äî Start Interactive Chat

**Purpose:**  
Boots up the interactive chat loop so you can talk to CodeLlama.
After running this cell:
- A prompt will appear below
- Type your message
- Press Enter to get a reply
- Type `exit` or `quit` when done


In [None]:
chat()


You: write a fastapi backend for a todo app


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


CodeLlama:  Here is an example of a FastAPI backend for a simple todo app:
```
```
from fastapi import FastAPI

app = FastAPI()

todo_items = []

@app.post("/todo")
async def create_todo(item: str):
    todo_items.append(item)
    return {"message": f"Todo item '{item}' created"}

@app.get("/todo")
async def get_todo():
    return todo_items

@app.delete("/todo/{item_id}")
async def delete_todo(item_id: str):
    todo_items.remove(item_id)
    return {"message": f"Todo item '{item_id}' deleted"}

@app.put("/todo/{item_id}")
async def update_todo(item_id: str, item: str):
    todo_items[item_id] = item
    return {"message": f"Todo item '{item_id}' updated"}
```
This backend defines three endpoints:

* `/todo`: Creates a new todo
You: exit
You: exit


## ‚úÖ Notebook Completed

Congratulations! You have:
‚úî Installed libraries  
‚úî Loaded a large language model in 8-bit  
‚úî Verified model output  
‚úî Built an interactive chat loop  
‚úî Created a professional, GitHub-ready notebook  

üöÄ Next steps (optional):
- Add a web UI (Gradio)
- Save chat logs
- Add more system prompts
