<a href="https://colab.research.google.com/github/nidhesg/Great-Frontend/blob/main/Meta_Llama_3_8B_Instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install -U transformers

Collecting transformers
  Downloading transformers-4.57.0-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m138.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.2
    Uninstalling transformers-4.56.2:
      Successfully uninstalled transformers-4.56.2
Successfully installed transformers-4.57.0


## Local Inference on GPU
Model page: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

⚠️ If the generated code snippets do not work, please open an issue on either the [model repo](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
			and/or on [huggingface.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) 🙏

The model you are trying to use is gated. Please make sure you have access to it by visiting the model page.To run inference, either set HF_TOKEN in your environment variables/ Secrets or run the following cell to login. 🤗

In [3]:
from huggingface_hub import login
login(new_session=False)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig # Added BitsAndBytesConfig

# --- Fix 1: Load Model and Tokenizer separately for configuration (Now with 4-bit Quantization) ---
# Using AutoModelForCausalLM ensures proper device and dtype configuration for large models.

# Determine the device and configure for optimized 4-bit loading (QLoRA)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Using 4-bit quantization is the most effective way to address CUDA Out of Memory errors
# by drastically reducing the VRAM footprint, which is necessary for Llama 3 8B on smaller GPUs.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16 # Use float16 for computation
)

print(f"Loading model on device: {device}. Using 4-bit Quantization to conserve VRAM.")
print("NOTE: Llama 3 models are gated and may require a Hugging Face token to download.")

# Load the tokenizer and model
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # Passing the 4-bit configuration here
    device_map="auto" # This automatically manages the model across GPUs/CPU
)
model.eval()

# --- Fix 2: Correct Input Formatting for Instruct Models ---
# The model expects a single string formatted according to the Llama 3 instruction template.

messages = [
    {"role": "system", "content": "You are a helpful, harmless, and honest code generation assistant."},
    {"role": "user", "content": "Who are you? Respond in three sentences."},
]

# Apply the tokenizer's chat template to convert the list of dictionaries
# into the single, formatted string the model requires.
# 'add_generation_prompt=True' tells the tokenizer we expect a response.
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# --- Fix 3: Use the Pipeline and print the output ---
# The pipeline is created here, referencing the optimized model and tokenizer.
# The task remains "text-generation" but now uses the correctly formatted input string.

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device
)

# Run the generation using the formatted prompt
result = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    # This setting is important to only return the generated text, not the prompt itself
    return_full_text=False
)

# Print the generated text cleanly
if result and result[0] and 'generated_text' in result[0]:
    print("\n--- Generated Response ---")
    print(result[0]['generated_text'].strip())
else:
    print("\nError: Could not retrieve generated text.")

# Optional: Clear CUDA cache after use, addressing the prior memory issue
if torch.cuda.is_available():
    torch.cuda.empty_cache()


PackageNotFoundError: No package metadata was found for bitsandbytes

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

## Remote Inference via Inference Providers
Ensure you have a valid **HF_TOKEN** set in your environment. You can get your token from [your settings page](https://huggingface.co/settings/tokens). Note: running this may incur charges above the free tier.
The following Python example shows how to run the model remotely on HF Inference Providers, automatically selecting an available inference provider for you.
For more information on how to use the Inference Providers, please refer to our [documentation and guides](https://huggingface.co/docs/inference-providers/en/index).

In [None]:
import os
os.environ['HF_TOKEN'] = 'YOUR_TOKEN_HERE'

In [None]:
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)

completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
)

print(completion.choices[0].message)