# ü§ñ TinyLlama 1.1B Chat

**Model:** TinyLlama 1.1B Chat v1.0  
**Size:** ~650MB (Q4_K_M quantized)  
**Best for:** Fast inference, low memory usage, general chat

---

## How to use this notebook:
1. Run **Step 1** to install and load the model (takes 2-3 minutes first time)
2. Run **Step 2** to start chatting with the model
3. Modify the prompt in Step 2 and run again for different responses

## üì¶ Step 1: Setup & Load Model

This cell will:
- Install required packages
- Download the model (cached after first run)
- Load the model into memory

**Just run this cell and wait for "‚úÖ Ready to chat!"**

In [None]:
# Install dependencies
print("üì• Installing packages...")
!pip install -q --no-cache-dir llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!pip install -q huggingface_hub

# Import libraries
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import os

# Configuration
MODEL_REPO = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
MODEL_FILE = "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
os.environ["LLAMA_LOG_LEVEL"] = "ERROR"
os.environ["GGML_LOG_LEVEL"] = "ERROR"

# Download model
print(f"‚¨áÔ∏è Downloading {MODEL_FILE}...")
model_path = hf_hub_download(repo_id=MODEL_REPO, filename=MODEL_FILE)
print(f"‚úì Model downloaded")

# Load model
print("üîÑ Loading model into memory...")
llm = Llama(
    model_path=model_path,
    n_gpu_layers=-1,      # Use GPU acceleration
    n_ctx=4096,           # Context window
    verbose=False
)
print("‚úÖ Ready to chat!")

## üí¨ Step 2: Chat with the Model

**Instructions:**
- Edit the `user_prompt` below with your question
- Run this cell to get a response
- Run it multiple times with different prompts!

**Tips:**
- Increase `max_tokens` for longer responses (128-512)
- Increase `temperature` (0.1-0.9) for more creative responses
- Lower `temperature` (0.1-0.3) for more focused responses

In [None]:
# üëá EDIT YOUR PROMPT HERE
user_prompt = "Explain what machine learning is in simple terms"

# Configuration (optional - adjust as needed)
temperature = 0.7    # 0.0 = focused, 1.0 = creative
max_tokens = 256     # Maximum response length

# Generate response
print(f"üßë User: {user_prompt}\n")
print("ü§ñ Assistant: ", end="")

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": user_prompt}],
    temperature=temperature,
    max_tokens=max_tokens
)

print(response["choices"][0]["message"]["content"])

## üîÑ Multi-turn Conversation (Optional)

For conversations with context, use this cell:

In [None]:
# Build a conversation with context
conversation = [
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level programming language."},
    {"role": "user", "content": "What are its main advantages?"}
]

response = llm.create_chat_completion(
    messages=conversation,
    temperature=0.7,
    max_tokens=256
)

print("ü§ñ Assistant:", response['choices'][0]['message']['content'])