In this notebook we load the Gguf model that can be run on the cpu

In [None]:
# Install llama-cpp-python
!pip install llama-cpp-python

# Download your GGUF from the Q4_K_M repo
!huggingface-cli download lippa6602/llama-3.2-1b-finetome-optimized-Q4_K_M-GGUF \
    --include "*.gguf" \
    --local-dir ./model

# Check what was downloaded
import os
print("Downloaded files:")
for f in os.listdir("./model"):
    if f.endswith('.gguf'):
        size = os.path.getsize(f"./model/{f}") / (1024**2)
        print(f"  {f}: {size:.1f} MB")

# Load and test
from llama_cpp import Llama

# Find the GGUF file
import glob
gguf_file = glob.glob("./model/*.gguf")[0]
print(f"\nLoading: {gguf_file}")

llm = Llama(
    model_path=gguf_file,
    n_ctx=2048,
    n_threads=2,
    n_gpu_layers=0,
    verbose=False
)

print("✓ Model loaded!\n")

# Test generation
prompt = "What is machine learning?"
formatted = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

output = llm(formatted, max_tokens=150, temperature=0.7, stop=["<|eot_id|>"])
print(f"User: {prompt}")
print(f"\nAssistant: {output['choices'][0]['text'].strip()}")

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.16.tar.gz (50.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.3.16-cp312-cp312-linux_x86_64.whl size=4422319 sha256=81e805952a6cccfdc03

llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


✓ Model loaded!

User: What is machine learning?

Assistant: Machine learning (ML) is a branch of computer science that uses algorithms to learn and make predictions about data without being explicitly programmed to do so. It is a subfield of artificial intelligence (AI) that involves the use of computers to simulate how the human brain works. The goal of machine learning is to develop algorithms that can automatically learn and improve based on data without being explicitly programmed to do so. This allows for machines to learn from data and make predictions without requiring human intervention. This process is known as machine learning because the machine is doing the learning instead of the programmer. Machine learning is used in a wide range of applications, from image recognition and speech recognition to predictive analytics and natural language processing.

The development of machine learning has revolutionized many industries
