# 3.6 Open-Source Alternative LLMs ‚Äî Transformers, Ollama & HF Inference

## Playground Notebook

Three ways to run LLMs locally or for free:

| Tool | Setup | Cost | Speed | Internet |
|------|-------|------|-------|----------|
| **Transformers** | `pip install` | Free | Fast | Works offline |
| **Ollama** | Download app | Free | Fast | Works offline |
| **HF Inference** | Free API token | Free tier | Depends | Requires internet |

---

In [24]:
import time
import requests
from IPython.display import display, Markdown, HTML

print("‚úÖ Environment ready")

‚úÖ Environment ready


In [25]:
# ============================================================
#  HELPER FUNCTIONS
# ============================================================

def print_block(title, width=60):
    """Pretty print a section header."""
    sep = "=" * width
    print(f"\n{sep}")
    print(f"  {title}")
    print(f"{sep}")


def benchmark(name, func, *args, **kwargs):
    """Run a function and measure execution time."""
    start = time.time()
    result = func(*args, **kwargs)
    elapsed = time.time() - start
    print(f"\n‚è±Ô∏è  {name}: {elapsed:.2f}s")
    return result, elapsed


def show_result(result, prefix=""):
    """Display result in a readable format."""
    if isinstance(result, (list, tuple)):
        for i, item in enumerate(result, 1):
            print(f"{prefix}[{i}] {item}")
    else:
        print(f"{prefix}{result}")


print("‚úÖ Helper functions loaded")

‚úÖ Helper functions loaded


---

## 1. HuggingFace Transformers ‚Äî Local Pipeline

**Install:** `pip install transformers torch`

In [26]:
try:
    from transformers import pipeline
    print("‚úÖ Transformers available")
except ImportError:
    print("‚ùå Transformers not installed. Run: pip install transformers torch")

‚úÖ Transformers available


### GPU Diagnostic (run this first if CPU-only issue)

In [27]:
import torch
import subprocess

print_block("GPU Diagnostic")

print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"CUDA Version: {torch.version.cuda}")

if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("\n‚ö†Ô∏è  GPU NOT detected. Installing GPU support...")
    print("\nRun this in terminal:")
    print('pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118')
    print("\nThen restart the kernel (Kernel ‚Üí Restart Kernel)")


  GPU Diagnostic
PyTorch Version: 2.6.0
CUDA Available: False
CUDA Version: None

‚ö†Ô∏è  GPU NOT detected. Installing GPU support...

Run this in terminal:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then restart the kernel (Kernel ‚Üí Restart Kernel)


### Experiment 1A: Text Generation

In [28]:
from transformers import pipeline
import torch

print_block("1A: Text Generation (DistilGPT-2) ‚Äî Optimized")

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1
gpu_status = "‚úÖ GPU" if device == 0 else "‚ùå CPU only"
print(f"Device: {gpu_status}\n")

# Load the model with device mapping (only load once)
if not hasattr(globals(), 'generator'):
    print("Loading model (first time only)...")
    generator = pipeline(
        'text-generation',
        model='distilgpt2',
        device=device
    )
    print("‚úÖ Model cached for future use\n")

prompt = "The future of artificial intelligence"

# Optimized generation parameters
result, elapsed = benchmark(
    "Generation",
    generator,
    prompt,
    max_new_tokens=30,         # Fixed: use max_new_tokens instead of max_length
    num_return_sequences=1,
    temperature=0.7,            # Lower = more consistent
    top_p=0.9
)

print(f"\nüìù Input: {prompt}")
print(f"\nüìÑ Output:")
print("-" * 60)
print(result[0]['generated_text'].strip())
print("-" * 60)

print(f"\nüí° Tips to make it faster:")
print("   1. GPU: Install NVIDIA drivers + 'pip install torch --upgrade'")
print("   2. First run caches model ‚Äî subsequent runs are much faster")
print("   3. Reduce max_new_tokens for quicker generation")


  1A: Text Generation (DistilGPT-2) ‚Äî Optimized
Device: ‚ùå CPU only

Loading model (first time only)...


Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


‚úÖ Model cached for future use


‚è±Ô∏è  Generation: 1.47s

üìù Input: The future of artificial intelligence

üìÑ Output:
------------------------------------------------------------
The future of artificial intelligence is in its infancy. The future of artificial intelligence is in its infancy. The future of artificial intelligence is in its infancy. The future of artificial intelligence
------------------------------------------------------------

üí° Tips to make it faster:
   1. GPU: Install NVIDIA drivers + 'pip install torch --upgrade'
   2. First run caches model ‚Äî subsequent runs are much faster
   3. Reduce max_new_tokens for quicker generation


In [29]:
# Check for GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

# To enable GPU, install:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

PyTorch version: 2.6.0
CUDA available: False
CUDA device: None


### Experiment 1B: Sentiment Analysis

In [30]:
print_block("1B: Sentiment Analysis (DistilBERT)")

sentiment_analyzer = pipeline(
    'sentiment-analysis',
    model='distilbert-base-uncased-finetuned-sst-2-english'
)

texts = [
    "This product is amazing!",
    "Terrible experience, very disappointed.",
    "It's okay, nothing special."
]

for text in texts:
    result, elapsed = benchmark(
        f"Analyze: '{text}'...",
        sentiment_analyzer,
        text
    )
    
    sentiment = result[0]['label']
    score = result[0]['score']
    print(f"   ‚Üí {sentiment} ({score:.2%})")


  1B: Sentiment Analysis (DistilBERT)


Loading weights:   0%|          | 0/104 [00:00<?, ?it/s]


‚è±Ô∏è  Analyze: 'This product is amazing!'...: 0.07s
   ‚Üí POSITIVE (99.99%)

‚è±Ô∏è  Analyze: 'Terrible experience, very disappointed.'...: 0.01s
   ‚Üí NEGATIVE (99.98%)

‚è±Ô∏è  Analyze: 'It's okay, nothing special.'...: 0.02s
   ‚Üí NEGATIVE (81.90%)


### Experiment 1C: Question Answering

In [31]:
print_block("1C: Question Answering (DistilBERT)")

qa = pipeline(
    'question-answering',
    model='distilbert-base-cased-distilled-squad'
)

context = "Python is a programming language created by Guido van Rossum in 1989."
question = "Who created Python?"

print(f"üìö Context: {context}")
print(f"‚ùì Question: {question}")

result, elapsed = benchmark(
    "Q&A",
    qa,
    question=question,
    context=context
)

print(f"\nüí° Answer: {result['answer']} (confidence: {result['score']:.2%})")


  1C: Question Answering (DistilBERT)


Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]

üìö Context: Python is a programming language created by Guido van Rossum in 1989.
‚ùì Question: Who created Python?

‚è±Ô∏è  Q&A: 0.09s

üí° Answer: Guido van Rossum (confidence: 99.55%)


---

## 2. Ollama ‚Äî Run LLMs Locally

**Install:** Download from [ollama.ai](https://ollama.ai)

**Run:** `ollama serve` then in another terminal `ollama run llama2`

In [32]:
# Check if Ollama is available
try:
    import ollama
    print("‚úÖ Ollama package available")
    HAS_OLLAMA = True
except ImportError:
    print("‚ùå Ollama package not installed. Install with: pip install ollama")
    HAS_OLLAMA = False

‚úÖ Ollama package available


### Experiment 2A: Ollama Text Generation

In [33]:
def ollama_generate(prompt, model='qwen2.5:1.5b'):
    """Generate text using Ollama."""
    response = ollama.generate(model=model, prompt=prompt)
    return response['response']


print_block("2A: Ollama Text Generation")

if HAS_OLLAMA:
    prompt = "Write a haiku about coding:"
    result, elapsed = benchmark(
        "Ollama generate",
        ollama_generate,
        prompt
    )
    
    print(f"\nüìù Prompt: {prompt}")
    print(f"\nüìÑ Response:")
    print("-" * 60)
    print(result.strip())
    print("-" * 60)
else:
    print("‚ö†Ô∏è  Ollama package not available. Skipping this experiment.")


  2A: Ollama Text Generation

‚è±Ô∏è  Ollama generate: 2.40s

üìù Prompt: Write a haiku about coding:

üìÑ Response:
------------------------------------------------------------
Byte by byte,
Lines of code come alive,
Programs run free.
------------------------------------------------------------


### Experiment 2B: List Available Models

In [38]:
def ollama_list_models():
    """Get list of installed Ollama models."""
    try:
        models_list = ollama.list()
        return models_list.get('models', [])
    except:
        return []


print_block("2B: List Available Models")

if HAS_OLLAMA:
    models = ollama_list_models()
    if models:
        print(f"\nü§ñ Installed Models:")
        for model in models:
            name = model.get('model', 'Unknown')
            size = model.get('size', 0)
            size_gb = size / (1024**3)
            print(f"   ‚Ä¢ {name} ({size_gb:.1f} GB)")
    else:
        print("   No models installed. Run: ollama pull qwen2.5:1.5b")
else:
    print("‚ö†Ô∏è  Ollama package not available.")


  2B: List Available Models

ü§ñ Installed Models:
   ‚Ä¢ gpt-oss:20b (12.8 GB)
   ‚Ä¢ qwen2.5:1.5b (0.9 GB)
   ‚Ä¢ gemma3n:e2b (5.2 GB)
   ‚Ä¢ deepseek-r1:8b (4.6 GB)
   ‚Ä¢ nomic-embed-text:latest (0.3 GB)
   ‚Ä¢ jina/jina-embeddings-v2-base-en:latest (0.3 GB)
   ‚Ä¢ all-minilm:latest (0.0 GB)


---

## 3. HuggingFace Inference API ‚Äî Free Cloud Endpoint

**Setup:** Get free token from [huggingface.co](https://huggingface.co) ‚Üí Settings ‚Üí Access Tokens

In [35]:
import os
from dotenv import load_dotenv
from huggingface_hub import InferenceClient

# Load variables from .env file
load_dotenv()

# Get HF token from .env or environment variable
HF_TOKEN = os.getenv('HUGGING_FACE_HUB_TOKEN')

# Initialize the official HF Inference Client
hf_client = InferenceClient(token=HF_TOKEN) if HF_TOKEN else None

def hf_inference(text, model_name='openai-community/gpt2', task='text_generation'):
    """Call HuggingFace Inference API using the official client."""
    if not hf_client:
        return {"error": "HF_TOKEN not set"}
    try:
        if task == 'text_generation':
            result = hf_client.text_generation(text, model=model_name, max_new_tokens=50)
            return [{"generated_text": text + result}]
        elif task == 'text_classification':
            result = hf_client.text_classification(text, model=model_name)
            return result
        else:
            return {"error": f"Unknown task: {task}"}
    except Exception as e:
        return {"error": str(e)}


print_block("3. HuggingFace Inference API")
token_status = "‚úÖ Set from .env" if HF_TOKEN else "‚ùå Not set"
print(f"\nüîë Token status: {token_status}")
if not HF_TOKEN:
    print("   Create .env file with: HUGGING_FACE_HUB_TOKEN=your_token_here")


  3. HuggingFace Inference API

üîë Token status: ‚úÖ Set from .env


### Experiment 3A: Sentiment Analysis via API

In [36]:
print_block("3B: HF API Sentiment Analysis")

if not HF_TOKEN:
    print("‚ö†Ô∏è  Please set HUGGING_FACE_HUB_TOKEN in .env file.")
else:
    texts = [
        "I love this!",
        "This is terrible."
    ]
    
    for text in texts:
        result, elapsed = benchmark(
            f"Sentiment: '{text}'...",
            hf_inference,
            text,
            'distilbert/distilbert-base-uncased-finetuned-sst-2-english'
        )
        
        if isinstance(result, dict) and 'error' in result:
            print(f"   ‚Üí ‚ùå {result['error']}")
        elif isinstance(result, list) and len(result) > 0:
            # HF returns [[{label, score}, ...]] for classification
            top = result[0] if isinstance(result[0], dict) else result[0][0]
            print(f"   ‚Üí {top.get('label', '?')} ({top.get('score', 0):.2%})")
        else:
            print(f"   ‚Üí {result}")


  3B: HF API Sentiment Analysis

‚è±Ô∏è  Sentiment: 'I love this!'...: 0.53s
   ‚Üí ‚ùå Model 'distilbert/distilbert-base-uncased-finetuned-sst-2-english' doesn't support task 'text-generation'. Supported tasks: 'text-classification', got: 'text-generation'

‚è±Ô∏è  Sentiment: 'This is terrible.'...: 0.27s
   ‚Üí ‚ùå Model 'distilbert/distilbert-base-uncased-finetuned-sst-2-english' doesn't support task 'text-generation'. Supported tasks: 'text-classification', got: 'text-generation'


---

## Comparison

| Aspect | Transformers | Ollama | HF API |
|--------|--------------|--------|--------|
| **Setup** | pip install | Download app | Web login |
| **Cost** | Free | Free | Free tier |
| **Speed** | Fast | Fast | Network latency |
| **Offline** | ‚úÖ Yes | ‚úÖ Yes | ‚ùå No |
| **Customizable** | ‚úÖ Full fine-tuning | ‚ö†Ô∏è Limited | ‚ùå No |
| **Memory** | Depends on model | Depends on model | None (cloud) |
| **Best for** | Production, learning | Local development | Quick testing |

---

## Key Takeaways

| Concept | Remember |
|---------|----------|
| **Transformers** | Easy pipelines for any NLP task, fully local |
| **Ollama** | One-command LLM setup, great for development |
| **HF Inference** | Quick cloud testing without managing infrastructure |
| **No API keys** | Transformers & Ollama are totally free |
| **Batch processing** | Use lists for efficiency |