# üöÄ Model Setup for Document Search System

This notebook will help you download and set up GGUF models for the document search system using Hugging Face Hub.

## Step 1: Install Required Libraries

In [1]:
!pip install huggingface_hub transformers

[0mCollecting huggingface_hub
  Using cached huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting transformers
  Using cached transformers-4.53.3-py3-none-any.whl.metadata (40 kB)
Collecting hf-xet<2.0.0,>=1.1.2 (from huggingface_hub)
  Using cached hf_xet-1.1.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/usr/local/lib/python3.10/dist-packages/numpy-1.26.3.dist-info/METADATA'
[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


## Step 2: Login to Hugging Face (Optional but recommended)

In [3]:
!pip install huggingface_hub 

Collecting huggingface_hub
  Using cached huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Using cached huggingface_hub-0.33.4-py3-none-any.whl (515 kB)
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.33.4
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [5]:
from huggingface_hub import login

# Uncomment and run this if you want to login to Hugging Face
# This is optional but recommended for better download speeds
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## Step 3: Create Models Directory

In [6]:
import os
from pathlib import Path

# Create models directory
models_dir = Path("models")
models_dir.mkdir(exist_ok=True)
print(f"‚úÖ Models directory created: {models_dir.absolute()}")

‚úÖ Models directory created: /workspace/models


## Step 4: Choose and Download a Model

Select one of the following options based on your needs:

### Option 1: Phi-3 Mini (3.8B) - Recommended for speed

In [7]:
from huggingface_hub import hf_hub_download
import shutil

print("üì• Downloading Phi-3 Mini GGUF model...")

# Download Phi-3 Mini Q4 GGUF model
downloaded_file = hf_hub_download(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="Phi-3-mini-4k-instruct-q4.gguf",
    local_dir="models",
    local_dir_use_symlinks=False
)

# Move to expected location
target_file = models_dir / "llama-model.gguf"
if target_file.exists():
    target_file.unlink()  # Remove existing file

shutil.move(downloaded_file, target_file)
print(f"‚úÖ Phi-3 Mini model ready at: {target_file}")
print(f"üìä Model size: {target_file.stat().st_size / 1024 / 1024 / 1024:.2f} GB")

üì• Downloading Phi-3 Mini GGUF model...


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Phi-3-mini-4k-instruct-q4.gguf:   0%|          | 0.00/2.39G [00:00<?, ?B/s]

‚úÖ Phi-3 Mini model ready at: models/llama-model.gguf
üìä Model size: 2.23 GB


### Option 2: Llama 3.1 8B - Better quality, larger model

In [10]:
# Alternative: Direct download with wget
import subprocess
import os
from pathlib import Path

print("üì• Downloading better model with wget...")

# Create models directory
models_dir = Path("models")
models_dir.mkdir(exist_ok=True)

# Option 1: Try Phi-3 Medium directly
model_urls = [
    ("Phi-3 Medium 14B", "https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-gguf/resolve/main/Phi-3-medium-4k-instruct-q4.gguf"),
    ("Llama 3.2 3B", "https://huggingface.co/hugging-quants/Llama-3.2-3B-Instruct-Q4_K_M-GGUF/resolve/main/llama-3.2-3b-instruct-q4_k_m.gguf"),
    ("Gemma 2 2B", "https://huggingface.co/google/gemma-2-2b-it-GGUF/resolve/main/2b_it_q4_k_m.gguf")
]

for model_name, url in model_urls:
    print(f"\nüîÑ Trying to download {model_name}...")
    try:
        result = subprocess.run([
            "wget", "-O", "models/temp-model.gguf", url
        ], capture_output=True, text=True, timeout=300)
        
        if result.returncode == 0:
            # Success! Move to final location
            target_file = models_dir / "llama-model.gguf"
            if target_file.exists():
                target_file.unlink()
            
            os.rename("models/temp-model.gguf", target_file)
            file_size_gb = target_file.stat().st_size / 1024 / 1024 / 1024
            print(f"‚úÖ {model_name} downloaded successfully!")
            print(f"üìä Model size: {file_size_gb:.2f} GB")
            print(f"üìÅ Location: {target_file}")
            break
        else:
            print(f"‚ùå Failed to download {model_name}")
            
    except Exception as e:
        print(f"‚ùå Error downloading {model_name}: {e}")
        continue
else:
    print("\n‚ö†Ô∏è All downloads failed. You may need to manually download a GGUF model.")
    print("Try visiting https://huggingface.co/models?library=gguf and download any instruct model.")

üì• Downloading better model with wget...

üîÑ Trying to download Phi-3 Medium 14B...
‚ùå Failed to download Phi-3 Medium 14B

üîÑ Trying to download Llama 3.2 3B...
‚úÖ Llama 3.2 3B downloaded successfully!
üìä Model size: 1.88 GB
üìÅ Location: models/llama-model.gguf


### Option 3: Llama 3.2 1B - Fastest, smallest model

In [None]:
from huggingface_hub import hf_hub_download
import shutil

print("üì• Downloading Llama 3.2 1B GGUF model...")

# Download Llama 3.2 1B Q4_K_M GGUF model
downloaded_file = hf_hub_download(
    repo_id="bartowski/Llama-3.2-1B-Instruct-GGUF",
    filename="Llama-3.2-1B-Instruct-Q4_K_M.gguf",
    local_dir="models",
    local_dir_use_symlinks=False
)

# Move to expected location
target_file = models_dir / "llama-model.gguf"
if target_file.exists():
    target_file.unlink()  # Remove existing file

shutil.move(downloaded_file, target_file)
print(f"‚úÖ Llama 3.2 1B model ready at: {target_file}")
print(f"üìä Model size: {target_file.stat().st_size / 1024 / 1024 / 1024:.2f} GB")

## Step 5: Verify Model Installation

In [16]:
# Check if model file exists and get info
model_path = models_dir / "llama-model.gguf"

if model_path.exists():
    file_size_gb = model_path.stat().st_size / 1024 / 1024 / 1024
    print(f"‚úÖ Model successfully installed!")
    print(f"üìÅ Location: {model_path.absolute()}")
    print(f"üìä Size: {file_size_gb:.2f} GB")
    
    # Try to load the model to verify it works
    try:
        print("\nüß™ Testing model loading...")
        
        # Import our LLM service
        import sys
        sys.path.append('.')
        
        from config import Config
        from llm_service_cpp import LLMServiceCPP
        
        config = Config()
        llm_service = LLMServiceCPP(config)
        
        if llm_service.check_model_availability():
            print("‚úÖ Model loaded successfully and is working!")
            
            # Test generation
            test_response = llm_service.llm(
                "<|system|>You are a helpful assistant.<|user|>Say hello!<|assistant|>",
                max_tokens=50,
                temperature=0.1
            )
            print(f"ü§ñ Test response: {test_response['choices'][0]['text'].strip()}")
            
        else:
            print("‚ùå Model loaded but not responding correctly")
            
    except Exception as e:
        print(f"‚ö†Ô∏è  Model file exists but couldn't test loading: {e}")
        print("This might be normal if dependencies aren't fully installed yet.")
        
else:
    print("‚ùå Model file not found. Please run one of the download options above.")

‚úÖ Model successfully installed!
üìÅ Location: /workspace/models/llama-model.gguf
üìä Size: 2.23 GB

üß™ Testing model loading...
‚ö†Ô∏è  Model file exists but couldn't test loading: No module named 'llama_cpp'
This might be normal if dependencies aren't fully installed yet.


## Step 6: Install System Dependencies

In [17]:
# Install the main system requirements
print("üì¶ Installing system requirements...")
!pip install --upgrade pip
!pip install -r requirements.txt

print("\n‚úÖ All dependencies installed!")
print("\nüéâ Setup Complete!")
print("\nNext steps:")
print("1. Build index: python run_search.py --build /path/to/your/documents")
print("2. Start web UI: python run_search.py --web")
print("3. Or search directly: python run_search.py --search 'your question'")

üì¶ Installing system requirements...
Collecting streamlit>=1.28.0 (from -r requirements.txt (line 6))
  Using cached streamlit-1.47.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain>=0.1.0 (from -r requirements.txt (line 7))
  Using cached langchain-0.3.26-py3-none-any.whl.metadata (7.8 kB)
Collecting chromadb>=0.4.0 (from -r requirements.txt (line 8))
  Using cached chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting llama-cpp-python>=0.2.0 (from -r requirements.txt (line 9))
  Using cached llama_cpp_python-0.3.14-cp310-cp310-linux_x86_64.whl
Collecting altair<6,>=4.0 (from streamlit>=1.28.0->-r requirements.txt (line 6))
  Using cached altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting blinker<2,>=1.5.0 (from streamlit>=1.28.0->-r requirements.txt (line 6))
  Using cached blinker-1.9.0-py3-none-any.whl.metadata (1.6 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit>=1.28.0->-r requirements.txt (line 6))
  Us

## Optional: Test the Complete System

In [14]:
!pip install sentence_transformers

Collecting sentence_transformers
  Using cached sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting scikit-learn (from sentence_transformers)
  Using cached scikit_learn-1.7.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Using cached sentence_transformers-5.0.0-py3-none-any.whl (470 kB)
Using cached scikit_learn-1.7.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
Installing collected packages: scikit-learn, sentence_transformers
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2/2[0m [sentence_transformers]ence_transformers]
[1A[2KSuccessfully installed scikit-learn-1.7.1 sentence_transformers-5.0.0
[0m

In [1]:
# Quick system test
try:
    print("üß™ Testing complete system...")
    
    # Test imports
    from config import Config
    from document_processor import DocumentProcessor
    from embedding_service import EmbeddingService
    from vector_database import VectorDatabase
    from llm_service_cpp import LLMServiceCPP
    
    print("‚úÖ All modules imported successfully")
    
    # Test CUDA availability
    import torch
    if torch.cuda.is_available():
        print(f"‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"   CUDA version: {torch.version.cuda}")
        print(f"   GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    else:
        print("‚ö†Ô∏è  CUDA not available - will use CPU mode")
    
    # Test model loading
    config = Config()
    config.create_directories()
    
    print("‚úÖ Configuration loaded")
    print("‚úÖ System ready for document processing!")
    
except Exception as e:
    print(f"‚ùå System test failed: {e}")
    print("Please check the error and install missing dependencies")

üß™ Testing complete system...
‚ùå System test failed: No module named 'llama_cpp'
Please check the error and install missing dependencies


In [2]:
# Install llama-cpp-python with CUDA support for RTX 4090 (GPU-accelerated compilation)
print("üì¶ Installing llama-cpp-python with CUDA support...")
print("Using RTX 4090 for accelerated compilation...")

# Use GPU-accelerated compilation with multiple parallel jobs
import os
os.environ['CMAKE_ARGS'] = '-DGGML_CUDA=on -DCUDA_ARCHITECTURES=89'  # RTX 4090 is compute capability 8.9
os.environ['CUDACXX'] = '/usr/local/cuda/bin/nvcc'  # Ensure CUDA compiler is found
os.environ['CMAKE_BUILD_PARALLEL_LEVEL'] = '8'  # Use 8 parallel compilation jobs

!pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir --verbose

print("‚úÖ llama-cpp-python with RTX 4090 optimization installed!")

üì¶ Installing llama-cpp-python with CUDA support...
Using RTX 4090 for accelerated compilation...
Using pip 25.1.1 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.14.tar.gz (51.0 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m51.0/51.0 MB[0m [31m217.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 25.1.1 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting scikit-build-core>=0.9.2 (from scikit-build-core[pyproject]>=0.9.2)
    Obtaining dependency information for scikit-build-core>=0.9.2 from https://files.pythonhosted.org/packages/45/23/0ffa0df7550ca0535f6e03b9a9ab2bf0495ac62e15fd322544c98321a10c/scikit_build_core-0.11.5-py3-none-any.whl.metadata
    Using cached scikit_build_core-0.11.5-py3-none-any.whl.

In [3]:
# Test the complete system with llama-cpp-python installed
try:
    print("üß™ Testing complete system with llama-cpp-python...")
    
    # Test imports
    from config import Config
    from document_processor import DocumentProcessor
    from embedding_service import EmbeddingService
    from vector_database import VectorDatabase
    from llm_service_cpp import LLMServiceCPP
    
    print("‚úÖ All modules imported successfully")
    
    # Test CUDA availability
    import torch
    if torch.cuda.is_available():
        print(f"‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"   CUDA version: {torch.version.cuda}")
        print(f"   GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    else:
        print("‚ö†Ô∏è  CUDA not available - will use CPU mode")
    
    # Test model loading
    config = Config()
    config.create_directories()
    
    print("‚úÖ Configuration loaded")
    
    # Test LLM service
    print("ü§ñ Testing LLM service...")
    llm_service = LLMServiceCPP(config)
    
    if llm_service.check_model_availability():
        print("‚úÖ LLM model loaded and working!")
        
        # Test a simple generation
        test_response = llm_service.llm(
            "<|system|>You are a helpful assistant.<|user|>Say 'Hello, I am working!'<|assistant|>",
            max_tokens=20,
            temperature=0.1
        )
        print(f"ü§ñ Test response: {test_response['choices'][0]['text'].strip()}")
        
    else:
        print("‚ùå LLM model not responding correctly")
    
    print("‚úÖ System ready for document processing!")
    print("\nüéâ Complete setup successful!")
    print("\nNext steps:")
    print("1. Build index: python run_search.py --build /path/to/your/documents")
    print("2. Start web UI: python run_search.py --web")
    print("3. Or search directly: python run_search.py --search 'your question'")
    
except Exception as e:
    print(f"‚ùå System test failed: {e}")
    import traceback
    traceback.print_exc()
    print("\nTroubleshooting:")
    print("- Make sure all dependencies are installed")
    print("- Check that the model file exists at models/llama-model.gguf")
    print("- Verify CUDA installation if using GPU")

üß™ Testing complete system with llama-cpp-python...
‚úÖ All modules imported successfully
‚úÖ CUDA available: NVIDIA GeForce RTX 4090
   CUDA version: 12.6
   GPU memory: 23.6 GB
‚úÖ Configuration loaded
ü§ñ Testing LLM service...


llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


‚úÖ LLM model loaded and working!
ü§ñ Test response: Hello! I'm an AI here to assist you. While I don't work in
‚úÖ System ready for document processing!

üéâ Complete setup successful!

Next steps:
1. Build index: python run_search.py --build /path/to/your/documents
2. Start web UI: python run_search.py --web
3. Or search directly: python run_search.py --search 'your question'
