# Build llcuda v2.1.0 Binaries for Tesla T4 (Google Colab)

**Purpose**: Build complete CUDA 12 binaries for llcuda v2.1.0 on Google Colab Tesla T4 GPU

**Output**:
1. llama.cpp binaries (264 MB) - HTTP server mode with FlashAttention
2. Complete package: `llcuda-binaries-cuda12-t4-v2.1.0.tar.gz`

**Important Notes**:
- These binaries are optimized for Tesla T4 (SM 7.5)
- Includes FlashAttention v2, CUDA Graphs, and Tensor Core optimizations
- Compatible with v2.1.0 Python APIs and Unsloth integration

**Requirements**:
- Google Colab with Tesla T4 GPU
- CUDA 12.x (pre-installed in Colab)
- Python 3.10+

**Estimated Time**: ~15 minutes

---

## Step 1: Verify GPU and Environment

In [None]:
# Check GPU
!nvidia-smi --query-gpu=name,compute_cap,driver_version,memory.total --format=csv

In [None]:
# Verify CUDA version
!nvcc --version

In [None]:
# Check Python version
import sys
print(f"Python: {sys.version}")
print(f"Expected: 3.10+ (Colab default)")

In [None]:
# Verify compute capability
import subprocess

result = subprocess.run(
    ['nvidia-smi', '--query-gpu=compute_cap', '--format=csv,noheader'],
    capture_output=True,
    text=True
)
compute_cap = result.stdout.strip()
major, minor = map(int, compute_cap.split('.'))

print(f"Compute Capability: SM {major}.{minor}")

if major == 7 and minor == 5:
    print("âœ“ Tesla T4 detected - Perfect for llcuda v2.1.0!")
elif major >= 7 and minor >= 5:
    print(f"âœ“ SM {major}.{minor} detected - Compatible with llcuda v2.1.0")
else:
    print(f"âš  WARNING: SM {major}.{minor} is below SM 7.5 (T4)")
    print("llcuda v2.1.0 requires SM 7.5+ for Tensor Cores and FlashAttention")

## Step 2: Clone llama.cpp Repository

We'll build llama.cpp with CUDA 12 support, FlashAttention, and optimizations for Tesla T4.

In [None]:
# Clone llama.cpp
%cd /content
!git clone https://github.com/ggml-org/llama.cpp.git
%cd llama.cpp

In [None]:
# Check llama.cpp version
!git log --oneline -5

## Step 3: Configure and Build llama.cpp for Tesla T4

**Build Configuration**:
- **Target**: Tesla T4 (SM 7.5)
- **CUDA**: 12.x
- **FlashAttention**: Enabled (2-3x faster)
- **CUDA Graphs**: Enabled (20-40% latency reduction)
- **Tensor Cores**: Optimized for mixed precision
- **Shared Libraries**: Enabled for dynamic loading

In [None]:
# Configure llama.cpp for Tesla T4 with all optimizations
!cmake -B build_cuda12_t4 \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="75" \
    -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
    -DGGML_NATIVE=OFF \
    -DGGML_CUDA_FORCE_MMQ=OFF \
    -DGGML_CUDA_FORCE_CUBLAS=OFF \
    -DGGML_CUDA_FA=ON \
    -DGGML_CUDA_FA_ALL_QUANTS=ON \
    -DGGML_CUDA_GRAPHS=ON \
    -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_TOOLS=ON \
    -DLLAMA_CURL=ON \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_INSTALL_RPATH='$ORIGIN/../lib' \
    -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON

In [None]:
# Build llama.cpp (takes ~10 minutes)
import time

print("Building llama.cpp with CUDA 12 + FlashAttention...")
print("Estimated time: 10-12 minutes\n")
start_time = time.time()

!cmake --build build_cuda12_t4 --config Release -j$(nproc)

elapsed = time.time() - start_time
print(f"\nâœ“ Build completed in {elapsed/60:.1f} minutes")

In [None]:
# Verify binaries were built successfully
print("=== Built Binaries ===")
!ls -lh build_cuda12_t4/bin/llama-server
!ls -lh build_cuda12_t4/bin/llama-cli
!ls -lh build_cuda12_t4/bin/llama-quantize
!ls -lh build_cuda12_t4/bin/llama-embedding
!ls -lh build_cuda12_t4/bin/llama-bench

print("\n=== Shared Libraries ===")
!ls -lh build_cuda12_t4/bin/*.so* | head -10

In [None]:
# Test llama-server binary
import os
import subprocess

# Set LD_LIBRARY_PATH to include CUDA libraries
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/targets/x86_64-linux/lib:/content/llama.cpp/build_cuda12_t4/bin'

result = subprocess.run(
    ['/content/llama.cpp/build_cuda12_t4/bin/llama-server', '--version'],
    env=os.environ,
    capture_output=True,
    text=True
)

print("STDOUT:", result.stdout)
print("STDERR:", result.stderr)

if result.returncode == 0:
    print("\nâœ“ llama-server works correctly!")
else:
    print(f"\nâœ— Error: Return code {result.returncode}")

## Step 4: Package Binaries for Distribution

Create the `llcuda-binaries-cuda12-t4-v2.1.0.tar.gz` package with:
- llama-server, llama-cli, llama-quantize, llama-embedding, llama-bench
- All required shared libraries (.so files)
- Proper directory structure for llcuda v2.1.0

In [None]:
# Create package directory structure
%cd /content

!mkdir -p llcuda_binaries_t4/bin
!mkdir -p llcuda_binaries_t4/lib

# Copy essential binaries
print("Copying binaries...")
!cp llama.cpp/build_cuda12_t4/bin/llama-server llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-cli llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-quantize llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-embedding llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-bench llcuda_binaries_t4/bin/

# Copy all shared libraries
print("Copying shared libraries...")
!cp llama.cpp/build_cuda12_t4/bin/*.so* llcuda_binaries_t4/lib/

print("\nâœ“ Package structure created")

In [None]:
# Create README for the package
readme_content = """# llcuda v2.1.0 Binaries for Tesla T4

**Built on**: Google Colab
**GPU**: Tesla T4 (SM 7.5)
**CUDA**: 12.x
**Date**: {date}

## Contents

### bin/
- `llama-server` - HTTP inference server
- `llama-cli` - Command-line interface
- `llama-quantize` - Model quantization tool
- `llama-embedding` - Embedding generation
- `llama-bench` - Performance benchmarking

### lib/
- `libggml-cuda.so` - GGML CUDA kernels with FlashAttention
- `libllama.so` - llama.cpp library
- Other required shared libraries

## Installation

These binaries are automatically downloaded by llcuda v2.1.0 on first import.

For manual installation:
```bash
# Extract
tar -xzf llcuda-binaries-cuda12-t4-v2.1.0.tar.gz

# Copy to llcuda package (if needed)
cp -r bin lib ~/.cache/llcuda/binaries/cuda12/
```

## Features

- âœ… FlashAttention v2 (2-3x faster attention)
- âœ… Tensor Core optimization (SM 7.5)
- âœ… CUDA Graphs (20-40% latency reduction)
- âœ… All quantization formats (NF4, Q4_K_M, Q5_K_M, Q8_0, F16)
- âœ… Optimized for Tesla T4 GPUs

## Compatibility

- **llcuda**: v2.1.0+
- **Python**: 3.10+
- **CUDA**: 12.x
- **GPU**: Tesla T4 (SM 7.5) or higher

## Usage with llcuda

```python
import llcuda

# Binaries are automatically loaded
engine = llcuda.InferenceEngine()
engine.load_model("model.gguf")
result = engine.infer("Your prompt")
```

## Links

- **llcuda**: https://github.com/llcuda/llcuda
- **Documentation**: https://llcuda.github.io/
- **llama.cpp**: https://github.com/ggml-org/llama.cpp

---

**Built with**: Google Colab Tesla T4 | CUDA 12 | llama.cpp
"""

from datetime import datetime
readme_content = readme_content.format(date=datetime.now().strftime("%Y-%m-%d"))

with open('/content/llcuda_binaries_t4/README.md', 'w') as f:
    f.write(readme_content)

print("âœ“ README.md created")

In [None]:
# Create BUILD_INFO.txt with build details
import subprocess
from datetime import datetime

# Get llama.cpp commit hash
llamacpp_commit = subprocess.run(
    ['git', 'rev-parse', 'HEAD'],
    capture_output=True,
    text=True,
    cwd='/content/llama.cpp'
).stdout.strip()

build_info = f"""llcuda v2.1.0 Binary Build Information
=========================================

Build Date: {datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC")}
Build Platform: Google Colab
GPU: Tesla T4 (SM 7.5)
CUDA Version: 12.x
Python Version: {sys.version.split()[0]}

llama.cpp Details:
------------------
Repository: https://github.com/ggml-org/llama.cpp
Commit: {llamacpp_commit}

Build Configuration:
-------------------
CMAKE_BUILD_TYPE=Release
GGML_CUDA=ON
CMAKE_CUDA_ARCHITECTURES=75
GGML_CUDA_FA=ON (FlashAttention)
GGML_CUDA_FA_ALL_QUANTS=ON
GGML_CUDA_GRAPHS=ON
BUILD_SHARED_LIBS=ON

Features:
---------
- FlashAttention v2 enabled
- CUDA Graphs optimization
- Tensor Core utilization
- All quantization formats supported
- HTTP server mode

Compatible with:
----------------
- llcuda v2.1.0+
- Python 3.10+
- CUDA 12.x
- Tesla T4 or higher (SM 7.5+)
"""

with open('/content/llcuda_binaries_t4/BUILD_INFO.txt', 'w') as f:
    f.write(build_info)

print("âœ“ BUILD_INFO.txt created")
print("\nBuild Information:")
print(build_info)

In [None]:
# Show package contents and sizes
print("=== Package Contents ===")
!du -sh /content/llcuda_binaries_t4
!du -sh /content/llcuda_binaries_t4/bin
!du -sh /content/llcuda_binaries_t4/lib

print("\n=== Binary Files ===")
!ls -lh /content/llcuda_binaries_t4/bin/

print("\n=== Library Files ===")
!ls -lh /content/llcuda_binaries_t4/lib/ | head -15

## Step 5: Create tar.gz Archive

Create the final `llcuda-binaries-cuda12-t4-v2.1.0.tar.gz` archive.

In [None]:
# Create the tar.gz archive
%cd /content

# Rename to match expected structure
!mv llcuda_binaries_t4 package_t4

print("Creating tar.gz archive...")
!tar -czf llcuda-binaries-cuda12-t4-v2.1.0.tar.gz package_t4/

print("\nâœ“ Archive created successfully!")
print("\n=== Final Package ===")
!ls -lh llcuda-binaries-cuda12-t4-v2.1.0.tar.gz
!du -h llcuda-binaries-cuda12-t4-v2.1.0.tar.gz

In [None]:
# Create SHA256 checksum
!sha256sum llcuda-binaries-cuda12-t4-v2.1.0.tar.gz > llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256

print("âœ“ SHA256 checksum created")
print("\nChecksum:")
!cat llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256

In [None]:
# Verify archive contents
print("=== Archive Contents ===")
!tar -tzf llcuda-binaries-cuda12-t4-v2.1.0.tar.gz | head -30

## Step 6: Download Files

Download the binary package and checksum to your local machine.

In [None]:
# Download files
from google.colab import files

print("Downloading llcuda-binaries-cuda12-t4-v2.1.0.tar.gz (266 MB)...")
print("This may take a few minutes...\n")
files.download('/content/llcuda-binaries-cuda12-t4-v2.1.0.tar.gz')

print("\nDownloading checksum file...")
files.download('/content/llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256')

print("\nâœ“ All files downloaded successfully!")

## ðŸŽ‰ Build Complete!

### Created Files:

1. **llcuda-binaries-cuda12-t4-v2.1.0.tar.gz** (~266 MB)
   - llama.cpp binaries with FlashAttention
   - All required shared libraries
   - README and build information

2. **llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256**
   - SHA256 checksum for verification

### Next Steps:

1. **Upload to GitHub Releases**:
   ```bash
   gh release create v2.1.0 \
       --repo llcuda/llcuda \
       --title "llcuda v2.1.0 - Tesla T4 Release" \
       --notes "Complete CUDA 12 binaries with FlashAttention for Tesla T4" \
       llcuda-binaries-cuda12-t4-v2.1.0.tar.gz \
       llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256
   ```

2. **Test Installation**:
   ```python
   import llcuda
   print(llcuda.__version__)  # Should show 2.1.0
   ```

3. **Update bootstrap.py** to download from v2.1.0 release

### Package Features:

- âœ… FlashAttention v2 (2-3x faster)
- âœ… CUDA Graphs (20-40% latency reduction)
- âœ… Tensor Core optimization
- âœ… All quantization formats
- âœ… Optimized for Tesla T4 (SM 7.5)

---

**Built with**: Google Colab Tesla T4 | CUDA 12 | Python 3.10+  
**For**: llcuda v2.1.0 with Unsloth Integration  
**Repository**: https://github.com/llcuda/llcuda  
**Documentation**: https://llcuda.github.io/