# Build llcuda v2.1.0 Binaries for Tesla T4 (Google Colab)

**Purpose**: Build complete CUDA 12 binaries for llcuda v2.1.0 on Google Colab Tesla T4 GPU

**Output**:
1. llama.cpp binaries (264 MB) - HTTP server mode with FlashAttention
2. Complete package: `llcuda-binaries-cuda12-t4-v2.1.0.tar.gz`

**Important Notes**:
- These binaries are optimized for Tesla T4 (SM 7.5)
- Includes FlashAttention v2, CUDA Graphs, and Tensor Core optimizations
- Compatible with v2.1.0 Python APIs and Unsloth integration

**Requirements**:
- Google Colab with Tesla T4 GPU
- CUDA 12.x (pre-installed in Colab)
- Python 3.10+

**Estimated Time**: ~15 minutes

---

## Step 1: Verify GPU and Environment

In [1]:
# Check GPU
!nvidia-smi --query-gpu=name,compute_cap,driver_version,memory.total --format=csv

name, compute_cap, driver_version, memory.total [MiB]
Tesla T4, 7.5, 550.54.15, 15360 MiB


In [2]:
# Verify CUDA version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [3]:
# Check Python version
import sys
print(f"Python: {sys.version}")
print(f"Expected: 3.10+ (Colab default)")

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Expected: 3.10+ (Colab default)


In [4]:
# Verify compute capability
import subprocess

result = subprocess.run(
    ['nvidia-smi', '--query-gpu=compute_cap', '--format=csv,noheader'],
    capture_output=True,
    text=True
)
compute_cap = result.stdout.strip()
major, minor = map(int, compute_cap.split('.'))

print(f"Compute Capability: SM {major}.{minor}")

if major == 7 and minor == 5:
    print("✓ Tesla T4 detected - Perfect for llcuda v2.1.0!")
elif major >= 7 and minor >= 5:
    print(f"✓ SM {major}.{minor} detected - Compatible with llcuda v2.1.0")
else:
    print(f"⚠ WARNING: SM {major}.{minor} is below SM 7.5 (T4)")
    print("llcuda v2.1.0 requires SM 7.5+ for Tensor Cores and FlashAttention")

Compute Capability: SM 7.5
✓ Tesla T4 detected - Perfect for llcuda v2.1.0!


## Step 2: Clone llama.cpp Repository

We'll build llama.cpp with CUDA 12 support, FlashAttention, and optimizations for Tesla T4.

In [5]:
# Clone llama.cpp
%cd /content
!git clone https://github.com/ggml-org/llama.cpp.git
%cd llama.cpp

/content
Cloning into 'llama.cpp'...
remote: Enumerating objects: 75979, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 75979 (delta 29), reused 17 (delta 16), pack-reused 75924 (from 3)[K
Receiving objects: 100% (75979/75979), 279.54 MiB | 31.18 MiB/s, done.
Resolving deltas: 100% (55154/55154), done.
/content/llama.cpp


In [6]:
# Check llama.cpp version
!git log --oneline -5

[33m516a4ca9b[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m, [m[1;31morigin/master[m[33m, [m[1;31morigin/HEAD[m[33m)[m refactor : remove libcurl, use OpenSSL when available (#18828)
[33m3e4bb2966[m[33m ([m[1;33mtag: b7735[m[33m)[m vulkan: Check maxStorageBufferRange in supports_op (#18709)
[33m47f961249[m llama-model: fix unfortunate typo (#18832)
[33m01cbdfd7e[m CUDA : fix typo in clang pragma comment [no ci] (#18830)
[33m635ef78ec[m vulkan: work around Intel fp16 bug in mmq (#18814)


## Step 3: Configure and Build llama.cpp for Tesla T4

**Build Configuration**:
- **Target**: Tesla T4 (SM 7.5)
- **CUDA**: 12.x
- **FlashAttention**: Enabled (2-3x faster)
- **CUDA Graphs**: Enabled (20-40% latency reduction)
- **Tensor Cores**: Optimized for mixed precision
- **Shared Libraries**: Enabled for dynamic loading

In [7]:
# Configure llama.cpp for Tesla T4 with all optimizations
!cmake -B build_cuda12_t4 \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="75" \
    -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
    -DGGML_NATIVE=OFF \
    -DGGML_CUDA_FORCE_MMQ=OFF \
    -DGGML_CUDA_FORCE_CUBLAS=OFF \
    -DGGML_CUDA_FA=ON \
    -DGGML_CUDA_FA_ALL_QUANTS=ON \
    -DGGML_CUDA_GRAPHS=ON \
    -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_TOOLS=ON \
    -DLLAMA_CURL=ON \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_INSTALL_RPATH='$ORIGIN/../lib' \
    -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
[0mCMAKE_BUILD_TYPE=Release[0m
-- Found Git: /usr/bin/git (found version "2.34.1")
  LLAMA_CURL option is deprecated and will be ignored

[0m
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -f

In [8]:
# Build llama.cpp (takes ~10 minutes)
import time

print("Building llama.cpp with CUDA 12 + FlashAttention...")
print("Estimated time: 10-12 minutes\n")
start_time = time.time()

!cmake --build build_cuda12_t4 --config Release -j$(nproc)

elapsed = time.time() - start_time
print(f"\n✓ Build completed in {elapsed/60:.1f} minutes")

Building llama.cpp with CUDA 12 + FlashAttention...
Estimated time: 10-12 minutes

[  0%] [32mBuilding C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o[0m
[  0%] [32mBuilding CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o[0m
[  0%] Built target build_info
[  0%] [32mBuilding CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o[0m
[  1%] [32mBuilding C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o[0m
[  2%] [32mBuilding CXX object vendor/cpp-httplib/CMakeFiles/cpp-httplib.dir/httplib.cpp.o[0m
[  2%] [32mBuilding CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o[0m
[  2%] [32mBuilding CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o[0m
[  2%] [32mBuilding CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o[0m
[  2%] [32mBuilding C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o[0m
[  2%] [32mBuilding CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o[0m
[  3%] [32m[1mLinking CXX share

In [9]:
# Verify binaries were built successfully
print("=== Built Binaries ===")
!ls -lh build_cuda12_t4/bin/llama-server
!ls -lh build_cuda12_t4/bin/llama-cli
!ls -lh build_cuda12_t4/bin/llama-quantize
!ls -lh build_cuda12_t4/bin/llama-embedding
!ls -lh build_cuda12_t4/bin/llama-bench

print("\n=== Shared Libraries ===")
!ls -lh build_cuda12_t4/bin/*.so* | head -10

=== Built Binaries ===
-rwxr-xr-x 1 root root 6.7M Jan 14 18:54 build_cuda12_t4/bin/llama-server
-rwxr-xr-x 1 root root 5.1M Jan 14 18:54 build_cuda12_t4/bin/llama-cli
-rwxr-xr-x 1 root root 434K Jan 14 18:53 build_cuda12_t4/bin/llama-quantize
-rwxr-xr-x 1 root root 4.2M Jan 14 18:52 build_cuda12_t4/bin/llama-embedding
-rwxr-xr-x 1 root root 581K Jan 14 18:52 build_cuda12_t4/bin/llama-bench

=== Shared Libraries ===
lrwxrwxrwx 1 root root   17 Jan 14 18:02 build_cuda12_t4/bin/libggml-base.so -> libggml-base.so.0
lrwxrwxrwx 1 root root   21 Jan 14 18:02 build_cuda12_t4/bin/libggml-base.so.0 -> libggml-base.so.0.9.5
-rwxr-xr-x 1 root root 721K Jan 14 18:02 build_cuda12_t4/bin/libggml-base.so.0.9.5
lrwxrwxrwx 1 root root   16 Jan 14 18:03 build_cuda12_t4/bin/libggml-cpu.so -> libggml-cpu.so.0
lrwxrwxrwx 1 root root   20 Jan 14 18:03 build_cuda12_t4/bin/libggml-cpu.so.0 -> libggml-cpu.so.0.9.5
-rwxr-xr-x 1 root root 949K Jan 14 18:03 build_cuda12_t4/bin/libggml-cpu.so.0.9.5
lrwxrwxrwx 1 ro

In [15]:
import os
import subprocess

# 1. Find libcuda.so.1 location safely
result = subprocess.run(
    ['find', '/usr', '-name', 'libcuda.so.1', '-type', 'f'],
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL,
    text=True
)
libcuda_path = result.stdout.strip().split('\n')[0] if result.stdout else ''

if libcuda_path:
    libcuda_dir = os.path.dirname(libcuda_path)
    print(f"Found libcuda at: {libcuda_dir}")
else:
    libcuda_dir = '/usr/lib/x86_64-linux-gnu'
    print("libcuda not found, defaulting to standard path.")

# 2. Set comprehensive LD_LIBRARY_PATH
# Added the compat folder which is often needed for T4/CUDA 12.x in Colab
cuda_paths = [
    '/usr/local/cuda-12.5/compat',
    libcuda_dir,
    '/usr/local/cuda/lib64',
    '/content/llama.cpp/build_cuda12_t4/bin',
    '/usr/lib/x86_64-linux-gnu'
]
os.environ['LD_LIBRARY_PATH'] = ':'.join(cuda_paths)

print(f"Testing with LD_LIBRARY_PATH: {os.environ['LD_LIBRARY_PATH']}\n")

# 3. Run the version check
try:
    result = subprocess.run(
        ['/content/llama.cpp/build_cuda12_t4/bin/llama-server', '--version'],
        env=os.environ,
        capture_output=True, # This is fine here because we aren't setting stderr manually
        text=True
    )

    if result.returncode == 0:
        print(f"✓ llama-server works! \nVersion: {result.stdout}")
    else:
        print(f"✗ Error: Return code {result.returncode}")
        print(f"STDERR: {result.stderr}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

libcuda not found, defaulting to standard path.
Testing with LD_LIBRARY_PATH: /usr/local/cuda-12.5/compat:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64:/content/llama.cpp/build_cuda12_t4/bin:/usr/lib/x86_64-linux-gnu

✓ llama-server works! 
Version: 


In [18]:
import os

# 1. Get the existing system paths to avoid losing libnvidia-ml.so
current_path = os.environ.get('LD_LIBRARY_PATH', '')

# 2. Define your new required paths
build_paths = [
    '/usr/local/cuda-12.5/compat',
    '/usr/local/cuda/lib64',
    '/content/llama.cpp/build_cuda12_t4/bin',
    '/usr/lib64-nvidia',           # Critical: This is where libnvidia-ml.so usually lives
    '/usr/local/nvidia/lib64',     # Common secondary location
    '/usr/lib/x86_64-linux-gnu'
]

# 3. Combine them (filtering for paths that actually exist)
valid_paths = [p for p in build_paths if os.path.exists(p)]
if current_path:
    valid_paths.append(current_path)

os.environ['LD_LIBRARY_PATH'] = ':'.join(valid_paths)

print("✅ LD_LIBRARY_PATH restored and updated.")
print(f"Final Path: {os.environ['LD_LIBRARY_PATH']}")

# 4. Verify that nvidia-smi is back online
!nvidia-smi --query-gpu=name,memory.total --format=csv

✅ LD_LIBRARY_PATH restored and updated.
Final Path: /usr/local/cuda-12.5/compat:/usr/local/cuda/lib64:/content/llama.cpp/build_cuda12_t4/bin:/usr/lib64-nvidia:/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.5/compat:/usr/local/cuda/lib64:/content/llama.cpp/build_cuda12_t4/bin:/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.5/compat:/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64:/content/llama.cpp/build_cuda12_t4/bin:/usr/lib/x86_64-linux-gnu
name, memory.total [MiB]
Tesla T4, 15360 MiB


## Step 4: Package Binaries for Distribution

Create the `llcuda-binaries-cuda12-t4-v2.1.0.tar.gz` package with:
- llama-server, llama-cli, llama-quantize, llama-embedding, llama-bench
- All required shared libraries (.so files)
- Proper directory structure for llcuda v2.1.0

In [20]:
# Create package directory structure
%cd /content

!mkdir -p llcuda_binaries_t4/bin
!mkdir -p llcuda_binaries_t4/lib

# Copy essential binaries
print("Copying binaries...")
!cp llama.cpp/build_cuda12_t4/bin/llama-server llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-cli llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-quantize llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-embedding llcuda_binaries_t4/bin/
!cp llama.cpp/build_cuda12_t4/bin/llama-bench llcuda_binaries_t4/bin/

# Copy all shared libraries
print("Copying shared libraries...")
!cp llama.cpp/build_cuda12_t4/bin/*.so* llcuda_binaries_t4/lib/

print("\n✓ Package structure created")

/content
Copying binaries...
Copying shared libraries...

✓ Package structure created


In [21]:
# Create README for the package
readme_content = """# llcuda v2.1.0 Binaries for Tesla T4

**Built on**: Google Colab
**GPU**: Tesla T4 (SM 7.5)
**CUDA**: 12.x
**Date**: {date}

## Contents

### bin/
- `llama-server` - HTTP inference server
- `llama-cli` - Command-line interface
- `llama-quantize` - Model quantization tool
- `llama-embedding` - Embedding generation
- `llama-bench` - Performance benchmarking

### lib/
- `libggml-cuda.so` - GGML CUDA kernels with FlashAttention
- `libllama.so` - llama.cpp library
- Other required shared libraries

## Installation

These binaries are automatically downloaded by llcuda v2.1.0 on first import.

For manual installation:
```bash
# Extract
tar -xzf llcuda-binaries-cuda12-t4-v2.1.0.tar.gz

# Copy to llcuda package (if needed)
cp -r bin lib ~/.cache/llcuda/binaries/cuda12/
```

## Features

- ✅ FlashAttention v2 (2-3x faster attention)
- ✅ Tensor Core optimization (SM 7.5)
- ✅ CUDA Graphs (20-40% latency reduction)
- ✅ All quantization formats (NF4, Q4_K_M, Q5_K_M, Q8_0, F16)
- ✅ Optimized for Tesla T4 GPUs

## Compatibility

- **llcuda**: v2.1.0+
- **Python**: 3.10+
- **CUDA**: 12.x
- **GPU**: Tesla T4 (SM 7.5) or higher

## Usage with llcuda

```python
import llcuda

# Binaries are automatically loaded
engine = llcuda.InferenceEngine()
engine.load_model("model.gguf")
result = engine.infer("Your prompt")
```

## Links

- **llcuda**: https://github.com/llcuda/llcuda
- **Documentation**: https://llcuda.github.io/
- **llama.cpp**: https://github.com/ggml-org/llama.cpp

---

**Built with**: Google Colab Tesla T4 | CUDA 12 | llama.cpp
"""

from datetime import datetime
readme_content = readme_content.format(date=datetime.now().strftime("%Y-%m-%d"))

with open('/content/llcuda_binaries_t4/README.md', 'w') as f:
    f.write(readme_content)

print("✓ README.md created")

✓ README.md created


In [22]:
# Create BUILD_INFO.txt with build details
import subprocess
from datetime import datetime

# Get llama.cpp commit hash
llamacpp_commit = subprocess.run(
    ['git', 'rev-parse', 'HEAD'],
    capture_output=True,
    text=True,
    cwd='/content/llama.cpp'
).stdout.strip()

build_info = f"""llcuda v2.1.0 Binary Build Information
=========================================

Build Date: {datetime.now().strftime("%Y-%m-%d %H:%M:%S UTC")}
Build Platform: Google Colab
GPU: Tesla T4 (SM 7.5)
CUDA Version: 12.x
Python Version: {sys.version.split()[0]}

llama.cpp Details:
------------------
Repository: https://github.com/ggml-org/llama.cpp
Commit: {llamacpp_commit}

Build Configuration:
-------------------
CMAKE_BUILD_TYPE=Release
GGML_CUDA=ON
CMAKE_CUDA_ARCHITECTURES=75
GGML_CUDA_FA=ON (FlashAttention)
GGML_CUDA_FA_ALL_QUANTS=ON
GGML_CUDA_GRAPHS=ON
BUILD_SHARED_LIBS=ON

Features:
---------
- FlashAttention v2 enabled
- CUDA Graphs optimization
- Tensor Core utilization
- All quantization formats supported
- HTTP server mode

Compatible with:
----------------
- llcuda v2.1.0+
- Python 3.10+
- CUDA 12.x
- Tesla T4 or higher (SM 7.5+)
"""

with open('/content/llcuda_binaries_t4/BUILD_INFO.txt', 'w') as f:
    f.write(build_info)

print("✓ BUILD_INFO.txt created")
print("\nBuild Information:")
print(build_info)

✓ BUILD_INFO.txt created

Build Information:
llcuda v2.1.0 Binary Build Information

Build Date: 2026-01-14 19:26:45 UTC
Build Platform: Google Colab
GPU: Tesla T4 (SM 7.5)
CUDA Version: 12.x
Python Version: 3.12.12

llama.cpp Details:
------------------
Repository: https://github.com/ggml-org/llama.cpp
Commit: 516a4ca9b5f2fa72c2a71f412929a67cf76a6213

Build Configuration:
-------------------
CMAKE_BUILD_TYPE=Release
GGML_CUDA=ON
CMAKE_CUDA_ARCHITECTURES=75
GGML_CUDA_FA=ON (FlashAttention)
GGML_CUDA_FA_ALL_QUANTS=ON
GGML_CUDA_GRAPHS=ON
BUILD_SHARED_LIBS=ON

Features:
---------
- FlashAttention v2 enabled
- CUDA Graphs optimization
- Tensor Core utilization
- All quantization formats supported
- HTTP server mode

Compatible with:
----------------
- llcuda v2.1.0+
- Python 3.10+
- CUDA 12.x
- Tesla T4 or higher (SM 7.5+)



In [23]:
# Show package contents and sizes
print("=== Package Contents ===")
!du -sh /content/llcuda_binaries_t4
!du -sh /content/llcuda_binaries_t4/bin
!du -sh /content/llcuda_binaries_t4/lib

print("\n=== Binary Files ===")
!ls -lh /content/llcuda_binaries_t4/bin/

print("\n=== Library Files ===")
!ls -lh /content/llcuda_binaries_t4/lib/ | head -15

=== Package Contents ===
696M	/content/llcuda_binaries_t4
17M	/content/llcuda_binaries_t4/bin
679M	/content/llcuda_binaries_t4/lib

=== Binary Files ===
total 17M
-rwxr-xr-x 1 root root 581K Jan 14 19:26 llama-bench
-rwxr-xr-x 1 root root 5.1M Jan 14 19:26 llama-cli
-rwxr-xr-x 1 root root 4.2M Jan 14 19:26 llama-embedding
-rwxr-xr-x 1 root root 434K Jan 14 19:26 llama-quantize
-rwxr-xr-x 1 root root 6.7M Jan 14 19:26 llama-server

=== Library Files ===
total 679M
-rwxr-xr-x 1 root root 721K Jan 14 19:26 libggml-base.so
-rwxr-xr-x 1 root root 721K Jan 14 19:26 libggml-base.so.0
-rwxr-xr-x 1 root root 721K Jan 14 19:26 libggml-base.so.0.9.5
-rwxr-xr-x 1 root root 949K Jan 14 19:26 libggml-cpu.so
-rwxr-xr-x 1 root root 949K Jan 14 19:26 libggml-cpu.so.0
-rwxr-xr-x 1 root root 949K Jan 14 19:26 libggml-cpu.so.0.9.5
-rwxr-xr-x 1 root root 221M Jan 14 19:26 libggml-cuda.so
-rwxr-xr-x 1 root root 221M Jan 14 19:26 libggml-cuda.so.0
-rwxr-xr-x 1 root root 221M Jan 14 19:26 libggml-cuda.so.0.9.

## Step 5: Create tar.gz Archive

Create the final `llcuda-binaries-cuda12-t4-v2.1.0.tar.gz` archive.

In [24]:
# Create the tar.gz archive
%cd /content

# Rename to match expected structure
!mv llcuda_binaries_t4 package_t4

print("Creating tar.gz archive...")
!tar -czf llcuda-binaries-cuda12-t4-v2.1.0.tar.gz package_t4/

print("\n✓ Archive created successfully!")
print("\n=== Final Package ===")
!ls -lh llcuda-binaries-cuda12-t4-v2.1.0.tar.gz
!du -h llcuda-binaries-cuda12-t4-v2.1.0.tar.gz

/content
Creating tar.gz archive...

✓ Archive created successfully!

=== Final Package ===
-rw-r--r-- 1 root root 267M Jan 14 19:27 llcuda-binaries-cuda12-t4-v2.1.0.tar.gz
267M	llcuda-binaries-cuda12-t4-v2.1.0.tar.gz


In [25]:
# Create SHA256 checksum
!sha256sum llcuda-binaries-cuda12-t4-v2.1.0.tar.gz > llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256

print("✓ SHA256 checksum created")
print("\nChecksum:")
!cat llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256

✓ SHA256 checksum created

Checksum:
1bc2dd6d837f3b2ffcc5aee90ad65829aeba63dd5f01505adb2437cd417bf5db  llcuda-binaries-cuda12-t4-v2.1.0.tar.gz


In [26]:
# Verify archive contents
print("=== Archive Contents ===")
!tar -tzf llcuda-binaries-cuda12-t4-v2.1.0.tar.gz | head -30

=== Archive Contents ===
package_t4/
package_t4/README.md
package_t4/lib/
package_t4/lib/libggml-base.so.0
package_t4/lib/libggml-cpu.so.0.9.5
package_t4/lib/libllama.so.0
package_t4/lib/libggml.so.0.9.5
package_t4/lib/libggml-base.so
package_t4/lib/libmtmd.so.0.0.7736
package_t4/lib/libggml.so.0
package_t4/lib/libggml-cpu.so.0
package_t4/lib/libllama.so
package_t4/lib/libmtmd.so
package_t4/lib/libggml.so
package_t4/lib/libllama.so.0.0.7736
package_t4/lib/libggml-base.so.0.9.5
package_t4/lib/libggml-cpu.so
package_t4/lib/libggml-cuda.so.0
package_t4/lib/libggml-cuda.so
package_t4/lib/libmtmd.so.0
package_t4/lib/libggml-cuda.so.0.9.5
package_t4/bin/
package_t4/bin/llama-bench
package_t4/bin/llama-server
package_t4/bin/llama-cli
package_t4/bin/llama-embedding
package_t4/bin/llama-quantize
package_t4/BUILD_INFO.txt


## Step 6: Download Files

Download the binary package and checksum to your local machine.

In [27]:
# Download files
from google.colab import files

print("Downloading llcuda-binaries-cuda12-t4-v2.1.0.tar.gz (266 MB)...")
print("This may take a few minutes...\n")
files.download('/content/llcuda-binaries-cuda12-t4-v2.1.0.tar.gz')

print("\nDownloading checksum file...")
files.download('/content/llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256')

print("\n✓ All files downloaded successfully!")

Downloading llcuda-binaries-cuda12-t4-v2.1.0.tar.gz (266 MB)...
This may take a few minutes...



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Downloading checksum file...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


✓ All files downloaded successfully!


## 🎉 Build Complete!

### Created Files:

1. **llcuda-binaries-cuda12-t4-v2.1.0.tar.gz** (~266 MB)
   - llama.cpp binaries with FlashAttention
   - All required shared libraries
   - README and build information

2. **llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256**
   - SHA256 checksum for verification

### Next Steps:

1. **Upload to GitHub Releases**:
   ```bash
   gh release create v2.1.0 \
       --repo llcuda/llcuda \
       --title "llcuda v2.1.0 - Tesla T4 Release" \
       --notes "Complete CUDA 12 binaries with FlashAttention for Tesla T4" \
       llcuda-binaries-cuda12-t4-v2.1.0.tar.gz \
       llcuda-binaries-cuda12-t4-v2.1.0.tar.gz.sha256
   ```

2. **Test Installation**:
   ```python
   import llcuda
   print(llcuda.__version__)  # Should show 2.1.0
   ```

3. **Update bootstrap.py** to download from v2.1.0 release

### Package Features:

- ✅ FlashAttention v2 (2-3x faster)
- ✅ CUDA Graphs (20-40% latency reduction)
- ✅ Tensor Core optimization
- ✅ All quantization formats
- ✅ Optimized for Tesla T4 (SM 7.5)

---

**Built with**: Google Colab Tesla T4 | CUDA 12 | Python 3.10+  
**For**: llcuda v2.1.0 with Unsloth Integration  
**Repository**: https://github.com/llcuda/llcuda  
**Documentation**: https://llcuda.github.io/