# Backfill de Embeddings (JE-3 y CLIP-v2) a Supabase (Postgres + pgvector)

Este notebook configura **GPU en Colab**, instala dependencias, lee **secrets** (Supabase/Postgres),
usa **Hugging Face cache en Drive**, y ejecuta tus *backfills* mediante `python -m src.ingest...`.

👉 *Sigue las celdas en orden.*

****

In [1]:
# Instalar dependencias mínimas
%%capture
!pip -q install --upgrade sentence-transformers transformers psycopg2-binary boto3 python-dotenv tqdm pillow

**Versiones de libs clave**

In [2]:
!pip show sentence-transformers transformers

Name: sentence-transformers
Version: 5.1.0
Summary: Embeddings, Retrieval, and Reranking
Home-page: https://www.SBERT.net
Author: 
Author-email: Nils Reimers <info@nils-reimers.de>, Tom Aarsen <tom.aarsen@huggingface.co>
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, Pillow, scikit-learn, scipy, torch, tqdm, transformers, typing_extensions
Required-by: 
---
Name: transformers
Version: 4.55.4
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft, sentence-transformers


**GPU y drivers**

In [3]:
!nvidia-smi
!nvidia-smi --query-gpu=name,memory.total,memory.used,driver_version --format=csv

Sun Aug 24 16:40:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

**Versiones de PyTorch/CUDA/cuDNN**

In [4]:
import torch, platform
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA runtime:", torch.version.cuda)
print("cuDNN:", torch.backends.cudnn.version())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))


Torch: 2.8.0+cu126
CUDA available: True
CUDA runtime: 12.6
cuDNN: 91002
GPU: Tesla T4


In [5]:
# (Opcional pero recomendado) Montar Drive y usarlo como caché de Hugging Face
from google.colab import drive
drive.mount('/content/drive')
import os
os.environ['HF_HOME'] = '/content/drive/MyDrive/hf_cache'
os.environ['TRANSFORMERS_CACHE'] = '/content/drive/MyDrive/hf_cache/transformers'
print('HF_HOME =', os.environ['HF_HOME'])

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
HF_HOME = /content/drive/MyDrive/hf_cache


In [6]:
# Cargar credenciales desde Colab Secrets (Panel izquierdo > Secrets)
from google.colab import userdata
import os
def _get(name):
    try:
        v = userdata.get(name)
        if v is None:
            raise KeyError
        return v
    except Exception:
        raise RuntimeError(f'Secret {name} no existe o no tiene acceso para este notebook.')

# Nombres sugeridos de secrets (ajústalos si usas otros):
os.environ['user'] = _get('user')
os.environ['password'] = _get('password')
os.environ['host'] = _get('host')
os.environ['port'] = _get('port')
os.environ['dbname'] = _get('dbname')
print('Variables de entorno cargadas: user, host, dbname (port/password ocultos)')

Variables de entorno cargadas: user, host, dbname (port/password ocultos)


In [7]:
# Preparar estructura de proyecto para imports estilo `python -m src.ingest...`
import os, pathlib
for p in [
    '/content/src',
    '/content/src/ingest',
    '/content/src/embeddings'
]:
    os.makedirs(p, exist_ok=True)

# Crear __init__.py para que Python trate los directorios como paquetes
pathlib.Path('/content/src/__init__.py').write_text('')
pathlib.Path('/content/src/ingest/__init__.py').write_text('')
pathlib.Path('/content/src/embeddings/__init__.py').write_text('')
print('Estructura creada. Sube tus .py en los pasos siguientes.')

Estructura creada. Sube tus .py en los pasos siguientes.


In [8]:
# Subir tus archivos locales: backfill_* y módulos de embeddings
from google.colab import files
uploaded = files.upload()  # Selecciona los archivos desde tu PC

import shutil
for fname in uploaded:
    # Ingest/backfill scripts
    if fname.endswith('backfill_text_e5.py'):
        shutil.move(fname, '/content/src/ingest/backfill_text_e5.py')
    elif fname.endswith('backfill_text_gte.py'):
        shutil.move(fname, '/content/src/ingest/backfill_text_gte.py')
    elif fname.endswith('backfill_text_je3.py'):
        shutil.move(fname, '/content/src/ingest/backfill_text_je3.py')
    elif fname.endswith('backfill_clip_512.py'):
        shutil.move(fname, '/content/src/ingest/backfill_clip_512.py')
    # Legacy mapping (si usas variantes antiguas)
    elif fname.endswith('backfill_clip_v2.py'):
        shutil.move(fname, '/content/src/ingest/backfill_clip_v2.py')

    # Embedding modules
    elif fname.endswith('text_e5_small.py'):
        shutil.move(fname, '/content/src/embeddings/text_e5_small.py')
    elif fname.endswith('text_gte_base.py'):
        shutil.move(fname, '/content/src/embeddings/text_gte_base.py')
    elif fname.endswith('text_je3.py'):
        shutil.move(fname, '/content/src/embeddings/text_je3.py')
    elif fname.endswith('text_clip_multi.py'):
        shutil.move(fname, '/content/src/embeddings/text_clip_multi.py')
    elif fname.endswith('image_clip_vitb32.py'):
        shutil.move(fname, '/content/src/embeddings/image_clip_vitb32.py')
    # Legacy mapping (si usas variantes antiguas)
    elif fname.endswith('clip_v2.py'):
        shutil.move(fname, '/content/src/embeddings/clip_v2.py')
    else:
        print('Archivo no reconocido:', fname)
print('Archivos colocados en /content/src/...')

Saving clip_v2.py to clip_v2.py
Saving text_je3.py to text_je3.py
Archivos colocados en /content/src/...


In [9]:
#  Asegurar que /content está en el PYTHONPATH (para `python -m src...`)
import sys
if '/content' not in sys.path:
    sys.path.append('/content')
print('PYTHONPATH ok')

PYTHONPATH ok


In [10]:
#  Sanity check de imports (si falla aquí, revisa rutas y nombres)
from src.embeddings.text_je3 import encode_texts as _je3
from src.embeddings.text_e5_small import encode_texts as _e5
from src.embeddings.text_gte_base import encode_texts as _gte
from src.embeddings.text_clip_multi import encode_texts as _clip_txt
from src.embeddings.image_clip_vitb32 import encode_images as _clip_img
print('Imports OK')



Imports OK


**SANEO:** Úsala si el runtime quedó “contaminado” con la forma incorrecta o si alguna otra celda fijó TRANSFORMERS_CACHE

In [21]:
# Limpia TRANSFORMERS_CACHE (deprecado)
os.environ.pop("TRANSFORMERS_CACHE", None)  # HF recomienda HF_HOME

# Corrige el allocator si alguien lo dejó con '='
val = os.environ.get("PYTORCH_CUDA_ALLOC_CONF")
if val and "expandable_segments=True" in val:
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = val.replace("expandable_segments=True",
                                                        "expandable_segments:True")

# (Re)fija lo correcto explícitamente
os.environ.setdefault("HF_HOME", "/content/drive/MyDrive/hf_cache")
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")

print("HF_HOME =", os.environ.get("HF_HOME"))
print("PYTORCH_CUDA_ALLOC_CONF =", os.environ.get("PYTORCH_CUDA_ALLOC_CONF"))
print("CUDA available before import:", torch.cuda.is_available())


HF_HOME = /content/drive/MyDrive/hf_cache
PYTORCH_CUDA_ALLOC_CONF = expandable_segments:True
CUDA available before import: True


In [None]:
# Ejecutar backfills de texto (orden: E5 → GTE → JE-3)
!python -m src.ingest.backfill_text_e5 --batch-size 512
!python -m src.ingest.backfill_text_gte --batch-size 256
!python -m src.ingest.backfill_text_je3 --batch-size 128

2025-08-24 16:41:33.787060: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756053693.806760   24898 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756053693.812976   24898 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756053693.828193   24898 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756053693.828224   24898 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756053693.828230   24898 computation_placer.cc:177] computation placer alr

In [None]:
# Ejecutar backfill CLIP texto→imagen (text_multi)
!python -m src.ingest.backfill_clip_512 --mode text_multi --batch-size 256

In [None]:
# Ejecutar backfill CLIP imagen→texto (image)
!python -m src.ingest.backfill_clip_512 --mode image --batch-size 128

### Tips rápidos
- Si ves **OOM** (out-of-memory) en GPU, baja `--batch-size`.
- Si Colab te desconecta, ejecuta por **tandas** (ej.: por rangos de `id`).
- Reutiliza la caché en Drive para no re-descargar modelos grandes.