# 02 - Pipeline Deployment - RAG OpenShift AI

## 🎯 Objetivo
Crear, compilar y deployar el pipeline RAG en OpenShift AI Data Science Pipelines.
Este notebook toma los components desarrollados en el notebook anterior y los prepara para deployment real.

## 📋 Lo que Haremos
1. **Crear Components como archivos Python separados**
2. **Definir Pipeline real con components importados**
3. **Compilar Pipeline a YAML**
4. **Configurar MinIO con buckets necesarios**
5. **Deploy en OpenShift AI v2**
6. **Testing del Pipeline deployado**

## 🔧 Setup Inicial

In [22]:
import os
import sys
import json
import yaml
from datetime import datetime
from pathlib import Path
import warnings
import stat
warnings.filterwarnings('ignore')

# Verificar KFP y dependencias
try:
    import kfp
    from kfp import dsl, compiler
    from kfp.client import Client
    from kfp.dsl import component, pipeline, Input, Output, Dataset
    print(f"✅ KubeFlow Pipelines disponible: {kfp.__version__}")
except ImportError:
    print("❌ Instalando KubeFlow Pipelines...")
    !pip install kfp>=2.0.0

# Verificar MinIO client
try:
    from minio import Minio
    print("✅ MinIO client disponible")
except ImportError:
    print("❌ Instalando MinIO client...")
    !pip install minio

print("🚀 Notebook de deployment iniciado")
print(f"📁 Directorio actual: {os.getcwd()}")

✅ KubeFlow Pipelines disponible: 2.12.1
✅ MinIO client disponible
🚀 Notebook de deployment iniciado
📁 Directorio actual: /opt/app-root/src


## 📁 Crear Estructura de Archivos para Components

Creamos los archivos Python separados para cada component del pipeline

In [23]:
def create_project_structure():
    """Crear la estructura de directorios y archivos del proyecto"""
    
    # Definir estructura de directorios
    directories = [
        "components",
        "pipelines", 
        "webhook",
        "config",
        "deploy/minio",
        "deploy/elasticsearch", 
        "deploy/openshift-ai",
        "deploy/webhook",
        "tests"
    ]
    
    print("📁 Creando estructura de directorios...")
    
    # Crear directorios
    for dir_path in directories:
        Path(dir_path).mkdir(parents=True, exist_ok=True)
        print(f"  ✅ {dir_path}/")
    
    # Crear archivos __init__.py para modules Python
    init_files = [
        "components/__init__.py",
        "pipelines/__init__.py",
        "webhook/__init__.py"
    ]
    
    for init_file in init_files:
        Path(init_file).touch(exist_ok=True)
        print(f"  ✅ {init_file}")
    
    print("\n📋 Estructura creada:")
    print("  components/          # Pipeline components")
    print("  pipelines/           # Pipeline definitions") 
    print("  webhook/             # Webhook handler")
    print("  config/              # Configuration files")
    print("  deploy/              # Deployment manifests")
    print("  tests/               # Integration tests")
    
    return directories

# Crear estructura
created_dirs = create_project_structure()

📁 Creando estructura de directorios...
  ✅ components/
  ✅ pipelines/
  ✅ webhook/
  ✅ config/
  ✅ deploy/minio/
  ✅ deploy/elasticsearch/
  ✅ deploy/openshift-ai/
  ✅ deploy/webhook/
  ✅ tests/
  ✅ components/__init__.py
  ✅ pipelines/__init__.py
  ✅ webhook/__init__.py

📋 Estructura creada:
  components/          # Pipeline components
  pipelines/           # Pipeline definitions
  webhook/             # Webhook handler
  config/              # Configuration files
  deploy/              # Deployment manifests
  tests/               # Integration tests


## 🔧 Component 1: Text Processing Components

Creamos el archivo con los components de procesamiento de texto

In [24]:
def create_text_processing_components():
    """Crear archivo components/text_processing.py"""
    
    text_processing_content = '''"""
Text Processing Components para RAG Pipeline
Incluye: extract_text_component y chunk_text_component
Author: Carlos Estay (github: pkstaz)
"""

from kfp.dsl import component, Input, Output, Dataset

@component(
    base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
    packages_to_install=[
        "PyPDF2==3.0.1",
        "python-docx==0.8.11", 
        "minio==7.1.17",
        "chardet==5.2.0"
    ]
)
def extract_text_component(
    bucket_name: str,
    object_key: str,
    minio_endpoint: str,
    minio_access_key: str,
    minio_secret_key: str,
    extracted_text: Output[Dataset],
    metadata: Output[Dataset]
):
    """
    Extrae texto de documentos almacenados en MinIO.
    
    Args:
        bucket_name: Nombre del bucket en MinIO
        object_key: Path del archivo en el bucket
        minio_endpoint: Endpoint de MinIO
        minio_access_key: Access key de MinIO
        minio_secret_key: Secret key de MinIO
        extracted_text: Output dataset con el texto extraído
        metadata: Output dataset con metadata del documento
    """
    import os
    import json
    import tempfile
    from pathlib import Path
    from datetime import datetime
    from minio import Minio
    import PyPDF2
    from docx import Document
    import chardet
    
    # Conectar a MinIO
    minio_client = Minio(
        minio_endpoint,
        access_key=minio_access_key,
        secret_key=minio_secret_key,
        secure=False
    )
    
    # Crear directorio temporal
    with tempfile.TemporaryDirectory() as temp_dir:
        local_file_path = os.path.join(temp_dir, object_key.split('/')[-1])
        
        # Descargar archivo desde MinIO
        try:
            minio_client.fget_object(bucket_name, object_key, local_file_path)
            print(f"✅ Archivo descargado: {local_file_path}")
        except Exception as e:
            raise Exception(f"Error descargando archivo: {str(e)}")
        
        # Detectar tipo de archivo
        file_extension = Path(local_file_path).suffix.lower()
        file_size = os.path.getsize(local_file_path)
        
        # Extraer texto según el tipo de archivo
        extracted_content = ""
        
        if file_extension == '.pdf':
            try:
                with open(local_file_path, 'rb') as file:
                    pdf_reader = PyPDF2.PdfReader(file)
                    for page_num in range(len(pdf_reader.pages)):
                        page = pdf_reader.pages[page_num]
                        extracted_content += page.extract_text() + "\\n"
                print(f"✅ PDF procesado: {len(pdf_reader.pages)} páginas")
            except Exception as e:
                raise Exception(f"Error procesando PDF: {str(e)}")
                
        elif file_extension == '.docx':
            try:
                doc = Document(local_file_path)
                for paragraph in doc.paragraphs:
                    extracted_content += paragraph.text + "\\n"
                print(f"✅ DOCX procesado: {len(doc.paragraphs)} párrafos")
            except Exception as e:
                raise Exception(f"Error procesando DOCX: {str(e)}")
                
        elif file_extension in ['.txt', '.md']:
            try:
                # Detectar encoding
                with open(local_file_path, 'rb') as file:
                    raw_data = file.read()
                    encoding = chardet.detect(raw_data)['encoding']
                
                # Leer con encoding detectado
                with open(local_file_path, 'r', encoding=encoding) as file:
                    extracted_content = file.read()
                print(f"✅ TXT procesado con encoding: {encoding}")
            except Exception as e:
                raise Exception(f"Error procesando TXT: {str(e)}")
                
        else:
            raise Exception(f"Tipo de archivo no soportado: {file_extension}")
        
        # Validar que se extrajo contenido
        if not extracted_content.strip():
            raise Exception("No se pudo extraer texto del documento")
        
        # Preparar metadata
        document_metadata = {
            "source_file": object_key,
            "file_type": file_extension,
            "file_size": file_size,
            "processed_at": datetime.now().isoformat(),
            "char_count": len(extracted_content),
            "word_count": len(extracted_content.split()),
            "bucket_name": bucket_name
        }
        
        # Guardar outputs
        with open(extracted_text.path, 'w', encoding='utf-8') as f:
            f.write(extracted_content)
            
        with open(metadata.path, 'w', encoding='utf-8') as f:
            json.dump(document_metadata, f, indent=2)
        
        print(f"✅ Texto extraído: {len(extracted_content)} caracteres")


@component(
    base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
    packages_to_install=[
        "tiktoken==0.5.1",
        "langchain==0.0.350"
    ]
)
def chunk_text_component(
    extracted_text: Input[Dataset],
    metadata: Input[Dataset],
    chunk_size: int,
    chunk_overlap: int,
    chunks: Output[Dataset]
):
    """
    Divide el texto en chunks con overlap para processing óptimo.
    
    Args:
        extracted_text: Input dataset con texto extraído
        metadata: Input dataset con metadata del documento
        chunk_size: Tamaño máximo de cada chunk (en tokens)
        chunk_overlap: Overlap entre chunks (en tokens)
        chunks: Output dataset con chunks procesados
    """
    import json
    import tiktoken
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    # Leer input data
    with open(extracted_text.path, 'r', encoding='utf-8') as f:
        text_content = f.read()
    
    with open(metadata.path, 'r', encoding='utf-8') as f:
        doc_metadata = json.load(f)
    
    # Configurar tokenizer
    encoding = tiktoken.get_encoding("cl100k_base")
    
    def count_tokens(text: str) -> int:
        return len(encoding.encode(text))
    
    # Configurar text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size * 4,  # Aproximación: 1 token ≈ 4 caracteres
        chunk_overlap=chunk_overlap * 4,
        length_function=len,
        separators=["\\n\\n", "\\n", ". ", " ", ""]
    )
    
    # Dividir texto en chunks
    text_chunks = text_splitter.split_text(text_content)
    print(f"✅ Texto dividido en {len(text_chunks)} chunks")
    
    # Procesar cada chunk
    processed_chunks = []
    
    for i, chunk_text in enumerate(text_chunks):
        token_count = count_tokens(chunk_text)
        
        chunk_metadata = {
            "chunk_id": f"{doc_metadata['source_file']}_chunk_{i:04d}",
            "chunk_index": i,
            "total_chunks": len(text_chunks),
            "text": chunk_text.strip(),
            "token_count": token_count,
            "char_count": len(chunk_text),
            "word_count": len(chunk_text.split()),
            "source_document": doc_metadata['source_file'],
            "file_type": doc_metadata['file_type'],
            "processed_at": doc_metadata['processed_at']
        }
        
        processed_chunks.append(chunk_metadata)
    
    # Filtrar chunks muy pequeños
    processed_chunks = [chunk for chunk in processed_chunks if chunk['token_count'] >= 10]
    
    print(f"✅ Chunks procesados: {len(processed_chunks)}")
    
    # Guardar chunks
    with open(chunks.path, 'w', encoding='utf-8') as f:
        json.dump(processed_chunks, f, indent=2, ensure_ascii=False)
'''
    
    # Escribir archivo
    with open('components/text_processing.py', 'w', encoding='utf-8') as f:
        f.write(text_processing_content)
    
    print("✅ Archivo creado: components/text_processing.py")
    print("📋 Components incluidos:")
    print("  - extract_text_component")
    print("  - chunk_text_component")

# Crear archivo de text processing
create_text_processing_components()

✅ Archivo creado: components/text_processing.py
📋 Components incluidos:
  - extract_text_component
  - chunk_text_component


## 🎯 Component 2: Vector Processing Components

Creamos el archivo con los components de embeddings e indexación

In [27]:
def create_vector_processing_components():
    """Crear archivo components/vector_processing.py"""
    
    vector_processing_content = '''"""
Vector Processing Components para RAG Pipeline
Incluye: generate_embeddings_component y index_elasticsearch_component
Author: Carlos Estay (github: pkstaz)
"""

from kfp.dsl import component, Input, Output, Dataset

@component(
    base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
    packages_to_install=[
        "sentence-transformers==2.2.2",
        "numpy==1.24.3"
    ]
)
def generate_embeddings_component(
    chunks: Input[Dataset],
    model_name: str,
    embeddings: Output[Dataset]
):
    """
    Genera embeddings vectoriales para los chunks de texto.
    
    Args:
        chunks: Input dataset con chunks de texto
        model_name: Nombre del modelo de embeddings
        embeddings: Output dataset con embeddings generados
    """
    import json
    import numpy as np
    from sentence_transformers import SentenceTransformer
    import torch
    
    # Verificar si hay GPU disponible
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"🖥️ Usando device: {device}")
    
    # Cargar modelo de embeddings
    print(f"📥 Cargando modelo: {model_name}")
    model = SentenceTransformer(model_name, device=device)
    
    # Leer chunks
    with open(chunks.path, 'r', encoding='utf-8') as f:
        chunk_data = json.load(f)
    
    print(f"📝 Procesando {len(chunk_data)} chunks")
    
    # Extraer textos para embedding
    texts = [chunk['text'] for chunk in chunk_data]
    
    # Generar embeddings en batches para eficiencia
    batch_size = 32
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        batch_embeddings = model.encode(
            batch_texts,
            convert_to_numpy=True,
            show_progress_bar=True if i == 0 else False,
            normalize_embeddings=True
        )
        all_embeddings.extend(batch_embeddings)
        
        if i % (batch_size * 5) == 0:
            print(f"  Procesado: {min(i + batch_size, len(texts))}/{len(texts)} chunks")
    
    print(f"✅ Embeddings generados: {len(all_embeddings)} vectores de {len(all_embeddings[0])} dimensiones")
    
    # Combinar chunks con sus embeddings
    enriched_chunks = []
    for chunk, embedding in zip(chunk_data, all_embeddings):
        enriched_chunk = chunk.copy()
        enriched_chunk['embedding'] = embedding.tolist()
        enriched_chunk['embedding_dim'] = len(embedding)
        enriched_chunk['embedding_model'] = model_name
        enriched_chunks.append(enriched_chunk)
    
    # Guardar chunks enriquecidos con embeddings
    with open(embeddings.path, 'w', encoding='utf-8') as f:
        json.dump(enriched_chunks, f, indent=2, ensure_ascii=False)
    
    print(f"✅ Chunks enriquecidos guardados")


@component(
    base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
    packages_to_install=[
        "elasticsearch==8.11.0"
    ]
)
def index_elasticsearch_component(
    enriched_chunks: Input[Dataset],
    es_endpoint: str,
    es_index: str,
    index_status: Output[Dataset]
):
    """
    Indexa chunks enriquecidos en ElasticSearch.
    
    Args:
        enriched_chunks: Input dataset con chunks y embeddings
        es_endpoint: Endpoint de ElasticSearch
        es_index: Nombre del índice
        index_status: Output dataset con status de indexación
    """
    import json
    from datetime import datetime
    from elasticsearch import Elasticsearch
    from elasticsearch.helpers import bulk
    
    # Conectar a ElasticSearch
    try:
        es = Elasticsearch([es_endpoint], verify_certs=False)
        
        if not es.ping():
            raise Exception("No se puede conectar a ElasticSearch")
        
        print(f"✅ Conectado a ElasticSearch: {es_endpoint}")
    except Exception as e:
        raise Exception(f"Error conectando a ElasticSearch: {str(e)}")
    
    # Leer chunks enriquecidos
    with open(enriched_chunks.path, 'r', encoding='utf-8') as f:
        chunks_data = json.load(f)
    
    print(f"📝 Indexando {len(chunks_data)} chunks en índice: {es_index}")
    
    # Definir mapping del índice
    index_mapping = {
        "mappings": {
            "properties": {
                "chunk_id": {"type": "keyword"},
                "text": {
                    "type": "text",
                    "analyzer": "standard"
                },
                "embedding": {
                    "type": "dense_vector",
                    "dims": chunks_data[0]['embedding_dim'] if chunks_data else 384,
                    "index": True,
                    "similarity": "cosine"
                },
                "source_document": {"type": "keyword"},
                "file_type": {"type": "keyword"},
                "chunk_index": {"type": "integer"},
                "total_chunks": {"type": "integer"},
                "token_count": {"type": "integer"},
                "char_count": {"type": "integer"},
                "word_count": {"type": "integer"},
                "processed_at": {"type": "date"},
                "indexed_at": {"type": "date"}
            }
        },
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        }
    }
    
    # Crear índice si no existe
    if not es.indices.exists(index=es_index):
        es.indices.create(index=es_index, body=index_mapping)
        print(f"✅ Índice creado: {es_index}")
    else:
        print(f"ℹ️ Índice ya existe: {es_index}")
    
    # Preparar documentos para bulk indexing
    documents = []
    for chunk in chunks_data:
        doc = {
            "_index": es_index,
            "_id": chunk['chunk_id'],
            "_source": {
                **chunk,
                "indexed_at": datetime.now().isoformat()
            }
        }
        documents.append(doc)
    
    # Indexar en batches
    try:
        success_count, failed_items = bulk(
            es,
            documents,
            chunk_size=100,
            request_timeout=300
        )
        
        print(f"✅ Indexación completada:")
        print(f"  Documentos exitosos: {success_count}")
        print(f"  Documentos fallidos: {len(failed_items) if failed_items else 0}")
        
    except Exception as e:
        raise Exception(f"Error en bulk indexing: {str(e)}")
    
    # Refresh del índice
    es.indices.refresh(index=es_index)
    
    # Verificar indexación
    doc_count = es.count(index=es_index)['count']
    print(f"✅ Total documentos en índice: {doc_count}")
    
    # Preparar status de indexación
    indexing_status = {
        "index_name": es_index,
        "total_chunks": len(chunks_data),
        "indexed_chunks": success_count,
        "failed_chunks": len(failed_items) if failed_items else 0,
        "total_documents_in_index": doc_count,
        "indexed_at": datetime.now().isoformat(),
        "success": len(failed_items) == 0 if failed_items else True
    }
    
    # Guardar status
    with open(index_status.path, 'w', encoding='utf-8') as f:
        json.dump(indexing_status, f, indent=2)
    
    print(f"✅ Status de indexación guardado")
'''
    
    # Escribir archivo
    with open('components/vector_processing.py', 'w', encoding='utf-8') as f:
        f.write(vector_processing_content)
    
    print("✅ Archivo creado: components/vector_processing.py")
    print("📋 Components incluidos:")
    print("  - generate_embeddings_component")
    print("  - index_elasticsearch_component")

# Crear archivo de vector processing
create_vector_processing_components()

✅ Archivo creado: components/vector_processing.py
📋 Components incluidos:
  - generate_embeddings_component
  - index_elasticsearch_component


## 🔗 Pipeline Definition: RAG Pipeline Principal

Creamos el pipeline principal que orquesta todos los components

In [28]:
def create_rag_pipeline():
    """Crear archivo pipelines/rag_pipeline.py"""
    
    rag_pipeline_content = '''"""
RAG Pipeline Principal - OpenShift AI Data Science Pipeline
Orquesta todos los components para procesamiento completo de documentos
Author: Carlos Estay (github: pkstaz)
"""

from kfp import dsl
from kfp.dsl import pipeline

# Importar components
from components.text_processing import extract_text_component, chunk_text_component
from components.vector_processing import generate_embeddings_component, index_elasticsearch_component

@pipeline(
    name="rag-document-processing-v1",
    description="Pipeline completo de procesamiento de documentos RAG para OpenShift AI"
)
def rag_document_pipeline(
    bucket_name: str = "raw-documents",
    object_key: str = "",
    minio_endpoint: str = "minio:9000",
    minio_access_key: str = "minio",
    minio_secret_key: str = "minio123",
    es_endpoint: str = "elasticsearch:9200",
    es_index: str = "rag-documents",
    chunk_size: int = 512,
    chunk_overlap: int = 50,
    embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"
):
    """
    Pipeline completo de procesamiento de documentos RAG.
    
    Args:
        bucket_name: Nombre del bucket en MinIO
        object_key: Path del archivo a procesar
        minio_endpoint: Endpoint de MinIO
        minio_access_key: Access key de MinIO
        minio_secret_key: Secret key de MinIO
        es_endpoint: Endpoint de ElasticSearch
        es_index: Nombre del índice en ElasticSearch
        chunk_size: Tamaño de chunks en tokens
        chunk_overlap: Overlap entre chunks en tokens
        embedding_model: Modelo para generar embeddings
    
    Returns:
        Status de indexación final
    """
    
    # Step 1: Extract text from document
    extract_task = extract_text_component(
        bucket_name=bucket_name,
        object_key=object_key,
        minio_endpoint=minio_endpoint,
        minio_access_key=minio_access_key,
        minio_secret_key=minio_secret_key
    )
    extract_task.set_display_name("📄 Extract Text")
    extract_task.set_cpu_limit("500m")
    extract_task.set_memory_limit("1Gi")
    extract_task.set_retry(3)
    
    # Step 2: Chunk the extracted text
    chunk_task = chunk_text_component(
        extracted_text=extract_task.outputs['extracted_text'],
        metadata=extract_task.outputs['metadata'],
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    ).after(extract_task)
    chunk_task.set_display_name("🧩 Chunk Text")
    chunk_task.set_cpu_limit("500m")
    chunk_task.set_memory_limit("1Gi")
    chunk_task.set_retry(3)
    
    # Step 3: Generate embeddings
    embedding_task = generate_embeddings_component(
        chunks=chunk_task.outputs['chunks'],
        model_name=embedding_model
    ).after(chunk_task)
    embedding_task.set_display_name("🎯 Generate Embeddings")
    embedding_task.set_cpu_limit("1000m")
    embedding_task.set_memory_limit("4Gi")
    embedding_task.set_retry(2)
    # embedding_task.set_gpu_limit("1")  # Uncomment si hay GPUs disponibles
    
    # Step 4: Index in ElasticSearch
    index_task = index_elasticsearch_component(
        enriched_chunks=embedding_task.outputs['embeddings'],
        es_endpoint=es_endpoint,
        es_index=es_index
    ).after(embedding_task)
    index_task.set_display_name("🔍 Index ElasticSearch")
    index_task.set_cpu_limit("500m")
    index_task.set_memory_limit("2Gi")
    index_task.set_retry(3)
    
    # Return final status
    return index_task.outputs['index_status']


if __name__ == "__main__":
    # Para testing local del pipeline definition
    print("✅ RAG Pipeline definido correctamente")
    print("📋 Pipeline functions disponibles:")
    print("  - rag_document_pipeline: Procesamiento individual")
'''
    
    # Escribir archivo
    with open('pipelines/rag_pipeline.py', 'w', encoding='utf-8') as f:
        f.write(rag_pipeline_content)
    
    print("✅ Archivo creado: pipelines/rag_pipeline.py")
    print("📋 Pipeline function incluida:")
    print("  - rag_document_pipeline: Pipeline principal")

# Crear archivo del pipeline
create_rag_pipeline()

✅ Archivo creado: pipelines/rag_pipeline.py
📋 Pipeline function incluida:
  - rag_document_pipeline: Pipeline principal


## ⚙️ Configuration Files

Creamos los archivos de configuración necesarios para el deployment

In [30]:
def create_configuration_files():
    """Crear archivos de configuración del proyecto"""
    
    # 1. Pipeline Configuration
    pipeline_config = {
        "pipeline": {
            "name": "rag-document-processing-v1",
            "version": "1.0.0",
            "description": "RAG Document Processing Pipeline para OpenShift AI",
            "author": "Carlos Estay",
            "github": "pkstaz",
            "created": datetime.now().isoformat()
        },
        "components": {
            "base_image": "pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
            "extract_text": {
                "cpu_limit": "500m",
                "memory_limit": "1Gi",
                "retry_limit": 3
            },
            "chunk_text": {
                "cpu_limit": "500m", 
                "memory_limit": "1Gi",
                "retry_limit": 3
            },
            "generate_embeddings": {
                "cpu_limit": "1000m",
                "memory_limit": "4Gi", 
                "retry_limit": 2,
                "gpu_limit": "0"  # Set to "1" si hay GPU disponible
            },
            "index_elasticsearch": {
                "cpu_limit": "500m",
                "memory_limit": "2Gi",
                "retry_limit": 3
            }
        },
        "default_parameters": {
            "chunk_size": 512,
            "chunk_overlap": 50,
            "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
            "es_index": "rag-documents"
        },
        "storage": {
            "minio": {
                "endpoint": "minio:9000",
                "access_key": "minio",
                "secret_key": "minio123",
                "bucket_raw": "raw-documents",
                "bucket_processed": "processed-documents",
                "bucket_failed": "failed-documents",
                "bucket_pipeline": "pipeline"
            },
            "elasticsearch": {
                "endpoint": "elasticsearch:9200",
                "index_prefix": "rag-",
                "replicas": 0,
                "shards": 1
            }
        }
    }
    
    with open('config/pipeline_config.yaml', 'w') as f:
        yaml.dump(pipeline_config, f, indent=2, default_flow_style=False)
    
    print("✅ Creado: config/pipeline_config.yaml")
    
    # 2. Secrets Template
    secrets_template = {
        "apiVersion": "v1",
        "kind": "Secret",
        "metadata": {
            "name": "rag-pipeline-secrets",
            "namespace": "rag-openshift-ai",
            "labels": {
                "app": "rag-pipeline",
                "author": "carlos-estay",
                "github": "pkstaz"
            }
        },
        "type": "Opaque",
        "stringData": {
            "minio-access-key": "minio",
            "minio-secret-key": "minio123",
            "elasticsearch-username": "",
            "elasticsearch-password": "",
            "pipeline-webhook-token": "rag-pipeline-token-change-me"
        }
    }
    
    with open('config/secrets.yaml', 'w') as f:
        yaml.dump(secrets_template, f, indent=2, default_flow_style=False)
    
    print("✅ Creado: config/secrets.yaml")
    
    # 3. Requirements file
    requirements_content = '''# RAG Pipeline Requirements
# Author: Carlos Estay (github: pkstaz)
# Core KFP
kfp>=2.0.0

# Document Processing
PyPDF2==3.0.1
python-docx==0.8.11
chardet==5.2.0

# Text Processing
tiktoken==0.5.1
langchain==0.0.350

# ML/Embeddings
sentence-transformers==2.2.2
torch>=2.0.1
numpy==1.24.3

# Storage/DB
minio==7.1.17
elasticsearch==8.11.0

# Web/API (for webhook)
flask==2.3.3
requests==2.31.0

# Utilities
pyyaml==6.0.1
python-dateutil==2.8.2
'''
    
    with open('requirements.txt', 'w') as f:
        f.write(requirements_content)
    
    print("✅ Creado: requirements.txt")
    
    # 4. Environment variables template
    env_template = '''# RAG Pipeline Environment Variables
# Author: Carlos Estay (github: pkstaz)
# Copy to .env and modify as needed

# MinIO Configuration
MINIO_ENDPOINT=minio:9000
MINIO_ACCESS_KEY=minio
MINIO_SECRET_KEY=minio123
MINIO_SECURE=false

# ElasticSearch Configuration  
ES_ENDPOINT=elasticsearch:9200
ES_INDEX=rag-documents
ES_USERNAME=
ES_PASSWORD=

# Pipeline Configuration
CHUNK_SIZE=512
CHUNK_OVERLAP=50
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# OpenShift AI Pipeline
KFP_HOST=http://ml-pipeline:8888
PIPELINE_NAMESPACE=rag-openshift-ai

# Webhook Configuration
WEBHOOK_PORT=8080
WEBHOOK_TOKEN=rag-pipeline-token-change-me
'''
    
    with open('config/env.template', 'w') as f:
        f.write(env_template)
    
    print("✅ Creado: config/env.template")
    
    print("\n📋 Archivos de configuración creados:")
    print("  config/pipeline_config.yaml - Configuración del pipeline")
    print("  config/secrets.yaml - Template de secrets K8s")
    print("  requirements.txt - Dependencias Python")  
    print("  config/env.template - Variables de ambiente")

# Crear archivos de configuración
create_configuration_files()

✅ Creado: config/pipeline_config.yaml
✅ Creado: config/secrets.yaml
✅ Creado: requirements.txt
✅ Creado: config/env.template

📋 Archivos de configuración creados:
  config/pipeline_config.yaml - Configuración del pipeline
  config/secrets.yaml - Template de secrets K8s
  requirements.txt - Dependencias Python
  config/env.template - Variables de ambiente


## 🪣 Configuración de MinIO - Crear Buckets

Configuramos MinIO y creamos todos los buckets necesarios incluyendo el bucket "pipeline"

In [32]:
def setup_minio_buckets_complete():
    """Configurar MinIO y crear todos los buckets necesarios incluyendo 'pipeline'"""
    
    print("🔧 Configurando MinIO y creando buckets completos...")
    
    # Configuración MinIO
    MINIO_ENDPOINT = "minio:9000"
    MINIO_ACCESS_KEY = "minio"
    MINIO_SECRET_KEY = "minio123"
    
    try:
        from minio import Minio
        from minio.error import S3Error
        
        # Conectar a MinIO
        print(f"🔗 Conectando a MinIO: {MINIO_ENDPOINT}")
        
        minio_client = Minio(
            MINIO_ENDPOINT,
            access_key=MINIO_ACCESS_KEY,
            secret_key=MINIO_SECRET_KEY,
            secure=False  # HTTP para desarrollo
        )
        
        # Verificar conectividad
        try:
            buckets = minio_client.list_buckets()
            print(f"✅ Conectado a MinIO exitosamente")
            print(f"📊 Buckets existentes: {len(buckets)}")
            for bucket in buckets:
                print(f"  - {bucket.name} (creado: {bucket.creation_date})")
        except Exception as e:
            print(f"❌ Error conectando a MinIO: {str(e)}")
            return False
        
        # Buckets necesarios para RAG pipeline + OpenShift AI DSP
        required_buckets = [
            {
                "name": "pipeline",
                "description": "Bucket requerido por OpenShift AI Data Science Pipelines"
            },
            {
                "name": "raw-documents",
                "description": "Documentos originales subidos por usuarios"
            },
            {
                "name": "processed-documents", 
                "description": "Documentos procesados exitosamente"
            },
            {
                "name": "failed-documents",
                "description": "Documentos que fallaron en procesamiento"
            },
            {
                "name": "test-datasets",
                "description": "Datasets de prueba para desarrollo"
            }
        ]
        
        print(f"\n🪣 Creando buckets necesarios...")
        
        created_buckets = []
        existing_buckets = []
        
        for bucket_info in required_buckets:
            bucket_name = bucket_info["name"]
            description = bucket_info["description"]
            
            try:
                # Verificar si bucket ya existe
                if minio_client.bucket_exists(bucket_name):
                    existing_buckets.append(bucket_name)
                    print(f"  ℹ️ Bucket ya existe: {bucket_name}")
                else:
                    # Crear bucket
                    minio_client.make_bucket(bucket_name)
                    created_buckets.append(bucket_name)
                    print(f"  ✅ Bucket creado: {bucket_name}")
                
                print(f"    📝 {description}")
                    
            except S3Error as e:
                print(f"  ❌ Error con bucket {bucket_name}: {str(e)}")
        
        print(f"\n📊 Resumen de buckets:")
        print(f"  ✅ Creados: {len(created_buckets)}")
        print(f"  ℹ️ Ya existían: {len(existing_buckets)}")
        
        # Listar todos los buckets después de la configuración
        final_buckets = minio_client.list_buckets()
        print(f"\n🗂️ Buckets disponibles:")
        for bucket in final_buckets:
            print(f"  - {bucket.name}")
        
        # Verificar bucket crítico para OpenShift AI
        if minio_client.bucket_exists("pipeline"):
            print(f"\n✅ Bucket 'pipeline' configurado correctamente para OpenShift AI DSP")
        else:
            print(f"\n❌ Bucket 'pipeline' no se pudo crear - requerido para OpenShift AI DSP")
            return False
        
        return True
        
    except ImportError:
        print("❌ MinIO client no disponible. Instalar con: pip install minio")
        return False
    except Exception as e:
        print(f"❌ Error configurando MinIO: {str(e)}")
        return False

def upload_test_document():
    """Subir documento de prueba al bucket raw-documents"""
    
    print(f"\n📄 Subiendo documento de prueba...")
    
    try:
        from minio import Minio
        import tempfile
        
        # Conectar a MinIO
        minio_client = Minio(
            "minio:9000",
            access_key="minio",
            secret_key="minio123",
            secure=False
        )
        
        # Crear documento de prueba más detallado
        test_content = """# Documento de Prueba RAG Pipeline - OpenShift AI
        
Este es un documento de prueba para validar el pipeline RAG en OpenShift AI Data Science Pipelines.

## Configuración del Ambiente

### MinIO Buckets Configurados:
- **pipeline**: Bucket requerido por OpenShift AI DSP
- **raw-documents**: Documentos originales 
- **processed-documents**: Documentos procesados exitosamente
- **failed-documents**: Documentos con errores de procesamiento
- **test-datasets**: Datasets para desarrollo y pruebas

### Pipeline de Procesamiento RAG

El sistema procesa documentos siguiendo estos pasos:

1. **Extracción de texto**: 
   - Soporta PDF, DOCX, TXT
   - Detecta encoding automáticamente
   - Extrae metadatos del documento

2. **Chunking inteligente**:
   - Chunks de 512 tokens con 50 de overlap
   - Usa tiktoken para conteo preciso
   - Mantiene contexto entre fragmentos

3. **Generación de embeddings**:
   - Modelo: sentence-transformers/all-MiniLM-L6-v2
   - Vectores de 384 dimensiones
   - Normalización para cosine similarity

4. **Indexación en ElasticSearch**:
   - Índice híbrido (texto + vectores)
   - Búsqueda semántica y por palabras clave
   - Metadatos preservados para trazabilidad

## Información del Proyecto

- **Autor**: Carlos Estay
- **GitHub**: pkstaz  
- **Pipeline**: rag-document-processing-v1
- **Platform**: OpenShift AI Data Science Pipelines

Este documento será procesado automáticamente cuando el webhook detecte 
su upload al bucket raw-documents.

## Testing del Pipeline

Para validar el funcionamiento:

1. Este documento se procesa automáticamente
2. Se generan aproximadamente 4-6 chunks
3. Cada chunk obtiene un embedding de 384 dimensiones  
4. Se indexa en ElasticSearch con ID único
5. Está disponible para búsqueda semántica

¡Pipeline RAG funcionando en OpenShift AI!
"""
        
        # Crear archivo temporal
        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8') as f:
            f.write(test_content)
            temp_file_path = f.name
        
        # Subir archivo a MinIO
        object_name = "test-document-openshift-ai.txt"
        bucket_name = "raw-documents"
        
        try:
            minio_client.fput_object(
                bucket_name=bucket_name,
                object_name=object_name,
                file_path=temp_file_path,
                content_type="text/plain"
            )
            
            print(f"✅ Documento subido: {bucket_name}/{object_name}")
            print(f"📏 Tamaño: {len(test_content)} caracteres")
            print(f"📊 Palabras aproximadas: {len(test_content.split())}")
            
            # Verificar que se subió correctamente
            try:
                stat = minio_client.stat_object(bucket_name, object_name)
                print(f"📋 Verificación:")
                print(f"  Size: {stat.size} bytes")
                print(f"  Content-Type: {stat.content_type}")
                print(f"  Last-Modified: {stat.last_modified}")
            except Exception as e:
                print(f"⚠️ No se pudo verificar el archivo: {str(e)}")
            
            return {
                "bucket": bucket_name,
                "object": object_name,
                "size": len(test_content),
                "word_count": len(test_content.split()),
                "success": True
            }
            
        except Exception as e:
            print(f"❌ Error subiendo archivo: {str(e)}")
            return None
            
        finally:
            # Limpiar archivo temporal
            import os
            if os.path.exists(temp_file_path):
                os.unlink(temp_file_path)
        
    except Exception as e:
        print(f"❌ Error en upload de documento de prueba: {str(e)}")
        return None

# Ejecutar configuración completa de MinIO
print("🚀 Configuración completa de MinIO para OpenShift AI DSP")
print("=" * 60)

minio_setup_success = setup_minio_buckets_complete()

# Subir documento de prueba si MinIO está configurado
if minio_setup_success:
    test_upload_result = upload_test_document()
    
    if test_upload_result:
        print(f"\n🎯 MINIO CONFIGURADO PARA OPENSHIFT AI DSP:")
        print(f"  ✅ Bucket 'pipeline' creado (requerido por OpenShift AI)")
        print(f"  ✅ Buckets RAG creados (raw-documents, processed-documents, etc.)")
        print(f"  ✅ Documento de prueba subido: {test_upload_result['object']}")
        print(f"  📊 Tamaño: {test_upload_result['word_count']} palabras")
        print(f"\n🚀 LISTO PARA CONFIGURAR OPENSHIFT AI DSP")
    else:
        print(f"\n⚠️ MinIO configurado pero falta documento de prueba")
else:
    print(f"\n❌ Configurar MinIO antes de continuar")

🚀 Configuración completa de MinIO para OpenShift AI DSP
🔧 Configurando MinIO y creando buckets completos...
🔗 Conectando a MinIO: minio:9000
✅ Conectado a MinIO exitosamente
📊 Buckets existentes: 6
  - failed-documents (creado: 2025-07-01 17:40:26.746000+00:00)
  - pipeline (creado: 2025-07-01 17:46:35.035000+00:00)
  - processed-documents (creado: 2025-07-01 17:40:26.739000+00:00)
  - rag-documents (creado: 2025-07-01 16:45:51.180000+00:00)
  - raw-documents (creado: 2025-07-01 17:40:26.732000+00:00)
  - test-datasets (creado: 2025-07-01 17:40:26.753000+00:00)

🪣 Creando buckets necesarios...
  ℹ️ Bucket ya existe: pipeline
    📝 Bucket requerido por OpenShift AI Data Science Pipelines
  ℹ️ Bucket ya existe: raw-documents
    📝 Documentos originales subidos por usuarios
  ℹ️ Bucket ya existe: processed-documents
    📝 Documentos procesados exitosamente
  ℹ️ Bucket ya existe: failed-documents
    📝 Documentos que fallaron en procesamiento
  ℹ️ Bucket ya existe: test-datasets
    📝 Data

## 🔨 Compilación del Pipeline - KFP 2.12.1 Compatible

Compilamos el pipeline usando la sintaxis correcta para KFP 2.12.1

In [34]:
def compile_standalone_pipeline():
    """Compilar pipeline standalone compatible con KFP 2.12.1"""
    
    print("🔨 Compilando Pipeline Standalone...")
    
    try:
        from kfp import dsl
        from kfp.dsl import component, pipeline
        from kfp import compiler
        
        print("✅ Imports para KFP 2.12.1 correctos")
        
        @component(
            base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
            packages_to_install=["PyPDF2==3.0.1", "minio==7.1.17"]
        )
        def simple_extract_component(
            bucket_name: str,
            object_key: str
        ) -> str:
            """Component simplificado para testing - retorna string"""
            
            # Simular extracción
            test_content = f"Contenido extraído de {object_key} en bucket {bucket_name}"
            
            print(f"✅ Texto extraído: {len(test_content)} caracteres")
            
            return test_content

        @component(
            base_image="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel",
            packages_to_install=["elasticsearch==8.11.0"]
        )
        def simple_index_component(
            extracted_text: str,
            es_endpoint: str
        ) -> dict:
            """Component simplificado de indexación - retorna dict"""
            from datetime import datetime
            
            # Simular indexación
            status = {
                "indexed_at": datetime.now().isoformat(),
                "content_length": len(extracted_text),
                "es_endpoint": es_endpoint,
                "success": True
            }
            
            print(f"✅ Indexación simulada completada")
            
            return status

        @pipeline(
            name="rag-simple-test-v1",
            description="Pipeline RAG simplificado para testing - Carlos Estay (pkstaz)"
        )
        def simple_rag_pipeline(
            bucket_name: str = "raw-documents",
            object_key: str = "test-document-openshift-ai.txt",
            es_endpoint: str = "elasticsearch:9200"
        ) -> dict:
            """Pipeline simplificado para verificar compilación"""
            
            # Step 1: Extract
            extract_task = simple_extract_component(
                bucket_name=bucket_name,
                object_key=object_key
            )
            extract_task.set_display_name("📄 Simple Extract")
            extract_task.set_cpu_limit("500m")
            extract_task.set_memory_limit("1Gi")
            
            # Step 2: Index
            index_task = simple_index_component(
                extracted_text=extract_task.output,
                es_endpoint=es_endpoint
            )
            index_task.set_display_name("🔍 Simple Index")
            index_task.set_cpu_limit("500m")
            index_task.set_memory_limit("1Gi")
            
            return index_task.output
        
        print("✅ Pipeline simplificado definido con sintaxis KFP 2.12.1")
        
        # Compilar pipeline
        output_file = 'rag_simple_pipeline_v1.yaml'
        
        compiler.Compiler().compile(
            pipeline_func=simple_rag_pipeline,
            package_path=output_file
        )
        
        print(f"✅ Pipeline compilado exitosamente: {output_file}")
        
        # Verificar archivo
        if Path(output_file).exists():
            file_size = Path(output_file).stat().st_size
            print(f"📄 Tamaño del archivo: {file_size} bytes")
            
            # Verificar contenido
            with open(output_file, 'r') as f:
                content = f.read()
                
            if 'apiVersion' in content and 'kind' in content:
                print("✅ YAML válido generado")
                
                # Contar components
                component_count = content.count('componentRef')
                print(f"📊 Components encontrados en YAML: {component_count}")
                
                # Mostrar metadata
                try:
                    yaml_data = yaml.safe_load(content)
                    metadata = yaml_data.get('metadata', {})
                    print(f"📋 Pipeline name: {metadata.get('name', 'N/A')}")
                except:
                    print("⚠️ No se pudo parsear metadata del YAML")
            
            print(f"\n📝 Primeras líneas del YAML:")
            lines = content.split('\n')[:8]
            for i, line in enumerate(lines, 1):
                print(f"  {i}: {line}")
        
        return output_file
        
    except Exception as e:
        print(f"❌ Error en compilación: {str(e)}")
        
        # Más debugging específico
        import traceback
        print(f"\n🔍 Traceback completo:")
        traceback.print_exc()
        
        return None

# Ejecutar compilación
compiled_file = compile_standalone_pipeline()

if compiled_file:
    print(f"\n✅ PIPELINE COMPILADO EXITOSAMENTE")
    print(f"📄 Archivo: {compiled_file}")
    print(f"🚀 Listo para subir a OpenShift AI Dashboard")
else:
    print(f"\n❌ Error en compilación del pipeline")

🔨 Compilando Pipeline Standalone...
✅ Imports para KFP 2.12.1 correctos
✅ Pipeline simplificado definido con sintaxis KFP 2.12.1
✅ Pipeline compilado exitosamente: rag_simple_pipeline_v1.yaml
📄 Tamaño del archivo: 6423 bytes

📝 Primeras líneas del YAML:
  1: # PIPELINE DEFINITION
  2: # Name: rag-simple-test-v1
  3: # Description: Pipeline RAG simplificado para testing - Carlos Estay (pkstaz)
  4: # Inputs:
  5: #    bucket_name: str [Default: 'raw-documents']
  6: #    es_endpoint: str [Default: 'elasticsearch:9200']
  7: #    object_key: str [Default: 'test-document-openshift-ai.txt']
  8: # Outputs:

✅ PIPELINE COMPILADO EXITOSAMENTE
📄 Archivo: rag_simple_pipeline_v1.yaml
🚀 Listo para subir a OpenShift AI Dashboard


## 🛠️ Configuración de OpenShift AI Data Science Pipeline v2

Creamos la configuración compatible con OpenShift AI v2 pipelines usando MinIO externo

In [44]:
def recreate_dsp_completely():
    """Recrear DSP v2 desde cero con configuración correcta"""
    
    print("🔧 Recreando DSP v2 completamente...")
    
    # Configuración DSP mínima y funcional
    dsp_minimal_config = {
        "apiVersion": "datasciencepipelinesapplications.opendatahub.io/v1alpha1",
        "kind": "DataSciencePipelinesApplication",
        "metadata": {
            "name": "rag-dsp-minimal",
            "namespace": "rag-openshift-ai"
        },
        "spec": {
            "dspVersion": "v2",
            "apiServer": {
                "deploy": True,
                "enableSamplePipeline": False
            },
            "objectStorage": {
                "externalStorage": {
                    "bucket": "pipeline",
                    "host": "minio-api-poc-rag.apps.cluster-2gbhp.2gbhp.sandbox1120.opentlc.com",
                    "scheme": "https",
                    "s3CredentialsSecret": {
                        "accessKey": "accesskey",
                        "secretKey": "secretkey",
                        "secretName": "minio-secret"
                    }
                }
            },
            "database": {
                "mariaDB": {
                    "deploy": True,
                    "pvcSize": "5Gi"
                }
            }
        }
    }
    
    with open('deploy/openshift-ai/dsp-minimal.yaml', 'w') as f:
        yaml.dump(dsp_minimal_config, f, indent=2, default_flow_style=False)
    
    print("✅ Configuración DSP mínima creada: deploy/openshift-ai/dsp-minimal.yaml")
    
    return True

def create_complete_rebuild_script():
    """Script para reconstruir DSP completamente"""
    
    rebuild_script = '''#!/bin/bash
# Complete DSP Rebuild Script
# Author: Carlos Estay (pkstaz)

set -e

echo "🔧 Rebuilding DSP v2 completely..."

# Colors
RED='\\033[0;31m'
GREEN='\\033[0;32m'
YELLOW='\\033[1;33m'
NC='\\033[0m'

NAMESPACE="rag-openshift-ai"

# Step 1: Complete cleanup
echo -e "${YELLOW}🗑️ Complete cleanup...${NC}"
oc delete datasciencepipelinesapplication --all -n $NAMESPACE --ignore-not-found=true

# Wait for complete deletion
echo -e "${YELLOW}⏳ Waiting for complete cleanup...${NC}"
sleep 30

# Delete any remaining resources
oc delete pods --all -n $NAMESPACE --ignore-not-found=true
oc delete pvc --all -n $NAMESPACE --ignore-not-found=true
oc delete configmaps --all -n $NAMESPACE --ignore-not-found=true
oc delete secrets --all -n $NAMESPACE --ignore-not-found=true

# Wait more
sleep 10

# Step 2: Recreate namespace fresh
echo -e "${YELLOW}📦 Recreating namespace...${NC}"
oc delete namespace $NAMESPACE --ignore-not-found=true
sleep 15
oc apply -f deploy/openshift-ai/namespace.yaml

# Step 3: Apply MinIO secret
echo -e "${YELLOW}🔑 Creating MinIO secret...${NC}"
oc apply -f deploy/openshift-ai/minio-secret.yaml

# Step 4: Deploy minimal DSP
echo -e "${YELLOW}🔬 Deploying minimal DSP...${NC}"
oc apply -f deploy/openshift-ai/dsp-minimal.yaml

# Step 5: Wait for DSP to be ready
echo -e "${YELLOW}⏳ Waiting for DSP to be ready (this may take 5-10 minutes)...${NC}"
oc wait --for=condition=Ready datasciencepipelinesapplication/rag-dsp-minimal -n $NAMESPACE --timeout=600s

# Step 6: Check status
echo -e "${YELLOW}📊 Checking final status...${NC}"
oc get datasciencepipelinesapplications -n $NAMESPACE
echo ""
oc get pods -n $NAMESPACE
echo ""
oc get pvc -n $NAMESPACE

# Step 7: Verify DSP is working
echo -e "${YELLOW}🔍 Verifying DSP functionality...${NC}"
API_SERVER_POD=$(oc get pods -n $NAMESPACE -l app=ds-pipeline-rag-dsp-minimal -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")

if [ ! -z "$API_SERVER_POD" ]; then
    echo -e "${GREEN}✅ API Server pod found: $API_SERVER_POD${NC}"
    oc logs $API_SERVER_POD -n $NAMESPACE --tail=10
else
    echo -e "${RED}❌ API Server pod not found${NC}"
fi

echo -e "${GREEN}🎉 DSP rebuild completed!${NC}"
echo -e "${YELLOW}📋 Next steps:${NC}"
echo "1. Wait for all pods to be Running"
echo "2. Access OpenShift AI Dashboard"
echo "3. Navigate to Data Science Projects > rag-openshift-ai"
echo "4. Try uploading a simple pipeline"
'''
    
    with open('rebuild_dsp.sh', 'w') as f:
        f.write(rebuild_script)
    
    Path('rebuild_dsp.sh').chmod(0o755)
    
    print("✅ Script de rebuild creado: rebuild_dsp.sh")
    
    return True

def create_verification_pipeline():
    """Crear pipeline super simple para verificar que DSP funciona"""
    
    print("\n🧪 Creando pipeline de verificación ultra-simple...")
    
    try:
        from kfp import dsl, compiler
        from kfp.dsl import component, pipeline
        
        @component(base_image="python:3.9")
        def verify_component() -> str:
            """Ultra simple verification component"""
            import os
            import sys
            
            result = f"DSP is working! Python version: {sys.version}"
            print(result)
            return result

        @pipeline(
            name="dsp-verification",
            description="Ultra simple pipeline to verify DSP is working"
        )
        def verification_pipeline() -> str:
            """Verification pipeline"""
            
            task = verify_component()
            task.set_display_name("Verify")
            task.set_cpu_limit("100m")
            task.set_memory_limit("128Mi")
            
            return task.output
        
        # Compilar
        verify_file = 'dsp_verification.yaml'
        
        compiler.Compiler().compile(
            pipeline_func=verification_pipeline,
            package_path=verify_file
        )
        
        print(f"✅ Pipeline de verificación creado: {verify_file}")
        return verify_file
        
    except Exception as e:
        print(f"❌ Error creando pipeline de verificación: {str(e)}")
        return False

def show_troubleshooting_plan():
    """Plan completo de troubleshooting"""
    
    print(f"\n📋 PLAN DE TROUBLESHOOTING COMPLETO:")
    print("=" * 60)
    
    plan = [
        "🔥 PROBLEMA: launcher-v2 faltante indica DSP mal configurado",
        "",
        "✅ SOLUCIÓN:",
        "1. Reconstruir DSP completamente desde cero",
        "2. Usar configuración mínima (sin customizaciones)",
        "3. Dejar que OpenShift AI use imágenes por defecto",
        "4. Probar con pipeline ultra-simple",
        "",
        "📋 PASOS EJECUTAR:",
        "1. ./rebuild_dsp.sh (reconstruir DSP)",
        "2. Esperar 5-10 minutos para que se inicialice",
        "3. Verificar: oc get pods -n rag-openshift-ai",
        "4. Subir: dsp_verification.yaml al Dashboard",
        "5. Si funciona, subir pipelines más complejos",
        "",
        "⚠️ SI SIGUE FALLANDO:",
        "- El cluster puede no tener OpenShift AI instalado correctamente",
        "- Verificar con administrador del cluster",
        "- Puede necesitar reinstalar OpenShift AI operator"
    ]
    
    for item in plan:
        print(item)
    
    return True

# Ejecutar todas las soluciones
print("🔧 SOLUCIÓN DEFINITIVA - Reconstruir DSP v2")
print("=" * 60)

recreate_dsp_completely()
create_complete_rebuild_script()
verification_pipeline = create_verification_pipeline()
show_troubleshooting_plan()

print(f"\n🎯 ACCIÓN REQUERIDA:")
print(f"  ⚠️  El DSP actual está corrupto/mal configurado")
print(f"  🔧 SOLUCIÓN: Reconstruir completamente")
print(f"  🚀 EJECUTAR: ./rebuild_dsp.sh")

print(f"\n📁 ARCHIVOS CREADOS:")
print(f"  📄 deploy/openshift-ai/dsp-minimal.yaml - Configuración mínima")
print(f"  🔧 rebuild_dsp.sh - Script de reconstrucción completa")
print(f"  📄 dsp_verification.yaml - Pipeline ultra-simple para testing")

print(f"\n⏱️ TIEMPO ESTIMADO:")
print(f"  🕐 Rebuild: 5-10 minutos")
print(f"  🕑 Verificación: 2-3 minutos")
print(f"  🕒 Total: ~15 minutos")

print(f"\n💡 ESTE ENFOQUE DEBERÍA FUNCIONAR:")
print(f"  ✅ Configuración mínima sin customizaciones problemáticas")
print(f"  ✅ Deja que OpenShift AI use sus imágenes por defecto")
print(f"  ✅ Elimina cualquier configuración corrupta")

🔧 SOLUCIÓN DEFINITIVA - Reconstruir DSP v2
🔧 Recreando DSP v2 completamente...
✅ Configuración DSP mínima creada: deploy/openshift-ai/dsp-minimal.yaml
✅ Script de rebuild creado: rebuild_dsp.sh

🧪 Creando pipeline de verificación ultra-simple...
✅ Pipeline de verificación creado: dsp_verification.yaml

📋 PLAN DE TROUBLESHOOTING COMPLETO:
🔥 PROBLEMA: launcher-v2 faltante indica DSP mal configurado

✅ SOLUCIÓN:
1. Reconstruir DSP completamente desde cero
2. Usar configuración mínima (sin customizaciones)
3. Dejar que OpenShift AI use imágenes por defecto
4. Probar con pipeline ultra-simple

📋 PASOS EJECUTAR:
1. ./rebuild_dsp.sh (reconstruir DSP)
2. Esperar 5-10 minutos para que se inicialice
3. Verificar: oc get pods -n rag-openshift-ai
4. Subir: dsp_verification.yaml al Dashboard
5. Si funciona, subir pipelines más complejos

⚠️ SI SIGUE FALLANDO:
- El cluster puede no tener OpenShift AI instalado correctamente
- Verificar con administrador del cluster
- Puede necesitar reinstalar OpenS

## 🔍 Diagnóstico del Deployment

Verificamos qué está pasando con el DSP deployment

In [38]:
def diagnose_dsp_deployment():
    """Diagnosticar problemas con el deployment DSP"""
    
    print("🔍 Diagnosticando deployment de DSP...")
    
    diagnostic_commands = [
        "# Ver estado del DSP",
        "oc get datasciencepipelinesapplications -n rag-openshift-ai -o wide",
        "",
        "# Ver detalles del DSP",
        "oc describe datasciencepipelinesapplication rag-pipeline-dsp-v2 -n rag-openshift-ai",
        "",
        "# Ver pods en el namespace",
        "oc get pods -n rag-openshift-ai",
        "",
        "# Ver eventos recientes",
        "oc get events -n rag-openshift-ai --sort-by='.lastTimestamp' | tail -20",
        "",
        "# Ver logs del operator si existe",
        "oc logs -n openshift-operators deployment/rhods-operator --tail=50",
        "",
        "# Verificar si hay conflictos de recursos",
        "oc get all -n rag-openshift-ai",
        "",
        "# Ver status de namespace",
        "oc describe namespace rag-openshift-ai"
    ]
    
    print("📋 Ejecuta estos comandos para diagnosticar:")
    print("=" * 50)
    for cmd in diagnostic_commands:
        print(cmd)
    
    return diagnostic_commands

# Mostrar comandos de diagnóstico
diagnose_dsp_deployment()

print("\n💡 POSIBLES CAUSAS:")
print("  1. OpenShift AI Operator no está instalado")
print("  2. Namespace sin permisos adecuados")
print("  3. Recursos insuficientes en el cluster")
print("  4. Problema de conectividad con MinIO externo")
print("  5. CRD de DataSciencePipelinesApplication no existe")

print("\n🔧 VERIFICACIONES RÁPIDAS:")
print("  # Verificar si el CRD existe:")
print("  oc get crd datasciencepipelinesapplications.opendatahub.io")
print("")
print("  # Verificar OpenShift AI operator:")
print("  oc get pods -n openshift-operators | grep rhods")
print("")
print("  # Verificar permisos:")
print("  oc auth can-i create datasciencepipelinesapplications --as=system:serviceaccount:rag-openshift-ai:default")

🔍 Diagnosticando deployment de DSP...
📋 Ejecuta estos comandos para diagnosticar:
# Ver estado del DSP
oc get datasciencepipelinesapplications -n rag-openshift-ai -o wide

# Ver detalles del DSP
oc describe datasciencepipelinesapplication rag-pipeline-dsp-v2 -n rag-openshift-ai

# Ver pods en el namespace
oc get pods -n rag-openshift-ai

# Ver eventos recientes
oc get events -n rag-openshift-ai --sort-by='.lastTimestamp' | tail -20

# Ver logs del operator si existe
oc logs -n openshift-operators deployment/rhods-operator --tail=50

# Verificar si hay conflictos de recursos
oc get all -n rag-openshift-ai

# Ver status de namespace
oc describe namespace rag-openshift-ai

💡 POSIBLES CAUSAS:
  1. OpenShift AI Operator no está instalado
  2. Namespace sin permisos adecuados
  3. Recursos insuficientes en el cluster
  4. Problema de conectividad con MinIO externo
  5. CRD de DataSciencePipelinesApplication no existe

🔧 VERIFICACIONES RÁPIDAS:
  # Verificar si el CRD existe:
  oc get crd dat

## 🎉 DSP Funcionando - Próximos Pasos

Ahora que el DSP está operativo, vamos a probar nuestro pipeline RAG real

In [45]:
def create_real_rag_pipeline():
    """Crear el pipeline RAG real ahora que DSP funciona"""
    
    print("🚀 Creando pipeline RAG real para el DSP funcional...")
    
    try:
        from kfp import dsl, compiler
        from kfp.dsl import component, pipeline
        
        @component(
            base_image="python:3.9",
            packages_to_install=[
                "minio==7.1.17",
                "requests==2.31.0"
            ]
        )
        def extract_text_real(
            bucket_name: str,
            object_key: str,
            minio_endpoint: str,
            minio_access_key: str,
            minio_secret_key: str
        ) -> str:
            """Extract text from MinIO document"""
            
            from minio import Minio
            import tempfile
            import os
            
            print(f"Extracting text from: {object_key}")
            
            try:
                # Conectar a MinIO público
                minio_client = Minio(
                    minio_endpoint,
                    access_key=minio_access_key,
                    secret_key=minio_secret_key,
                    secure=True  # HTTPS
                )
                
                # Crear archivo temporal
                with tempfile.NamedTemporaryFile(delete=False, suffix='.txt') as temp_file:
                    temp_path = temp_file.name
                
                # Descargar archivo
                minio_client.fget_object(bucket_name, object_key, temp_path)
                print(f"File downloaded: {object_key}")
                
                # Leer contenido
                with open(temp_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # Limpiar
                os.unlink(temp_path)
                
                print(f"Text extracted: {len(content)} characters")
                return content
                
            except Exception as e:
                error_msg = f"Error extracting text: {str(e)}"
                print(error_msg)
                return error_msg

        @component(
            base_image="python:3.9",
            packages_to_install=[
                "tiktoken==0.5.1"
            ]
        )
        def chunk_text_real(
            text_content: str,
            chunk_size: int = 512,
            chunk_overlap: int = 50
        ) -> str:
            """Chunk text into processable pieces"""
            
            import tiktoken
            import json
            
            print(f"Chunking text: {len(text_content)} characters")
            
            try:
                # Configurar tokenizer
                encoding = tiktoken.get_encoding("cl100k_base")
                
                def count_tokens(text: str) -> int:
                    return len(encoding.encode(text))
                
                # Simple chunking
                max_chars = chunk_size * 4  # Aprox 1 token = 4 chars
                overlap_chars = chunk_overlap * 4
                
                chunks = []
                start = 0
                chunk_id = 0
                
                while start < len(text_content):
                    end = min(start + max_chars, len(text_content))
                    chunk_text = text_content[start:end]
                    
                    if chunk_text.strip():
                        chunk_data = {
                            "chunk_id": f"chunk_{chunk_id:04d}",
                            "text": chunk_text.strip(),
                            "token_count": count_tokens(chunk_text),
                            "char_count": len(chunk_text)
                        }
                        chunks.append(chunk_data)
                        chunk_id += 1
                    
                    start = end - overlap_chars if end < len(text_content) else len(text_content)
                
                result = json.dumps(chunks, ensure_ascii=False)
                print(f"Created {len(chunks)} chunks")
                
                return result
                
            except Exception as e:
                error_msg = f"Error chunking text: {str(e)}"
                print(error_msg)
                return error_msg

        @component(
            base_image="python:3.9",
            packages_to_install=[
                "sentence-transformers==2.2.2",
                "torch==2.0.1",
                "numpy==1.24.3"
            ]
        )
        def generate_embeddings_real(
            chunks_json: str,
            model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
        ) -> str:
            """Generate embeddings for chunks"""
            
            import json
            from sentence_transformers import SentenceTransformer
            import torch
            
            print(f"Generating embeddings with model: {model_name}")
            
            try:
                # Cargar chunks
                chunks = json.loads(chunks_json)
                print(f"Processing {len(chunks)} chunks")
                
                # Cargar modelo
                device = 'cuda' if torch.cuda.is_available() else 'cpu'
                print(f"Using device: {device}")
                
                model = SentenceTransformer(model_name, device=device)
                
                # Generar embeddings
                texts = [chunk['text'] for chunk in chunks]
                embeddings = model.encode(
                    texts,
                    convert_to_numpy=True,
                    normalize_embeddings=True,
                    show_progress_bar=True
                )
                
                # Enriquecer chunks con embeddings
                for i, chunk in enumerate(chunks):
                    chunk['embedding'] = embeddings[i].tolist()
                    chunk['embedding_dim'] = len(embeddings[i])
                    chunk['embedding_model'] = model_name
                
                result = json.dumps(chunks, ensure_ascii=False)
                print(f"Embeddings generated: {len(embeddings)} vectors")
                
                return result
                
            except Exception as e:
                error_msg = f"Error generating embeddings: {str(e)}"
                print(error_msg)
                return error_msg

        @pipeline(
            name="rag-full-pipeline",
            description="Complete RAG Pipeline - Carlos Estay (pkstaz)"
        )
        def rag_full_pipeline(
            bucket_name: str = "raw-documents",
            object_key: str = "test-document-openshift-ai.txt",
            minio_endpoint: str = "minio-api-poc-rag.apps.cluster-2gbhp.2gbhp.sandbox1120.opentlc.com",
            minio_access_key: str = "minio",
            minio_secret_key: str = "minio123",
            chunk_size: int = 512,
            chunk_overlap: int = 50,
            embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"
        ) -> str:
            """Complete RAG processing pipeline"""
            
            # Step 1: Extract Text
            extract_task = extract_text_real(
                bucket_name=bucket_name,
                object_key=object_key,
                minio_endpoint=minio_endpoint,
                minio_access_key=minio_access_key,
                minio_secret_key=minio_secret_key
            )
            extract_task.set_display_name("Extract Text")
            extract_task.set_cpu_limit("500m")
            extract_task.set_memory_limit("1Gi")
            extract_task.set_retry(2)
            
            # Step 2: Chunk Text
            chunk_task = chunk_text_real(
                text_content=extract_task.output,
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap
            )
            chunk_task.set_display_name("Chunk Text")
            chunk_task.set_cpu_limit("500m")
            chunk_task.set_memory_limit("1Gi")
            chunk_task.set_retry(2)
            
            # Step 3: Generate Embeddings
            embedding_task = generate_embeddings_real(
                chunks_json=chunk_task.output,
                model_name=embedding_model
            )
            embedding_task.set_display_name("Generate Embeddings")
            embedding_task.set_cpu_limit("1000m")
            embedding_task.set_memory_limit("4Gi")
            embedding_task.set_retry(1)
            
            return embedding_task.output
        
        # Compilar pipeline real
        real_pipeline_file = 'rag_full_pipeline.yaml'
        
        compiler.Compiler().compile(
            pipeline_func=rag_full_pipeline,
            package_path=real_pipeline_file
        )
        
        print(f"✅ Pipeline RAG completo compilado: {real_pipeline_file}")
        return real_pipeline_file
        
    except Exception as e:
        print(f"❌ Error creando pipeline RAG real: {str(e)}")
        return False

def show_next_steps():
    """Mostrar próximos pasos ahora que DSP funciona"""
    
    print(f"\n🎯 PRÓXIMOS PASOS CON DSP FUNCIONAL:")
    print("=" * 50)
    
    steps = [
        "1. Subir rag_full_pipeline.yaml al OpenShift AI Dashboard",
        "2. Crear experimento llamado 'rag-testing'", 
        "3. Ejecutar pipeline con parámetros por defecto",
        "4. Monitorear ejecución en el Dashboard",
        "5. Verificar que procesa test-document-openshift-ai.txt",
        "6. Revisar logs de cada step",
        "7. Confirmar que genera embeddings exitosamente"
    ]
    
    for step in steps:
        print(f"  {step}")
    
    print(f"\n📋 PARÁMETROS DEL PIPELINE:")
    params = [
        "bucket_name: raw-documents",
        "object_key: test-document-openshift-ai.txt", 
        "minio_endpoint: minio-api-poc-rag.apps.cluster-2gbhp.2gbhp.sandbox1120.opentlc.com",
        "minio_access_key: minio",
        "minio_secret_key: minio123",
        "chunk_size: 512",
        "chunk_overlap: 50",
        "embedding_model: sentence-transformers/all-MiniLM-L6-v2"
    ]
    
    for param in params:
        print(f"  📄 {param}")
    
    print(f"\n💡 QUÉ ESPERAR:")
    expectations = [
        "✅ Extract Text: Descarga y lee test-document-openshift-ai.txt",
        "✅ Chunk Text: Divide en ~4-6 chunks de 512 tokens",
        "✅ Generate Embeddings: Crea vectores de 384 dimensiones",
        "⏱️ Tiempo total: ~5-10 minutos (primera vez descarga modelo)",
        "📊 Output: JSON con chunks y embeddings listos para indexar"
    ]
    
    for expectation in expectations:
        print(f"  {expectation}")
    
    return True

# Crear pipeline RAG real
print("🚀 CREANDO PIPELINE RAG REAL")
print("=" * 60)

real_pipeline = create_real_rag_pipeline()
show_next_steps()

if real_pipeline:
    print(f"\n✅ PIPELINE RAG REAL LISTO:")
    print(f"  📄 {real_pipeline}")
    print(f"  🎯 3 components: Extract → Chunk → Embeddings")
    print(f"  🔗 Conecta a MinIO público")
    print(f"  🧠 Usa sentence-transformers para embeddings")
    
    print(f"\n🚀 LISTO PARA TESTING REAL:")
    print(f"  1. Subir {real_pipeline} al Dashboard")
    print(f"  2. Ejecutar con documento test")
    print(f"  3. Verificar pipeline end-to-end")
    
    print(f"\n🎉 ¡PIPELINE RAG FUNCIONANDO EN OPENSHIFT AI!")
else:
    print(f"\n❌ Error creando pipeline RAG real")

🚀 CREANDO PIPELINE RAG REAL
🚀 Creando pipeline RAG real para el DSP funcional...
✅ Pipeline RAG completo compilado: rag_full_pipeline.yaml

🎯 PRÓXIMOS PASOS CON DSP FUNCIONAL:
  1. Subir rag_full_pipeline.yaml al OpenShift AI Dashboard
  2. Crear experimento llamado 'rag-testing'
  3. Ejecutar pipeline con parámetros por defecto
  4. Monitorear ejecución en el Dashboard
  5. Verificar que procesa test-document-openshift-ai.txt
  6. Revisar logs de cada step
  7. Confirmar que genera embeddings exitosamente

📋 PARÁMETROS DEL PIPELINE:
  📄 bucket_name: raw-documents
  📄 object_key: test-document-openshift-ai.txt
  📄 minio_endpoint: minio-api-poc-rag.apps.cluster-2gbhp.2gbhp.sandbox1120.opentlc.com
  📄 minio_access_key: minio
  📄 minio_secret_key: minio123
  📄 chunk_size: 512
  📄 chunk_overlap: 50
  📄 embedding_model: sentence-transformers/all-MiniLM-L6-v2

💡 QUÉ ESPERAR:
  ✅ Extract Text: Descarga y lee test-document-openshift-ai.txt
  ✅ Chunk Text: Divide en ~4-6 chunks de 512 tokens
  