Este código implementa un sistema de búsqueda basado en embeddings densos generados a partir de descripciones en un archivo CSV. A continuación, se describen sus principales funciones y estructura:

## Funcionalidad
1. **Carga de Datos**:
   - Se utiliza un archivo CSV (`data/repd_vp_cedulas_principal.csv`) que contiene descripciones en la columna `descripcion_desaparicion`.
   - Opcionalmente, se puede filtrar el CSV para excluir filas específicas y guardar el resultado en `data/filtered_dataset.csv`.

2. **Generación de Embeddings**:
   - Se utiliza el modelo `BGEM3FlagModel` para generar embeddings densos de las descripciones.
   - Los embeddings se guardan en caché (`data/database_embeddings.pkl` o `data/filtered_embeddings.pkl`) para evitar recalcularlos.

3. **Búsqueda**:
   - El usuario ingresa una consulta en lenguaje natural.
   - El sistema genera un embedding para la consulta y calcula la similitud con los embeddings precomputados.
   - Devuelve los resultados más similares (top-k) con sus puntajes.

4. **Exportación de Resultados**:
   - Los resultados se exportan en dos formatos:
     - **HTML**: Un archivo visual (`output/search_results.html`) que muestra los resultados con detalles.
     - **CSV**: Un archivo (`output/filtered_results.csv`) que contiene las filas originales del CSV relacionadas con los resultados.

5. **Carpeta de Salida**:
   - Todos los archivos generados (HTML y CSV) se guardan en la carpeta `output`. Si no existe, el código la crea automáticamente.

## Estructura de Carpetas y Archivos
- **Entrada**:
  - `data/repd_vp_cedulas_principal.csv`: Archivo CSV principal.
  - `data/filtered_dataset.csv`: (Opcional) CSV filtrado.
- **Caché**:
  - `data/database_embeddings.pkl`: Caché de embeddings.
  - `data/filtered_embeddings.pkl`: Caché de embeddings filtrados.
- **Salida**:
  - `output/search_results_YYYYMMDD_HHMMSS.html`: Resultados en HTML.
  - `output/filtered_results_YYYYMMDD_HHMMSS.csv`: Resultados en CSV.

## Requisitos
- Instalar las dependencias necesarias:

In [None]:
# Install required dependencies
!pip install pandas torch sentence-transformers jinja2 tqdm

In [None]:
import os
import pandas as pd
import torch
from sentence_transformers import util
import hashlib
import pickle
from tqdm import tqdm
from FlagEmbedding import BGEM3FlagModel
from jinja2 import Environment, FileSystemLoader
import numpy as np
import csv
from datetime import datetime

os.environ["TRANSFORMERS_NO_ADVISORY_WARNINGS"] = "1"

def ensure_output_folder_exists():
    """
    Ensures that the 'output' folder exists.
    """
    output_folder = "output"
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    return output_folder

def set_memory_limit(memory_fraction=0.9):
    """
    Sets a memory limit for GPU usage to avoid crashes.
    :param memory_fraction: Fraction of GPU memory to allocate (default is 50%).
    """
    if torch.cuda.is_available():
        torch.cuda.set_per_process_memory_fraction(memory_fraction)
        torch.cuda.empty_cache()
        print(f"GPU memory usage capped at {memory_fraction * 100:.0f}%.")

def compute_file_hash(file_path):
    """
    Computes the MD5 hash of a file to detect changes.
    """
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def load_cached_embeddings(cache_file, file_hash):
    """
    Loads cached embeddings if the file hash matches.
    """
    if os.path.exists(cache_file):
        with open(cache_file, "rb") as f:
            cache = pickle.load(f)
            if cache.get("file_hash") == file_hash:
                print("Loaded cached embeddings.")
                return cache.get("embeddings"), cache.get("texts")
    return None, None

def save_embeddings_to_cache(cache_file, file_hash, embeddings, texts):
    """
    Saves embeddings and their corresponding texts to a cache file along with the file hash.
    """
    with open(cache_file, "wb") as f:
        pickle.dump({"file_hash": file_hash, "embeddings": embeddings, "texts": texts}, f)
    print("Saved embeddings to cache.")

def apply_filter(input_csv, filtered_csv):
    """
    Applies a filter to the original CSV file and saves the filtered data to a new CSV file.
    :param input_csv: Path to the original CSV file.
    :param filtered_csv: Path to save the filtered CSV file.
    :return: Path to the filtered CSV file.
    """
    print("Applying filter to the dataset...")
    cases = []
    with open(input_csv, "r", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            # Filter cases that meet the conditions
            if (row['condicion_localizacion'].strip().upper() == "NO APLICA"):
                cases.append(row)

    # Save the filtered cases to a new CSV file
    with open(filtered_csv, "w", encoding="utf-8", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=reader.fieldnames)
        writer.writeheader()
        writer.writerows(cases)

    print(f"Filtered dataset saved to {filtered_csv}")
    return filtered_csv

def generate_embeddings_for_database(input_csv, column, cache_file):
    """
    Generates dense embeddings for the specified column in the database and caches them.
    :param input_csv: Path to the input CSV file.
    :param column: Column to process.
    :param cache_file: Path to the cache file.
    :return: List of dense embeddings and corresponding texts.
    """
    # Compute file hash
    file_hash = compute_file_hash(input_csv)

    # Load cached embeddings if available
    embeddings, texts = load_cached_embeddings(cache_file, file_hash)
    if embeddings is not None and texts is not None:
        print("Loaded cached embeddings.")
        return embeddings, texts

    # Generate embeddings
    print("Generating dense embeddings for the database...")
    model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)  # Use FP16 for faster computation
    df = pd.read_csv(input_csv)
    texts = df[column].dropna().tolist()  # Drop NaN values and convert to list

    embeddings = []
    batch_size = 3  # Adjust batch size based on available memory
    total_batches = (len(texts) + batch_size - 1) // batch_size  # Calculate total number of batches

    for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings", unit="batch", total=total_batches):
        batch_texts = texts[i:i + batch_size]
        print(f"\nProcessing batch {i // batch_size + 1}/{total_batches}...")  # Verbose output for each batch
        with torch.no_grad():
            # Generate dense embeddings
            batch_embeddings = model.encode(batch_texts, batch_size=batch_size, max_length=512)['dense_vecs']
            embeddings.extend(batch_embeddings)
        print(f"Batch {i // batch_size + 1} completed. Processed {len(batch_texts)} texts.")

    # Save embeddings to cache
    save_embeddings_to_cache(cache_file, file_hash, embeddings, texts)
    print("Embeddings generation completed and saved to cache.")

    return embeddings, texts

def search_database(query, embeddings, texts, top_k=5):
    """
    Searches the database for the most similar entries to the query using dense embeddings.
    :param query: User's query in natural language.
    :param embeddings: Precomputed dense embeddings for the database.
    :param texts: Corresponding texts for the embeddings.
    :param top_k: Number of top results to return.
    :return: List of top-k results with their similarity scores.
    """
    print("Processing query...")
    model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)  # Use FP16 for faster computation

    with torch.no_grad():
        # Generate dense embedding for the query
        query_embedding = model.encode([query], max_length=512)['dense_vecs'][0]

    # Convert embeddings to numpy.ndarray for better performance
    embeddings = np.array(embeddings)

    # Convert embeddings and query_embedding to PyTorch tensors
    embeddings_tensor = torch.tensor(embeddings)
    query_embedding_tensor = torch.tensor(query_embedding).unsqueeze(0)  # Expand to 2-D

    # Compute cosine similarities
    similarities = embeddings_tensor @ query_embedding_tensor.T
    top_results = torch.topk(similarities.squeeze(), k=top_k)

    results = []
    for idx, score in zip(top_results.indices, top_results.values):
        results.append((texts[idx], score.item(), idx.item()))  # Include the index for CSV export

    return results

def generate_html(query, results, output_file="output/search_results.html", input_csv=None):
    """
    Generates an HTML file to display search results in the 'output' folder.
    """
    ensure_output_folder_exists()
    print("Generating HTML file for search results...")

    # Load the original CSV to get additional information (e.g., condicion_localizacion)
    df = pd.read_csv(input_csv) if input_csv else None

    # Define the template directory and load the template
    env = Environment(loader=FileSystemLoader("."))  # Current directory
    template = env.from_string("""
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Search Results</title>
        <style>
            body {
                font-family: Arial, sans-serif;
                margin: 20px;
                padding: 0;
                background-color: #f4f4f9;
            }
            h1 {
                color: #333;
            }
            .result {
                border: 1px solid #ddd;
                border-radius: 5px;
                padding: 15px;
                margin-bottom: 10px;
                background-color: #fff;
            }
            .result h2 {
                margin: 0;
                font-size: 18px;
                color: #007BFF;
            }
            .result p {
                margin: 5px 0;
                font-size: 14px;
                color: #555;
            }
        </style>
    </head>
    <body>
        <h1>Search Results for: "{{ query }}"</h1>
        <p>Showing {{ results|length }} results:</p>
        {% for result in results %}
        <div class="result">
            <h2>Result {{ result.rank }}</h2>
            <p><strong>Similarity:</strong> {{ "%.4f"|format(result.score) }}</p>
            <p><strong>Condition:</strong> {{ result.condicion }}</p>
            <p>{{ result.text }}</p>
        </div>
        {% endfor %}
    </body>
    </html>
    """)

    # Prepare the results with additional information from the CSV
    enriched_results = []
    for rank, (text, score, idx) in enumerate(results, start=1):
        condicion = df.iloc[idx]['condicion_localizacion'] if df is not None else "N/A"
        enriched_results.append({"rank": rank, "text": text, "score": score, "condicion": condicion})

    # Render the HTML content
    html_content = template.render(query=query, results=enriched_results)

    # Write the HTML content to a file
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(html_content)

    print(f"HTML file generated: {output_file}")

def export_to_csv(results, original_csv, output_csv="output/filtered_results.csv"):
    """
    Exports the search results to a new CSV file in the 'output' folder.
    """
    ensure_output_folder_exists()
    print("Exporting results to CSV...")
    df = pd.read_csv(original_csv)
    indices = [idx for _, _, idx in results]  # Extract indices from results
    filtered_df = df.iloc[indices]  # Filter rows by indices
    filtered_df.to_csv(output_csv, index=False)  # Save to a new CSV file
    print(f"Results exported to {output_csv}")

def save_with_timestamp(file_path):
    """
    Saves a file with a timestamp in its filename inside the 'output' folder.
    :param file_path: Path to the file to save.
    """
    output_folder = ensure_output_folder_exists()
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    base, ext = os.path.splitext(file_path)
    timestamped_file = os.path.join(output_folder, f"{os.path.basename(base)}_{timestamp}{ext}")
    os.rename(file_path, timestamped_file)
    print(f"File saved with timestamp: {timestamped_file}")
    return timestamped_file

def main():
    # Set GPU memory limit
    set_memory_limit(memory_fraction=1.0)

    input_csv = "data/repd_vp_cedulas_principal.csv"
    column = "descripcion_desaparicion"
    cache_file = "data/database_embeddings.pkl"

    # Preguntar al usuario si desea aplicar un filtro
    apply_filter_option = input("Do you want to apply a filter to the dataset? (yes/no): ").strip().lower()
    if apply_filter_option == "yes":
        filtered_csv = "data/filtered_dataset.csv"
        input_csv = apply_filter(input_csv, filtered_csv)
        cache_file = "data/filtered_embeddings.pkl"  # Cambiar al archivo de caché para la base de datos filtrada

    # Generar o cargar embeddings
    embeddings, texts = generate_embeddings_for_database(input_csv, column, cache_file)

    while True:
        try:
            # Ask the user for a query
            query = input("\nEnter your query (or type 'exit' to quit): ")
            if query.lower() == "exit":
                print("\nExiting... Goodbye!")
                break

            # Ask the user how many results to display
            try:
                top_k = int(input("How many results do you want to display? "))
            except ValueError:
                print("Invalid input. Showing 5 results by default.")
                top_k = 5

            # Search the database
            results = search_database(query, embeddings, texts, top_k=top_k)

            # Generate an HTML file with the results
            output_html = "output/search_results.html"
            generate_html(query, results, output_html, input_csv=input_csv)
            save_with_timestamp(output_html)

            # Export results to a new CSV
            output_csv = "output/filtered_results.csv"
            export_to_csv(results, input_csv, output_csv)
            save_with_timestamp(output_csv)

        except KeyboardInterrupt:
            print("\nKeyboard interrupt detected. Returning to options...\n")
            continue

if __name__ == "__main__":
    main()