En este notebook que es parte de v5-CLASS-bert_ft.ipynb se va a realizar la reducción de las opciones de las interacciones del operador con un LLM.

In [1]:
!which python

/root/bert_ft_marcos/vllm_experiments/.venv/bin/python


In [2]:
import pandas as pd

df = pd.read_excel("processed_llm_output_anonimized_clean.xlsx",
                   usecols=["ACTOR", "DIALOG_ACT", "SUBACT", "ATA_TEXTO_TAGGED"])
# eliminar interacciones duplicadas
df = df.drop_duplicates(subset=["ATA_TEXTO_TAGGED"], keep="first")
# antes de eliminar duplicados había 5052 registros
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3847 entries, 0 to 5049
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  3847 non-null   object
 1   ACTOR             3847 non-null   object
 2   DIALOG_ACT        3847 non-null   object
 3   SUBACT            3847 non-null   object
dtypes: object(4)
memory usage: 150.3+ KB


In [15]:
# ver el número de interacciones de los operadores
df_operadores = df[df["ACTOR"] == "Operador"].copy()
df_operadores.shape

(1778, 4)

In [16]:
# de las 3847 interacciones, 1778 son de los operadores
# ahora revisar por acto de diáologo
df_operadores["DIALOG_ACT"].value_counts()

DIALOG_ACT
Pregunta       1446
Rutina          196
Descripción      93
Orden            38
Ruido             5
Name: count, dtype: int64

In [17]:
# ahora revisar por subacto de dialogo
df_operadores["SUBACT"].value_counts()

SUBACT
localización            641
otro                    312
confirmación            219
contacto                179
rutina_operador         154
tipo_incidente          140
heridos                  58
cortesía                 36
permanecer_línea         23
evolución                 5
ruido                     5
seguir_instrucciones      4
acción_seguridad          2
Name: count, dtype: int64

Lo que se me ocurre es ir reduciendo de poco a poco, ir seleccionando primero cada uno de los actos de diálogo macros con cada uno de sus subactos.

In [6]:
# estos son los actos de diálogo:

# Pregunta
# Rutina
# Descripción
# Orden

In [18]:
# este es el cliente del gpt oss que está corriendo como
# servicio en el hpc
from openai import OpenAI

system_prompt = """You are a linguistic analyst specialized in emergency call transcripts.
You must follow instructions precisely and output only the requested format.
Reasoning: high
"""

client = OpenAI(base_url="http://localhost:8010/v1", api_key="EMPTY")

def get_llm_answer(user_prompt):
    
    result = client.chat.completions.create(
        model="openai/gpt-oss-20b",
        messages=[
            {
                "role": "system",
                "content": system_prompt  
            },
            {
                "role": "user",
                "content": user_prompt
            },
        ],
    )
    
    return (result.choices[0].message.content)

# Pregunta

In [7]:
# Para Pregunta, estos son los subactos
df_operadores_pregunta = df_operadores[df_operadores["DIALOG_ACT"] == "Pregunta"].copy()
df_operadores_pregunta["SUBACT"].value_counts()

SUBACT
localización      623
otro              234
confirmación      219
contacto          179
tipo_incidente    135
heridos            56
Name: count, dtype: int64

## Localización

In [8]:
# De Pregunta, reducir primero localización
df_operadores_pregunta_localizacion = df_operadores_pregunta[df_operadores_pregunta["SUBACT"] == "localización"]


In [9]:
# prompt para el LLM
loc_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Question
- SUBACT: Localization

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE QUESTIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that request the SAME information.
- Do NOT merge questions that request DIFFERENT information.
- Write natural, neutral questions an operator would ask.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Canonical questions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_questions": [
    "Pregunta canónica 1?",
    "Pregunta canónica 2?"
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [11]:
# como son 623 interacciones o preguntas de localización
# se enviarán lotes de 50 en 50
# luego cuando se hayan reducido, se unificarán las oraciones generales
# y se volverá a solicitar al LLM que las reduzca una última vez
# así nos aseguramos de que el LLM haya observado todo

In [11]:
from tqdm import tqdm
import pandas as pd
import json_repair

slice_size = 50
num_rows = len(df_operadores_pregunta_localizacion)

for i in tqdm(range(0, num_rows, slice_size), desc="Processing location questions"):
    sliced_df = df_operadores_pregunta_localizacion.iloc[i:i + slice_size]
    utterances = sliced_df["ATA_TEXTO_TAGGED"].tolist()

    user_prompt = loc_prompt + str(utterances)

    response = get_llm_answer(user_prompt)

    parsed_response = json_repair.repair_json(
        response,
        return_objects=True,
        ensure_ascii=False
    )
    
    temp_df = pd.DataFrame({
        "canonical_questions": parsed_response["canonical_questions"]
    })
    
    temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

    temp_df_name = (
        f"reduccion_interacciones_operador/"
        f"reduccion_pregunta/"
        f"reduccion_localizacion/v1/{i}_{i+len(sliced_df)-1}.csv"
    )

    temp_df.to_csv(temp_df_name, index=False)

Processing location questions: 100%|████████████████████████████████████████████████████████| 13/13 [02:58<00:00, 13.70s/it]


In [None]:
# Una vez procesados todas las 623 interacciones del operador de:
# PREGUNTA / LOCALIZACIÓN
# como se enviaron en lotes de 50 en 50 y fueron guardados en csvs
# ahora se van a concantenar los 13 csvs en uno solo y una última vez
# se solicitará al LLM con el mismo prompt que reduzca las interacciones
# esto nos dará las preguntas canónicas o base que ha hecho un operador
# de la categoría pregunta/localizacion

In [12]:
import pandas as pd
import glob
import os

dir = "reduccion_interacciones_operador/reduccion_pregunta/reduccion_localizacion/v1"

all_csv = glob.glob(os.path.join(dir, "*.csv"))
concat_df = pd.concat(map(pd.read_csv, all_csv), ignore_index=True)
print(concat_df.info())
concat_df.to_csv(dir+"/concat_all_dfs.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  52 non-null     object
dtypes: object(1)
memory usage: 544.0+ bytes
None


In [13]:
# como vemos, de todas las 623 interacciones de obtuvieron 52 preguntas canónicas
# ahora se van a enviar las 52 preguntas al LLM para obtener las base, ya que el LLM
# solo veía lotes de 50 entonces puede que existan preguntas muy parecidas
import pandas as pd

utterances = str(concat_df["canonical_questions"].tolist())
user_prompt = loc_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_questions": parsed_response["canonical_questions"]
})

temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_localizacion/v1/final_pregunta_localizacion.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [14]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  8 non-null      object
dtypes: object(1)
memory usage: 192.0+ bytes


In [15]:
# finalmente, se obtuvieron 8 preguntas canónicas de localizalización
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_localizacion/v1/final_pregunta_localizacion.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Otro

In [9]:
# De Pregunta, reducir ahora Otro
df_operadores_pregunta_otro = df_operadores_pregunta[df_operadores_pregunta["SUBACT"] == "otro"]
df_operadores_pregunta_otro.info()


<class 'pandas.core.frame.DataFrame'>
Index: 234 entries, 0 to 5046
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  234 non-null    object
 1   ACTOR             234 non-null    object
 2   DIALOG_ACT        234 non-null    object
 3   SUBACT            234 non-null    object
dtypes: object(4)
memory usage: 9.1+ KB


In [10]:
# prompt para el LLM
other_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Question
- SUBACT: Other

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE QUESTIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that request the SAME information.
- Do NOT merge questions that request DIFFERENT information.
- Write natural, neutral questions an operator would ask.
- Canonical questions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_questions": [
    "Pregunta canónica 1?",
    "Pregunta canónica 2?"
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [None]:
# como son 234 interacciones o preguntas de otro
# se enviarán lotes de 50 en 50
# luego cuando se hayan reducido, se unificarán las oraciones generales
# y se volverá a solicitar al LLM que las reduzca una última vez
# así nos aseguramos de que el LLM haya observado todo

In [11]:
from tqdm import tqdm
import pandas as pd
import json_repair

slice_size = 50
num_rows = len(df_operadores_pregunta_otro)

for i in tqdm(range(0, num_rows, slice_size), desc="Processing other questions"):
    sliced_df = df_operadores_pregunta_otro.iloc[i:i + slice_size]
    utterances = sliced_df["ATA_TEXTO_TAGGED"].tolist()

    user_prompt = other_prompt + str(utterances)

    response = get_llm_answer(user_prompt)

    parsed_response = json_repair.repair_json(
        response,
        return_objects=True,
        ensure_ascii=False
    )
    
    temp_df = pd.DataFrame({
        "canonical_questions": parsed_response["canonical_questions"]
    })
    
    temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

    temp_df_name = (
        f"reduccion_interacciones_operador/"
        f"reduccion_pregunta/"
        f"reduccion_otro/v1/{i}_{i+len(sliced_df)-1}.csv"
    )

    temp_df.to_csv(temp_df_name, index=False)

Processing other questions: 100%|█████████████████████████████████████████████████████████████| 5/5 [02:12<00:00, 26.59s/it]


In [11]:
# Una vez procesados todas las 234 interacciones del operador de:
# PREGUNTA / OTRO
# como se enviaron en lotes de 50 en 50 y fueron guardados en csvs
# ahora se van a concantenar los 5 csvs en uno solo y una última vez
# se solicitará al LLM con el mismo prompt que reduzca las interacciones
# esto nos dará las preguntas canónicas o base que ha hecho un operador
# de la categoría pregunta/otro

In [12]:
import pandas as pd
import glob
import os

dir = "reduccion_interacciones_operador/reduccion_pregunta/reduccion_otro/v1"
all_csv = glob.glob(os.path.join(dir, "*.csv"))
concat_df = pd.concat(map(pd.read_csv, all_csv), ignore_index=True)
print(concat_df.info())
concat_df.to_csv(dir+"/concat_all_dfs.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  66 non-null     object
dtypes: object(1)
memory usage: 656.0+ bytes
None


In [13]:
# como vemos, de todas las 234 interacciones de obtuvieron 66 preguntas canónicas
# ahora se van a enviar las 66 preguntas al LLM para obtener las base, ya que el LLM
# solo veía lotes de 50 entonces puede que existan preguntas muy parecidas
import pandas as pd

utterances = str(concat_df["canonical_questions"].tolist())
user_prompt = other_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_questions": parsed_response["canonical_questions"]
})

temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_otro/v1/final_pregunta_otro.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [14]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  29 non-null     object
dtypes: object(1)
memory usage: 360.0+ bytes


In [15]:
# finalmente, se obtuvieron 29 preguntas canónicas de otro
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_otro/v1/final_pregunta_otro.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Confirmación

In [8]:
# De Pregunta, reducir ahora Otro
df_operadores_pregunta_confirmacion = df_operadores_pregunta[df_operadores_pregunta["SUBACT"] == "confirmación"]
df_operadores_pregunta_confirmacion.info()


<class 'pandas.core.frame.DataFrame'>
Index: 219 entries, 72 to 4989
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  219 non-null    object
 1   ACTOR             219 non-null    object
 2   DIALOG_ACT        219 non-null    object
 3   SUBACT            219 non-null    object
dtypes: object(4)
memory usage: 8.6+ KB


In [9]:
# prompt para el LLM
conf_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Question
- SUBACT: Confirmation

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE QUESTIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that request the SAME information.
- Do NOT merge questions that request DIFFERENT information.
- Write natural, neutral questions an operator would ask.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Canonical questions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_questions": [
    "Pregunta canónica 1?",
    "Pregunta canónica 2?"
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 219 interacciones o preguntas de confirmación
# se enviarán lotes de 50 en 50
# luego cuando se hayan reducido, se unificarán las oraciones generales
# y se volverá a solicitar al LLM que las reduzca una última vez
# así nos aseguramos de que el LLM haya observado todo

In [10]:
from tqdm import tqdm
import pandas as pd
import json_repair

slice_size = 50
num_rows = len(df_operadores_pregunta_confirmacion)

for i in tqdm(range(0, num_rows, slice_size), desc="Processing confirmation questions"):
    sliced_df = df_operadores_pregunta_confirmacion.iloc[i:i + slice_size]
    utterances = sliced_df["ATA_TEXTO_TAGGED"].tolist()

    user_prompt = conf_prompt + str(utterances)

    response = get_llm_answer(user_prompt)

    parsed_response = json_repair.repair_json(
        response,
        return_objects=True,
        ensure_ascii=False
    )
    
    temp_df = pd.DataFrame({
        "canonical_questions": parsed_response["canonical_questions"]
    })
    
    temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

    temp_df_name = (
        f"reduccion_interacciones_operador/"
        f"reduccion_pregunta/"
        f"reduccion_confirmacion/v1/{i}_{i+len(sliced_df)-1}.csv"
    )

    temp_df.to_csv(temp_df_name, index=False)

Processing confirmation questions:   0%|                                                              | 0/5 [00:00<?, ?it/s]

Processing confirmation questions: 100%|██████████████████████████████████████████████████████| 5/5 [01:59<00:00, 23.82s/it]


In [11]:
# Una vez procesados todas las 219 interacciones del operador de:
# PREGUNTA / CONFIRMACIÓN
# como se enviaron en lotes de 50 en 50 y fueron guardados en csvs
# ahora se van a concantenar los 5 csvs en uno solo y una última vez
# se solicitará al LLM con el mismo prompt que reduzca las interacciones
# esto nos dará las preguntas canónicas o base que ha hecho un operador
# de la categoría pregunta/confirmación

In [12]:
import pandas as pd
import glob
import os

dir = "reduccion_interacciones_operador/reduccion_pregunta/reduccion_confirmacion/v1"
all_csv = glob.glob(os.path.join(dir, "*.csv"))
concat_df = pd.concat(map(pd.read_csv, all_csv), ignore_index=True)
print(concat_df.info())
concat_df.to_csv(dir+"/concat_all_dfs.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  25 non-null     object
dtypes: object(1)
memory usage: 328.0+ bytes
None


In [13]:
# como vemos, de todas las 234 interacciones de obtuvieron 25 preguntas canónicas
# ahora se van a enviar las 25 preguntas al LLM para obtener las base, ya que el LLM
# solo veía lotes de 50 entonces puede que existan preguntas muy parecidas
import pandas as pd

utterances = str(concat_df["canonical_questions"].tolist())
user_prompt = conf_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_questions": parsed_response["canonical_questions"]
})

temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_confirmacion/v1/final_pregunta_confirmacion.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [14]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  22 non-null     object
dtypes: object(1)
memory usage: 304.0+ bytes


In [15]:
# finalmente, se obtuvieron 22 preguntas canónicas de confirmación
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_confirmacion/v1/final_pregunta_confirmacion.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Contacto

In [8]:
# De Pregunta, reducir ahora Contacto
df_operadores_pregunta_contacto = df_operadores_pregunta[df_operadores_pregunta["SUBACT"] == "contacto"]
df_operadores_pregunta_contacto.info()


<class 'pandas.core.frame.DataFrame'>
Index: 179 entries, 24 to 5048
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  179 non-null    object
 1   ACTOR             179 non-null    object
 2   DIALOG_ACT        179 non-null    object
 3   SUBACT            179 non-null    object
dtypes: object(4)
memory usage: 7.0+ KB


In [9]:
# prompt para el LLM
contact_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Question
- SUBACT: Contact

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE QUESTIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that request the SAME information.
- Do NOT merge questions that request DIFFERENT information.
- Write natural, neutral questions an operator would ask.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Canonical questions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_questions": [
    "Pregunta canónica 1?",
    "Pregunta canónica 2?"
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 179 interacciones o preguntas de contacto
# se enviarán lotes de 50 en 50
# luego cuando se hayan reducido, se unificarán las oraciones generales
# y se volverá a solicitar al LLM que las reduzca una última vez
# así nos aseguramos de que el LLM haya observado todo

In [10]:
from tqdm import tqdm
import pandas as pd
import json_repair

slice_size = 50
num_rows = len(df_operadores_pregunta_contacto)

for i in tqdm(range(0, num_rows, slice_size), desc="Processing contact questions"):
    sliced_df = df_operadores_pregunta_contacto.iloc[i:i + slice_size]
    utterances = sliced_df["ATA_TEXTO_TAGGED"].tolist()

    user_prompt = contact_prompt + str(utterances)

    response = get_llm_answer(user_prompt)

    parsed_response = json_repair.repair_json(
        response,
        return_objects=True,
        ensure_ascii=False
    )
    
    temp_df = pd.DataFrame({
        "canonical_questions": parsed_response["canonical_questions"]
    })
    
    temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

    temp_df_name = (
        f"reduccion_interacciones_operador/"
        f"reduccion_pregunta/"
        f"reduccion_contacto/v1/{i}_{i+len(sliced_df)-1}.csv"
    )

    temp_df.to_csv(temp_df_name, index=False)

Processing contact questions:   0%|                                                                   | 0/4 [00:00<?, ?it/s]

Processing contact questions: 100%|███████████████████████████████████████████████████████████| 4/4 [01:21<00:00, 20.39s/it]


In [None]:
# Una vez procesados todas las 179 interacciones del operador de:
# PREGUNTA / CONTACTO
# como se enviaron en lotes de 50 en 50 y fueron guardados en csvs
# ahora se van a concantenar los 4 csvs en uno solo y una última vez
# se solicitará al LLM con el mismo prompt que reduzca las interacciones
# esto nos dará las preguntas canónicas o base que ha hecho un operador
# de la categoría pregunta/contacto

In [11]:
import pandas as pd
import glob
import os

dir = "reduccion_interacciones_operador/reduccion_pregunta/reduccion_contacto/v1"
all_csv = glob.glob(os.path.join(dir, "*.csv"))
concat_df = pd.concat(map(pd.read_csv, all_csv), ignore_index=True)
print(concat_df.info())
concat_df.to_csv(dir+"/concat_all_dfs.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  12 non-null     object
dtypes: object(1)
memory usage: 224.0+ bytes
None


In [None]:
# como vemos, de todas las 179 interacciones de obtuvieron 12 preguntas canónicas
# ahora se van a enviar las 12 preguntas al LLM para obtener las base, ya que el LLM
# solo veía lotes de 50 entonces puede que existan preguntas muy parecidas
import pandas as pd

utterances = str(concat_df["canonical_questions"].tolist())
user_prompt = contact_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_questions": parsed_response["canonical_questions"]
})

temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_contacto/v1/final_pregunta_contacto.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [13]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  4 non-null      object
dtypes: object(1)
memory usage: 160.0+ bytes


In [None]:
# finalmente, se obtuvieron 4 preguntas canónicas de contacto
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_contacto/v1/final_pregunta_contacto.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Tipo_incidente

In [8]:
# De Pregunta, reducir ahora Contacto
df_operadores_pregunta_tipo = df_operadores_pregunta[df_operadores_pregunta["SUBACT"] == "tipo_incidente"]
df_operadores_pregunta_tipo.info()


<class 'pandas.core.frame.DataFrame'>
Index: 135 entries, 2 to 4866
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  135 non-null    object
 1   ACTOR             135 non-null    object
 2   DIALOG_ACT        135 non-null    object
 3   SUBACT            135 non-null    object
dtypes: object(4)
memory usage: 5.3+ KB


In [9]:
# prompt para el LLM
type_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Question
- SUBACT: Incident Type

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE QUESTIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that request the SAME information.
- Do NOT merge questions that request DIFFERENT information.
- Write natural, neutral questions an operator would ask.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Canonical questions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_questions": [
    "Pregunta canónica 1?",
    "Pregunta canónica 2?"
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 135 interacciones o preguntas de tipo de incidente
# se enviarán lotes de 50 en 50
# luego cuando se hayan reducido, se unificarán las oraciones generales
# y se volverá a solicitar al LLM que las reduzca una última vez
# así nos aseguramos de que el LLM haya observado todo

In [11]:
from tqdm import tqdm
import pandas as pd
import json_repair

slice_size = 50
num_rows = len(df_operadores_pregunta_tipo)

for i in tqdm(range(0, num_rows, slice_size), desc="Processing incident type questions"):
    sliced_df = df_operadores_pregunta_tipo.iloc[i:i + slice_size]
    utterances = sliced_df["ATA_TEXTO_TAGGED"].tolist()

    user_prompt = type_prompt + str(utterances)

    response = get_llm_answer(user_prompt)

    parsed_response = json_repair.repair_json(
        response,
        return_objects=True,
        ensure_ascii=False
    )
    
    temp_df = pd.DataFrame({
        "canonical_questions": parsed_response["canonical_questions"]
    })
    
    temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

    temp_df_name = (
        f"reduccion_interacciones_operador/"
        f"reduccion_pregunta/"
        f"reduccion_tipo_incidente/v1/{i}_{i+len(sliced_df)-1}.csv"
    )

    temp_df.to_csv(temp_df_name, index=False)

Processing incident type questions:   0%|                                                             | 0/3 [00:00<?, ?it/s]

Processing incident type questions: 100%|█████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.24s/it]


In [None]:
# Una vez procesados todas las 179 interacciones del operador de:
# PREGUNTA / CONTACTO
# como se enviaron en lotes de 50 en 50 y fueron guardados en csvs
# ahora se van a concantenar los 3 csvs en uno solo y una última vez
# se solicitará al LLM con el mismo prompt que reduzca las interacciones
# esto nos dará las preguntas canónicas o base que ha hecho un operador
# de la categoría pregunta/contacto

In [12]:
import pandas as pd
import glob
import os

dir = "reduccion_interacciones_operador/reduccion_pregunta/reduccion_tipo_incidente/v1"
all_csv = glob.glob(os.path.join(dir, "*.csv"))
concat_df = pd.concat(map(pd.read_csv, all_csv), ignore_index=True)
print(concat_df.info())
concat_df.to_csv(dir+"/concat_all_dfs.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  6 non-null      object
dtypes: object(1)
memory usage: 176.0+ bytes
None


In [14]:
# como vemos, de todas las 179 interacciones de obtuvieron 6 preguntas canónicas
# ahora se van a enviar las 6 preguntas al LLM para obtener las base, ya que el LLM
# solo veía lotes de 50 entonces puede que existan preguntas muy parecidas
import pandas as pd

utterances = str(concat_df["canonical_questions"].tolist())
user_prompt = type_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_questions": parsed_response["canonical_questions"]
})

temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_tipo_incidente/v1/final_pregunta_tipo_incidente.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [15]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  2 non-null      object
dtypes: object(1)
memory usage: 144.0+ bytes


In [16]:
# finalmente, se obtuvieron 2 preguntas canónicas de confirmación
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_tipo_incidente/v1/final_pregunta_tipo_incidente.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Heridos

In [8]:
# De Pregunta, reducir ahora Heridos
df_operadores_pregunta_heridos = df_operadores_pregunta[df_operadores_pregunta["SUBACT"] == "heridos"]
df_operadores_pregunta_heridos.info()


<class 'pandas.core.frame.DataFrame'>
Index: 56 entries, 6 to 4965
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  56 non-null     object
 1   ACTOR             56 non-null     object
 2   DIALOG_ACT        56 non-null     object
 3   SUBACT            56 non-null     object
dtypes: object(4)
memory usage: 2.2+ KB


In [9]:
# prompt para el LLM
inj_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Question
- SUBACT: Injured

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE QUESTIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that request the SAME information.
- Do NOT merge questions that request DIFFERENT information.
- Write natural, neutral questions an operator would ask.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Canonical questions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_questions": [
    "Pregunta canónica 1?",
    "Pregunta canónica 2?"
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [11]:
# como son 56 interacciones o preguntas de heridos
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_pregunta_heridos["ATA_TEXTO_TAGGED"].tolist())
user_prompt = inj_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_questions": parsed_response["canonical_questions"]
})

temp_df["canonical_questions"] = temp_df["canonical_questions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_heridos/v1/final_pregunta_heridos.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [12]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 1 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   canonical_questions  8 non-null      object
dtypes: object(1)
memory usage: 192.0+ bytes


In [13]:
# finalmente, se obtuvieron 8 preguntas canónicas de heridos
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_pregunta/"
    "reduccion_heridos/v1/final_pregunta_heridos.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

# Rutina

In [7]:
# Para Rutina, estos son los subactos
df_operadores_rutina = df_operadores[df_operadores["DIALOG_ACT"] == "Rutina"].copy()
df_operadores_rutina["SUBACT"].value_counts()

SUBACT
rutina_operador    154
cortesía            36
otro                 6
Name: count, dtype: int64

## Rutina Operador

In [9]:
# De Rutina, reducir primero rutina_operador
df_operadores_rutina_operador = df_operadores_rutina[df_operadores_rutina["SUBACT"] == "rutina_operador"]
df_operadores_rutina_operador.info()


<class 'pandas.core.frame.DataFrame'>
Index: 154 entries, 26 to 4959
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  154 non-null    object
 1   ACTOR             154 non-null    object
 2   DIALOG_ACT        154 non-null    object
 3   SUBACT            154 non-null    object
dtypes: object(4)
memory usage: 6.0+ KB


In [12]:
# prompt para el LLM
rut_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Routine
- SUBACT: Operator Routine

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE EXPRESSIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that perform the SAME routine function.
- Do NOT merge expressions that perform DIFFERENT functions.
- Write natural, neutral expressions an operator would use.
- When an expression encodes a standard institutional formula (e.g., emergency center identification or service assurance), preserve that function in the canonical expression.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Expressions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_expressions": [
    "Expresión canónica 1.",
    "Expresión canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [None]:
# como son 154 interacciones o preguntas de rutina de operador
# se enviarán lotes de 50 en 50
# luego cuando se hayan reducido, se unificarán las oraciones generales
# y se volverá a solicitar al LLM que las reduzca una última vez
# así nos aseguramos de que el LLM haya observado todo

In [14]:
from tqdm import tqdm
import pandas as pd
import json_repair

slice_size = 50
num_rows = len(df_operadores_rutina_operador)

for i in tqdm(range(0, num_rows, slice_size), desc="Processing operator routines"):
    sliced_df = df_operadores_rutina_operador.iloc[i:i + slice_size]
    utterances = sliced_df["ATA_TEXTO_TAGGED"].tolist()

    user_prompt = rut_prompt + str(utterances)

    response = get_llm_answer(user_prompt)

    parsed_response = json_repair.repair_json(
        response,
        return_objects=True,
        ensure_ascii=False
    )
    
    temp_df = pd.DataFrame({
        "canonical_expressions": parsed_response["canonical_expressions"]
    })
    
    temp_df["canonical_expressions"] = temp_df["canonical_expressions"].str.replace('"', '', regex=False)

    temp_df_name = (
        f"reduccion_interacciones_operador/"
        f"reduccion_rutina/"
        f"reduccion_rutina_operador/v1/{i}_{i+len(sliced_df)-1}.csv"
    )

    temp_df.to_csv(temp_df_name, index=False)

Processing operator routines: 100%|███████████████████████████████████████████████████████████| 4/4 [00:43<00:00, 10.81s/it]


In [None]:
# Una vez procesados todas las 154 interacciones del operador de:
# RUTINA / RUTINA_OPERADOR
# como se enviaron en lotes de 50 en 50 y fueron guardados en csvs
# ahora se van a concantenar los 4 csvs en uno solo y una última vez
# se solicitará al LLM con el mismo prompt que reduzca las interacciones
# esto nos dará las preguntas canónicas o base que ha hecho un operador
# de la categoría rutina/rutina_operador

In [15]:
import pandas as pd
import glob
import os

dir = "reduccion_interacciones_operador/reduccion_rutina/reduccion_rutina_operador/v1"

all_csv = glob.glob(os.path.join(dir, "*.csv"))
concat_df = pd.concat(map(pd.read_csv, all_csv), ignore_index=True)
print(concat_df.info())
concat_df.to_csv(dir+"/concat_all_dfs.csv", index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 1 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   canonical_expressions  14 non-null     object
dtypes: object(1)
memory usage: 240.0+ bytes
None


In [17]:
# como vemos, de todas las 154 interacciones de obtuvieron 14 preguntas canónicas
# ahora se van a enviar las 14 preguntas al LLM para obtener las base, ya que el LLM
# solo veía lotes de 50 entonces puede que existan preguntas muy parecidas
import pandas as pd

utterances = str(concat_df["canonical_expressions"].tolist())
user_prompt = rut_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_expressions": parsed_response["canonical_expressions"]
})

temp_df["canonical_expressions"] = temp_df["canonical_expressions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_rutina/"
    "reduccion_rutina_operador/v1/final_rutina_operador.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [18]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   canonical_expressions  4 non-null      object
dtypes: object(1)
memory usage: 160.0+ bytes


In [None]:
# finalmente, se obtuvieron 4 rutinas canónicas del operador
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_rutina/"
    "reduccion_rutina_operador/v1/final_rutina_operador.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Cortesía

In [9]:
# De Rutina, reducir ahora cortesía
df_operadores_rutina_cortesia = df_operadores_rutina[df_operadores_rutina["SUBACT"] == "cortesía"]
df_operadores_rutina_cortesia.info()


<class 'pandas.core.frame.DataFrame'>
Index: 36 entries, 217 to 4961
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  36 non-null     object
 1   ACTOR             36 non-null     object
 2   DIALOG_ACT        36 non-null     object
 3   SUBACT            36 non-null     object
dtypes: object(4)
memory usage: 1.4+ KB


In [10]:
# prompt para el LLM
court_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Routine
- SUBACT: Courtesy 

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE EXPRESSIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that perform the SAME routine function.
- Do NOT merge expressions that perform DIFFERENT functions.
- Write natural, neutral expressions an operator would use.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Expressions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_expressions": [
    "Expresión canónica 1.",
    "Expresión canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [12]:
# como son 36 interacciones o preguntas de heridos
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_rutina_cortesia["ATA_TEXTO_TAGGED"].tolist())
user_prompt = court_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_expressions": parsed_response["canonical_expressions"]
})

temp_df["canonical_expressions"] = temp_df["canonical_expressions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_rutina/"
    "reduccion_cortesia/v1/final_rutina_cortesia.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [13]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 1 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   canonical_expressions  11 non-null     object
dtypes: object(1)
memory usage: 216.0+ bytes


In [None]:
# finalmente, se obtuvieron 11 rutinas canónicas de cortesía
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_rutina/"
    "reduccion_cortesia/v1/final_rutina_cortesia.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Otro

In [8]:
# De Rutina, reducir ahora otro
df_operadores_rutina_otro = df_operadores_rutina[df_operadores_rutina["SUBACT"] == "otro"]
df_operadores_rutina_otro.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 374 to 4699
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  6 non-null      object
 1   ACTOR             6 non-null      object
 2   DIALOG_ACT        6 non-null      object
 3   SUBACT            6 non-null      object
dtypes: object(4)
memory usage: 240.0+ bytes


In [9]:
# prompt para el LLM
other_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Routine
- SUBACT: Other 

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE EXPRESSIONS
written in SPANISH.

Guidelines:
- Abstract recurring intents, but do not over-simplify.
- Merge different phrasings that perform the SAME routine function.
- Do NOT merge expressions that perform DIFFERENT functions.
- Write natural, neutral expressions an operator would use.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Expressions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_expressions": [
    "Expresión canónica 1.",
    "Expresión canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 36 interacciones o preguntas de heridos
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_rutina_otro["ATA_TEXTO_TAGGED"].tolist())
user_prompt = other_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_expressions": parsed_response["canonical_expressions"]
})

temp_df["canonical_expressions"] = temp_df["canonical_expressions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_rutina/"
    "reduccion_otro/v1/final_rutina_otro.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [12]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   canonical_expressions  4 non-null      object
dtypes: object(1)
memory usage: 160.0+ bytes


In [None]:
# finalmente, se obtuvieron 4 rutinas canónicas de otro
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_rutina/"
    "reduccion_otro/v1/final_rutina_otro.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

# Descripción

In [8]:
# Para Descripción, estos son los subactos
df_operadores_descripcion = df_operadores[df_operadores["DIALOG_ACT"] == "Descripción"].copy()
df_operadores_descripcion["SUBACT"].value_counts()

SUBACT
otro              63
localización      18
tipo_incidente     5
evolución          5
heridos            2
Name: count, dtype: int64

## Otro

In [18]:
# De Descripción, reducir primero otro
df_operadores_descripcion_otro = df_operadores_descripcion[df_operadores_descripcion["SUBACT"] == "otro"]
df_operadores_descripcion_otro.info()


<class 'pandas.core.frame.DataFrame'>
Index: 63 entries, 260 to 4754
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  63 non-null     object
 1   ACTOR             63 non-null     object
 2   DIALOG_ACT        63 non-null     object
 3   SUBACT            63 non-null     object
dtypes: object(4)
memory usage: 2.5+ KB


In [20]:
# prompt para el LLM
other_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Description
- SUBACT: Other

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE DESCRIPTIVE STATEMENTS
written in SPANISH.

Guidelines:
- Ignore conversational routines, greetings, fillers, confirmations, transfers, or administrative talk.
- Abstract recurring descriptive content, but do not over-simplify.
- Merge different phrasings that describe the SAME situation or information.
- Write clear, neutral descriptive statements spoken directly by the operator.
- Do NOT describe the operator's actions; write the statements the operator would say.
- Descriptions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_descriptions": [
    "Descripción canónica 1.",
    "Descripción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [21]:
# como son 63 interacciones o preguntas de heridos
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_descripcion_otro["ATA_TEXTO_TAGGED"].tolist())
user_prompt = other_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_descriptions": parsed_response["canonical_descriptions"]
})

temp_df["canonical_descriptions"] = temp_df["canonical_descriptions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_otro/v1/final_descripcion_otro.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [22]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_descriptions  6 non-null      object
dtypes: object(1)
memory usage: 176.0+ bytes


In [23]:
# finalmente, se obtuvieron 6 descripciones canónicas de otros
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_otro/v1/final_descripcion_otro.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Localización

In [8]:
# De Descripción, reducir ahora localización
df_operadores_descripcion_localizacion = df_operadores_descripcion[df_operadores_descripcion["SUBACT"] == "localización"]
df_operadores_descripcion_localizacion.info()


<class 'pandas.core.frame.DataFrame'>
Index: 18 entries, 184 to 4606
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  18 non-null     object
 1   ACTOR             18 non-null     object
 2   DIALOG_ACT        18 non-null     object
 3   SUBACT            18 non-null     object
dtypes: object(4)
memory usage: 720.0+ bytes


In [9]:
# prompt para el LLM
loc_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Description
- SUBACT: Localization

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE DESCRIPTIVE STATEMENTS
written in SPANISH.

Guidelines:
- Only consider utterances that describe the location of the incident or the emergency.
- Abstract recurring location descriptions, but do not over-simplify.
- Merge different phrasings that describe the SAME type of location information.
- Write clear, neutral descriptive statements spoken directly by the operator.
- Do NOT describe the operator's actions; write the statements the operator would say.
- Descriptions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_descriptions": [
    "Descripción canónica 1.",
    "Descripción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 18 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_descripcion_localizacion["ATA_TEXTO_TAGGED"].tolist())
user_prompt = loc_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_descriptions": parsed_response["canonical_descriptions"]
})

temp_df["canonical_descriptions"] = temp_df["canonical_descriptions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_localizacion/v1/final_descripcion_localizacion.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [11]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_descriptions  10 non-null     object
dtypes: object(1)
memory usage: 208.0+ bytes


In [None]:
# finalmente, se obtuvieron 10 descripciones canónicas de localización
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_localizacion/v1/final_descripcion_localizacion.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Tipo_incidente

In [13]:
# De Descripción, reducir ahora tipo_incidente
df_operadores_descripcion_tipo_incidente = df_operadores_descripcion[df_operadores_descripcion["SUBACT"] == "tipo_incidente"]
df_operadores_descripcion_tipo_incidente.info()


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 104 to 2577
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  5 non-null      object
 1   ACTOR             5 non-null      object
 2   DIALOG_ACT        5 non-null      object
 3   SUBACT            5 non-null      object
dtypes: object(4)
memory usage: 200.0+ bytes


In [15]:
# prompt para el LLM
type_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Description
- SUBACT: Incident Type

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE DESCRIPTIVE STATEMENTS
that describe the TYPE OF INCIDENT,
written in SPANISH.

Guidelines:
- Only consider utterances that clearly describe the nature or type of the incident.
- Ignore questions, routines, transfers, greetings, or administrative talk.
- Abstract the incident type; remove specific details, entities, or values.
- Merge different phrasings that describe the SAME type of incident.
- Write clear, neutral descriptive statements spoken directly by the operator.
- Do NOT describe the operator's actions or mention the alertant.
- Descriptions must be general and must not include specific names, places, institutions, brands, body parts, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_descriptions": [
    "Descripción canónica 1.",
    "Descripción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [16]:
# como son 5 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_descripcion_tipo_incidente["ATA_TEXTO_TAGGED"].tolist())
user_prompt = type_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_descriptions": parsed_response["canonical_descriptions"]
})

temp_df["canonical_descriptions"] = temp_df["canonical_descriptions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_tipo_incidente/v1/final_descripcion_tipo_incidente.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [17]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_descriptions  2 non-null      object
dtypes: object(1)
memory usage: 144.0+ bytes


In [None]:
# finalmente, se obtuvieron 2 descripciones canónicas de tipo de incidente
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_tipo_incidente/v1/final_descripcion_tipo_incidente.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Evolución

In [9]:
# De Descripción, reducir ahora evolución
df_operadores_descripcion_evolucion = df_operadores_descripcion[df_operadores_descripcion["SUBACT"] == "evolución"]
df_operadores_descripcion_evolucion.info()


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 615 to 2745
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  5 non-null      object
 1   ACTOR             5 non-null      object
 2   DIALOG_ACT        5 non-null      object
 3   SUBACT            5 non-null      object
dtypes: object(4)
memory usage: 200.0+ bytes


In [None]:
# prompt para el LLM
evo_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Description
- SUBACT: Incident Evolution

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE DESCRIPTIVE STATEMENTS
written in SPANISH.

Guidelines:
- Only consider utterances that clearly describe the incident evolution.
- Ignore questions, routines, transfers, greetings, or administrative talk.
- Merge different phrasings that describe the SAME evolution of incident.
- Write clear, neutral descriptive statements spoken directly by the operator.
- Do NOT describe the operator's actions or mention the alertant.
- Descriptions must be general and must not include specific names, places, institutions, brands, body parts, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_descriptions": [
    "Descripción canónica 1.",
    "Descripción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 5 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_descripcion_evolucion["ATA_TEXTO_TAGGED"].tolist())
user_prompt = evo_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_descriptions": parsed_response["canonical_descriptions"]
})

temp_df["canonical_descriptions"] = temp_df["canonical_descriptions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_evolucion/v1/final_descripcion_evolucion.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [11]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_descriptions  3 non-null      object
dtypes: object(1)
memory usage: 152.0+ bytes


In [None]:
# finalmente, se obtuvieron 3 descripciones canónicas de evolución
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_evolucion/v1/final_descripcion_evolucion.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Heridos

In [10]:
# De Descripción, reducir ahora heridos
df_operadores_descripcion_heridos = df_operadores_descripcion[df_operadores_descripcion["SUBACT"] == "heridos"]
df_operadores_descripcion_heridos.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 1081 to 2342
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  2 non-null      object
 1   ACTOR             2 non-null      object
 2   DIALOG_ACT        2 non-null      object
 3   SUBACT            2 non-null      object
dtypes: object(4)
memory usage: 80.0+ bytes


In [15]:
# prompt para el LLM
inj_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Description
- SUBACT: Injured

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE DESCRIPTIVE STATEMENTS
written in SPANISH.

Guidelines:
- Only consider utterances that clearly describe injured people or injures.
- Ignore questions, routines, transfers, greetings, or administrative talk.
- Merge different phrasings that describe the SAME.
- Write clear, grammatically correct, neutral descriptive statements spoken directly by the operator.
- Do NOT describe the operator's actions or mention the alertant.
- Descriptions must be general and must not include specific names, places, institutions, brands, body parts, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_descriptions": [
    "Descripción canónica 1.",
    "Descripción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [16]:
# como son 5 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_descripcion_heridos["ATA_TEXTO_TAGGED"].tolist())
user_prompt = inj_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_descriptions": parsed_response["canonical_descriptions"]
})

temp_df["canonical_descriptions"] = temp_df["canonical_descriptions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_heridos/v1/final_descripcion_heridos.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [17]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_descriptions  2 non-null      object
dtypes: object(1)
memory usage: 144.0+ bytes


In [None]:
# finalmente, se obtuvieron 2 descripciones canónicas de heridos
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_descripcion/"
    "reduccion_heridos/v1/final_descripcion_heridos.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

# Orden

In [19]:
# Para Orden, estos son los subactos
df_operadores_orden = df_operadores[df_operadores["DIALOG_ACT"] == "Orden"].copy()
df_operadores_orden["SUBACT"].value_counts()

SUBACT
permanecer_línea        23
otro                     9
seguir_instrucciones     4
acción_seguridad         2
Name: count, dtype: int64

## Permanecer en línea

In [None]:
# De Orden, reducir permanecer_línea
df_operadores_orden_linea = df_operadores_orden[df_operadores_orden["SUBACT"] == "permanecer_línea"]
df_operadores_orden_linea.info()


<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, 422 to 4326
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  23 non-null     object
 1   ACTOR             23 non-null     object
 2   DIALOG_ACT        23 non-null     object
 3   SUBACT            23 non-null     object
dtypes: object(4)
memory usage: 920.0+ bytes


In [12]:
# prompt para el LLM
stay_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Order
- SUBACT: Stay on line

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE INSTRUCTIONS
written in SPANISH.

Guidelines:
- Abstract recurring directive intents, but do not over-simplify.
- Merge different phrasings that instruct the SAME action.
- Do NOT merge instructions that require DIFFERENT actions.
- Write clear, neutral instructions an operator would give.
- Preserve safety-related or procedural intent when present.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Instructions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_instructions": [
    "Instrucción canónica 1.",
    "Instrucción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [13]:
# como son 23 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_orden_linea["ATA_TEXTO_TAGGED"].tolist())
user_prompt = stay_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_instructions": parsed_response["canonical_instructions"]
})

temp_df["canonical_instructions"] = temp_df["canonical_instructions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_permanecer_linea/v1/final_orden_permanecer_linea.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [14]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_instructions  3 non-null      object
dtypes: object(1)
memory usage: 152.0+ bytes


In [15]:
# finalmente, se obtuvieron 3 ordenes canónicas de permanecer en línea
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_permanecer_linea/v1/final_orden_permanecer_linea.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Otro

In [10]:
# De Orden, reducir ahora otro
df_operadores_orden_otro = df_operadores_orden[df_operadores_orden["SUBACT"] == "otro"]
df_operadores_orden_otro.info()


<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, 14 to 3319
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  9 non-null      object
 1   ACTOR             9 non-null      object
 2   DIALOG_ACT        9 non-null      object
 3   SUBACT            9 non-null      object
dtypes: object(4)
memory usage: 360.0+ bytes


In [11]:
# prompt para el LLM
order_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Order
- SUBACT: Other

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE INSTRUCTIONS
written in SPANISH.

Guidelines:
- Abstract recurring directive intents, but do not over-simplify.
- Merge different phrasings that instruct the SAME action.
- Do NOT merge instructions that require DIFFERENT actions.
- Write clear, neutral instructions an operator would give.
- Preserve safety-related or procedural intent when present.
- Ignore utterances that are not genuine operator instructions.
- Instructions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_instructions": [
    "Instrucción canónica 1.",
    "Instrucción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [12]:
# como son 9 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_orden_otro["ATA_TEXTO_TAGGED"].tolist())
user_prompt = order_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_instructions": parsed_response["canonical_instructions"]
})

temp_df["canonical_instructions"] = temp_df["canonical_instructions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_otro/v1/final_orden_otro.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [13]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_instructions  2 non-null      object
dtypes: object(1)
memory usage: 144.0+ bytes


In [None]:
# finalmente, se obtuvieron 2 ordenes canónicas de otro
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_otro/v1/final_orden_otro.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Seguir instrucciones

In [8]:
# De Orden, reducir ahora seguir_instrucciones
df_operadores_orden_instrucciones = df_operadores_orden[df_operadores_orden["SUBACT"] == "seguir_instrucciones"]
df_operadores_orden_instrucciones.info()


<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 10 to 4354
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  4 non-null      object
 1   ACTOR             4 non-null      object
 2   DIALOG_ACT        4 non-null      object
 3   SUBACT            4 non-null      object
dtypes: object(4)
memory usage: 160.0+ bytes


In [9]:
# prompt para el LLM
inst_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Order
- SUBACT: Follow Instructions

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE INSTRUCTIONS
written in SPANISH.

Guidelines:
- Abstract recurring directive intents, but do not over-simplify.
- Merge different phrasings that instruct the SAME action.
- Do NOT merge instructions that require DIFFERENT actions.
- Write clear, neutral instructions an operator would give.
- Preserve safety-related or procedural intent when present.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Instructions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_instructions": [
    "Instrucción canónica 1.",
    "Instrucción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [10]:
# como son 4 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_orden_instrucciones["ATA_TEXTO_TAGGED"].tolist())
user_prompt = inst_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_instructions": parsed_response["canonical_instructions"]
})

temp_df["canonical_instructions"] = temp_df["canonical_instructions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_seguir_instrucciones/v1/final_orden_seguir_instruciones.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [11]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_instructions  3 non-null      object
dtypes: object(1)
memory usage: 152.0+ bytes


In [None]:
# finalmente, se obtuvieron 3 ordenes canónicas de seguir instrucciones
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_seguir_instrucciones/v1/final_orden_seguir_instruciones.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)

## Acción de seguridad

In [26]:
# De Orden, reducir ahora seguir_instrucciones
df_operadores_orden_seguridad = df_operadores_orden[df_operadores_orden["SUBACT"] == "acción_seguridad"]
df_operadores_orden_seguridad.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 852 to 2125
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   ATA_TEXTO_TAGGED  2 non-null      object
 1   ACTOR             2 non-null      object
 2   DIALOG_ACT        2 non-null      object
 3   SUBACT            2 non-null      object
dtypes: object(4)
memory usage: 80.0+ bytes


In [28]:
# prompt para el LLM
seg_prompt = """You will be given real utterances produced by OPERATORS at an emergency center.
All utterances belong to:
- DIALOG ACT: Order
- SUBACT: Security Action

The utterances may be incomplete or noisy.
Your task is to reduce them into a distinct set of REPRESENTATIVE BASE INSTRUCTIONS
written in SPANISH.

Guidelines:
- Abstract recurring directive intents, but do not over-simplify.
- Merge different phrasings that instruct the SAME action.
- Do NOT merge instructions that require DIFFERENT actions.
- Write clear, neutral instructions an operator would give.
- Preserve safety-related or procedural intent when present.
- Write the instructions as direct imperatives spoken to the caller; do NOT describe the act of giving an instruction.
- Ignore utterances that are not genuinely related to the specified DIALOG ACT and SUBACT.
- Instructions must be general and must not include specific names, places, institutions, numbers, or unique identifiers.
- Do NOT add labels, explanations, or extra text.
- Do NOT use placeholders or brackets.

Output JSON (STRICT):

{
  "canonical_instructions": [
    "Instrucción canónica 1.",
    "Instrucción canónica 2."
  ]
}

Utterances:
"""

# poner al final
# Reasoning: Medium.

In [29]:
# como son 4 interacciones o descripciones de localización
# se enviarán todas en una sola interacción
import pandas as pd
import json_repair

utterances = str(df_operadores_orden_seguridad["ATA_TEXTO_TAGGED"].tolist())
user_prompt = seg_prompt + str(utterances)
response = get_llm_answer(user_prompt)
parsed_response = json_repair.repair_json(
    response,
    return_objects=True,
    ensure_ascii=False
)

temp_df = pd.DataFrame({
    "canonical_instructions": parsed_response["canonical_instructions"]
})

temp_df["canonical_instructions"] = temp_df["canonical_instructions"].str.replace('"', '', regex=False)

temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_accion_seguridad/v1/final_orden_accion_seguridad.csv"
)

temp_df.to_csv(temp_df_name, index=False)

In [30]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   canonical_instructions  1 non-null      object
dtypes: object(1)
memory usage: 136.0+ bytes


In [None]:
# finalmente, se obtuviero 1 orden canónica de acción de seguridad
temp_df_name = (
    "reduccion_interacciones_operador/"
    "reduccion_orden/"
    "reduccion_accion_seguridad/v1/final_orden_accion_seguridad.xlsx"
)
temp_df.to_excel(temp_df_name, index=False)