<a href="https://colab.research.google.com/github/jairorodriguezarias/explicaciones/blob/main/explain_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Carga de datos y transformaciones

First, we import the pandas library, which is commonly used for data manipulation.

In [47]:
import pandas as pd

We load the 'banking-conversation-corpus' dataset from Hugging Face, specifically the training split, which contains conversational data related to banking.

In [48]:
from datasets import load_dataset

# https://huggingface.co/datasets/talkmap/banking-conversation-corpus

banking_conversation_corpus = load_dataset("talkmap/banking-conversation-corpus", split="train")
banking_conversation_corpus

Dataset({
    features: ['conversation_id', 'speaker', 'date_time', 'text'],
    num_rows: 5532112
})

This step filters out conversations where the 'text' field is None or its length is 10 characters or less. This helps to clean the dataset by removing irrelevant or empty entries.

In [49]:
banking_conversation_corpus = banking_conversation_corpus.filter(lambda x: x['text'] is not None and len(x['text']) > 10)
banking_conversation_corpus

Dataset({
    features: ['conversation_id', 'speaker', 'date_time', 'text'],
    num_rows: 5421377
})

To work with a smaller, manageable portion of the dataset, we create a sample containing the first 100,000 records. This is useful for faster experimentation and development.

In [50]:
banking_conversation_corpus_sample = banking_conversation_corpus[:100000]

This block processes the sampled conversations using pandas to group them by `conversation_id`, sorts them chronologically, and concatenates individual speaker turns into a single `full_conversation` string. This prepares the data for embedding by creating a unified text representation for each conversation.

In [51]:
import pandas as pd

# Convert the dictionary sample to a pandas DataFrame directly
df = pd.DataFrame(banking_conversation_corpus_sample)

import pandas as pd

# 2. Ordenar cronológicamente
# Es indispensable para que el diálogo sea coherente.
df = df.sort_values(by=['conversation_id', 'date_time'])

# 3. Formateo Vectorizado (Mucho más rápido que .apply)
# Creamos la línea "SPEAKER: mensaje" usando operaciones nativas de strings
df['temp_line'] = df['speaker'].str.upper() + ": " + df['text'].fillna('')

# 4. Agrupación y Concatenación
# Usamos el método 'join' directamente sobre el objeto agrupado
df_conversations = (
    df.groupby('conversation_id', sort=False)['temp_line']
    .apply(lambda x: "\n".join(x))
    .reset_index(name='full_conversation')
)

# 5. Limpieza de memoria
# Eliminamos la columna temporal del dataframe original si es necesario
df.drop(columns=['temp_line'], inplace=True)

# Visualización del resultado
print(df_conversations.head())

# Guardar resultado (Parquet es mejor para archivos grandes)
# df_conversations.to_parquet('conversations_merged.parquet', index=False)

                    conversation_id  \
0  0007b43c697f40a38ba2395d6fee20dd   
1  001ce2f3448143d3b8df2e5185b330de   
2  001f3ecbff8e4b169cbb99069155a6d3   
3  0021f9e5b6e044f69e070c4a70de4693   
4  00232e8c81e1409ca89d54ed802acf3a   

                                   full_conversation  
0  AGENT: Good morning, thank you for calling Uni...  
1  AGENT: Hello, thank you for calling Union Fina...  
2  AGENT: Hello you for calling Union Financial. ...  
3  AGENT: Good morning, thank you for calling Uni...  
4  AGENT: Good morning, thank you for holding. My...  


After consolidating conversations into a pandas DataFrame, we convert it back into a Hugging Face Dataset object. This allows us to continue using the efficient dataset operations provided by the `datasets` library.

In [52]:
from datasets import Dataset

banking_conversation_corpus = Dataset.from_pandas(df_conversations)
banking_conversation_corpus

Dataset({
    features: ['conversation_id', 'full_conversation'],
    num_rows: 5659
})

## Creación de los embeddings

We load a pre-trained tokenizer and model from the `sentence-transformers/all-MiniLM-L6-v2` checkpoint. This model is designed to produce high-quality sentence embeddings, and the tokenizer prepares text inputs for this model.

In [53]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


To accelerate computations, especially for large models and datasets, we move the model to the GPU (CUDA device) if available. This leverages the parallel processing capabilities of GPUs.

In [54]:
import torch

device = torch.device("cuda")
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 384, padding_idx=0)
    (position_embeddings): Embedding(512, 384)
    (token_type_embeddings): Embedding(2, 384)
    (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-5): 6 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=384, out_features=384, bias=True)
            (key): Linear(in_features=384, out_features=384, bias=True)
            (value): Linear(in_features=384, out_features=384, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
    

**This** function defines a common pooling strategy for transformer models. It extracts the embedding of the `[CLS]` token (the first token in the sequence), which often represents the overall sentence meaning, from the model's last hidden state.

In [55]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

The `get_embeddings` function takes a list of text inputs, tokenizes them, moves them to the specified device, passes them through the model, and then applies CLS pooling to obtain a single embedding vector for each text.

In [56]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

We apply the `get_embeddings` function to the entire `banking_conversation_corpus` dataset using the `map` function. This efficiently generates an embedding for each conversation and adds it as a new 'embeddings' column to the dataset.

In [57]:
embedding = get_embeddings(banking_conversation_corpus["full_conversation"][0])
embedding.shape

torch.Size([1, 384])

In [58]:
embeddings_dataset = banking_conversation_corpus.map(
    lambda x: {"embeddings": get_embeddings(x["full_conversation"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/5659 [00:00<?, ? examples/s]

## Uso de FAISS para busquedas

This cell installs `faiss-cpu` and then adds a FAISS index to the `embeddings_dataset` based on the 'embeddings' column. FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors, which significantly speeds up nearest-neighbor queries.

In [59]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/6 [00:00<?, ?it/s]

Dataset({
    features: ['conversation_id', 'full_conversation', 'embeddings'],
    num_rows: 5659
})

Here, we define a sample `question` and generate its embedding using the `get_embeddings` function. This question embedding will be used to query our dataset for similar conversations.

In [60]:
question = "calling Union Financial"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 384)

This cell performs a similarity search on the `embeddings_dataset` using the `question_embedding`. It retrieves the 5 nearest examples (conversations) based on their embedding similarity, along with their corresponding similarity scores.

In [61]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

The retrieved samples and their scores are converted into a pandas DataFrame. The DataFrame is then sorted by `scores` in descending order to easily view the most similar conversations first.

In [62]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Finally, this loop iterates through the sorted DataFrame of similar conversations, printing each conversation's text and its similarity score. This allows us to inspect the results of our semantic search.

In [63]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.full_conversation}")
    print(f"SCORE: {row.scores}")
    print("=" * 50)
    print()

COMMENT: AGENT: Hello, thank name is Geraldine with Union Financial. How may I assist you today?
AGENT: Alright, I cand be happy to help you with that. Can you please provide me with your name and account number so I can look your information?
CLIENT: My, my name is Ladarius and my account number is 1234567890.
AGENT: Great, thank you for providing that information. Now, can you tell me more about the suspicious phone call? When did you receive it and what was the caller's request?
CLIENT: Well, I didn't actually receive the call myself. My wife answered it and said it was someone claiming to be from Union Financial, asking for my personal information. I just wanted to make sure it was legitimate before giving out any information.
AGENT: That's definitely understandable. Unfortunately, I'm unable to verify the authenticity of the call as it don't have access to real-time information on incoming calls. However, I can suggest some steps you can take to protect yourself from potential fra