<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/workshop_resources/ws4-embeddings/MultiLingual-updated.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Multilingual Embeddings

This notebook provides basic functionality for studying embedding across languages. Based on a corpus of medical ads we aim to explor cross-lingual connection using different types of visualisations
- a scatter plot of embedding using dimensionality reduction (UMAP)
- a heatmap that compares all the embeddings and highlight similar items

**Important**

The first part of this notebook shows how to retrieve and prepare data for analysis ("Data preparation"). However, you can skip this part and go directly to "Cross-lingual search" and the following section, where you can download the processed data.

## Install the Impresso library

In [None]:
!pip install -qqq git+https://github.com/impresso/impresso-py.git@embeddings-search

In [None]:
# restart the kernel just in case...
import os
os.kill(os.getpid(), 9)

## Import libraries

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [None]:
# helper functions for embedding text and retrieving vectors
import time
import base64
import struct

def embed_text(text: str, target: str):
  """
  Convert text to embedding, return None in case of an error
  """
  #time.sleep(.1)
  try:
    return impresso_session.tools.embed_text(text, target)
  except Exception as e:
    return None


def convert_embedding(embedding: np.float32):
  """
  Convert base64 string to a float array
  """
  if not embedding:
    return None

  _, arr = embedding.split(':')
  arr = base64.b64decode(arr)
  outof_corpus_emb = [struct.unpack('f', arr[i:i+4])[0] for i in range(0, len(arr), 4)]
  return outof_corpus_emb

# get the article embeddings from the API

def get_embedding_by_uid(uid):
  #time.sleep(.1)
  try:
    return impresso_session.content_items.get_embeddings(uid)[0]
  except Exception as e:
    return None


def get_embedding_from_api(row ,text_col, target='text'):
  """first check if embedding already exists
  other create embedding
  """
  embedding = get_embedding_by_uid(row['uid'])
  if not embedding:
    embedding = embed_text(row[text_col], target)
  return convert_embedding(embedding)


## Connect to the Impresso client

In [None]:
from impresso import connect

impresso_session = connect('https://dev.impresso-project.ch/public-api/v1')

# Data Preparation

Below we provide a link to the processed data, so feel free skip to this part.

## Download and unzip data

In [None]:
!gdown 1qUyd9iKdl7eX3Kg0H8lbhbtXPudA3gGD

In [None]:
# unzip the data for the general query
!unzip -o impresso_WS4data.zip -d data

In [None]:
CSV_PATH = '/content/data/impresso_WS4data/webapp_malariaPaludismeOR.csv'
df = pd.read_csv(CSV_PATH, sep=';',skiprows=4)
df.head(3)

In [None]:
df.shape

Let's inspect the distribution of the language codes.

In [None]:
df.languageCode.value_counts()

And now, we sample a subset of evenly divided over both languages in our dataset.

In [None]:
df_sample = pd.concat([df[df.languageCode=='de'].sample(500,random_state=42),
                       df[df.languageCode=='fr'].sample(500,random_state=42)],
                      ignore_index=True)

In [None]:
df_sample.shape

# Cross-lingual search with the Impresso API

The example below demonstrates the Impresso API for searching items across languages. We use an embedded German text to query the the vector space of French items.

In [None]:
# define query and target language
query_lang = 'de'
target_lang = 'fr'

In [None]:
# get the uid of an article
row = df[df.languageCode==query_lang].sample(1)#['uid'].values[0]
text = str(row.transcript.values[0])
text

In [None]:
# retrieve embedding for text
embedding = embed_text(text, 'text')
print(embedding)

In [None]:
# use embedding to query in French
results = impresso_session.search.find(
  language = target_lang,
  embedding=embedding,
  limit=4
)

In [None]:
results

# Embed text with the Impresso API

The code below showns how to retrieve embeddings. It combines texts that were already embedded (and can just be retrieved from the database) as well as documents without a pre-existing embedding, for which we need to use `embed_text' functionality.

In [None]:
df_sample.columns

We retrieve or create embeddings using the Impresso API.

In [None]:
tqdm.pandas()
df_sample['transcript_embedding']  = df_sample.progress_apply(get_embedding_from_api, text_col='transcript', axis=1)

In [None]:
impresso_session.content_items.get_embeddings('BNN-1887-01-16-a-i0004')

In [None]:
# save output
df_sample.to_json('df_sample-embedded.json')

# Visualize Embeddings

## UMAP


After creating embeddings we can explore the vector space visually. In this notebook we first visualise embeddings on a 2d plane using UMAP and use plotly to inspect the resulting space as an interactive scatterplot.

In [None]:
!pip -qqq install seaborn plotly umap-learn

In [None]:
!gdown 1w_PnhWl55qmIHdMo1NKDx1pXOBkmIv2W

Below we run the code for dimensionality reduction. You might get the following error:

```AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'```

Please ignore these errors, they won't break anything.

In [None]:
# --- DIMENSIONALITY REDUCTION ---
from umap import UMAP
print("Reducing to 2D with UMAP...")
reducer = UMAP(
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine",
    random_state=42
)

EMBEDDING = 'transcript_embedding' # 'title_embedding' | 'article_embedding'

df_sample = pd.read_json('df_sample-embedded.json')

df_sample = df_sample[~df_sample[EMBEDDING].isnull()]

embeddings = list(df_sample[EMBEDDING])
embeddings_2d = reducer.fit_transform(embeddings)

df_sample["x"] = embeddings_2d[:, 0]
df_sample["y"] = embeddings_2d[:, 1]

In [None]:

def clean_text(text, max_len=100):
    """Truncate text and replace newlines for nicer tooltips"""
    text = str(text).replace("\n", " ")
    return text[:max_len] + ("..." if len(text) > max_len else "")

df_sample["hover_text"] = df_sample.transcript.apply(clean_text)


In [None]:
# normalized, size = individual, colour = servants
import seaborn as sns
import plotly.express as px

fig = px.scatter(df_sample,
                 x="x",
                 size='transcriptLength',
                 y="y",
                 color="languageCode",
                 hover_data=['hover_text'],
                 width=1000, height=1000)
fig.update_layout(showlegend=False)
fig.show()

# Heatmap

We compare the similarity of articles across languages and sample the most similar pairs. First we create compare the similarity of each German vectors to each French vector.

Then we sample the most similar cross-lingual pairs and inspect their content

In [None]:
embeddings_de = list(df_sample[df_sample.languageCode=='de']['transcript_embedding'])
embeddings_fr = list(df_sample[df_sample.languageCode=='de']['transcript_embedding'])

We create a similarity matrix, comparing all vector in German to French vectors and visualise the result as a heatmap.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt



# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(embeddings_de, embeddings_fr)

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, cmap="viridis", annot=False)
plt.title("Cosine Similarity Heatmap: X vs Y")
plt.xlabel("Y Embeddings")
plt.ylabel("X Embeddings")
plt.tight_layout()
plt.show()


In [None]:
df_de = df_sample[df_sample.languageCode=='de'].reset_index(drop=True)
df_fr = df_sample[df_sample.languageCode=='fr'].reset_index(drop=True)
len(df_de), len(embeddings_de)

In [None]:
# sample similar pairs ignoring those on diagonal
similar = [(int(i),int(j)) for i,j in list(zip(*np.where(similarity_matrix > 0.9))) if i != j]

In [None]:
similar

In [None]:
i,j = similar[0]


In [None]:
df_de.iloc[i].transcript

In [None]:
df_fr.iloc[j].transcript

# Query


In this part, we index the embedded vectors with FAISS. And use the vectors database to search for German articles using the French transcripts. This can be easily reversed.

In [None]:
!pip -qqq install faiss-cpu

In [None]:
# --- VECTOR STORE (FAISS) ---
# save index
import faiss



df_sample = pd.read_json('df_sample-embedded.json')

EMBEDDING = 'transcript_embedding'

print(df_sample.languageCode.value_counts())
query_lang = 'de'
target_lang = 'fr'
df_q = df_sample[(~df_sample[EMBEDDING].isnull()) & (df_sample.languageCode==query_lang)]
df_t = df_sample[(~df_sample[EMBEDDING].isnull()) & (df_sample.languageCode==target_lang)]

embeddings_q = list(df_q[EMBEDDING])

VECTOR_DB_PATH = f"vector_db_{query_lang}.faiss"
embeddings = np.array(list(df_q[EMBEDDING]), dtype="float32")


dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
faiss.write_index(index, VECTOR_DB_PATH)
print(f"Vector DB saved to {VECTOR_DB_PATH}")


In [None]:

query = df_t.iloc[0].transcript
print(query)
q_emb = convert_embedding(embed_text(query,'text'))

D, I = index.search(np.array([q_emb], dtype="float32"), k=5)
print("Top 5 most similar articles:")
print(df_q.iloc[I[0]])

# Fin.