# Testing txtai embeddings

This notebook checks txtai embedding capabilities

We will use Don Quijote book from project Gutenberg

In [2]:
from txtai import Embeddings

  from .autonotebook import tqdm as notebook_tqdm
* 'fields' has been removed


In [3]:
# !curl https://www.gutenberg.org/cache/epub/2000/pg2000.txt -o ../input/Don_Quijote.txt

## Embeddings models

By default the model use is 'all-MiniLM-l6-v2' (https://neuml.github.io/txtai/models/). This model is light and its allowed even for commercial use. 

Here is the official Embedding page from txtai:  https://neuml.github.io/txtai/embeddings/

Short tutorial: [colab](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb#scrollTo=QxX9EtIc6Xzg)

Here is a table of some relevant models that can be used:


| Model | Dimensions | Size | Speed | Spanish Performance | English Performance | Multilingual Support | Memory Usage | Best Use Case |
|-------|------------|------|-------|-------------------|-------------------|-------------------|---------------|--------------|
| all-MiniLM-L6-v2 | 384 | ~80MB | Very Fast | Medium (6.5/10) | Excellent (9/10) | Limited | Low | English-focused projects with resource constraints |
| nli-mpnet-base-v2 | 768 | ~420MB | Medium | Medium (6/10) | Excellent (9.5/10) | Limited | High | High-precision English semantic tasks |
| paraphrase-multilingual-mpnet-base-v2 | 768 | ~1.1GB | Slow | Excellent (8.5/10) | Excellent (9/10) | Strong (100+ languages) | High | Production multilingual applications |
| LaBSE | 768 | ~1.5GB | Slow | Excellent (9/10) | Excellent (8.5/10) | Strong (109 languages) | Very High | Cross-lingual information retrieval |

Declare data in spanish and english to check how txtai embeddings works with different models

In [4]:
data_es = [
  "EEUU supera los 5 millones de casos confirmados de virus",
  "La última plataforma de hielo intacta de Canadá se derrumba repentinamente, formando un iceberg del tamaño de Manhattan",
  "Pekín moviliza embarcaciones de invasión a lo largo de la costa mientras aumentan las tensiones con Taiwán",
  "El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso",
  "Hombre de Maine gana $1M con un billete de lotería de $25",
  "Obtenga grandes ganancias sin trabajo, gane hasta $100,000 al día"
]

data_en = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]

Create embeddings for english and spanish text

In [5]:
# Create an embedding for english text
en_beddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

In [6]:
# Create an embedding for spanish text
es_beddings = Embeddings(path="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

Lets check if we can retrieve the sentences with some queries

### English embeddings:

In [7]:
# English embeddings with english data

# Create an index for the list of text
en_beddings.index(data_en)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
  # Extract uid of first result
  # search result format: (uid, score)
  uid = en_beddings.search(query, 1)[0][0]

  # Print text
  print("%-20s %s" % (query, data_en[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story  US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky                Maine man wins $1M from $25 lottery ticket
dishonest junk       Make huge profits without work, earn up to $100,000 a day


In [8]:
# English embeddings with spanish data

# Create an index for the list of text
en_beddings.index(data_es)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("historia feliz", "cambio climático", "salud pública", "guerra", "vida salvaje", "asia", "suerte", "estafa"):
  # Extract uid of first result
  # search result format: (uid, score)
  uid = en_beddings.search(query, 1)[0][0]

  # Print text
  print("%-20s %s" % (query, data_es[uid]))

Query                Best Match
--------------------------------------------------
historia feliz       El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
cambio climático     El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
salud pública        El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
guerra               El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
vida salvaje         El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
asia                 Pekín moviliza embarcaciones de invasión a lo largo de la costa mientras aumentan las tensiones con Taiwán
suerte               La última plataforma de hielo intacta de Canadá se derrumba repentinamente, formando un iceberg del tamaño de Manhattan
estafa               El Servicio de Parques Naci

### Spanish embeddings:

I have employed several models but all of them perform worse than "paraphrase-multilingual-mpnet-base-v2".

Additionally, it appears to be quite sensitive to words with or without capitalization and doesn't seem to work well with synonyms.

In [20]:
# Spanish embeddings with spanish data
# (paraphrase-multilingual-mpnet-base-v2)

# Create an index for the list of text
es_beddings.index(data_es)

print("%-20s %s" % ("Query", "Mejor coincidencia"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("historia feliz", "cambio climático", "salud pública", "enfermedades", "guerra", "vida salvaje", "Asia", "asia", "suerte", "estafa", "timo"):
  # Extract uid of first result
  # search result format: (uid, score)
  uid = es_beddings.search(query, 1)[0][0]

  # Print text
  print("%-20s %s" % (query, data_es[uid]))

Query                Mejor coincidencia
--------------------------------------------------
historia feliz       Obtenga grandes ganancias sin trabajo, gane hasta $100,000 al día
cambio climático     La última plataforma de hielo intacta de Canadá se derrumba repentinamente, formando un iceberg del tamaño de Manhattan
salud pública        El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
enfermedades         EEUU supera los 5 millones de casos confirmados de virus
guerra               Pekín moviliza embarcaciones de invasión a lo largo de la costa mientras aumentan las tensiones con Taiwán
vida salvaje         El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
Asia                 Pekín moviliza embarcaciones de invasión a lo largo de la costa mientras aumentan las tensiones con Taiwán
asia                 La última plataforma de hielo intacta de Canadá se derrumba repentinamente, formando u

**Context**:
We can see that "salud pública" query doesn't math with virus story.

**Lower/upper case**:

It doesn't get the same answer with 'Asia' and 'asia'

**Synonyms**:

estafa and timo are synonyms but they don't match in this example

## Hybrid search

In [31]:
# Create an embeddings
en_beddings = Embeddings(hybrid=True, path="sentence-transformers/nli-mpnet-base-v2")

# Create an index for the list of text
en_beddings.index(data_en)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
  # Extract uid of first result
  # search result format: (uid, score)
  uid = en_beddings.search(query, 1)[0][0]

  # Print text
  print("%-20s %s" % (query, data_en[uid]))

Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story  US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky                Maine man wins $1M from $25 lottery ticket
dishonest junk       Make huge profits without work, earn up to $100,000 a day


Same results as with semantic search. Let's run the same example with just a keyword index to view those results.

In [33]:
# Create an embeddings
en_beddings = Embeddings(keyword=True)

# Create an index for the list of text
en_beddings.index(data_en)

print(en_beddings.search("feel good story"))
print(en_beddings.search("lottery"))

[]
[(4, np.float64(0.5234998733628726))]


See that when the embeddings instance only uses a keyword index, it can't find semantic matches, only keyword matches.

---

Now in spanish

In [34]:
# Create an embeddings
es_beddings = Embeddings(hybrid=True, path="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# Create an index for the list of text
es_beddings.index(data_es)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("historia feliz", "cambio climático", "salud pública", "enfermedades", "guerra", "vida salvaje", "Asia", "asia", "suerte", "estafa", "timo"):
  # Extract uid of first result
  # search result format: (uid, score)
  uid = es_beddings.search(query, 1)[0][0]

  # Print text
  print("%-20s %s" % (query, data_es[uid]))

Query                Best Match
--------------------------------------------------
historia feliz       Obtenga grandes ganancias sin trabajo, gane hasta $100,000 al día
cambio climático     La última plataforma de hielo intacta de Canadá se derrumba repentinamente, formando un iceberg del tamaño de Manhattan
salud pública        El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
enfermedades         EEUU supera los 5 millones de casos confirmados de virus
guerra               Pekín moviliza embarcaciones de invasión a lo largo de la costa mientras aumentan las tensiones con Taiwán
vida salvaje         El Servicio de Parques Nacionales advierte contra sacrificar amigos más lentos en un ataque de oso
Asia                 Pekín moviliza embarcaciones de invasión a lo largo de la costa mientras aumentan las tensiones con Taiwán
asia                 La última plataforma de hielo intacta de Canadá se derrumba repentinamente, formando un iceber

In [35]:
# Create an embeddings
es_beddings = Embeddings(keyword=True)

# Create an index for the list of text
es_beddings.index(data_es)

print(es_beddings.search("historia feliz"))
print(es_beddings.search("lotería"))

[]
[(4, np.float64(0.4956687859191595))]
