[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/multilingual/cohere-multilingual/cohere-multilingual-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/multilingual/cohere-multilingual/cohere-multilingual-search.ipynb)

In [1]:
!pip install -qU datasets cohere 'pinecone-client[grpc]'


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Multilingual Search with Cohere

Cohere released what might be the most advanced multilingual embedding model back in December 2022.

Cohere's multilingual model supports **100+** languages and at the time of release provided *230%* better performance than the previous state-of-the-art in multilingual search.

A key advance in the ability of this model (beyond pure performance), is the ability to create *meaningful* embeddings for longer text. Previous multilingual models would not produce quality embeddings for anything longer than a sentence of text, Cohere's `multilingual-22-12` model can do it for paragraphs of text.

## Dataset

We'll start by setting up our dataset for multilingual search. The dataset being used is the [Wikipedia multilingual dataset](https://huggingface.co/datasets/wikipedia).

To download the dataset we do:

In [2]:
from datasets import load_dataset

en = load_dataset("Cohere/wikipedia-22-12", "en", streaming=True)
it = load_dataset("Cohere/wikipedia-22-12", "it", streaming=True)

In [3]:
next(iter(en['train']))

{'id': 0,
 'title': 'Deaths in 2022',
 'text': 'The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:',
 'url': 'https://en.wikipedia.org/wiki?curid=69407798',
 'wiki_id': 69407798,
 'views': 5674.4492597435465,
 'paragraph_id': 0,
 'langs': 38}

In [6]:
next(iter(it['train']))

{'id': 0,
 'title': 'Italia',
 'text': "LItalia (, ), ufficialmente Repubblica Italiana, è uno Stato membro dell'Unione europea, situato nell'Europa meridionale, il cui territorio coincide in gran parte con l'omonima regione geografica. L'Italia è una repubblica parlamentare unitaria e conta una popolazione di circa 59 milioni di abitanti, che ne fanno il terzo Stato dell'Unione europea per numero di abitanti. La capitale è Roma.",
 'url': 'https://it.wikipedia.org/wiki?curid=2340360',
 'wiki_id': 2340360,
 'views': 3425.779427882056,
 'paragraph_id': 0,
 'langs': 307}

We have 6.46M English records, and 1.74M Italian records.

If you like, feel free to use the full dataset — naturally this will cost money.

For the sake of time and your pocket, in this demo we'll stick with a smaller set of ~100K records from each language. You can modify this number later as we get to the **Indexing** step.

## Encoding with Cohere

To embed our text using Cohere we need to first initialize our connection to Cohere. For this we need an [API key](https://dashboard.cohere.ai/api-keys), then we do:

In [7]:
import cohere

co = cohere.Client("COHERE_API_KEY")

Given some text we embed it using the `multilingual-22-12` model like so:

In [8]:
texts = ["hi, how are you!", "ciao come va?"]

# create embeddings
res = co.embed(texts=texts, model='multilingual-22-12')
# pull embeddings from response
embeds = res.embeddings
print(f"{len(embeds[0])}, {len(embeds)}")

768, 2


This shows that we have `2`  `768`-dimensional vectors, one for each *text* that we just encoded.

That's it, we've created our embeddings — it's incredibly easy to do.

Before we embed and index everything we'll need to initialize a vector index using Pinecone to store our embeddings within.

## Creating Vector Index

To create a vector index we need to initialize our connection to Pinecone, for this we need a [free API key](https://app.pinecone.io/) and then pass it below:

In [10]:
from pinecone import Pinecone

pinecone.init(
    api_key="PINECONE_API_KEY",  # app.pinecone.io
    environment="YOUR_ENV"  # next to api key in console
)

Now we can initialize the vector index. There are a few parameters we need to do this:

* `dimension`: vector dimensionality, this must align to the embedding model dimensionality — for us this is `768`.

* `metric`: the similarity metric being used to compare vectors. Different embedding models produce vectors that should be used with different metrics — in this case we need `'dotproduct'`

* `pod_type`: choose `p1` for speed, `s1` for storage, or `p2` for *even more speed*.

* `pods`: number of pods needed — for `p1` we can fit ~1M vectors on `1` pod, for `s1` we can fit ~5M vectors on `1` pod.

In [11]:
index_name = 'cohere-multi-wiki'

# check if index already exists (it wont if this is first time running)
if index_name not in pinecone.list_indexes().names():
    # now create the new index
    pinecone.create_index(
        index_name,
        dimension=len(embeds[0]),  # 768
        metric='dotproduct',
        pod_type='s1',
        pods=1
    )

# connect to index
index = pinecone.Index(index_name)
# then check index status
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

With our embedding model and vector index setup we can move onto indexing *everything*.

## Indexing Everything

In [12]:
from tqdm.auto import tqdm

batch_size = 300
lang_limit = 10_000  # number of records to index from each language

data = {'en': iter(en['train']), 'it': iter(it['train'])}

for i in tqdm(range(0, lang_limit, batch_size)):
    # so that we don't go over the language limit set above
    i_end = min(i+batch_size, lang_limit)
    # we do for each language
    for lang in ['en', 'it']:
        # get the relevant batch
        batch = [next(data[lang]) for _ in range(batch_size)]
        # extract text
        texts = [x['text'] for x in batch]
        # create embeddings
        embeds = co.embed(texts=texts, model='multilingual-22-12').embeddings
        # create ids
        ids = [f"{lang}-{x['id']}" for x in batch]
        # we might also want to create metadata containing:
        # the original text, title, url, and language
        metadata = [{
            'text': x['text'], 'title': x['title'], 'url': x['url'], 'lang': lang
        } for x in batch]
        # now we can index the batch
        index.upsert(zip(ids, embeds, metadata))

  0%|          | 0/34 [00:00<?, ?it/s]

We check for the total number of vectors added to the index:

In [13]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 20400}},
 'total_vector_count': 20400}

Now we move on to querying.

## Making Queries

We first define a `search` function to handle embedding, querying, and printing results.

In [22]:
import urllib

def search(query: str):
    # create query vector
    xq = co.embed(texts=[query], model='multilingual-22-12').embeddings[0]
    # search index
    res = index.query(vector=xq, top_k=3, include_metadata=True)
    # print results
    for i, record in enumerate(res['matches']):
        metadata = record['metadata']
        # print key info
        print(f"{i+1}. {metadata['title']} ({metadata['lang']})")
        print(f"  {metadata['url']}")
        print(f"  {metadata['text'][:100]}...")
        # get translate link if not english already
        if metadata['lang'] != 'en':
            translate_url = "https://translate.google.com/?sl=auto&tl=en&text="+urllib.parse.quote_plus(
                metadata['title']+"\n"+metadata['text']
            )
            print(f"  Translate: {translate_url}")
        print()

Let's try something not well covered by English wikipedia pages:

In [24]:
search("who is giovanni falcone?")

1. Mostro di Firenze (it)
  https://it.wikipedia.org/wiki?curid=658864
  Uno dei testimoni principali dell'accusa contro Pacciani fu Giuseppe Bevilacqua, un funzionario dell...
  Translate: https://translate.google.com/?sl=auto&tl=en&text=Mostro+di+Firenze%0AUno+dei+testimoni+principali+dell%27accusa+contro+Pacciani+fu+Giuseppe+Bevilacqua%2C+un+funzionario+dell%27American+Battle+Monuments+Commission+che+nel+1985+dirigeva+il+cimitero+americano+di+Firenze+in+localit%C3%A0+Falciani%2C+a+poche+centinaia+di+metri+dall%27ultima+scena+del+crimine+del+Mostro+in+Via+degli+Scopeti.

2. Gianluigi Buffon (it)
  https://it.wikipedia.org/wiki?curid=103015
  Nell'estate 2009 viene ingaggiato come testimonial dalla "poker room online" PokerStars. Nell'ottobr...
  Translate: https://translate.google.com/?sl=auto&tl=en&text=Gianluigi+Buffon%0ANell%27estate+2009+viene+ingaggiato+come+testimonial+dalla+%22poker+room+online%22+PokerStars.+Nell%27ottobre+2011%2C+si+ritrova+assieme+ad+Eleonora+Abbagnato%2C+i

Once you're done, delete the index to save resources.

In [None]:
pinecone.delete_index(index_name)

---