# Document similarity

The most interesting document similarity writeup is Jeremy Merrill's [How Quartz used AI to sort through the Luanda Leaks](https://qz.com/1786896/ai-for-investigations-sorting-through-the-luanda-leaks), because *it doesn't matter what language something is written in*.

In [None]:
!pip install -q sentence-transformers sentencepiece

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

In [None]:
import pandas as pd

sentences = [
    "Molly ate a fish",
    "Jen consumed a carp",
    "I would like to sell you a house",
    "Я пытаюсь купить дачу", # I'm trying to buy a summer home
    "J'aimerais vous louer un grand appartement", # I would like to rent a large apartment to you
    "This is a wonderful investment opportunity",
    "Это прекрасная возможность для инвестиций", # investment opportunity
    "C'est une merveilleuse opportunité d'investissement", # investment opportunity
    "これは素晴らしい投資機会です", # investment opportunity
    "野球はあなたが思うよりも面白いことがあります", # baseball can be more interesting than you think
    "Baseball can be interesting than you'd think"
]

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Don't worry, there are some [pretrained multilingual models](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models)

In [None]:
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
embeddings = model.encode(sentences)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)