### Cross-lingual sentence similarity using TensorFlow

Based on https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb#scrollTo=MSeY-MUQo2Ha

But, this time using LaBSE

In [2]:
%load_ext autoreload
%autoreload 2

In [40]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import sklearn

In [8]:
tqdm.pandas()

In [9]:
def _layer(lines, num_overlaps, comb=' '):
    if num_overlaps < 1:
        raise Exception('num_overlaps must be >= 1')
    out = ['PAD', ] * min(num_overlaps - 1, len(lines))
    for ii in range(len(lines) - num_overlaps + 1):
        out.append(comb.join(lines[ii:ii + num_overlaps]))
    return out

def _preprocess_line(line):
    line = line.strip()
    if len(line) == 0:
        line = 'BLANK_LINE'
    return line

def yield_overlaps(lines, num_overlaps):
    lines = [_preprocess_line(line) for line in lines]
    for overlap in range(1, num_overlaps + 1):
        for out_line in _layer(lines, overlap):
            # check must be here so all outputs are unique
            out_line2 = out_line[:10000]  # limit line so dont encode arbitrarily long sentences
            yield out_line2

In [10]:
class Encoder:
    def __init__(self, model_name):
        self.model = SentenceTransformer(model_name)
        self.model_name = model_name

    def transform(self, sents, num_overlaps):
        overlaps = []
        for line in yield_overlaps(sents, num_overlaps):
            overlaps.append(line)

        sent_vecs = self.model.encode(overlaps)
        embedding_dim = sent_vecs.size // (len(sents) * num_overlaps)
        sent_vecs.resize(num_overlaps, len(sents), embedding_dim)

        len_vecs = [len(line.encode("utf-8")) for line in overlaps]
        len_vecs = np.array(len_vecs)
        len_vecs.resize(num_overlaps, len(sents))

        return sent_vecs, len_vecs

In [11]:
model_name = "LaBSE"
model = Encoder(model_name)

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/804 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/461 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

In [26]:
src_sents = ["That is good.","That is great!"]
tgt_sents = ["Das ist gut.", "Das ist nicht gut.", "Ja, ich bin ein Berliner."]
max_align = 5

In [27]:
src_result = model.model.encode(src_sents)
tgt_result = model.model.encode(tgt_sents)

In [28]:
src_result.shape

(2, 768)

In [29]:
tgt_result.shape

(3, 768)

In [33]:
src_result[0,:].shape

(768,)

In [31]:
np.linalg.norm(src_result[0,:])

1.0

In [41]:
def compute_sim(l1_tensor, l2_tensor):
    sim = 1 - np.arccos(sklearn.metrics.pairwise.cosine_similarity(l1_tensor, l2_tensor))/np.pi
    return float(sim)

In [42]:
np.dot(src_result[0,:],tgt_result[0,:])

0.93773484

In [43]:
np.dot(src_result[1,:],tgt_result[0,:])

0.6652988

In [45]:
compute_sim([src_result[0,:]], [tgt_result[0,:]])

0.8870809078216553

In [46]:
compute_sim([src_result[1,:]], [tgt_result[0,:]])

0.7316958904266357

In [14]:
print("Embedding source and target text using {} ...".format(model.model_name))
src_vecs, src_lens = model.transform(src_sents, max_align - 1)
tgt_vecs, tgt_lens = model.transform(tgt_sents, max_align - 1)

Embedding source and target text using LaBSE ...


In [47]:
import tensorflow_hub as hub

module_url = "https://tfhub.dev/google/LaBSE/2"

model = hub.load(module_url)

def embed_text(input):
    return model(input)

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run Identity: Dst tensor is not initialized. [Op:Identity]

In [None]:
model("Testing 123")

In [37]:
# Some texts of different lengths in different languages.
arabic_sentences = ['كلب', 'الجراء لطيفة.', 'أستمتع بالمشي لمسافات طويلة على طول الشاطئ مع كلبي.']
chinese_sentences = ['狗', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']
english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']
french_sentences = ['chien', 'Les chiots sont gentils.', 'J\'aime faire de longues promenades sur la plage avec mon chien.']
german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']
italian_sentences = ['cane', 'I cuccioli sono carini.', 'Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.']
japanese_sentences = ['犬', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']
korean_sentences = ['개', '강아지가 좋다.', '나는 나의 개와 해변을 따라 길게 산책하는 것을 즐긴다.']
russian_sentences = ['собака', 'Милые щенки.', 'Мне нравится подолгу гулять по пляжу со своей собакой.']
spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']

# Multilingual example
multilingual_example = ["Willkommen zu einfachen, aber", "verrassend krachtige", "multilingüe", "compréhension du langage naturel", "модели.", "大家是什么意思" , "보다 중요한", ".اللغة التي يتحدثونها"]
multilingual_example_in_en =  ["Welcome to simple yet", "surprisingly powerful", "multilingual", "natural language understanding", "models.", "What people mean", "matters more than", "the language they speak."]

In [38]:
# Compute embeddings.
ar_result = embed_text(arabic_sentences)
en_result = embed_text(english_sentences)
es_result = embed_text(spanish_sentences)
de_result = embed_text(german_sentences)
fr_result = embed_text(french_sentences)
it_result = embed_text(italian_sentences)
ja_result = embed_text(japanese_sentences)
ko_result = embed_text(korean_sentences)
ru_result = embed_text(russian_sentences)
zh_result = embed_text(chinese_sentences)

multilingual_result = embed_text(multilingual_example)
multilingual_in_en_result = embed_text(multilingual_example_in_en)

NameError: name 'embed_text' is not defined

In [None]:
visualize_similarity(multilingual_in_en_result, multilingual_result,
                     multilingual_example_in_en, multilingual_example,
                     "Multilingual Universal Sentence Encoder for Semantic Retrieval (Yang et al., 2019)")

In [23]:
#visualize_similarity(zh_result, ko_result, chinese_sentences, korean_sentences, 'Chinese-Korean Similarity')

In [11]:
# So now try it with aligned texts

In [12]:
aligned_path = "../Texts_Aligned/de_tge.en_fowkes/"

In [15]:
cur_fpath = os.path.join(aligned_path, "ch01.align.txt")

In [20]:
cur_df = pd.read_csv(cur_fpath, delimiter='\t', header=None, names=['de','en','alignment_id'])

In [43]:
cur_df.fillna("", inplace=True)

In [45]:
cur_df.iloc[0]

de                                                    
en                         Chapter 1: The Commodity 1.
alignment_id    ch01_clean.de_tge-ch01_clean.en_fowkes
Name: 0, dtype: object

In [46]:
de_result = embed_text(cur_df.iloc[2]['de'])
en_result = embed_text(cur_df.iloc[2]['en'])

In [36]:
def compute_sim(l1_tensor, l2_tensor):
    # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)
    sim = 1 - np.arccos(sklearn.metrics.pairwise.cosine_similarity(l1_tensor, l2_tensor))/np.pi
    return float(sim)

In [48]:
compute_row_sim = lambda row: compute_sim(embed_text(row['de']), embed_text(row['en']))

In [49]:
cur_df['sim'] = cur_df.apply(compute_row_sim, axis=1)

In [50]:
cur_df

Unnamed: 0,de,en,alignment_id,sim
0,,Chapter 1: The Commodity 1.,ch01_clean.de_tge-ch01_clean.en_fowkes,0.497484
1,ERSTES KAPITEL Die Ware 1. Die zwei Faktoren d...,THE TWO FACTORS OF THE COMMODITY: USE-VALUE AN...,ch01_clean.de_tge-ch01_clean.en_fowkes,0.757847
2,Unsere Untersuchung beginnt daher mit der Anal...,Our investigation therefore begins with the an...,ch01_clean.de_tge-ch01_clean.en_fowkes,0.859049
3,"Die Ware ist zunächst ein äußerer Gegenstand, ...","The commodity is, first of all, an external ob...",ch01_clean.de_tge-ch01_clean.en_fowkes,0.812498
4,"Die Natur dieser Bedürfnisse, ob sie z.B. dem ...","The nature of these needs, whether they arise,...",ch01_clean.de_tge-ch01_clean.en_fowkes,0.753188
...,...,...,...,...
613,Ein Mensch oder ein Gemeinwesen ist reich; ein...,"A man or a community is rich, a pearl or a dia...",ch01_clean.de_tge-ch01_clean.en_fowkes,0.813036
614,Bisher hat noch kein Chemiker Tauschwert in Pe...,So far no chemist has ever discovered exchange...,ch01_clean.de_tge-ch01_clean.en_fowkes,0.734134
615,Die ökonomischen Entdecker dieser chemischen S...,The economists who have discovered this chemic...,ch01_clean.de_tge-ch01_clean.en_fowkes,0.762967
616,"Was sie hierin bestätigt, ist der sonderbare U...",What confirms them in this view is the peculia...,ch01_clean.de_tge-ch01_clean.en_fowkes,0.748863


In [51]:
cur_df['sim'].mean()

0.7231409977940679

In [58]:
v2_fpath = "../Texts_Aligned/de_tge.en_aveling/ch01.align.txt"

In [61]:
v2_df = pd.read_csv(v2_fpath, delimiter='\t', header=None, names=['de','en','alignment_id'])

In [64]:
v2_df.fillna("", inplace=True)

In [68]:
v2_df['sim'] = v2_df.progress_apply(compute_row_sim, axis=1)

100%|██████████| 625/625 [00:08<00:00, 77.74it/s]


In [69]:
v2_df

Unnamed: 0,de,en,alignment_id,sim
0,ERSTES KAPITEL Die Ware 1. Die zwei Faktoren d...,Chapter 1: Commodities Section 1: The Two Fact...,ch01_clean.de_tge-ch01_clean.en_aveling,0.799426
1,Unsere Untersuchung beginnt daher mit der Anal...,Our investigation must therefore begin with th...,ch01_clean.de_tge-ch01_clean.en_aveling,0.805118
2,"Die Ware ist zunachst ein ausserer Gegenstand,...","A commodity is, in the first place, an object ...",ch01_clean.de_tge-ch01_clean.en_aveling,0.742811
3,"Die Natur dieser Bedurfnisse, ob sie z.B. dem ...","The nature of such wants, whether, for instanc...",ch01_clean.de_tge-ch01_clean.en_aveling,0.732976
4,"Es handelt sich hier auch nicht darum, wie die...",Neither are we here concerned to know how the ...,ch01_clean.de_tge-ch01_clean.en_aveling,0.685437
...,...,...,...,...
620,Eine Perle oder ein Diamant hat Wert als Perle...,A pearl or a diamond is valuable as a pearl or...,ch01_clean.de_tge-ch01_clean.en_aveling,0.789187
621,Bisher hat noch kein Chemiker Tauschwert in Pe...,So far no chemist has ever discovered exchange...,ch01_clean.de_tge-ch01_clean.en_aveling,0.732880
622,Die okonomischen Entdecker dieser chemischen S...,The economic discoverers of this chemical elem...,ch01_clean.de_tge-ch01_clean.en_aveling,0.699186
623,"Was sie hierin bestatigt, ist der sonderbare U...","What confirms them in this view, is the peculi...",ch01_clean.de_tge-ch01_clean.en_aveling,0.733851


In [70]:
v2_df['sim'].mean()

0.7033427904129028