In [None]:
# Author: Zhengxiang (Jack) Wang 
# Date: created on 2021-07-18; modified on 2021-07-20
# GitHub: https://github.com/jaaack-wang 
# About: calculating text similarity in paddlenlp

# Overview

In [loading pre-trained word embedding in paddlenp.ipynb](https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing), we learn how to load pre-trained word embedding models in `padddlenlp.embeddings.TokenEmbedding` as well as models predefined in `paddlenlp`. This notebook will use some simple examples to illustrate the application of word embedding in calculating text similarity between word pairs or between text (e.g., phrase, sentence) pairs. More on text similarity will be updated in my GitHub Project [dl-nlp-using-paddlenlp](https://github.com/jaaack-wang/dl-nlp-using-paddlenlp). 


<br>


<table align="right">
  <td>
    <a target="_blank" href="https://colab.research.google.com/drive/1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO?usp=sharing"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab </a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/jaaack-wang"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> Author's GitHub </a>
  </td>
  <td>
    <a href="https://docs.google.com/uc?export=download&id=1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download this notebook </a>
  </td>
</table> 


<br>


# Table of Contents

- [1. Cosine similarity](#1)
  - [1.1 Concept and formula](#1-1)
  - [1.2 Calculating cosine similarity in numpy](#1-2)
- [2. Calculating similarity between two words](#2)
- [3. Calculating similarity between two sentences](#3)
  - [3.1 The Bag-of-Words model](#3-1)
  - [3.2 Real-Wolrd examples: the performances of the Bag-of-Words model](#3-2)
    - [3.2.1 Chinese examples](#3-2-1)
    - [3.2.2 English examples](#3-2-2)
    - [3.2.3 Summary](#3-2-3)
- [4. References](#4)

In [None]:
# always installing or updating paddlepaddle and paddlenlp if you use Colab

!pip3 install --upgrade paddlepaddle
!pip3 install --upgrade paddlenlp

Collecting paddlepaddle
  Downloading paddlepaddle-2.1.1-cp37-cp37m-manylinux1_x86_64.whl (108.9 MB)
[K     |████████████████████████████████| 108.9 MB 29 kB/s 
Installing collected packages: paddlepaddle
Successfully installed paddlepaddle-2.1.1
Collecting paddlenlp
  Downloading paddlenlp-2.0.6-py3-none-any.whl (485 kB)
[K     |████████████████████████████████| 485 kB 3.0 MB/s 
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting colorlog
  Downloading colorlog-5.0.1-py2.py3-none-any.whl (10 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.4 MB/s 
[?25hCollecting visualdl
  Downloading visualdl-2.2.0-py3-none-any.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 17.8 MB/s 
Collecting bce-python-sdk
  Downloading bce_python_sdk-0.8.61-py3-none-any.whl (197 kB)
[K     |████████████████████████████████| 197 kB 32.3 MB/s 
Collecting Flask-Babel>=1.0.0
  Downloading Fla

<a name="1"></a>
# Cosine similarity

<a name="1-1"></a>
### 1.1 Concept and formula

There are several ways to measure the similarity or dissimilarity between two embeddings (which in essessence are vectors), such as calculating the distance between two vectors in a given vector space, but cosine similarity remains the most poopular choice among other things. The formula for coin similarity is given as follows:

$$CosineSim((\vec{u}), \vec{v}) = cos(\theta) = \frac{\vec{u}\cdot \vec{v} }{\vert\vec{u}\vert * \vert \vec{v}\vert} \tag{1}$$


where $\vec{u}\cdot \vec{v}$ is the inner/dot product of vector $\vec{u}$ and vector $\vec{v}$. 

<br>

Cosine similarity basically can be seen as representing the cosine value of the angel between the two vectors $\vec{u}\cdot \vec{v}$. The bigger the cosine value, the smaller the angel. (To get the $\theta$, we can the inverse function of cosine, namely: $\theta = arcos (\frac{\vec{u}\cdot \vec{v} }{\vert\vec{u}\vert * \vert \vec{v}\vert})$). The basic interpretation of for the results of cosine similarity formula can be illustrated in the figure below:

![](https://datascience-enthusiast.com/figures/cosine_sim.png)


<a name="1-2"></a>
### 1.2 Calculating cosine similarity in numpy


In numpy, the dot product of two vectors, two matrices or one vector and one matrix can be calculated by using `numpy.dot` or the special operator `@`, i.e. (suppose $\vec{u}$ and $vec{v}$ are both a (n, 1) dimensional vector):


```python
import numpy as np

>>> dot_prod = np.dot(u.T, v)
# Or:
>>> dot_prod = u.T @ v
```

The norm of a vector can be calculated in the following two ways:

```python
>>> norm_u, norm_v = np.sqrt(np.dot(u.T, u)), np.sqrt(np.dot(v.T, v))
# Or
>>> norm_u, norm_v = np.linalg.norm(u, ord=2), np.linalg.norm(v, ord=2)
```

<br>

However, using `np.dot` is less powerful compared to its equivalent form given by the following equation (2) because if the vector is instead (1, n) dimensional, we will get an unwanted matrix of (n, n) dimensions but not a scalar value:


$$\vec{u}\cdot \vec{v} = \sum_{i=1}^{n} u_i * v_i \tag{2}$$

We thus can use euqation (2) to rewrite the dot_prod and (one of the) norm_u/v given above as follows:

```python
>>> dot_prod = np.sum(u * v)
>>> norm_u, norm_v = np.sqrt(np.sum(v * v)), np.sqrt(np.sum(v * v)
```

<br>

Taken together, we can define a consine similarity function in numpy as follows:

```python
import numpy as np

def cosine_similarity(u, v):
  return np.sum(u * v) / (np.sqrt(np.sum(u * u)) * np.sqrt(np.sum(v * v)))
```

<a name="2"></a>
# 2. Calculating similarity between two words

- First, we need to get the embeddings of two words, which can be done by utilizing `padddlenlp.embeddings.TokenEmbedding` class, as described in [loading pre-trained word embedding in paddlenp.ipynb](https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing). 
- Then, we can use the `cosine_similarity` function built above to get the cosine similarity value.

<br>

In paddlenlp, however, we can also use an inbuilt `cosine_sim` function inside the `TokenEmbedding` class to calculate the cosine similarity between two words. <ins>But please note that, this function cannot be used to measure the cosine similarity between two texts.</ins>

<br>

**Reference**: [TokenEmbedding](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.embeddings.token_embedding.html) 

**Functions of interest inside `TokenEmbedding`**:
- **`vocab.to_tokens(indices)`**: To get the vocabulary of the pre-trained model
  - indices: `list` or `int`. 
  - Rerturns: corresponding tokens in the vocabulary of the model.
- **`search(words)`**: To get the pre-trained embedding of a given word/token
  - words: `list` or `str` or `int`. 
  - Retruns: the vectors of specified words.
- **`cosine_sim(word_a, word_b)`**: Cosine simlarity between two words (**!! words, NOT texts**)
  - word_a (str), word_b (str) -- The first word string.
  - Returns: The cosine similarity value of the 2 words.

In [None]:
# import padddlenlp.embeddings.TokenEmbedding
from paddlenlp.embeddings import TokenEmbedding


# model name: w2v.baidu_encyclopedia.target.word-character.char1-4.dim300	
# model size: 679.51 MB, vocab size: 636038
token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-character.char1-4.dim300")

100%|██████████| 695822/695822 [01:18<00:00, 8878.92it/s] 
[32m[2021-07-20 18:55:38,829] [    INFO][0m - Loading token embedding...[0m
[32m[2021-07-20 18:55:45,778] [    INFO][0m - Finish loading embedding vector.[0m
[32m[2021-07-20 18:55:45,780] [    INFO][0m - Token Embedding info:             
Unknown index: 636036             
Unknown token: [UNK]             
Padding index: 636037             
Padding token: [PAD]             
Shape :[636038, 300][0m


In [None]:
# # uncomment if you are interested 
# # check the vocab
# vocab = token_embedding.vocab.to_tokens(list(range(636038)))
# # print 50 randomly selected words from the vocab
# print(vocab[1000:1050])

['普通', '形象', '客户', '容易', '),', '那些', '级', '现任', '1953', '出生', '仍', '因素', 'g', '中央', '计算', '90', '平台', '迅速', '400', '而是', '行政', '举办', '运行', '23', '----', '若', '自由', '设施', '*', '所谓', '更加', '感觉', '1937', '训练', '副书记', '亿元', '演员', '此外', '1938', '论', '强', '成果', '几乎', '0', '调整', '自然保护区', '1954', '最终', '科研', '1500']


In [None]:
# # uncomment if you are interested 
# # Try to get the embeddings of any given words, 
# # say cat = 猫, dog = 狗, human = 人

# cat_em, dog_em, human_em = token_embedding.search(['猫', '狗', '人'])

# # check the embedding for cat 
# print('This is the word embedding for cat in the chosen model:\n\n', cat_em)

In [None]:
# calculate the cosine similary between two words, say, cat and dog versus cat and human
# we would expecr that cat is more simlar to dog than cat to human

cos_sim1 = token_embedding.cosine_sim('猫', '狗')
cos_sim2 = token_embedding.cosine_sim('猫', '人')
print('This is the cosine similarity score for cat and dog: ', cos_sim1)
print('This is the cosine similarity score for cat and human: ', cos_sim2)
print('Is cos_sim1 greater than cos_sim2 as expected?: ', cos_sim1 > cos_sim2)

This is the cosine similarity score for cat and dog:  0.77376914
This is the cosine similarity score for cat and human:  0.36149243
Is cos_sim1 greater than cos_sim2 as expected?:  True


In [None]:
# # # uncomment if you are interested (you need to uncomment the above cells as well)
# # Let's also check whether the cosine_similarity function we built above is correct

# import numpy as np

# def cosine_similarity(u, v):
#   return np.sum(u * v) / (np.sqrt(np.sum(u * u)) * np.sqrt(np.sum(v * v)))


# cat_dog_sim = cosine_similarity(cat_em, dog_em)
# cat_human_sim = cosine_similarity(cat_em, human_em)
# assert cos_sim1==cat_dog_sim, f'A bug here! cat_dog_sim({cat_dog_sim}) != {cos_sim1}'
# assert cos_sim2==cat_human_sim, f'A bug here! cat_dog_sim({cat_human_sim}) != {cos_sim2}'
# print('Your function works! Congratulations!')

Your function works! Congratulations!


<a name="3"></a>
# 3. Calculating similarity between two sentences

<a name="3-1"></a>
### 3.1 The Bag-of-Words model

- To calculate the cosine similarity between two sentences, we need first to calculate the embeddings of the sentence pair(s) of interest. 
- A sentence's (or a larger text) embedding can usually be seen as a summation of the embeddings of the words/tokens in it. This is also known as [Bag-of-Words model](https://en.wikipedia.org/wiki/Bag-of-words_model). 
- Since `cosine_sim` function provided by `padddlenlp.embeddings.TokenEmbedding` only applys to a pair of words, we need to use the `cosine_similarity` function instead. 

<br>

---

<br>

In short, we can calculate the cosine similarity between two sentences in the following five steps:

- Tokenize a sentences into a lisr of words/tokens (use [`jieba`](https://github.com/fxsjy/jieba));
- Retrieve the embeddings of these words (use `search` function inside `TokenEmbedding` class, see above); 
- Sum up these embeddings to get the sentence's embedding;
- Do the same thing to get the of the embedding other sentence;
- Calculate the cosine similarity between the two sentences (use the `cosine_similarity` defined above). 


In [None]:
# first install jieba 
!pip3 install jieba



In [None]:
# use jieba.lcut as the Chinese tokenizer
# you can also the inbuilt JiebaTokenizer in paddlenlp
# which can directly load vocab from ''token_embedding'
# from paddlenlp.data import JiebaTokenizer
# tokenizer = JiebaTokenizer(vocab=token_embedding.vocab)


import jieba

exmp_sent = '我来自中国' # I = 我 [come from] = 来自 China = 中国
jieba.lcut(exmp_sent) # This will return a list of tokens directly

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.025 seconds.
Prefix dict has been built successfully.


['我', '来自', '中国']

In [None]:
import numpy as np

# then define a function to get sentence embedding 


def sentence_embedding(sentence, tokenizer=jieba.lcut, embedder=token_embedding):
  '''Return sentence embedding(s) given a sentence or a list of sentences. 

  Arguments:
    sentence: str or list of str
    tokenizer: a tokenizer function that returns a list of tokens. Defaults to 
      jieba.lcut. 
    embedder: defaults to padddlenlp.embeddings.TokenEmbedding as we defined above. 
  Returns: sentence embedding(s)
  '''
  def embedding(sent):
    tokens = tokenizer(sent)
    return np.sum(embedder.search(tokens), axis=0) # sum vertically

  if isinstance(sentence, str):
    return embedding(sentence)
  elif isinstance(sentence, list):
    return [embedding(sent) for sent in sentence]
  else:
    raise TypeError(f'sentence should be either a str or a list.' 
    '{type(sentence)} not supported. ')


def cosine_similarity(u, v):
  return np.sum(u * v) / (np.sqrt(np.sum(u * u)) * np.sqrt(np.sum(v * v)))


def sentence_cosine_sim(sent_a, sent_b, tokenizer=jieba.lcut, embedder=token_embedding):
  '''Returns the cosine similarity between two sentences.

  Arguments:
    sent_a, sent_b: str
  Returns:
    cosine simlarity between two sentence embeddings
  '''

  sent_a_em, sent_b_em = sentence_embedding([sent_a, sent_b], tokenizer, embedder)
  return cosine_similarity(sent_a_em, sent_b_em)

In [None]:
# Let's calculate two made-up sentences!
# sent_a = I enjoy learning deep learning
# sent_b = I am interested in learning deep learning

sent_a, sent_b = '我喜欢学习深度学习', '我对学习深度学习很有兴趣'
sentence_cosine_sim(sent_a, sent_b)

0.94084466

In [None]:
# These two sentences have very different structures,
# and there are also five non-overlapping words (喜欢, 对, 很, 有, 兴趣),
# but the cosine similarity socre is over 0.94! Nice work!

print('Tokens of Sentence_a:', jieba.lcut(sent_a))
print('Tokens of Sentence_b:', jieba.lcut(sent_b))

Tokens of Sentence_a: ['我', '喜欢', '学习', '深度', '学习']
Tokens of Sentence_b: ['我', '对', '学习', '深度', '学习', '很', '有', '兴趣']


<br>

**Side project: Calculating the cosine similarity between two English sentences**

<br>

As the [TokenEmbedding API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/embeddings.html) in `paddlenlp` also includes some pre-trained English word embedding models (see [loading pre-trained word embedding in paddlenp.ipynb](https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing)), we can easily calculate the cosine similarity score of two English sentences using the function we just built. 

In [None]:
# model name: glove.wiki2014-gigaword.target.word-word.dim300.en
# model size: 422.83 MB; vocab size: 400002

tk_embedding_en = TokenEmbedding(embedding_name='glove.wiki2014-gigaword.target.word-word.dim300.en')

100%|██████████| 432979/432979 [00:58<00:00, 7338.69it/s]
[32m[2021-07-20 18:58:53,383] [    INFO][0m - Loading token embedding...[0m
[32m[2021-07-20 18:58:57,444] [    INFO][0m - Finish loading embedding vector.[0m
[32m[2021-07-20 18:58:57,452] [    INFO][0m - Token Embedding info:             
Unknown index: 400000             
Unknown token: [UNK]             
Padding index: 400001             
Padding token: [PAD]             
Shape :[400002, 300][0m


In [None]:
# The cosine similarity score is over 0.87! Nice work!
sent_a, sent_b = 'I enjoy learning deep learning', 'I am interested in learning deep learning'
en_tokenizer = lambda x: x.split() # here, we simply use space to tokenize English text

sentence_cosine_sim(sent_a, sent_b, en_tokenizer, tk_embedding_en)

0.8740044

<a name="3-2"></a>
### 3.2 Real-Wolrd examples: the performances of the Bag-of-Words model

We have just used two pre-trained word embedding models (one in Chinese, one in English) to build a Bag-of-Words model that can calculate the cosine similarity between two sentences. **The results of the two made-up sentences above look nice, but how well can the models we built work for real-world datasets?** Let's give a casual try! 

<br>

<a name="3-2-1"></a>
#### 3.2.1 Chinese examples

Here, we will use 50 examples extracted from the Large-scale Chinese Question Matching Corpus ([lcqmc](https://aclanthology.org/C18-1166/)) to see how well the simple Bog-of-Words model can give us clues regarding the similarity between Chinese sentences pairs. In this dataset, a similar sentence pair is labelled with "1" whereas a dissimilar sentence pari is labelled with "0". You can download the file [lcqmc_sample.tsv](https://drive.google.com/file/d/1-HVs8IEYqXX9Z63zLFODYETucW35Eq_T/view?usp=sharing) here.

<br>

**More on how to handle external files in Colab can be found [here](https://colab.research.google.com/notebooks/io.ipynb).**

In [None]:
from google.colab import drive

drive.mount('/drive', force_remount=True)

Mounted at /drive


In [None]:
# define a data loader that will load the dataset 
# in the structure of [sent_a, sent_b, (similarity) label]

def data_loader(filepath):
  data = open(filepath, 'r')
  out = [] # output
  for line in data:
    line = line.split('\t')
    sent_a, sent_b, label = line[0], line[1], line[2].strip()
    out.append([sent_a, sent_b, label])

  return out 

# the compare() function will calculate the cosine similarity between sentence paris
# and print them out along with the senntence pairs and given labels
# tokenizer= and embedder allows you to switch from Chinese model to English model.  
def compare(data, tokenizer=jieba.lcut, embedder=token_embedding):
  for item in data:
    sent_a, sent_b, label = item
    sim_score = sentence_cosine_sim(sent_a, sent_b, tokenizer=tokenizer, embedder=embedder)
    print('sent_a: ', sent_a)
    print('sent_b: ', sent_b)
    print('similarity score: ', sim_score)
    print('Given label: ', label)
    print()

In [None]:
file_path1 = '/drive/My Drive/lcqmc_sample.tsv'
data1 = data_loader(file_path1)
print('first five examples:\n')
for item in data1[:5]:
  print(item)

first five examples:

['喜欢打篮球的男生喜欢什么样的女生', '爱打篮球的男生喜欢什么样的女生', '1']
['我手机丢了，我想换个手机', '我想买个新手机，求推荐', '1']
['大家觉得她好看吗', '大家觉得跑男好看吗？', '0']
['求秋色之空漫画全集', '求秋色之空全集漫画', '1']
['晚上睡觉带着耳机听音乐有什么害处吗？', '孕妇可以戴耳机听音乐吗?', '0']


In [None]:
# It appears that this raw model does well on similar sentence pairs, but not dissimilar ones!
# The raw model needs to be further trained in order to make more accurate predictions 
compare(data1)

sent_a:  喜欢打篮球的男生喜欢什么样的女生
sent_b:  爱打篮球的男生喜欢什么样的女生
similarity score:  0.9883471
Given label:  1

sent_a:  我手机丢了，我想换个手机
sent_b:  我想买个新手机，求推荐
similarity score:  0.7669853
Given label:  1

sent_a:  大家觉得她好看吗
sent_b:  大家觉得跑男好看吗？
similarity score:  0.94471115
Given label:  0

sent_a:  求秋色之空漫画全集
sent_b:  求秋色之空全集漫画
similarity score:  0.99999994
Given label:  1

sent_a:  晚上睡觉带着耳机听音乐有什么害处吗？
sent_b:  孕妇可以戴耳机听音乐吗?
similarity score:  0.9046495
Given label:  0

sent_a:  学日语软件手机上的
sent_b:  手机学日语的软件
similarity score:  0.98783
Given label:  1

sent_a:  打印机和电脑怎样连接，该如何设置
sent_b:  如何把带无线的电脑连接到打印机上
similarity score:  0.9338498
Given label:  0

sent_a:  侠盗飞车罪恶都市怎样改车
sent_b:  侠盗飞车罪恶都市怎么改车
similarity score:  0.9842851
Given label:  1

sent_a:  什么花一年四季都开
sent_b:  什么花一年四季都是开的
similarity score:  0.96798384
Given label:  1

sent_a:  看图猜一电影名
sent_b:  看图猜电影！
similarity score:  0.9307659
Given label:  1

sent_a:  这上面写的是什么？
sent_b:  胃上面是什么
similarity score:  0.8654834
Given label:  0

sent_a:  建议您重新注册，辛苦您了。
sent_b:  

<a name="3-2-2"></a>
#### 3.2.2 English examples

Here, we will use 50 examples extracted from the Semantic Textual Similarity benchmark ([sts](http://ixa2.si.ehu.eus/stswiki/index.php/Main_Page)) to see how well the simple Bog-of-Words model can give us clues regarding the similarity between English sentences pairs. In this dataset, a similar sentence pair is labelled with a similarity score ranging from 0 to 5, which has been normailzed to [0, 1] in our test sample. You can download the file [sts_sample.tsv](https://drive.google.com/file/d/1TJgl4WtKY4JlcCVZtd9CQK-1w26wNo9D/view?usp=sharing) here.

In [None]:
drive.mount('/drive', force_remount=True)

file_path2 = '/drive/My Drive/sts_sample.tsv'
data2 = data_loader(file_path2)
print('first five examples:\n')
for item in data2[:5]:
  print(item)

Mounted at /drive
first five examples:

['A multi-colored bird clings to a wire fence.', 'A bird holding on to a metal gate.', '0.64']
['A woman is mixing meat.', 'A woman is feeding a man.', '0.2']
['Two men sailing in a small sailboat.', 'Two brown horses standing in grassy field.', '0.0']
['the united states and other nato members have refused to do ratify the updates cfe.', 'the united states and other nato members have refused to ratify the amended treaty. ', '0.64']
["Thieves steal Channel swimmer's wheelchair", "Thieves snatch English Channel swimmer's custom-made wheelchair", '0.8']


In [None]:
#Again, it appears that this raw model does well on similar sentence pairs, but not dissimilar ones!
# The raw model needs to be further trained in order to make more accurate predictions 
en_tokenizer = lambda x: x.split() # here, we simply use space to tokenize English text
compare(data2, en_tokenizer, tk_embedding_en)

sent_a:  A multi-colored bird clings to a wire fence.
sent_b:  A bird holding on to a metal gate.
similarity score:  0.7458005
Given label:  0.64

sent_a:  A woman is mixing meat.
sent_b:  A woman is feeding a man.
similarity score:  0.81256783
Given label:  0.2

sent_a:  Two men sailing in a small sailboat.
sent_b:  Two brown horses standing in grassy field.
similarity score:  0.62510234
Given label:  0.0

sent_a:  the united states and other nato members have refused to do ratify the updates cfe.
sent_b:  the united states and other nato members have refused to ratify the amended treaty. 
similarity score:  0.980547
Given label:  0.64

sent_a:  Thieves steal Channel swimmer's wheelchair
sent_b:  Thieves snatch English Channel swimmer's custom-made wheelchair
similarity score:  0.69551796
Given label:  0.8

sent_a:  Men are falling into a pool.
sent_b:  People flip into a swimming pool.
similarity score:  0.7382532
Given label:  0.55

sent_a:  Fire destroys Tibetan town in China
sent_

<a name="3-2-3"></a>
#### 3.2.3 Summary

It is quite understandable that the bog-of-word models do not work well on dissimilar sentence pairs both in Chinese dataset and English dataset. **Our models are too simple to be powerful**. What we did is only use the pre-trained word embeddings and sum them up to get the sentence embeddings, which still needs further fine-tuned in order to have predictive powers.

<br>

As the sentence embedding is just sum of its tokens' embeddings, two sentences whose tokens' embeddings are similar but whose meaninsg are different, can have a very high cosine similarity score. Moreover, if two sentences have identical words except one being negative (e.g., "I am happy" versus "I am not happy"), then their cosine similarity score will be very high as well. 

<br>

More on text similarity will be updated in my GitHub Project [dl-nlp-using-paddlenlp](https://github.com/jaaack-wang/dl-nlp-using-paddlenlp).  

<a name="4"></a>
# 4. References


- [PaddleNLP Embedding API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/embeddings.html)