A more modern approach (in 2021) is to use a Machine Learning NLP model. There are pre-trained models exactly for this task, many of them are derived from BERT, so you don't have to train your own model (you could if you wanted to). Here is a code example that uses the excellent Huggingface Transformers library with PyTorch. It's based on this example:

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch



  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)lve/main/config.json: 100%|██████████| 433/433 [00:00<?, ?B/s] 
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 3.05MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 1.78MB/s]
Downloading model.safetensors: 100%|██████████| 433M/433M [01:12<00:00, 5.95MB/s] 


not paraphrase: 10%
is paraphrase: 90%


In [7]:
model_name = "bert-base-cased-finetuned-mrpc"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

sequence_0 = "Những đứa trẻ đang chơi bóng rổ tại công viên gần nhà họ vào buổi chiều mặt trời tỏa sáng."
sequence_1 = "Những con chim đang hót vang trên những cành cây rậm rạp ở một khu rừng tĩnh lặng."

tokens = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
classification_logits = model(**tokens)[0]
results = torch.softmax(classification_logits, dim=1).tolist()[0]

classes = ["not paraphrase", "is paraphrase"]
for i in range(len(classes)):
    print(f"{classes[i]}: {round(results[i] * 100)}%")

not paraphrase: 61%
is paraphrase: 39%


In [None]:
from gensim.models import Word2Vec

# List of keywords
keywords = ['việc làm', 'Hợp đồng', 'quan hệ', 'thỏa thuận', 'nghĩa vụ', 'quyền', 'công', 'tiền lương', 'lao động']

# Define sentences as a list of keywords
sentences = [keywords]

# Create and train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

# Find word relations
similar_words = {}
for keyword in keywords:
    similar_words[keyword] = model.wv.most_similar(keyword)

# Print word relations
for keyword in keywords:
    print(f"Words related to '{keyword}':")
    for word, similarity in similar_words[keyword]:
        print(f"- {word} (Similarity: {similarity:.2f})")