# Universal sentence encoder

### [Multilingual universal sentence encoder for semantic retrieval](https://arxiv.org/abs/1907.04307)
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.
 arXiv preprint arXiv:1907.04307

### Code examples
Semantic similarity examples: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

# Setup environment

In [None]:
# Install the latest Tensorflow version.
!pip install tensorflow_text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow_text
  Downloading tensorflow_text-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow<2.12,>=2.11.0
  Downloading tensorflow-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting flatbuffers>=2.0
  Downloading flatbuffers-23.1.21-py2.py3-none-any.whl (26 kB)
Collecting keras<2.12,>=2.11.0
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.2-py3-none-any.whl (6.0 MB)
[2

In [None]:
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import numpy as np

# Load and initialize model

In [None]:
# The 16-language multilingual module is the default but feel free
# to pick others from the list and compare the results.
model_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3' #@param ['https://tfhub.dev/google/universal-sentence-encoder-multilingual/3', 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3']
use_model = hub.load(model_url)

 # Example

![USEFigure.png](https://learnopencv.com/wp-content/uploads/2018/11/Universal-Sentence-Encoder.png)

In [None]:
test_str = "\u0E40\u0E1B\u0E47\u0E19\u0E40\u0E1E\u0E23\u0E32\u0E30\u0E40\u0E25\u0E37\u0E2D\u0E14 group b \u0E2B\u0E23\u0E37\u0E2D\u0E40\u0E1B\u0E25\u0E48\u0E32" #@param {type:"string"}
input = [ test_str ] # List of strings
output = use_model(input)
print("Vector size:", output.shape)
print("Object type:", type(output))

Vector size: (1, 512)
Object type: <class 'tensorflow.python.framework.ops.EagerTensor'>


#Load preprocessed datasets

In [None]:
!wget -O train_set_action.pkl https://github.com/ChanatipSaetia/SpeechAndLanguageTechnologies/releases/download/1.0/train_set_action.pkl
!wget -O test_set_action.pkl https://github.com/ChanatipSaetia/SpeechAndLanguageTechnologies/releases/download/1.0/test_set_action.pkl

--2023-02-06 09:39:43--  https://github.com/ChanatipSaetia/SpeechAndLanguageTechnologies/releases/download/1.0/train_set_action.pkl
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/379235282/d5254500-d386-11eb-8df7-f726dafdc4b8?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230206%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230206T093943Z&X-Amz-Expires=300&X-Amz-Signature=06d2ac2aeb2d815e28932eaedbcad927e8775c3e73fe0c6e6127877865cc7eab&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=379235282&response-content-disposition=attachment%3B%20filename%3Dtrain_set_action.pkl&response-content-type=application%2Foctet-stream [following]
--2023-02-06 09:39:43--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/379235282/d52545

In [None]:
import pickle

# read dataset
with open('train_set_action.pkl', 'rb') as f:
  train_text, train_labels = pickle.load(f)
with open('test_set_action.pkl', 'rb') as f:
  test_text, test_labels = pickle.load(f)

In [None]:
# Display first 5 entries of train set
for text, label in zip(train_text[:10], train_labels[:10]):
  print("Label:", label, "     \t", "Text:", text)

Label: enquire      	 Text: ให้รหัสไวไฟมา ใส่เท่าไหร่ก็เด้งไม่ถูกต้อง
Label: buy      	 Text: รบกวนสมัครเน็ตให้หน่อยค่ะ
Label: enquire      	 Text: จะสอบถามค่าบริการของทรู ว่าค้างออยู่รึเปล่าค่ะ
Label: enquire      	 Text: ผมเติมทรูมันนี่ มันไม่เข้าครับ
Label: activate      	 Text: พี่ครับ ผมซื้อซิมมาจาก <phone_number_removed> ผมโทรไปขอเบอร์แล้วเค้าบอกให้รอเอสเอ็มเอส รอเป็นชั่วโมงแล้ว ยังไม่ได้เลย
Label: enquire      	 Text: ตอนนี้อยู่จีนนะค่ะ เปิด data romimg เหมาจ่าย 333 บาท บุฟเฟ่ต์/วัน มี account ของทรูมูฟ เช็คค่าใช้จ่ายเกินมา 500 กว่าบาทแล้วค่ะ
Label: enquire      	 Text: อยากสอบถามยอดค้างบริการค่ะ
Label: cancel      	 Text: จะยกเลิกบริการเสริมครับ
Label: change      	 Text: เอ่อจะเปลี่ยนโปรโมชั่นอ่ะครับพี่
Label: enquire      	 Text: พอดีสมัครไวไฟแล้ว แล้วคราวนี้มันปิดไม่ได้ค่ะ


# Encode sentences

In [None]:
def batch_feed(texts, batch_size=32):
  """Feed the text as batches to avoid an OOM problem"""
  results = []
  for i in range(0, len(texts), batch_size):
    vectors = use_model(texts[i:i+batch_size]) # Feed to USE
    vectors = vectors.numpy() # Convert from tf.Tensor to numpy array
    results.append(vectors)
  results = np.concatenate(results, axis=0)
  return results

In [None]:
encoded_train_text = batch_feed(train_text)
encoded_test_text = batch_feed(test_text)

# Train a model

Train a logistic regression model with the vectors from USE



In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=4, max_iter=1000, random_state=42)

In [None]:
model.fit(encoded_train_text, train_labels)

LogisticRegression(C=4, max_iter=1000, random_state=42)

# Evaluate

In [None]:
test_predict = model.predict(encoded_test_text)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_labels, test_predict, digits=3))

              precision    recall  f1-score   support

    activate      0.727     0.667     0.696        48
         buy      0.864     0.722     0.786        79
      cancel      0.920     0.880     0.900       117
      change      0.875     0.761     0.814        46
     enquire      0.869     0.939     0.903       859
      report      0.760     0.633     0.691       150
     request      0.875     0.424     0.571        33

    accuracy                          0.858      1332
   macro avg      0.841     0.718     0.766      1332
weighted avg      0.856     0.858     0.853      1332



In [None]:
%%time
text = "ให้รหัสไวไฟมา ใส่เท่าไหร่ก็เด้งไม่ถูกต้อง"
encoded_text = use_model([text])
predict = model.predict(encoded_text)[0]
print(predict)

enquire
CPU times: user 29.1 ms, sys: 1.98 ms, total: 31.1 ms
Wall time: 27.4 ms


In [None]:
%%time
text = "ยกเลิก sms"
encoded_text = use_model([text])
predict = model.predict(encoded_text)[0]
print(predict)

cancel
CPU times: user 35 ms, sys: 1.23 ms, total: 36.3 ms
Wall time: 36.1 ms


In [None]:
%%time
text = "สอบถามยอดรายเดือน"
encoded_text = use_model([text])
predict = model.predict(encoded_text)[0]
print(predict)

enquire
CPU times: user 40 ms, sys: 2.21 ms, total: 42.2 ms
Wall time: 56.9 ms


In [None]:
def classification_text(data_text):
  encoded_text = use_model([data_text])
  predict = model.predict(encoded_text)[0]
  return predict

In [None]:
text = "ซื้อมือถือ"
print(classification_text(text))

request


In [None]:
text = "ยกเลิก SMS"
print(classification_text(text))

cancel


In [None]:
text = "สอบถามยอดรายเดือน"
print(classification_text(text))

enquire


In [None]:
text = "โทรออกไม่ได้"
print(classification_text(text))

report
