<a href="https://colab.research.google.com/github/legacyai/legacyai_notebooks/blob/master/sentence2vec_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!nvidia-smi

Tue Dec 14 12:23:21 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!git clone -b tft_typing https://github.com/legacyai/tf-transformers.git

!pip install sentencepiece

!pip install tensorflow-text

!pip install transformers

!pip install wandb

!pip install datasets

Cloning into 'tf-transformers'...
remote: Enumerating objects: 4271, done.[K
remote: Counting objects: 100% (2524/2524), done.[K
remote: Compressing objects: 100% (1543/1543), done.[K
remote: Total 4271 (delta 1803), reused 1602 (delta 946), pack-reused 1747[K
Receiving objects: 100% (4271/4271), 4.15 MiB | 16.09 MiB/s, done.
Resolving deltas: 100% (3058/3058), done.
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.3 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96
Collecting tensorflow-text
  Downloading tensorflow_text-2.7.3-cp37-cp37m-manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 5.3 MB/s 
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.7.3
Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)


In [4]:
import csv
import gzip
import os
import numpy as np

class STSDataReader:
    """
    Reads in the STS dataset. Each line contains two sentences (s1_col_idx, s2_col_idx) and one label (score_col_idx)
    Default values expects a tab seperated file with the first & second column the sentence pair and third column the score (0...1). Default config normalizes scores from 0...5 to 0...1
    """
    def __init__(self, dataset_folder, s1_col_idx=0, s2_col_idx=1, score_col_idx=2, delimiter="\t",
                 quoting=csv.QUOTE_NONE, normalize_scores=True, min_score=0, max_score=5):
        self.dataset_folder = dataset_folder
        self.score_col_idx = score_col_idx
        self.s1_col_idx = s1_col_idx
        self.s2_col_idx = s2_col_idx
        self.delimiter = delimiter
        self.quoting = quoting
        self.normalize_scores = normalize_scores
        self.min_score = min_score
        self.max_score = max_score

    def get_examples(self, filename, max_examples=0):
        """
        filename specified which data split to use (train.csv, dev.csv, test.csv).
        """
        filepath = os.path.join(self.dataset_folder, filename)
        with gzip.open(filepath, 'rt', encoding='utf8') if filename.endswith('.gz') else open(filepath, encoding="utf-8") as fIn:
            data = csv.reader(fIn, delimiter=self.delimiter, quoting=self.quoting)
            examples = []
            for id, row in enumerate(data):
                score = float(row[self.score_col_idx])
                if self.normalize_scores:  # Normalize to a 0...1 value
                    score = (score - self.min_score) / (self.max_score - self.min_score)

                s1 = row[self.s1_col_idx]
                s2 = row[self.s2_col_idx]
                examples.append({'file_name': filename+str(id), 's1': s1, 's2': s2, 'label': score})

                if max_examples > 0 and len(examples) >= max_examples:
                    break

        return examples

In [5]:
!wget https://data.deepai.org/Stsbenchmark.zip
!unzip -d stsbenchmark Stsbenchmark.zip

sts_reader = STSDataReader(dataset_folder='stsbenchmark/stsbenchmark/', s1_col_idx=5, s2_col_idx=6, score_col_idx=4)
sts_examples = sts_reader.get_examples(filename='sts-dev.csv')

--2021-12-14 12:24:01--  https://data.deepai.org/Stsbenchmark.zip
Resolving data.deepai.org (data.deepai.org)... 138.201.36.183
Connecting to data.deepai.org (data.deepai.org)|138.201.36.183|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 409703 (400K) [application/zip]
Saving to: ‘Stsbenchmark.zip’


2021-12-14 12:24:02 (1.15 MB/s) - ‘Stsbenchmark.zip’ saved [409703/409703]

Archive:  Stsbenchmark.zip
   creating: stsbenchmark/stsbenchmark/
  inflating: stsbenchmark/stsbenchmark/readme.txt  
  inflating: stsbenchmark/stsbenchmark/sts-test.csv  
  inflating: stsbenchmark/stsbenchmark/correlation.pl  
  inflating: stsbenchmark/stsbenchmark/LICENSE.txt  
  inflating: stsbenchmark/stsbenchmark/sts-dev.csv  
  inflating: stsbenchmark/stsbenchmark/sts-train.csv  


In [6]:
# Read STS examples
s1_examples = [example['s1'] for example in sts_examples]
s2_examples = [example['s2'] for example in sts_examples]
sts_labels  = [example['label'] for example in sts_examples]
sts_labels  = np.array(sts_labels)

In [7]:
import sys
sys.path.append("/content/tf-transformers/src/")

import tensorflow as tf
import tensorflow_text as tf_text

In [8]:
from google.colab import auth
auth.authenticate_user()

In [9]:
import tensorflow as tf
from tf_transformers.core import LegacyLayer, LegacyModel
from tf_transformers.utils import tf_utils
from tf_transformers.models import AlbertModel, AlbertTokenizerTFText

class Similarity_Model_Pretraining(LegacyLayer):
    def __init__(
        self,
        encoder,
        projection_dimension,
        decoder=None,
        is_training=False,
        use_dropout=False,
        initializer="glorot_uniform",
        siamese=True,
        **kwargs,
    ):
        super(Similarity_Model_Pretraining, self).__init__(
            is_training=is_training, use_dropout=use_dropout, name=encoder.name, **kwargs
        )
        self.is_training = is_training
        if siamese:
            self.encoder = encoder
            self.decoder = encoder
        else:
            if decoder is None:
                raise ValueError("When siamese = False, decoder has to be provided. Provided decoder = None.")
            self.encoder = encoder
            self.decoder = decoder

        self.linear_projection = tf.keras.layers.Dense(
            units=projection_dimension,
            activation=None,
            kernel_initializer=initializer,
            name="linear_projection",
        )

        # As per CLIP paper
        self.logits_scale = tf.Variable(tf.math.log(1 / 0.07), name='logits_scale')


    def get_mean_embeddings(self, token_embeddings, input_mask):
        """ """
        # cls_embeddings = token_embeddings[:, 0, :]  # 0 is CLS (<s>)
        # mask PAD tokens
        token_emb_masked = token_embeddings * tf.cast(tf.expand_dims(input_mask, 2), tf.float32)
        total_non_padded_tokens_per_batch = tf.cast(tf.reduce_sum(input_mask, axis=1), tf.float32)
        # Convert to 2D
        total_non_padded_tokens_per_batch = tf.expand_dims(total_non_padded_tokens_per_batch, 1)
        mean_embeddings = tf.reduce_sum(token_emb_masked, axis=1) / total_non_padded_tokens_per_batch
        return mean_embeddings

    def call(self, inputs):
        """Call"""
        centre_outputs = self.encoder(inputs)
        neighbour_outputs = self.decoder(inputs)

        if 'cls_output' not in centre_outputs:
            centre_outputs['cls_output'] = tf.keras.layers.Lambda(lambda x: tf.squeeze(x[:, 0:1, :], axis=1))(
                centre_outputs['token_embeddings']
            )
        if 'cls_output' not in neighbour_outputs:
            neighbour_outputs['cls_output'] = tf.keras.layers.Lambda(lambda x: tf.squeeze(x[:, 0:1, :], axis=1))(
                neighbour_outputs['token_embeddings']
            )

        centre_sentence_embedding = self.linear_projection(centre_outputs['cls_output'])
        neighbour_sentence_embedding = self.linear_projection(neighbour_outputs['cls_output'])

        centre_sentence_embedding_mean = self.linear_projection(self.get_mean_embeddings(centre_outputs['token_embeddings'], inputs['input_mask']))
        neighbour_sentence_embedding_mean = self.linear_projection(self.get_mean_embeddings(neighbour_outputs['token_embeddings'], inputs['input_mask']))

        centre_sentence_embedding_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(
            centre_sentence_embedding
        )
        neighbour_sentence_embedding_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(
            neighbour_sentence_embedding
        )

        centre_sentence_embedding_mean_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(
            centre_sentence_embedding_mean
        )
        neighbour_sentence_embedding_mean_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(
            neighbour_sentence_embedding_mean
        )


        # # Clamp logits to a max of tf.math.log(100) = 4.6051702 as per CLIP model
        # logits_scale = tf.math.exp(self.logits_scale)
        # logits_scale = tf.clip_by_value(
        #     logits_scale, clip_value_min=tf.math.log(1 / 0.07), clip_value_max=4.6051752
        # )

        # logits = tf.matmul(
        #             centre_sentence_embedding_normalized, neighbour_sentence_embedding_normalized, transpose_b=True
        #         )
        # logits = tf.cast(logits_scale, dtype=tf_utils.get_dtype()) * logits

        # logits_mean = tf.matmul(
        #             centre_sentence_embedding_mean_normalized, neighbour_sentence_embedding_mean_normalized, transpose_b=True
        #         )
        # logits_mean = tf.cast(logits_scale, dtype=tf_utils.get_dtype()) * logits_mean

        # scores = tf.matmul(centre_sentence_embedding_normalized,
        #         centre_sentence_embedding_normalized, transpose_b=True)
        # scores_mask = tf.where(tf.equal(scores, tf.linalg.diag_part(scores)), _large_compatible_negative(scores.dtype), tf.cast(0.0, scores.dtype))
        # # Reset only diagonal entries back to 1.0
        # scores_mask = tf.linalg.set_diag(scores_mask, tf.zeros(shape=(tf.shape(scores_mask)[0])), name='make_diagonal_one')

        outputs = {}
        #outputs['logits'] = logits + scores_mask
        #outputs['logits_mean'] = logits_mean + scores_mask

        outputs['centre_sentence_embedding'] = centre_sentence_embedding
        outputs['centre_sentence_embedding_mean'] = centre_sentence_embedding_mean
        outputs['centre_sentence_embedding_normalized'] = centre_sentence_embedding_normalized
        outputs['neighbour_sentence_embedding_normalized'] = neighbour_sentence_embedding_normalized
        outputs['centre_sentence_embedding_mean_normalized'] = centre_sentence_embedding_mean_normalized
        outputs['neighbour_sentence_embedding_mean_normalized'] = neighbour_sentence_embedding_mean_normalized

        return outputs
        

    def get_model(self):
        inputs = self.encoder.input
        layer_output = self(inputs)
        model = LegacyModel(inputs=inputs, outputs=layer_output, name="similarity_model")
        try:
            model.model_config = self.encoder._config_dict
        except:
            model.model_config = self.encoder.model_config
        return model


In [10]:
def model_fn():
  encoder = AlbertModel.from_pretrained("albert-base-v2")


  decoder_config = AlbertModel.get_config("albert-base-v2")
  decoder_config['num_hidden_layers']= 6
  decoder = AlbertModel.from_config(decoder_config)
  encoder.save_checkpoint("/tmp/albert/", overwrite=True)
  decoder.load_checkpoint("/tmp/albert")

  model = Similarity_Model_Pretraining(encoder=encoder, projection_dimension=768, decoder=decoder, siamese=False)
  model = model.get_model() 

  return model

finetuned_model = model_fn()

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/47.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/713 [00:00<?, ?B/s]

INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /root/.cache/huggingface/hub/tftransformers__albert-base-v2.main.999c3eeace9b4d2c3f2ad87aad4548b3b73ea3cc/ckpt-1
INFO:absl:Successful ✅: Loaded model from tftransformers/albert-base-v2


Downloading:   0%|          | 0.00/713 [00:00<?, ?B/s]

INFO:absl:Create model from config
INFO:absl:Successful ✅: Saved model at /tmp/albert/ckpt-1
INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /tmp/albert/ckpt-1


In [11]:
finetuned_model.load_checkpoint("gs://legacyai-bucket/sentence2vec_sample/")
#finetuned_model.load_checkpoint("gs://legacyai-bucket/sentence2vec_sample_old_delimter_batch64")
model_endoder = finetuned_model.layers[-1].encoder

INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from gs://legacyai-bucket/sentence2vec_sample/ckpt-1


In [31]:
import tqdm
import math
import scipy
import random

In [18]:
tokenizer_layer_eval = AlbertTokenizerTFText.from_pretrained("albert-base-v2", add_special_tokens=True, dynamic_padding=True, max_length=256)
batch_size = 32

INFO:absl:Loading albert-base-v2 tokenizer to /tmp/tftransformers_tokenizer_cache/albert-base-v2/spiece.model


In [53]:
# Base AlbertModel
base_model = AlbertModel.from_pretrained("albert-base-v2")


INFO:absl:Successful ✅✅: Model checkpoints matched and loaded from /root/.cache/huggingface/hub/tftransformers__albert-base-v2.main.999c3eeace9b4d2c3f2ad87aad4548b3b73ea3cc/ckpt-1
INFO:absl:Successful ✅: Loaded model from tftransformers/albert-base-v2


In [36]:
s1_eval = tf.data.Dataset.from_tensor_slices({'text': s1_examples}).batch(batch_size, drop_remainder=False)
s2_eval = tf.data.Dataset.from_tensor_slices({'text': s2_examples}).batch(batch_size, drop_remainder=False)

def run_sts_benchmark(model):

  s1_embeddings = []
  for batch_inputs in tqdm.tqdm(s1_eval):
    model_outputs = model(tokenizer_layer_eval(batch_inputs))
    s1_embeddings.append(model_outputs['cls_output'])

  s1_embeddings = tf.concat(s1_embeddings, axis=0)
  s1_embeddings = tf.nn.l2_normalize(s1_embeddings, axis=1)

  s2_embeddings = []
  for batch_inputs in tqdm.tqdm(s2_eval):
    model_outputs = model(tokenizer_layer_eval(batch_inputs))
    s2_embeddings.append(model_outputs['cls_output'])

  s2_embeddings = tf.concat(s2_embeddings, axis=0)
  s2_embeddings = tf.nn.l2_normalize(s2_embeddings, axis=1)

  cosine_similarities = tf.reduce_sum(tf.multiply(s1_embeddings, s2_embeddings), axis=1)
  clip_cosine_similarities = tf.clip_by_value(cosine_similarities, -1.0, 1.0)
  scores = 1.0 - tf.acos(clip_cosine_similarities) / math.pi
  """Returns the similarity scores"""
  return s1_embeddings, s2_embeddings, scores




In [40]:
# Base Model
s1_embeddings_base, s2_embeddings_base, scores_base = run_sts_benchmark(base_model)
pearson_correlation = scipy.stats.pearsonr(scores_base, sts_labels)
print('Pearson correlation coefficient = {0}\np-value = {1}'.format(
    pearson_correlation[0], pearson_correlation[1]))

Pearson correlation coefficient = 0.0780602674718766
p-value = 0.002483390146319606


In [47]:
# Encoder Model from Finetuned (sentence2vec)
# Pearson correlation coefficient = 0.5523335229198306
# p-value = 1.5276410871835363e-120

# old_delimiter_batch64
# pearson correlation coefficient = 0.5803103286608059
# p-value = 9.576509371272565e-136


s1_embeddings_new, s2_embeddings_new, scores_new = run_sts_benchmark(model_endoder)
pearson_correlation = scipy.stats.pearsonr(scores_new, sts_labels)
print('Pearson correlation coefficient = {0}\np-value = {1}'.format(
    pearson_correlation[0], pearson_correlation[1]))

100%|██████████| 47/47 [03:21<00:00,  4.30s/it]
100%|██████████| 47/47 [03:21<00:00,  4.30s/it]

Pearson correlation coefficient = 0.5523335229198306
p-value = 1.5276410871835363e-120





In [15]:
# Encoder Model from Finetuned
s1_embeddings_new, s2_embeddings_new, scores_new = run_sts_benchmark(model_endoder)
pearson_correlation = scipy.stats.pearsonr(scores_new, sts_labels)
print('Pearson correlation coefficient = {0}\np-value = {1}'.format(
    pearson_correlation[0], pearson_correlation[1]))

100%|██████████| 47/47 [00:16<00:00,  2.84it/s]
100%|██████████| 47/47 [00:20<00:00,  2.30it/s]

Pearson correlation coefficient = 0.5803103286608059
p-value = 9.576509371272565e-136





In [34]:
def run_sts_benchmark_finetuned(model):

  s1_embeddings_centre = []
  s1_embeddings_centre_mean = []
  for batch_inputs in tqdm.tqdm(s1_eval):
    model_outputs = model(tokenizer_layer_eval(batch_inputs))
    s1_embeddings_centre.append(model_outputs['centre_sentence_embedding'])
    s1_embeddings_centre_mean.append(model_outputs['centre_sentence_embedding_mean'])

  s1_embeddings_centre = tf.concat(s1_embeddings_centre, axis=0)
  s1_embeddings_centre = tf.nn.l2_normalize(s1_embeddings_centre, axis=1)

  s1_embeddings_centre_mean = tf.concat(s1_embeddings_centre_mean, axis=0)
  s1_embeddings_centre_mean = tf.nn.l2_normalize(s1_embeddings_centre_mean, axis=1)

  s2_embeddings_centre = []
  s2_embeddings_centre_mean = []
  for batch_inputs in tqdm.tqdm(s2_eval):
    model_outputs = model(tokenizer_layer_eval(batch_inputs))
    s2_embeddings_centre.append(model_outputs['centre_sentence_embedding'])
    s2_embeddings_centre_mean.append(model_outputs['centre_sentence_embedding_mean'])

  s2_embeddings_centre = tf.concat(s2_embeddings_centre, axis=0)
  s2_embeddings_centre = tf.nn.l2_normalize(s2_embeddings_centre, axis=1)

  s2_embeddings_centre_mean = tf.concat(s2_embeddings_centre_mean, axis=0)
  s2_embeddings_centre_mean = tf.nn.l2_normalize(s2_embeddings_centre_mean, axis=1)

  cosine_similarities = tf.reduce_sum(tf.multiply(s1_embeddings_centre, s2_embeddings_centre), axis=1)
  clip_cosine_similarities = tf.clip_by_value(cosine_similarities, -1.0, 1.0)
  scores = 1.0 - tf.acos(clip_cosine_similarities) / math.pi

  cosine_similarities = tf.reduce_sum(tf.multiply(s1_embeddings_centre_mean, s2_embeddings_centre_mean), axis=1)
  clip_cosine_similarities = tf.clip_by_value(cosine_similarities, -1.0, 1.0)
  scores_mean = 1.0 - tf.acos(clip_cosine_similarities) / math.pi
  """Returns the similarity scores"""
  return s1_embeddings_centre, s2_embeddings_centre, s1_embeddings_centre_mean, s2_embeddings_centre_mean,  scores, scores_mean

In [37]:
s1_embeddings_centre, s2_embeddings_centre, s1_embeddings_centre_mean, s2_embeddings_centre_mean,  scores, scores_mean = run_sts_benchmark_finetuned(finetuned_model)


100%|██████████| 47/47 [00:40<00:00,  1.15it/s]
100%|██████████| 47/47 [00:40<00:00,  1.15it/s]


In [56]:
# legacyai-bucket/sentence2vec_sample/

# Pearson correlation coefficient = 0.5494236469069682
# p-value = 4.825365549321115e-119

# Mean
# Pearson correlation coefficient = 0.5578009309028753
# p-value = 2.119342998731924e-123


# legacyai-bucket/sentence2vec_old_delimiter64/

# Pearson correlation coefficient = 0.5895740794403129
# p-value = 4.1226925295805925e-141

# Mean
# Pearson correlation coefficient = 0.6198279847214399
# p-value = 6.587348365270443e-160

pearson_correlation = scipy.stats.pearsonr(scores, sts_labels)
print('Pearson correlation coefficient = {0}\np-value = {1}'.format(
    pearson_correlation[0], pearson_correlation[1]))

pearson_correlation = scipy.stats.pearsonr(scores_mean, sts_labels)
print('Pearson correlation coefficient = {0}\np-value = {1}'.format(
    pearson_correlation[0], pearson_correlation[1]))

Pearson correlation coefficient = 0.5494236469069682
p-value = 4.825365549321115e-119


In [29]:

def score_base_model(model, embeddings, examples,  text):
  query_vec = model(tokenizer_layer_eval({'text': [text]}))['cls_output']
  query_vec =  tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(query_vec)
  scores = tf.matmul(query_vec, embeddings, transpose_b=True)
  probs, indexes = tf.nn.top_k(scores, k=10)

  for idx, p_index in enumerate(indexes.numpy()[0]):
    print(examples[p_index], '-->', probs.numpy()[0][idx])



def score_finetuned_model(model, embeddings, examples,  text):
  query_vec = model(tokenizer_layer_eval({'text': [text]}))['centre_sentence_embedding']
  query_vec =  tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(query_vec)
  scores = tf.matmul(query_vec, embeddings, transpose_b=True)
  probs, indexes = tf.nn.top_k(scores, k=10)

  for idx, p_index in enumerate(indexes.numpy()[0]):
    print(examples[p_index], '-->', probs.numpy()[0][idx])

def score_finetuned_model_mean(model, embeddings, examples,  text):
  query_vec = model(tokenizer_layer_eval({'text': [text]}))['centre_sentence_embedding_mean']
  query_vec =  tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(query_vec)
  scores = tf.matmul(query_vec, embeddings, transpose_b=True)
  probs, indexes = tf.nn.top_k(scores, k=10)

  for idx, p_index in enumerate(indexes.numpy()[0]):
    print(examples[p_index], '-->', probs.numpy()[0][idx])

In [59]:
text = '3 traffic accidents leave 56 dead in China'
score_base_model(base_model, s1_embeddings_base, s1_examples,  text)

Gunmen 'kill 10 tourists' in Kashmir --> 0.9716114
Military plane crashes in south France: authorities --> 0.96852636
Thai police use tear gas against protesters --> 0.9684438
Suicide bomber kills 4 near NATO's Afghan HQ --> 0.96780616
1 person killed in sectarian clashes in Lebanon --> 0.964547
UN Chemical Weapons Experts to Visit Syria --> 0.9637456
2 pro-Palestinian activists arrested at B-G Airport --> 0.9634037
Food poisoning kills at least 20 children in India --> 0.9633391
House blaze kills 7 in northern Pakistan --> 0.9633246
Death toll in Colorado floods rises to four --> 0.9621356


In [20]:
text = '3 traffic accidents leave 56 dead in China'
score_base_model(model_endoder, s1_embeddings_new, s1_examples,  text)

South Africa train crash kills at least 19 --> 0.87800825
Motorists killed after Japanese tunnel collapses --> 0.8757251
'Around 100 dead or injured' after China earthquake --> 0.8573516
Floods leave six dead in Philippines --> 0.8552806
No radiation leak at Iran's nuclear plant --> 0.83268195
17 govt employees killed in Pakistan bus bombing --> 0.81438434
Military plane crashes in south France: authorities --> 0.8100992
Residents near Gaza instructed to stay near shelter --> 0.8079426
Saudi gas truck blast kills at least 22 --> 0.8064612
Deaths confirmed after helicopter crashes into Scottish pub --> 0.79784274


In [39]:
text = '3 traffic accidents leave 56 dead in China'
score_finetuned_model(finetuned_model, s1_embeddings_centre, s1_examples,  text)

South Africa train crash kills at least 19 --> 0.877387
Saudi gas truck blast kills at least 22 --> 0.841351
17 govt employees killed in Pakistan bus bombing --> 0.793483
Motorists killed after Japanese tunnel collapses --> 0.78722435
2 pro-Palestinian activists arrested at B-G Airport --> 0.786338
Deaths in rollover crashes accounted for 82 percent of the number of traffic deaths in 2002, the agency says. --> 0.7750963
FAA lifts ban on U.S. flights to Tel Aviv --> 0.77028334
Thai police use tear gas against protesters --> 0.7697736
The plane was estimated to be within 100 pounds of its maximum takeoff weight. --> 0.7663858
'Around 100 dead or injured' after China earthquake --> 0.76290655


In [40]:
text = '3 traffic accidents leave 56 dead in China'
score_finetuned_model_mean(finetuned_model, s1_embeddings_centre_mean, s1_examples,  text)

South Africa train crash kills at least 19 --> 0.8828801
Saudi gas truck blast kills at least 22 --> 0.8380425
17 govt employees killed in Pakistan bus bombing --> 0.7877832
Thai police use tear gas against protesters --> 0.7864125
Motorists killed after Japanese tunnel collapses --> 0.77681893
Floods leave six dead in Philippines --> 0.7743559
'Around 100 dead or injured' after China earthquake --> 0.7713753
2 pro-Palestinian activists arrested at B-G Airport --> 0.7616317
Turkish police mass near Istanbul park protest area --> 0.75263625
Two blasts hit Syria capital --> 0.7523016


In [23]:
text = 'The song is good'
score_finetuned_model_mean(finetuned_model, s1_embeddings_centre_mean, s1_examples,  text)

A man plays the guitar and sings. --> 0.92224723
A man plays an acoustic guitar. --> 0.90142226
A man plays the guitar. --> 0.8978389
A man playing the guitar. --> 0.8914678
A man is playing guitar. --> 0.88675064
A man plays a guitar. --> 0.88396347
A woman is playing the guitar. --> 0.8818873
A woman is playing the guitar. --> 0.88188714
A girl is playing a flute. --> 0.88187635
A musician is smearing jam on his white guitar at a concert. --> 0.87784964


In [13]:
# Quora question pairs
from datasets import load_dataset

quora_dataset = load_dataset("quora")

Downloading:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/559 [00:00<?, ?B/s]



Downloading and preparing dataset quora/default (download: 55.48 MiB, generated: 55.46 MiB, post-processed: Unknown size, total: 110.94 MiB) to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

Dataset quora downloaded and prepared to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
all_questions = []
for item in quora_dataset['train']:
  all_questions.extend(item['questions']['text'])
  
all_questions = list(set(all_questions))

In [54]:
def get_embeddings(model, sentences):

  dataset = tf.data.Dataset.from_tensor_slices({'text': sentences}).batch(batch_size, drop_remainder=False)
  s1_embeddings_centre = []
  s1_embeddings_centre_mean = []
  for batch_inputs in tqdm.tqdm(dataset):
    model_outputs = model(tokenizer_layer_eval(batch_inputs))
    s1_embeddings_centre.append(model_outputs['centre_sentence_embedding'])
    s1_embeddings_centre_mean.append(model_outputs['centre_sentence_embedding_mean'])

  s1_embeddings_centre = tf.concat(s1_embeddings_centre, axis=0)
  s1_embeddings_centre_mean = tf.concat(s1_embeddings_centre_mean, axis=0)

  return s1_embeddings_centre, s1_embeddings_centre_mean

def get_embeddings_base_model(model, sentences):

  dataset = tf.data.Dataset.from_tensor_slices({'text': sentences}).batch(batch_size, drop_remainder=False)
  s1_embeddings_centre = []
  for batch_inputs in tqdm.tqdm(dataset):
    model_outputs = model(tokenizer_layer_eval(batch_inputs))
    s1_embeddings_centre.append(model_outputs['cls_output'])

  s1_embeddings_centre = tf.concat(s1_embeddings_centre, axis=0)

  return s1_embeddings_centre

In [28]:
all_questions_sample = random.sample(all_questions, 10000)
quora_embeddings, quora_embeddings_mean = get_embeddings(finetuned_model, all_questions_sample)

100%|██████████| 313/313 [03:21<00:00,  1.55it/s]


In [58]:
#quora_embeddings_base = get_embeddings_base_model(base_model, all_questions_sample)
quora_embeddings_base_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(quora_embeddings_base)

In [30]:
quora_embeddings_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(
            quora_embeddings
        )
quora_embeddings_mean_normalized = tf.keras.layers.Lambda(lambda x: tf.nn.l2_normalize(x, axis=1))(
            quora_embeddings_mean
        )

In [83]:
text = random.sample(all_questions, 1)
print(text)

['Why is the Grand Canyon famous?']


In [103]:
text = random.sample(all_questions_sample, 1)
print(text)

['How much margin does amazon keep in gift vouchers?']


In [114]:
text = 'Which are the best countries to live in?'
text = 'How is green tea related to weight loss?'
text = 'What are the best action movies ?'
text = 'What is difference between a computer science engineer and an IT engineer?'
text = 'What is Blackcore Edge Max? Is it effective?'
text = 'How can I find all the uninstalled APK files on my Android phone and delete them?'
text = "What is the best printer to use to print lined paper for notebooks so that it doesn't curl or buckle?"
text = 'Why is the Grand Canyon famous'

text = 'How did biological life evolve from chemical elements?'
text = 'What is Gradient Descent?'
text = 'How Deep Learning differs from Machine Learning?'
text = 'Artificial Neural Networks'


score_finetuned_model(finetuned_model, quora_embeddings_normalized, all_questions_sample,  text)

print()
print('---------------------------')
print("Mean Model")
print()
score_finetuned_model_mean(finetuned_model, quora_embeddings_mean_normalized, all_questions_sample,  text)

print()
print("Base Model")
print('---------------------------')
print()
score_base_model(base_model, quora_embeddings_base_normalized, all_questions_sample,  text)


What are the Applications of automata theory in Simulation and Modelling? --> 0.80909824
How can I partially automate signaling pathways reconstruction from the literature? --> 0.7967655
What is machine learning algorithm? --> 0.7902843
What are computer algorithms? --> 0.78896767
How important are data structure and algorithms to learn programming? --> 0.7754864
How does Neuro-Linguistic Programming work? --> 0.77190477
Whether or not disproving the null hypothesis is very important for qualitative researchers? --> 0.7710546
What is biomedical engineering? --> 0.7708437
Can the intelligence of a certain human being be measured in a Lab using modern technology (MRI combined with other tests for example) and our current knowledge of the human brain? --> 0.7640928
What are the positions of the modern cell theory? --> 0.76299983

---------------------------
Mean Model

What are the Applications of automata theory in Simulation and Modelling? --> 0.8045826
What is machine learning algorith