## Contradictory, My Dear Watson

Can machines determine the relationships between sentences?

Given two sentences, there are three ways they could be related:
* one could entail the other
* one could contradict the other
* they could be unrelated

In [1]:
import numpy as np
import pandas as pd

from transformers import BertTokenizer, TFBertModel
import matplotlib.pyplot as plt
import tensorflow as tf

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

  from .autonotebook import tqdm as notebook_tqdm
E0000 00:00:1727080037.323303      12 common_lib.cc:798] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8471 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:479
D0923 08:27:17.331401505      12 config.cc:196]                        gRPC EXPERIMENT call_status_override_on_cancellation   OFF (default:OFF)
D0923 08:27:17.331416066      12 config.cc:196]                        gRPC EXPERIMENT call_v3                                OFF (default:OFF)
D0923 08:27:17.331419382      12 config.cc:196]                        gRPC EXPERIMENT canary_client_privacy                  ON  (default:ON)
D0923 08:27:17.331421695      12 config.cc:196]                        gRPC EXPERIMENT capture_base_context                   ON  (default:ON)
D0923 08:27:17.331424065      12 config.cc:196]                        gRPC EXPERIMENT

/kaggle/input/contradictory-my-dear-watson/sample_submission.csv
/kaggle/input/contradictory-my-dear-watson/train.csv
/kaggle/input/contradictory-my-dear-watson/test.csv


## Set up the TPU

In [2]:
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # This is TPU detection
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
  strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
  strategy = tf.distribute.get_strategy() #For CPU and/or single GPU
  print(f'Number of replicas: {strategy.num_replicas_in_sync}')

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: local


I0000 00:00:1727080072.655029      12 service.cc:145] XLA service 0x5b242dd5c1e0 initialized for platform TPU (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727080072.655087      12 service.cc:153]   StreamExecutor device (0): TPU, 2a886c8
I0000 00:00:1727080072.655092      12 service.cc:153]   StreamExecutor device (1): TPU, 2a886c8
I0000 00:00:1727080072.655096      12 service.cc:153]   StreamExecutor device (2): TPU, 2a886c8
I0000 00:00:1727080072.655099      12 service.cc:153]   StreamExecutor device (3): TPU, 2a886c8
I0000 00:00:1727080072.655102      12 service.cc:153]   StreamExecutor device (4): TPU, 2a886c8
I0000 00:00:1727080072.655105      12 service.cc:153]   StreamExecutor device (5): TPU, 2a886c8
I0000 00:00:1727080072.655107      12 service.cc:153]   StreamExecutor device (6): TPU, 2a886c8
I0000 00:00:1727080072.655110      12 service.cc:153]   StreamExecutor device (7): TPU, 2a886c8


INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I

## Load the Data

| Label | Meaning | 
| --- | --- | 
| 0 | entailment|
| 1 | neutral|
| 2 | contradiction|

In [3]:
from sklearn.model_selection import train_test_split

train_raw = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
test = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')

train, valid = train_test_split(train_raw,
                                test_size=0.2,
                                random_state=0,
                                stratify=train_raw.lang_abv)

train\
    .loc[lambda df: df.lang_abv.eq('en')]\
    .groupby('label')\
    .sample(2, random_state=1)\
    [['premise', 'hypothesis', 'label']]\
    .style.hide(axis='index')

premise,hypothesis,label
so are can i just ask you are you Canadian,Are you from Canada?,0
"The first, reached from Luxor, is Esna, 54 km (33 miles) by road.",Esna is located 54km away from Luxor.,0
The association's mission is to reduce the incidence of fraud and white-collar crime through prevention and education.,The association hopes people will not steal someone's identity.,1
yeah well are you you with TI,TI is the tourism international society.,1
"Despite protests by preservationists, there was little alternative.",There were various alternatives and one that was appeasing to everyone was implemented.,2
Participation in the rulemaking process requires (1) the public to be aware of opportunities to participate and (2) systems that will allow agencies to receive comments in an efficient and effective manner.,The public need not be made aware of any opportunities for rulemaking processes.,2


## Prepare Data for Input

We'll use a pretrained BERT model from HuggingFace.

First, we'll download the tokenizer.

Tokenizers turn sequences of words into arrays of numbers.

In [4]:
model_name = 'bert-base-multilingual-cased'

# We imported BertTokenizer from transformers
tokenizer = BertTokenizer.from_pretrained(model_name)



### Let's look at an example tokenization

In [5]:
def encode_sentence(sentence):
  # This is using the BERT Tokenizer that we just downloaded to generate tokens
  # So the sentence "I love machine learning" becomes
  # ['I', 'love', 'machine', 'learning', '!']
  tokens = list(tokenizer.tokenize(sentence))

  # Adding the separator token
  tokens.append('[SEP]')

  # This is using the BERT Tokenizer to convert the tokens to unique integers
  # So  ['I', 'love', 'machine', 'learning', '!'] becomes 
  # [146, 16138, 21432, 26901, 106]
  return tokenizer.convert_tokens_to_ids(tokens)

encode_sentence('I love machine learning')

[146, 16138, 21432, 26901, 102]

In [6]:
encode_sentence('I LOVE MACHINE LEARNING')

[146, 52734, 71008, 108880, 93280, 84977, 52188, 52898, 34065, 102]

BERT requires three inputs:
* input word IDs (what you see above)
* input masks
* input type IDs

These allow the model to know that the premise and hypothesis are distinct sentences and to ignore any padding from the tokenizer.

A `[CLS]` token is used to denote the beginning of the inputs and a `[SEP]` token is used to separate the premise and hypothesis.

We also need to pad all of the inputs to be the same size.

You can read more about BERT inputs at HuggingFace.

Now we can encode all premise/hypothesis pairs for input into BERT.

### How long are the sentences?

In [7]:
train\
    .hypothesis.add(train.premise)\
    .apply(encode_sentence)\
    .apply(len)\
    .describe()

count    9696.000000
mean       46.374587
std        23.264591
min         2.000000
25%        30.000000
50%        43.000000
75%        59.000000
max       257.000000
dtype: float64

In [8]:
max_len=75

def bert_encode(hypotheses, premises, tokenizer, max_len):

  num_examples = len(hypotheses)

  # Encode the input sentences and convert the results to tensors
  hypoth_tensors = tf.ragged.constant(
      [encode_sentence(s) for s in np.array(hypotheses)]
  )

  premise_tensors = tf.ragged.constant(
      [encode_sentence(s) for s in np.array(premises)]
  )

  # Create the appropriate number of start tokens and then encode them
  cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])] * num_examples

  # Create the input by combining the start token, the hypothesis, and the
  # premise. (Keep in mind that the separator token was added by the 
  # encode_sentences function.)
  # Don't forget to set the axis because the default is a vertical stack.
  input_word_ids = tf.concat([cls, hypoth_tensors, premise_tensors], axis=1)

  # The input mask is all ones (pay attention to everything?)
  input_mask = tf.ones_like(input_word_ids).to_tensor(
          shape=[input_word_ids.shape[0], max_len])

  # The type IDs are all zeros? Why?
  type_cls=tf.zeros_like(cls)
  type_hypoth = tf.zeros_like(hypoth_tensors)
  type_premise = tf.ones_like(premise_tensors)
  input_type_ids = tf.concat([type_cls, type_hypoth, type_premise], axis=1)\
    .to_tensor(
          shape=[input_word_ids.shape[0], max_len])

  # Combine all inputs into a dictionary
  inputs = {
      'input_word_ids': input_word_ids.to_tensor(
          shape=[input_word_ids.shape[0], max_len]),
      'input_mask': input_mask,
      'input_type_ids': input_type_ids
  }

  return inputs

bert_encode(
    train.head(2).hypothesis.values,
    train.head(2).premise.values,
    tokenizer,
    max_len
)

{'input_word_ids': <tf.Tensor: shape=(2, 75), dtype=int32, numpy=
 array([[   101,  40690,    117,    146,  21852,    119,    102,  10657,
            117,    146,  16938,    112,    188,  21852,    119,    102,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0,      0,      0,      0,      0,      0,
              0,      0,      0],
        [   101,  39301,    189,  12577,  77008,  10343,  11471,  15694,
          15673,  10113,  10339,  87051,  10116,  10549,  10440,  15694,
         106156,  29175,    119,    102,  77056,  68192,  62310,  10113,
          11735,  19402,

In [9]:
train_input = bert_encode(
    train.hypothesis.values,
    train.premise.values,
    tokenizer,
    max_len
)

valid_input = bert_encode(
    valid.hypothesis.values,
    valid.premise.values,
    tokenizer,
    max_len
)

test_input = bert_encode(
    test.hypothesis.values,
    test.premise.values,
    tokenizer,
    max_len
)

print('Input data prepared for modeling')

Input data prepared for modeling


In [10]:
print(train.hypothesis.iloc[0], train.premise.iloc[0], train.label.iloc[0])

# We can see the encoding of the first train sentence here like this
train_input['input_word_ids'][0]

Yes, I know. No, I don't know.  2


<tf.Tensor: shape=(75,), dtype=int32, numpy=
array([  101, 40690,   117,   146, 21852,   119,   102, 10657,   117,
         146, 16938,   112,   188, 21852,   119,   102,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0], dtype=int32)>

## Create the Model

In [11]:
def build_model():
  # Load the BERT model from TensorFlow Hub
  bert_encoder = TFBertModel.from_pretrained(model_name)

  # Create the input layers for the model
  input_word_ids = tf.keras.Input(shape=(max_len,),
                                  dtype=tf.int32,
                                  name='input_word_ids')
  
  input_mask = tf.keras.Input(shape=(max_len,),
                              dtype=tf.int32,
                              name='input_mask')
  
  input_type_ids = tf.keras.Input(shape=(max_len,),
                                  dtype=tf.int32,
                                  name='input_type_ids')
  
  # Encode the input sentences
  # This creates a tensor of shape=(None, max_len, 768)
  embedding = bert_encoder([input_word_ids, input_mask, input_type_ids])[0]

  # Specify the output
  # This creates a tensor of shape=(None, 3) (since we have three classes?)
  output = tf.keras.layers.Dense(3, activation='softmax')(embedding[:,0, :])

  # Create the model given the inputs and outputs
  model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids],
                         outputs=output)
  
  # Compile the model based on accuracy metric
  model.compile(tf.keras.optimizers.Adam(learning_rate=1e-5),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  
  return model

# Instantiate the model and print summary
with strategy.scope():
  model = build_model()
  model.summary()

I0000 00:00:1727080099.919795      12 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initializ

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_word_ids (InputLayer  [(None, 75)]                 0         []                            
 )                                                                                                
                                                                                                  
 input_mask (InputLayer)     [(None, 75)]                 0         []                            
                                                                                                  
 input_type_ids (InputLayer  [(None, 75)]                 0         []                            
 )                                                                                                
                                                                                              

## Train the Model

Use early stopping to control the number of epochs.

In [13]:
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',  # You can change this to 'val_loss'
    patience=3,              # Stop after 3 epochs with no improvement
    restore_best_weights=True
)

model.fit(train_input,
          train.label.values,
          validation_data=(valid_input, valid.label.values),
          epochs=50,
          verbose=1,
          batch_size=64,
          callbacks=[early_stopping])

Epoch 1/50














2024-09-23 08:29:47.542337: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node Adam/AssignAddVariableOp.
I0000 00:00:1727080190.480385     828 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(570239134acc4e17:0:0), session_name()
I0000 00:00:1727080222.439782     828 tpu_compile_op_common.cc:245] Compilation of 570239134acc4e17:0:0 with session name  took 31.959349092s and succeeded
I0000 00:00:1727080222.524117     828 tpu_compilation_cache_interface.cc:475] TPU host compilation cache: compilation complete for cache_key(570239134acc4e17:0:0), session_name(), subgraph_key(std::string(property.function_name) = "cluster_train_function_18247264971753395072", property.function_library_fingerprint = 11395093272996447256, property.mlir_module_fingerprint = 0, property.num_replicas = 8, topology.chip_bounds().x = 2, topology.chip_bounds().y = 2, topology.chip_bounds().z = 1,



I0000 00:00:1727080236.742013     838 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(e851ce6a469b54ca:0:0), session_name()
I0000 00:00:1727080265.302001     838 tpu_compile_op_common.cc:245] Compilation of e851ce6a469b54ca:0:0 with session name  took 28.559943872s and succeeded
I0000 00:00:1727080265.382128     838 tpu_compilation_cache_interface.cc:475] TPU host compilation cache: compilation complete for cache_key(e851ce6a469b54ca:0:0), session_name(), subgraph_key(std::string(property.function_name) = "cluster_train_function_18247264971753395072", property.function_library_fingerprint = 11395093272996447256, property.mlir_module_fingerprint = 0, property.num_replicas = 8, topology.chip_bounds().x = 2, topology.chip_bounds().y = 2, topology.chip_bounds().z = 1, topology.wrap().x = false, topology.wrap().y = false, topology.wrap().z = false, std::string(property.shapes_prefix) = "4,75,;4,75,;4,75,;4,;", property.guaranteed_constants_size = 0, embedd



2024-09-23 08:31:16.108239: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] model_pruner failed: INVALID_ARGUMENT: Graph does not contain terminal node AssignAddVariableOp.
I0000 00:00:1727080276.837876     803 tpu_compilation_cache_interface.cc:441] TPU host compilation cache miss: cache_key(1fb62e31b4ed8895:0:0), session_name()
I0000 00:00:1727080280.537632     803 tpu_compile_op_common.cc:245] Compilation of 1fb62e31b4ed8895:0:0 with session name  took 3.69971833s and succeeded
I0000 00:00:1727080280.566164     803 tpu_compilation_cache_interface.cc:475] TPU host compilation cache: compilation complete for cache_key(1fb62e31b4ed8895:0:0), session_name(), subgraph_key(std::string(property.function_name) = "cluster_test_function_13883616677987383626", property.function_library_fingerprint = 12183572859597201306, property.mlir_module_fingerprint = 0, property.num_replicas = 8, topology.chip_bounds().x = 2, topology.chip_bounds().y = 2, topology.chip_bounds().z = 1, topolog

Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


<tf_keras.src.callbacks.History at 0x78c2801ba980>

## Estimate accuracy on the test set

In [17]:
from sklearn.metrics import accuracy_score

class_probabilities = model.predict(valid_input)

predictions_valid = class_probabilities.argmax(axis=-1)

accuracy_score(y_true=valid.label.values,
               y_pred=predictions_valid)



0.6386138613861386

## Make Predictions on the Test Set

In [18]:
class_probabilities_test = model.predict(test_input)

predictions_test = class_probabilities_test.argmax(axis=-1)

submission = pd.DataFrame({'id': test.id, 'prediction': predictions_test})

submission.to_csv(
    'submission.csv',
    index=False)



## Hypothesis: The model performs better in English

If this is true, can we translate each sentence to English before classification?

In [22]:
valid\
    .assign(
        pred = predictions_valid,
        correct = lambda df: df.label.eq(df.pred)
    )\
    .groupby('language')\
    .correct\
    .mean()\
    .sort_values()

language
Thai          0.472973
Greek         0.546667
Russian       0.546667
Urdu          0.552632
German        0.557143
Turkish       0.571429
Swahili       0.584416
French        0.602564
Bulgarian     0.608696
Vietnamese    0.644737
Hindi         0.653333
English       0.668122
Chinese       0.670732
Spanish       0.684932
Arabic        0.687500
Name: correct, dtype: float64

Accuracy in English is not radically better than the overall accuracy.

## Look at some incorrect predictions

In [43]:
label_map = {
    0: 'Entail',
    1: 'Neutral',
    2: 'Contra'
}

valid\
    .assign(
        _pred = predictions_valid,
        correct = lambda df: df.label.eq(df._pred),
        incorrect = lambda df: ~df.correct,
        pred = lambda df: df._pred.replace(label_map),
        label = lambda df: df.label.replace(label_map)
    )\
    .groupby(['language', 'label', 'pred'], as_index=False)\
    .agg(
        num_incorrect = ('incorrect', 'sum')
    )\
    .sort_values(by=['num_incorrect'], ascending=False)\
    .loc[lambda df: df.num_incorrect.gt(0)]\
    .head(10)

Unnamed: 0,language,label,pred,num_incorrect
29,English,Contra,Neutral,101
32,English,Entail,Neutral,98
30,English,Entail,Contra,82
28,English,Contra,Entail,69
33,English,Neutral,Contra,67
34,English,Neutral,Entail,39
50,German,Entail,Neutral,14
100,Thai,Contra,Entail,11
95,Swahili,Entail,Neutral,9
122,Urdu,Entail,Neutral,9


In [45]:
valid\
    .assign(
        _pred = predictions_valid,
        correct = lambda df: df.label.eq(df._pred),
        incorrect = lambda df: ~df.correct,
        pred = lambda df: df._pred.replace(label_map),
        label = lambda df: df.label.replace(label_map)
    )\
    .loc[lambda df: df.language.eq('English') & ~df.correct & df.label.eq('Contra')]\
    [['premise', 'hypothesis', 'label', 'pred']]\
    .sample(10, random_state=1)\
    .style.hide(axis='index')

premise,hypothesis,label,pred
"Using teams can also assist in integrating different perspectives, flattening organizational structure, and streamlining operations.",Organizational structure isn't one of the issues that the team has been known to assist with.,Contra,Neutral
"Despite their 17th-century origins, these gardens avoid the rigid geometry of the Tuileries and Ver?­sailles.",These gardens were around well before the 17th-century.,Contra,Neutral
Jon's defense began to weaken and slow.,Jon felt stronger and more defensive than ever.,Contra,Neutral
well this is real interesting that you're as far away as you are because i really thought this was uh uh we're,"you're so nearby, it's surprising.",Contra,Entail
The Tunnel of Eupalinos can be explored but it's not for the claustrophobic.,"The tunnel of Eupalinos is only one foot in diameter, barely large enough for a child to squeeze through.",Contra,Neutral
Pro-choicers point out that these close-up images literally cut the fetus's context--the woman--out of the picture.,Pro-choices say the close-up images are fair.,Contra,Neutral
Closed on the Sabbath.,Sabbath is closed.,Contra,Entail
"I see, said Tuppence thoughtfully.","""I can't comprehend it,"" said Tuppence fitfully.",Contra,Neutral
it would probably be a lot more work and probably not turn out as good,"Oh that way sounds great, it could turn out even better",Contra,Entail
DOD's common practice for managing this environment has been to create aggressive risk reduction efforts in its programs.,The DOD increases risk to manage the environment.,Contra,Neutral
