# Group 42 - COMP34812



**Task A :** Natural Language Inference (NLI)

*Given a premise and a hypothesis, determine if the hypothesis is true based on the
premise. You will be given more than 26K premise-hypothesis pairs as training data, and
more than 6K pairs as validation data.*

**Solution C :** Deep learning-based approaches underpinned by transformer architectures

*Our final model used an ensemble approach where predictions from three transformer models T5, RoBERTa, and FlanT5 are combined using hard voting. These pre-trained models underwent fine-tuning and transfer learning with the dataset to improve their performance, as well as adding a BiLSTM layer to the classification head. Leveraging these pre-trained models as a starting point for training on the dataset will result in faster convergence and improved performance.*

**Group 42 :** Aisha Wahid & Libby Walton

## Preparing Dataset

Required Environmnet

In [1]:
import os
import numpy as np
os.environ["KERAS_BACKEND"] = "tensorflow"
%env TF_USE_LEGACY_KERAS=1
import tensorflow as tf

env: TF_USE_LEGACY_KERAS=1


Cloning Resources from Git

In [2]:
!git clone https://github.com/aishawahid/COMP34812.git resources

Cloning into 'resources'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 24 (delta 1), reused 9 (delta 1), pack-reused 15[K
Receiving objects: 100% (24/24), 41.77 MiB | 27.81 MiB/s, done.
Resolving deltas: 100% (2/2), done.


Loading test data

In [3]:
import pandas as pd

test_df = pd.read_csv('/content/resources/Data/test.csv')
test_df['hypothesis'] = test_df['hypothesis'].astype(str)

## Loading Models

### Tokenisers

T5Tokenizer

In [4]:
import tensorflow as tf
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration

# T5 tokeniser
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base")

def t5_encode(hypotheses, premises, tokenizer, max_length=120):

    concatenated_inputs = [h + ' [SEP] ' + p for h, p in zip(np.array(hypotheses), np.array(premises))]

    inputs = tokenizer(
        concatenated_inputs,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

    return {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask']
    }

# Tokenize test data
test_input_T5 = t5_encode(test_df.premise.values, test_df.hypothesis.values, tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


RobertaTokenizer

In [5]:
import tensorflow as tf
import numpy as np
from transformers import RobertaTokenizer

# RoBERTa Tokeniser
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

def roberta_encode(hypotheses, premises, tokenizer, max_length=120):

    concatenated_inputs = [h + ' </s> ' + p for h, p in zip(np.array(hypotheses), np.array(premises))]

    inputs = tokenizer(
        concatenated_inputs,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

    return {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask']
    }

# Tokenise test data
test_input_RB = roberta_encode(test_df.premise.values, test_df.hypothesis.values, tokenizer)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Flan-T5Tokenizer

In [6]:
import tensorflow as tf
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration

# FLAN T5 Tokeniser
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

def flan_t5_encode(hypotheses, premises, tokenizer, max_length=120):

    concatenated_inputs = [h + ' [SEP] ' + p for h, p in zip(np.array(hypotheses), np.array(premises))]

    inputs = tokenizer(
        concatenated_inputs,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='tf'
    )

    return {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask']
    }

# Tokenise test data
test_input_FLAN = t5_encode(test_df.premise.values, test_df.hypothesis.values, tokenizer)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Load Models

Loading Models from my public Google Drive

In [7]:
# Create a new folder called "Models" in /content/resources/
!mkdir -p /content/resources/Models

# Download the models from google drive into the "Models" folder
!gdown --output "/content/resources/Models/t5_model.h5" 1cF2SIFknn-DVoYo2xjKP4raSmLAlyvIU
!gdown --output "/content/resources/Models/t5_flan_model.h5" 1-D7BhUqyX0kDixryv6WbDv5HX2kPnbbs
!gdown --output "/content/resources/Models/roberta_model.h5" 1-L4AsV0kLzZ49DWK9bYq4OTslVc8y1_P

Downloading...
From (original): https://drive.google.com/uc?id=1cF2SIFknn-DVoYo2xjKP4raSmLAlyvIU
From (redirected): https://drive.google.com/uc?id=1cF2SIFknn-DVoYo2xjKP4raSmLAlyvIU&confirm=t&uuid=3c0d8c83-74dd-4748-8a7b-cda21d146b79
To: /content/resources/Models/t5_model.h5
100% 881M/881M [00:05<00:00, 166MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1-D7BhUqyX0kDixryv6WbDv5HX2kPnbbs
From (redirected): https://drive.google.com/uc?id=1-D7BhUqyX0kDixryv6WbDv5HX2kPnbbs&confirm=t&uuid=eb75f6b5-afc1-42f6-8e7b-9c6f331d2fe6
To: /content/resources/Models/t5_flan_model.h5
100% 881M/881M [00:05<00:00, 159MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1-L4AsV0kLzZ49DWK9bYq4OTslVc8y1_P
From (redirected): https://drive.google.com/uc?id=1-L4AsV0kLzZ49DWK9bYq4OTslVc8y1_P&confirm=t&uuid=0a195f0c-b69a-4c5d-be7d-ae072cc89c72
To: /content/resources/Models/roberta_model.h5
100% 1.00G/1.00G [00:06<00:00, 160MB/s]


Loading models using transformers

In [8]:
import transformers

#Load models as transformers
modelT5 = tf.keras.models.load_model('/content/resources/Models/t5_model.h5', custom_objects={"TFT5EncoderModel": transformers.TFT5EncoderModel})
modelRoBERTa = tf.keras.models.load_model('/content/resources/Models/roberta_model.h5', custom_objects={"TFRobertaModel": transformers.TFRobertaModel})
modelFlanT5 = tf.keras.models.load_model('/content/resources/Models/t5_flan_model.h5', custom_objects={"TFT5EncoderModel": transformers.TFT5EncoderModel})



### Ensemble

In [9]:
from collections import Counter

# Generate predicitons for each model
predictionsT5 = np.argmax(modelT5.predict(test_input_T5), axis=1)
predictionsRoBERTa = np.argmax(modelRoBERTa.predict(test_input_RB), axis=1)
predictionsT5Flan = np.argmax(modelFlanT5.predict(test_input_FLAN), axis=1)

# Performs Hard voting
ensemble_predictions = []
for pred_t5, pred_roberta, pred_flan in zip(predictionsT5, predictionsRoBERTa, predictionsT5Flan):
    votes = Counter([pred_t5, pred_roberta, pred_flan])
    ensemble_predictions.append(votes.most_common(1)[0][0])



### Writing predicted labels to csv

Generating Dataframe

In [10]:
pd.set_option('display.max_rows', None)
result_df = pd.DataFrame({'prediction': ensemble_predictions})
column_name_row = pd.DataFrame({'prediction': ['prediction']}, index=[0])
result_df['prediction'] = result_df['prediction'].astype(int)
result_df = pd.concat([column_name_row, result_df]).reset_index(drop=True)
result_df

Unnamed: 0,prediction
0,prediction
1,1
2,1
3,1
4,1
5,1
6,0
7,0
8,0
9,0


Writing to CSV

In [11]:
result_df.to_csv('Group_42_C.csv', encoding='utf-8', index=False, header=False)