# Task B: BiLSTM Model - Demo
This notebook loads the trained model, saved tokenizer file from training, and test data to generate binary predictions (0 or 1), determining whether the given hypothesis is logically entailed by its corresponding premise.

This demo performs the following:

- Loads and preprocesses the test.csv evaluation dataset.
- Loads our trained BiLSTM model.
- Runs model.predict() to prediction the labels from the given inputs.
- Saves predictions to predictions.csv.

*_Note: Please ensure to download trained model and load tokenizer file._*


# Installing Required Packages

In [None]:
!pip install nltk



# Data Loading, Model Loading and NLTK Setup

Loading in saved model, loading in test dataset and downloading neccesary nltk packages for preprocessing.

In [None]:
import pandas as pd
import tensorflow as tf
import nltk
from tensorflow.keras.models import load_model
import tensorflow.keras.backend as K
nltk.download('punkt_tab')
nltk.download('wordnet')

def abs_diff(x):
    return K.abs(x)

model = load_model("best_nli_model_B.keras", custom_objects={'abs_diff': abs_diff})
df_test = pd.read_csv("/content/test.csv")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


# Text Preprocessing for Premises and Hypotheses

This step cleans the text data for both the premise and hypothesis columns in the dataset.

## Preprocessing Pipeline Includes:
- Lowercasing the text
- Removing non-alphabetic characters
- Tokenising using NLTK's word_tokenize
- Lemmatising using WordNetLemmatizer

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import numpy as np
import re


lemmatizer = WordNetLemmatizer()

#Preprocess the data
def preprocessing(texts):
    preprocessed_texts = []

    for text in texts:
        if not text:  #handle empty text
            continue

        #clean the text: lowercase, strip whitespace, and remove non-alphabetic characters
        cleaned_text = re.sub(r'[^A-Za-z\s]', '', text.lower().strip())

        #tokenise the cleaned text
        tokenized_words = [word for word in word_tokenize(cleaned_text)]

        lemmatized_words = [lemmatizer.lemmatize(word) for word in tokenized_words]

        #append the lemmatised words to the list
        preprocessed_texts.append(lemmatized_words)

    return preprocessed_texts


#apply preprocessing
premise_test = preprocessing(df_test['premise'])
hypothesis_test = preprocessing(df_test['hypothesis'])


# Prepare Test Data
In this step, the preprocessed premise and hypothesis texts are converted into padded sequences of integers.

## What This Step Does:
###  Tokenisation
- We load the tokeniser file obtained from training the model. This is to ensure consistency and guarantees that words are mapped to the same indices.
- The tokenised texts are then converted into sequences of integers.

### Padding Sequences
- During training, we calculated the maximum sequence length (max_len) across the training data to standardise input shapes for the model. This value must remain consistent during evaluation to ensure that the input shapes match the model's expected dimensions.
-We then use pad_sequences to pad each sequence to this length using 'post' padding.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pickle

#load the tokenizer
with open("tokenizer.pkl", "rb") as f:
    tokenizer = pickle.load(f)


max_len = 281
#convert to sequences using your existing tokenizer
premise_test_seq = tokenizer.texts_to_sequences(premise_test)
hypothesis_test_seq = tokenizer.texts_to_sequences(hypothesis_test)

#Pad sequences
premise_test_pad = tf.keras.preprocessing.sequence.pad_sequences(premise_test_seq, maxlen=max_len, padding='post')
hypothesis_test_pad = tf.keras.preprocessing.sequence.pad_sequences(hypothesis_test_seq, maxlen=max_len, padding='post')

# Final Prediction on Test Data
We use the saved and loaded model to generate predictions on the unseen test data.

1. Predict Probabilities: We use the trained model to predict the probabilities of entailment for each input pair in the test data.
2. Apply Optimal Threshold: We apply the best threshold (0.489) determined from the development set based on the highest Matthews Correlation Coefficient (MCC).
3. Save Predictions: The binary predictions are saved to a CSV file named Group_18_B.csv.

In [None]:
#predict probabilities
preds_test = model.predict([premise_test_pad, hypothesis_test_pad])
best_thresh = 0.48999999999999977 #from tuning and dev evaluation

#apply threshold
pred_labels_test = (preds_test > best_thresh).astype(int) #best threshold from evaluation

#save to CSV
prediction_df = pd.DataFrame({'prediction': pred_labels_test.flatten()})
prediction_df.to_csv("Group_18_B.csv", index=False)

[1m104/104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 730ms/step
