## Introduction

In this task we aim to predict the relevance of certain articles, described in `body`, when we produce one of the 50 searches described in `topic`, `narrative`, and `description`.

We implemented several supervised machine learning models to predict the relevance of other articles and their respective searches.

Our models consist in a logistic regression model, a single-hidden-layer neural network, a deep neural network, and two complex architecture models.
While more complex architectures outperformed simpler models, performance is between 83% and 88% accuracy.

Our main takeaway from these models is that models which take in two inputs are more effective in search relevance problems.



## Method
### The Data
The data:
We count with a labeled set of 19758 articles and their respective searches, as well as their relevance, and 4884 unlabeled article-search pairs.

For each of the article-search pairs we used the following parameters:

*   `doc id` - unique id for each article-search pair
*   `topic_title` - title of the topic of the search
*   `description` - description of the information needed in the search
*   `narrative` - description of what is relevant for the search
*   `author` - name of the author of the article
*   `title` - title of the article
*   `body` - text of the main body of the article
And in the case of the labeled data:
*   `judgement` - relevance article-search pairs. 0 if irrelevant, 1 if relevant.


Do note that `topic_title`, `description`, and `narrative` are perfectly correlated.


In [None]:
#Select parquet files with train and test data from device if relevant
from google.colab import files
uploaded = files.upload()

In [None]:
#Import necessary libraries for code execution. Note some of these libraries may require additional downloading of packages
import numpy as np
import pandas as pd
import scipy
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from keras.layers import Input, Embedding, Convolution2D, MaxPooling2D, Flatten, Dense
from keras.models import Model
from keras.layers import Reshape
from keras.layers import concatenate
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re
import matplotlib.pyplot as plt

In [None]:
#Import data. May need to modify depending on the path in your machine
df_train = pd.read_parquet('relevance_train.parquet')
df_test = pd.read_parquet('relevance_test.parquet')

In [None]:
#Observe sample of training data and their labels
df_train.head()

In [None]:
#Observe sample of test data and their labels
df_test.head()

### Data Pre-processing:
As we are dealing with mostly text-based data we need to ensure machines can comprehend the text’s structure, semantics, and context.

To do that we first remove noise from the data, in our data this means removing html tags, newline tokens, stopwords, digits, and punctuation.

We then want to only keep the meaning of the words. We attempted both lemmatization, and stemming to achieve that stemming yields slightly better results on our data. This data will need to be tokenised (in our case as vectors using sklearn’s TFidVectorizer) for interpretation, which is applied after the validation/training division of the data.

Additional data pre-processing involvers transforming NaN data into strings for type-consistency.


In [None]:
#Suggested downloads to run code. May be unnecessary if your environment already contains them
# Uncomment for download

# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download("wordnet")
# nltk.download("omw-1.4")

In [None]:
#Function to remove English stopwords from a sentence

#Fetch stopwords in English
stopset = set(stopwords.words('english'))

# Define a function to remove stopwords from a sentence
def remove_stop_words(sentence):
  '''
  sentence --->  unprocessed sentence (str)
  output ---> sentence without stop words (str)
  '''
  # Split the sentence into individual words
  words = sentence.split()

  # List comprehension to remove stopwords
  filtered_words = [word for word in words if word not in stopset]

  # Join the filtered words back into a sentence
  return ' '.join(filtered_words)

In [None]:
#Define a function to stem all words in a sentence
ps = PorterStemmer()
def stem (sentence):
  '''
  sentence --->  unprocessed sentence (str)
  output ---> sentence with just stem words (str)
  '''
    stemmed = []
    # Split the sentence into individual words
    words = sentence.split()
    for word in words:
        stemmed.append(ps.stem(word))
    #Join the stemmed words back into a sentence
    return ' '.join(stemmed)

In [None]:
#NOT USED IN OUR FINAL PRE-PROCESSING METHOD
#Define a function to find all lemmas of the words in a sentence
wl = WordNetLemmatizer()
def lemma (sentence):
  '''
  sentence --->  unprocessed sentence (str)
  output ---> sentence with just lemma of words (str)
  '''
    lemmas = []
    # Split the sentence into individual words
    words = sentence.split()
    for word in words:
        lemmas.append(wl.lemmatize(word))
    #Join the stemmed words back into a sentence
    return ' '.join(lemmas)

In [None]:
#Define a function to perform language processing on dataframe collumn
def nlp (iterable):
    '''
    iterable --->  unprocessed sentence (str)
    output ---> sentence of clean preprocessed sentence (str)
    '''
    processed = []
    for element in iterable:
        element = str(element)
        element = element.lower()
        element = ''.join([i for i in element if not i.isdigit()]) #Remove digits from string
        element = re.sub("\<.*?\>"," ", element) #Remove all HTML tags
        element = element.replace('\\n', '') #Remove new line token
        element = re.sub(r'[^\w\s]','',element) #Remove punctuation
        element = remove_stop_words(element) #Call remove stopwords function
        #element = lemma(element) #does not add much to the stemming. performs worse than stemming
        element = stem(element)
        #Add processed element into the list
        processed.append(element)
    return processed

In [None]:
#Label encoding for searches. Helps determing correlation between topic_title, description, and narrative
def id_creator (iterable):
    '''
    iterable --->  unprocessed sentence (iter)
    output ---> label encoded unique id lists (list)
    '''
    id = []
    id_Dict = {}
    for item in iterable:
        item = str(item)
        if item not in id_Dict:
            id_Dict[item] = len(id_Dict) + 1
        id.append(id_Dict[item])
    return id

In [None]:
#Generate DataFrame with the processed text for train
processed_train = {'id': df_train['doc_id'],
                   'author': df_train['author'],
                   'title': nlp(df_train['title']),
                   'topic_id': df_train['topic_id'],
                   'topic': nlp(df_train['topic_title']),
                   'body': nlp(df_train['body']),
                   'description_id': id_creator(df_train['description']),
                   'description': nlp(df_train['description']),
                   'narrative_id': id_creator(df_train['narrative']),
                   'narrative': nlp(df_train['narrative']),
                   'judgement':df_train['judgement']}

#Convert to dataframe
df_train_processed = pd.DataFrame(processed_train)

In [None]:
#Generate DataFrame with the processed text for test
processed_test = {'id': df_test['doc_id'],
                   'author': df_test['author'],
                   'title': nlp(df_test['title']),
                   'topic_id': df_test['topic_id'],
                   'topic': nlp(df_test['topic_title']),
                   'body': nlp(df_test['body']),
                   'description_id': id_creator(df_test['description']),
                   'description': nlp(df_test['description']),
                   'narrative_id': id_creator(df_test['narrative']),
                   'narrative': nlp(df_test['narrative'])}

#Convert to dataframe
df_test_processed = pd.DataFrame(processed_test)

In [None]:
#Optional storage. Used to not run pre-processing each time we tested a model

#Export processed text to excel
df_train_processed.to_excel('relevance_train_processed.xlsx')
df_test_processed.to_excel('relevance_test_processed.xlsx')

In [None]:
from google.colab import files
uploaded = files.upload()


In [None]:
df_train_processed = pd.read_excel('relevance_train_processed.xlsx')
df_test_processed = pd.read_excel('relevance_test_processed.xlsx')

### Down-sampling

Further observation of the data shows the proportion of relevant to irrelevant data is low: 84.28% irrelevant to 15.72% relevant. Such a disparity could cause models to skew towards irrelevance. By implementing this function we lose data quantity for training but gaining better fitted models. We run our model validation on full and down-sampled training data.

In [None]:
#Determine percentage of relevant samples
relevant_count = sum(df_train_processed['judgement'])
relevant_percentage = relevant_count/len(df_train_processed['judgement'])

print('irrelevant:', 1 - relevant_percentage, 'relevant:', relevant_percentage)

In [None]:
#Function to produce downsampling with selected seed
def down_sampling(X, seed = 42):
  '''
  Apply down sampling to dataframe
  X ---> train set data (pandas DataFrame). Must contain judgement column!
  seed ---> selection of random seed that will be used (int)
  '''
  #Create separate dataframes for relevant and irrelevant searches (0, 1 judgements)
  df_relevant = X[X['judgement']==1]
  df_irrelevant = X[X['judgement']==0]

  #Determine which contains more data
  down_needed, down_no_needed = df_irrelevant, df_relevant
  if df_relevant.shape[0] > df_irrelevant.shape[0]:
    down_needed = df_relevant, df_relevant

  #Perform down sampling on relevant list
  df_downsampled = down_needed.sample(down_no_needed.shape[0], random_state = seed)

  #Concatenate the balanced lists and return final list
  return pd.concat([df_downsampled, down_no_needed])

### Determining the validation set

For the validation set determination, we arbitrarily select 20% of the topics and their corresponding articles, as the validation set. Since the test data does not share any topics with the training data. We believe this is representative of the problem at hand.

In [None]:
#Function to perform 20% topic validation sets
def validation_split(X, size = 0.2, seed = 42):
  '''
  Select percentage of the topics as validation data
  X ---> train set data (pandas DataFrame). Must contain topic column!
  size ---> percentage in decimal form of topics selected (float)
  seed ---> selection of random seed that will be used (int)
  output ---> train (pandas DataFrame), val (pandas DataFrame)
  '''
  #Get all the unique topic ids
  topics = X['topic_id'].unique()

  #Select the arbitrary validation topics
  topics_val = pd.Series(topics).sample(frac=size, random_state = seed)

  #Split df into training and validation sets
  train = X[~X['topic_id'].isin(topics_val)]
  val = X[X['topic_id'].isin(topics_val)]

  return train, val

## Models

### "Sparse" model
We acknowledge the 84% accuracy that would be obtained from assigning all entries in the training set a 0 in the `judgement` attribute, and wish to include an “everything is deemed irrelevant” model as our first benchmark.

In [None]:
def sparse(tbd):
  '''
  tbd ---> data for which we want a prediction (numpy array)
  output ---> prediction array
  '''

  #Create a bucket of 0s of the same length as the labelled data
  return [0] * len(list(tbd))

### Standard Machine Learning model
Using standard machine learning techniques for prediction, we feed the `topic_title`, `description`, `narrative`, `body`, `title`, and `judgement` attributes of the training data, and fit these to a random forest model to make the prediction of `judgement` in the validation data. Other machine learning models were tested yielding less accurate results, and some (like Linear Regression) did not converge within 10000 iterations.

In [None]:
def standard(X, y, tbd):
  '''
  X ---> training data tokenised (numpy array)
  y ---> labels of training data (numpy array)
  tbd ---> data for which we want a prediction (numpy array)
  output ---> prediction array
  '''

  # Train logistic regression model
  model = LogisticRegression(max_iter=10000)
  model.fit(X, y)

  #Return the prediction
  return model.predict(tbd)

In [None]:
def forest(X, y, tbd, estimators = 100):
  '''
  X ---> training data tokenised (numpy array)
  y ---> labels of training data (numpy array)
  tbd ---> data for which we want a prediction (numpy array)
  estimators ---> number of trees in the forest (int)
  output ---> prediction array
  '''
  model = RandomForestClassifier(n_estimators = estimators, random_state=42)
  model.fit(X, y)
  return model.predict(tbd)

### 3-layer Neural Network
Our training data is fed into a neural network of the following structure:

Layer 1: Input layer (number of neurons dependent on input shape)

Layer 2: Hidden layer (variable neurons)

Layer 3: Output layer (1 neuron)

We vary the number of neurons in the hidden layer to check its impact on performance in our validation set.

We tested several activation functions and layer types. Our final models includes an initial sequential layer, and two dense layers. The hidden layer uses a Rectified Linear Unit activation function. And the output layer uses a sigmoid function.


In [None]:
def threeNN (X, y, X_val, y_val, hidden, input_shape, epochs = 10, batch = 32):
  '''
  X ---> training data tokenised (numpy array)
  y ---> labels of training data (numpy array)
  X_val ---> validation data tokenised (numpy array)
  y_val ---> labels of validation data (numpy array)
  hidden ---> number of neurons in hidden layer (int)
  input_shape ---> number of neurons in input layer (int)
  epoch ---> number of epochs that the network is trained (int)
  batch ---> number of samples in a batch (int)
  output1 ---> last_accuracy, final accuracy number for training data (int)
  output2 ---> val_accuracy, final accuracy number for validation data (int)
  '''

  #Create 3 layer architecture
  model = Sequential()
  model.add(Dense(hidden, activation='relu', input_shape = (input_shape,)))
  model.add(Dense (1, activation='sigmoid'))

  #Compile the model
  model.compile(optimizer ='adam', loss = 'binary_crossentropy', metrics=['accuracy'])

  #Fit model
  history = model.fit(X, y, epochs = epochs, batch_size = batch, validation_data = (X_val, y_val))

  #Evaluate model
  loss, accuracy = model.evaluate(X_val, y_val)

  #For graphics extract last accuracy
  last_accuracy = history.history['accuracy'][-1]
  val_accuracy = history.history['val_accuracy'][-1]

  return last_accuracy, val_accuracy


### Deep Neural Network
Expanding on our 3-layer Neural Network model, we use our best performing iteration of the 3-layer neural network and add hidden layers.

We test several configurations with different amounts of hidden layers, following the same structure as our 3-layer Neural Network hidden layer.


In [None]:
def DNN (X, y, X_val, y_val, hidden, hidden_layers, input_shape, epochs = 10, batch =32):
  '''
  X ---> training data tokenised (numpy array)
  y ---> labels of training data (numpy array)
  X_val ---> validation data tokenised (numpy array)
  y_val ---> labels of validation data (numpy array)
  hidden ---> number of neurons in hidden layer (int)
  hidden_layers ---> number of hidden layers in the network (int)
  input_shape ---> number of neurons in input layer (int)
  epoch ---> number of epochs that the network is trained (int)
  batch ---> number of samples in a batch (int)
  output1 ---> last_accuracy, final accuracy number for training data (int)
  output2 ---> val_accuracy, final accuracy number for validation data (int)
  '''

  #Build model architecture
  model = Sequential()
  model.add(Dense(hidden, activation = 'relu', input_shape = (input_shape,)))
  for i in range(hidden_layers - 1):
    model.add(Dense(hidden, activation = 'relu'))
  model.add(Dense(1, activation = 'sigmoid'))


  #Compile the model
  model.compile(optimizer ='adam', loss = 'binary_crossentropy', metrics=['accuracy'])

  #Fit model
  history = model.fit(X, y, epochs = epochs, batch_size = batch, validation_data = (X_val, y_val))

  #Evaluate model
  loss, accuracy = model.evaluate(X_val, y_val)

  #For graphics extract last accuracy
  last_accuracy = history.history['accuracy'][-1]
  val_accuracy = history.history['val_accuracy'][-1]

  return last_accuracy, val_accuracy

### Complex model 1
MatchPyramid is a semantic matching task developed by Microsoft [2]. The algorithm represents the search and article as matrices and calculates the element-wise product of these matrices.

The algorithm contains a complex architecture including embedding, convolutional, and pooling layers. After passing through those layers it outputs a matching score indicating how relevant the article is to the search.


In [None]:
def matchPyramid (X_train_article, X_train_topic, y_train, X_val_article, X_val_topic, y_val, input_shape, vocab_size, epochs = 10, embedding_dim = 100):
  '''
  X_train_article ---> training data for articles tokenised (numpy array)
  X_train_topic ---> training data for searches tokenised (numpy array)
  y_train ---> labels of training data (numpy array)
  X_val_article ---> validation data for articles tokenised (numpy array)
  X_val_topic ---> validation data for searches tokenised (numpy array)
  y_val ---> labels of validation data (numpy array)
  input_shape ---> number of neurons in input layer (int)
  epochs ---> number of epochs that the network is trained (int)
  embedding_dim ---> number of neurons in the embedding layers (int)
  output1 ---> last_accuracy, final accuracy number for training data (int)
  output2 ---> val_accuracy, final accuracy number for validation data (int)
  '''

  # Input layers
    input_article = Input(shape=(input_shape,))
    input_topic = Input(shape=(input_shape,))
    def build_matchpyramid_model(input_shape, vocab_size, embedding_dim=100, num_filters=16, kernel_size=(3, 3), output_size=1):
      # Embedding layer
      embedding_layer = Embedding(input_dim=vocab_size, output_dim=embedding_dim)
      embedded_article = embedding_layer(input_article)
      embedded_topic = embedding_layer(input_topic)

      # Reshape inputs to add channel dimension
      reshape_layer = Reshape((input_shape, embedding_dim, 1))
      reshaped_article = reshape_layer(embedded_article)
      reshaped_topic = reshape_layer(embedded_topic)

      # Convolutional layers
      convolution_layer = Convolution2D(filters=num_filters, kernel_size=kernel_size, activation='relu')
      conv_article = convolution_layer(reshaped_article)
      conv_topic = convolution_layer(reshaped_topic)

      # Max pooling layers
      max_pooling_layer = MaxPooling2D(pool_size=(2, 2))
      pool_article = max_pooling_layer(conv_article)
      pool_topic = max_pooling_layer(conv_topic)

      # Flatten layers
      flatten_layer = Flatten()
      flat_article = flatten_layer(pool_article)
      flat_topic = flatten_layer(pool_topic)

      # Merge layers
      merged_layer = Dense(64, activation='relu')(concatenate([flat_article, flat_topic]))

      # Output layer
      output_layer = Dense(output_size, activation='sigmoid')(merged_layer)

      #Compile the model
      model = Model(inputs=[input_article, input_topic], outputs=output_layer)
      model.compile(optimizer ='adam', loss = 'binary_crossentropy', metrics=['accuracy'])
      return model

    #Create model
    model = build_matchpyramid_model(input_shape, vocab_size, embedding_dim)

    #Fit model
    history = model.fit([X_train_article, X_train_topic], y_train, validation_data=([X_val_article, X_val_topic], y_val), epochs=epochs, batch_size=32)

    #Evaluate model
    loss, accuracy = model.evaluate(X_val, y_val)

    #Extract last accuracy
    last_accuracy = history.history['accuracy'][-1]
    val_accuracy = history.history['val_accuracy'][-1]

    return last_accuracy, val_accuracy

In [None]:
#Function to associate relevance based on match connections and a threshold
def map_to_relevance(y_pred, threshold=0.5):
  '''
  y_pred ---> list of values between 0 and 1 (list)
  threshold ---> value between 0 and 1 (float)
  output ---> list of relevance 0 and 1 (list)
  '''
    connectedness_predictions = []
    for pred in y_pred:
        if pred >= threshold:
            connectedness_predictions.append(1)  # Connected
        else:
            connectedness_predictions.append(0)  # Not connected
    return connectedness_predictions

### Complex model 2
This custom complex model creates tailored LSTM models for different groups of searches, which are related to each other.

It begins by combining all data into a single DataFrame, and clusters the data based on the searches using k-means clustering. For each cluster it creates a separate LSTM model, with a pre-determined core LSTM layer with 64 memory units. Finishing with a sigmoid output layer determining relevance.


In [None]:
# Define LSTM model architecture
def create_lstm_model(vocab_size, max_seq_length):
    model = Sequential([
        Embedding(input_dim=vocab_size, output_dim=128, input_length=max_seq_length),
        LSTM(64),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
def clusters(X, y, X_val, y_val, vocab_size, k=20, epoch=10, seed=42):
  '''
  X ---> training data tokenised (numpy array)
  y ---> labels of training data (numpy array)
  X_val ---> validation data tokenised (numpy array)
  y_val ---> labels of validation data (numpy array)
  vocab size ---> number of tokens in vocabulary (int)
  k ---> number of clusters (int)
  epoch ---> number of epochs that the network is trained (int)
  seed ---> selection of random seed that will be used (int)
  output1 ---> fitted lstm model (obj)
  '''
    # Convert arrays to DataFrames
    X = pd.DataFrame(X)
    X_val = pd.DataFrame(X_val)

    # Concatenate the target values for training and validation
    judge = np.concatenate((y, y_val))

    # Add 'train' and 'val' labels to the data
    X['status'] = 'train'
    X_val['status'] = 'val'

    # Concatenate the DataFrames
    all_X = pd.concat([X, X_val], ignore_index=True)

    # Perform clustering
    kmeans = KMeans(n_clusters=k, random_state=seed)
    cluster_labels = kmeans.fit_predict(all_X.drop(columns=['status']))

    # Train LSTM models for each cluster
    lstm_models = {}
    acc_train = []
    acc_val = []
    for cluster in range(k):
        cluster_df = all_X[cluster_labels == cluster]
        texts = cluster_df.drop(columns=['status'])

        # Convert DataFrame to a list of sequences
        texts_list = [text.tolist() for _, text in texts.iterrows()]

        # Pad sequences to ensure consistent length
        max_seq_length = 100
        texts_padded = pad_sequences(texts_list, maxlen=max_seq_length)

        # Extract judgments for the current cluster
        cluster_judgments = judge[cluster_labels == cluster]

        # Convert X_val to float32
        X_val_values = X_val.values.astype('float32')

        # Convert y_val to float32
        y_val_float32 = y_val.astype('float32')

        # Train LSTM model
        lstm_model = create_lstm_model(vocab_size, max_seq_length)
        history = lstm_model.fit(texts_padded, cluster_judgments, validation_data=(X_val_values, y_val_float32), epochs=epoch, batch_size=32)
        lstm_models[cluster] = lstm_model

    return history

## Results
### Training the models

In order to train our models we condense our input attributes into 4:

*   `article`: concatenation of `title` and `body`
*   `search`: concatenation of `topic_title`, `narrative`, and `description`
*   `relevant`: labelled relevance
*   `topic_id`: relevant unique topic identifier (for validation set split purposes)

The attributes `article` and `search` are constructed separately as some of our models account for two separate entries for comparison.


In [None]:
#Function to create condensed data
def condensed (X):
  '''
  X --> train or test data (Pandas dataframe). Must contain `title`, `body`, `topic`, `narrative`, `description` and `topic_id` attributes
  output --> condensed data frame 3 attributes if train, 2 attributes if test (Pandas dataframe)
  '''
  #Ensure all NaN values and floats can be processed as strings
  df = X.astype(str)

  #Check if this is a test or a training set
  if 'judgement' in df.columns:
    data = {'article': X['title'] + X['body'],
              'search': X['topic'] + X['narrative'] + X['description'],
              'relevant': X['judgement']}
  else:
    data = {'article': X['title'] + X['body'],
              'search': X['topic'] + X['narrative'] + X['description']}
  return pd.DataFrame(data)


To be able for the machine to interpret our data it must be vectorised.

In [None]:
#Function to tokenise and convert train data into sequences + padding to ensure the same length
def token_train(X_train, X_val):
    '''
    X_train --> condensed train data (Pandas dataframe). Must contain `article` and `search` attributes
    X_val --> condensed validation data (Pandas dataframe). Must contain `article` and `search` attributes
    output --> list of numpy array of 'article' data tokenised and padded, and numpy array of 'search' data tokenised and padded
    '''
    print("Sample of input data:")
    print(X_train.head())

    #Ensure all elements are strings
    X_train = X_train.astype(str)
    X_val = X_val.astype(str)

    # Combine training and validation data for tokenization
    all_text = X_train['article'] + X_train['search'] + X_val['article'] + X_val['search']

    # Tokenize
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(all_text)

    # Convert to sequences and pad sequences to the maximum length
    sequences_train_article = tokenizer.texts_to_sequences(X_train['article'])
    sequences_train_search = tokenizer.texts_to_sequences(X_train['search'])
    sequences_val_article = tokenizer.texts_to_sequences(X_val['article'])
    sequences_val_search = tokenizer.texts_to_sequences(X_val['search'])

    max_seq_length = max(
        max(len(seq) for seq in sequences_train_article),
        max(len(seq) for seq in sequences_train_search),
        max(len(seq) for seq in sequences_val_article),
        max(len(seq) for seq in sequences_val_search)
    )

    article_padded_train = pad_sequences(sequences_train_article, maxlen=max_seq_length)
    search_padded_train = pad_sequences(sequences_train_search, maxlen=max_seq_length)
    article_padded_val = pad_sequences(sequences_val_article, maxlen=max_seq_length)
    search_padded_val = pad_sequences(sequences_val_search, maxlen=max_seq_length)

    print("Tokenized sequences:")
    print(sequences_train_article[:5])  # Print the first 5 sequences
    print("Sequence lengths:")
    print([len(seq) for seq in sequences_train_article])  # Print the lengths of tokenized sequences

    return [article_padded_train, search_padded_train], [article_padded_val, search_padded_val]


In [None]:
#Function to tokenise and convert test and train data into sequences + padding to ensure the same length
#Needs to include train and test data when used on this dataset so vectorisation is consistent throughout


def token_test(X, Y):
    '''
    X --> condensed train data (Pandas dataframe). Must contain `article` and `search` attributes
    Y --> condensed test data (Pandas dataframe). Must contain `article` and `search` attributes
    output --> two variables storing list of numpy array of 'article' data tokenised and padded, and numpy array of 'search' data tokenised and padded
    '''
    # Ensure all elements are strings
    X = X.astype(str)
    Y = Y.astype(str)

    # Combine training and testing data for calculating max sequence length and tokens
    combined_texts = pd.concat([X['article'], Y['article']]) + pd.concat([X['search'], Y['search']])

    # Tokenize
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(combined_texts)

    # Convert to sequences
    sequences_article = tokenizer.texts_to_sequences(X['article'])
    sequences_search = tokenizer.texts_to_sequences(X['search'])
    sequences_article_test = tokenizer.texts_to_sequences(Y['article'])
    sequences_search_test = tokenizer.texts_to_sequences(Y['search'])

    # Debugging: print out sequence lengths
    print("Sequence lengths before padding:")
    print("Train Article:", [len(seq) for seq in sequences_article])
    print("Train Search:", [len(seq) for seq in sequences_search])
    print("Test Article:", [len(seq) for seq in sequences_article_test])
    print("Test Search:", [len(seq) for seq in sequences_search_test])

    # Pad sequences to same length
    max_seq_length = max(len(seq) for seq in sequences_article + sequences_search + sequences_article_test + sequences_search_test)
    article_padded = pad_sequences(sequences_article, maxlen=max_seq_length)
    search_padded = pad_sequences(sequences_search, maxlen=max_seq_length)
    article_padded_test = pad_sequences(sequences_article_test, maxlen=max_seq_length)
    search_padded_test = pad_sequences(sequences_search_test, maxlen=max_seq_length)

    # Debugging: print out shape of padded arrays
    print("Shape of padded arrays:")
    print("Train Article:", article_padded.shape)
    print("Train Search:", search_padded.shape)
    print("Test Article:", article_padded_test.shape)
    print("Test Search:", search_padded_test.shape)

    # Organize return data
    to_train = [np.array(article_padded), np.array(search_padded)]
    to_test = [np.array(article_padded_test), np.array(search_padded_test)]

    return to_train, to_test, tokenizer




In [None]:
#Function to separate x and y inputs for data training and fitting
def separate_train (X, down = True, size = 0.2, seed = 42):
  '''
  X ---> train data (Pandas dataframe). Must contain `title`, `body`, `topic`, `narrative`, `description` and `topic_id` attributes
  down ---> indicate whether we want to apply downsampling on data (bool). Default True
  size ---> size of our generated topic selection for validation data (float). Default 0.2
  seed ---> selection of random seed that will be used (int)
  output ---> X_train_article, X_val_article, X_train_topic, X_val_topic, y_train, y_val all numpy arrays
  '''
  #Downsample if specified
  if down:
    X = down_sampling(X, seed)

  #Separate into training and validation
  train, val = validation_split(X, size, seed)

  #Condense dataframe
  con_train = condensed(train)
  con_val = condensed(val)

  #Separate judgement from the rest of the data
  y_train, y_val =  np.array(con_train['relevant']), np.array(con_val['relevant'])
  x_train, x_val =  con_train.drop('relevant', axis=1), con_val.drop('relevant', axis=1)


  #Implement tokenisation of X
  tokens_train, tokens_val, tokenizer = token_test(x_train, x_val)

  #Separate into article and topics
  X_train_article = np.array(tokens_train[0])
  X_val_article = np.array(tokens_val[0])
  X_train_search = np.array(tokens_train[1])
  X_val_search = np.array(tokens_val[1])

  print(np.array(tokens_train).shape, np.array(tokens_val).shape)

  return X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer




In [None]:
#Function to separate x and y inputs for data training and fitting
def separate_test (X, Y, down = True, size = 0.2, seed = 42):
  '''
  X ---> train data (Pandas dataframe). Must contain `title`, `body`, `topic`, `narrative`, `description` and `topic_id` attributes
  Y ---> train data (Pandas dataframe). Must contain `title`, `body`, `topic`, `narrative`, `description` and `topic_id` attributes
  down ---> indicate whether we want to apply downsampling on data (bool). Default True
  size ---> size of our generated topic selection for validation data (float). Default 0.2
  seed ---> selection of random seed that will be used (int)
  output ---> X_train_article, X_test_article, X_train_search, X_test_search, y_train all numpy arrays
  '''
  #Downsample if specified
  if down:
    X = down_sampling(X, seed)

  #Condense dataframe
  con_train = condensed(X)
  con_test = condensed(Y)

  #Separate judgement from the rest of the data
  y_train=  np.array(con_train['relevant'])
  x_train, x_test =  con_train.drop('relevant', axis=1), con_test


  #Implement tokenisation of X
  tokens_train, tokens_test = token_test(x_train, x_test)

  #Separate into article and topics
  X_train_article = np.array(tokens_train[0])
  X_test_article = np.array(tokens_test[0])
  X_train_search = np.array(tokens_train[1])
  X_test_search = np.array(tokens_test[1])

  return X_train_article, X_test_article, X_train_search, X_test_search, y_train


### "Sparse" and Standard Machine Learning model

The sparse benchmark acts in the expected manner: circa 84% without down sampling and circa 50% for the down sampled data.

As for the standard benchmark, we attempted multiple models including linear models, logistic regressions, and random forests. Some of the models, like logistic regression, did not manage to reach convergence. The best performing model for our validation set was random forests, which performed slightly above the sparse benchmark: circa 86% without down sampling and circa 53% for the down sampled data.


In [None]:
#Sparse and standard validation with down sampling
acc_ds_sparse, acc_ds_standard = [], []
for i in range(10):
    X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train(df_train_processed, down=True, size=0.2, seed=i)
    X_train = np.concatenate((X_train_article, X_train_search), axis=1)
    X_val = np.concatenate((X_val_article, X_val_search), axis=1)
    print(X_train.shape)
    pred_sparse = sparse(X_val)
    acc_ds_sparse.append(accuracy_score(y_val, pred_sparse))
    #pred_standard = standard(X_train, y_train, X_val)
    pred_standard = forest(X_train, y_train, X_val, 100)
    acc_ds_standard.append(accuracy_score(y_val, pred_standard))
print(acc_ds_sparse, acc_ds_standard)


In [None]:
#Plot
plt.figure(figsize=(10, 6))
plt.plot(acc_ds_standard, acc_ds_sparse, marker='o', label='Sparse Accuracy')
plt.title('Sparse and Standard Accuracy (DownSampling)')
plt.xlabel('Sparse')
plt.ylabel('')
plt.xscale('log')  # Setting x-axis to logarithmic scale
plt.xticks(indices, labels=[str(i) for i in indices])
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
#Sparse and standard validation without down sampling
acc_nds_sparse, acc_nds_standard = [], []
for i in range(10):
    X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train(df_train_processed, down=False, size=0.2, seed=i)
    X_train = np.concatenate((X_train_article, X_train_search), axis=1)
    X_val = np.concatenate((X_val_article, X_val_search), axis=1)
    print(X_train.shape)
    pred_sparse = sparse(X_val)
    acc_nds_sparse.append(accuracy_score(y_val, pred_sparse))
    #pred_standard = standard(X_train, y_train, X_val)
    pred_standard = forest(X_train, y_train, X_val, 100)
    acc_nds_standard.append(accuracy_score(y_val, pred_standard))
print(acc_nds_sparse, acc_nds_standard)


In [None]:
#Graphs
indices = [i for i in range(10)]

#Plot
plt.figure(figsize=(10, 6))
plt.plot(indices, acc_nds_sparse, marker='o', label='Sparse Accuracy')
plt.plot(indices, acc_nds_standard, marker='o', label='Standard Accuracy')
plt.title('Sparse and Standard Accuracy (no DownSampling)')
plt.xlabel('Seed')
plt.ylabel('Accuracy')
plt.xscale('log')  # Setting x-axis to logarithmic scale
plt.xticks(indices, labels=[str(i) for i in indices])
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

### 3NN
We run a three-layer neural network, where we modify the number of neurons in the hidden layer.

In [None]:
X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = True, size = 0.2, seed = 42)

X_train, X_val = X_train_article + X_train_search, X_val_article + X_val_search

X_train = np.concatenate((X_train_article, X_train_search), axis=1)
X_val = np.concatenate((X_val_article, X_val_search), axis=1)


#Storing second dimension for initialisation
second_dimension = X_train.shape[1]

train_acc_3d = []
val_acc_3d =[]
for i in range(12):
  last_accuracy, val_accuracy = threeNN(X_train, y_train, X_val, y_val, 2**i, second_dimension, 10, 32)
  train_acc_3d.append(last_accuracy)
  val_acc_3d.append(val_accuracy)

print(train_acc_3d)
print(val_acc_3d)

In [None]:
X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = False, size = 0.2, seed = 42)

X_train, X_val = X_train_article + X_train_search, X_val_article + X_val_search

X_train = np.concatenate((X_train_article, X_train_search), axis=1)
X_val = np.concatenate((X_val_article, X_val_search), axis=1)


#Storing second dimension for initialisation
second_dimension = X_train.shape[1]

train_acc_3nd = []
val_acc_3nd =[]
for i in range(12):
  last_accuracy, val_accuracy = threeNN(X_train, y_train, X_val, y_val, 2**i, second_dimension, 10, 32)
  train_acc_3nd.append(last_accuracy)
  val_acc_3nd.append(val_accuracy)

print(train_acc_3nd)
print(val_acc_3nd)

In [None]:
#Graphs
indices = [2**i for i in range(12)]

#Plot
plt.figure(figsize=(10, 6))
plt.plot(indices, train_acc_3d, marker='o', label='Training Accuracy')
plt.plot(indices, val_acc_3d, marker='o', label='Validation Accuracy')
plt.title('Training and Validation Accuracy vs. Neurons in Hidden Layer (DownSampling)')
plt.xlabel('Number of Neurons')
plt.ylabel('Accuracy')
plt.xscale('log')  # Setting x-axis to logarithmic scale
plt.xticks(indices, labels=[str(i) for i in indices])
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
#Graphs
indices = [2**i for i in range(12)]

#Plot
plt.figure(figsize=(10, 6))
plt.plot(indices, train_acc_3nd, marker='o', label='Training Accuracy')
plt.plot(indices, val_acc_3nd, marker='o', label='Validation Accuracy')
plt.title('Training and Validation Accuracy vs. Neurons in Hidden Layer (No DownSampling)')
plt.xlabel('Number of Neurons')
plt.ylabel('Accuracy')
plt.xscale('log')  # Setting x-axis to logarithmic scale
plt.xticks(indices, labels=[str(i) for i in indices])
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

As can be seen from the accuracy plots, for both down sampling and no down sampling, when we increase the number of neurons beyond a threshold of 128 neurons, we observe over-fitting from our model. The optimal accuracy dependent on the number of neurons for our training and validation sets ranges between 16 and 32 neurons in our hidden layer.

The accuracy of this model is low, close to a random relevance allocation. In the case of the non-down sampling implementation, always performing below the sparse benchmark.

The next step to improve our model is to include more hidden layers to improve predictions.


### DNN
As can be seen from the graphs, increasing the number of hidden layers improves training accuracy, but they do not significantly improve the performance on the validation set. This finding is consistent with other research in the field that shows no significant improvements [1].

In [None]:
X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = True, size = 0.2, seed = 42)

X_train, X_val = X_train_article + X_train_search, X_val_article + X_val_search

X_train = np.concatenate((X_train_article, X_train_search), axis=1)
X_val = np.concatenate((X_val_article, X_val_search), axis=1)


#Storing second dimension for initialisation
second_dimension = X_train.shape[1]

train_acc_dnnds = []
val_acc_dnnds =[]
for i in range(8):
  last_accuracy, val_accuracy = DNN(X_train, y_train, X_val, y_val, 16, i+3, second_dimension, 30, 32)
  train_acc_dnnds.append(last_accuracy)
  val_acc_dnnds.append(val_accuracy)

print(train_acc_dnnds)
print(val_acc_dnnds)

In [None]:
X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = False, size = 0.2, seed = 42)

X_train, X_val = X_train_article + X_train_search, X_val_article + X_val_search

X_train = np.concatenate((X_train_article, X_train_search), axis=1)
X_val = np.concatenate((X_val_article, X_val_search), axis=1)


#Storing second dimension for initialisation
second_dimension = X_train.shape[1]

train_acc_dnnnds = []
val_acc_dnnnds =[]
for i in range(8):
  last_accuracy, val_accuracy = DNN(X_train, y_train, X_val, y_val, 16, i+3, second_dimension, 30, 32)
  train_acc_dnnnds.append(last_accuracy)
  val_acc_dnnnds.append(val_accuracy)

print(train_acc_dnnnds)
print(val_acc_dnnnds)

From these models’ results prove we need to account for the fact that there is no significant improvement from deeper neural networks with respect to a standard machine learning approach. We hypothesise a reason for this lack of improvement is the fact there are two inputs to determine relevance, and our Standard Machine Learning, 3-layer Neural Networks, and Deep Neural Networks model only account for a single input.

### MatchPyramid

In [None]:
for i in range(10):
  X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = False, size = 0.2, seed = i)

  #Storing second dimension for initialisation
  second_dimension = max(X_train_article.shape[1], X_train_search.shape[1])
  vocab_size = len(tokenizer.word_index) + 1

  matchPyramid(X_train_article, X_train_search, y_train, X_val_article, X_val_search, y_val, second_dimension, vocab_size, epochs = 2, embedding_dim = 100)

In [None]:
X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = True, size = 0.2, seed = 1)

#Storing second dimension for initialisation
second_dimension = max(X_train_article.shape[1], X_train_search.shape[1])
vocab_size = len(tokenizer.word_index) + 1

matchPyramid(X_train_article, X_train_search, y_train, X_val_article, X_val_search, y_val, second_dimension, vocab_size, epochs = 4, embedding_dim = 100)

MatchPyramid’s performance is significantly improving the sparse benchmark with an 88% accuracy. We tried several iterations of the MatchPyramid algorithm modifying the embedding dimension. The results were promising as the deviation in accuracy is very low.

### K-means

We tested this custom model with three different algorithms after k-means clustering takes place. Linear Regression did not achieve significantly positive results. LSTM performed slightly better than the sparse benchmark. Whereas BERT underfitted the data, with a high tendency to predict irrelevance.

In [None]:
X_train_article, X_val_article, X_train_search, X_val_search, y_train, y_val, tokenizer = separate_train (df_train_processed, down = False, size = 0.2, seed = 42)

X_train, X_val = X_train_article + X_train_search, X_val_article + X_val_search

# X_train = np.concatenate((X_train_article, X_train_search), axis=1)
# X_val = np.concatenate((X_val_article, X_val_search), axis=1)

#Storing second dimension for initialisation
second_dimension = max(X_train_article.shape[1], X_train_search.shape[1])
vocab_size = len(tokenizer.word_index) + 1

#Storing second dimension for initialisation
second_dimension = X_train.shape[1]

train_acc, val_acc = clusters(X_train, y_train, X_val, y_val, vocab_size)

print(train_acc)
print(val_acc)

### Validation Results
Validation data without DownSampling running these models with 10 different seeds

| Model   | Average Accuracy    | Standard Deviation|
|---------|---------------------|-------------|
| Sparse Model  | 0.84116902 | 0.00548342|
| Standard ML: Random Forests| 0.8646539 | 0.0011255|
| 3-layer Neural Network| 0.83799| 0.019875|
| Deep Neural Network (4 hidden)| 0.84084 | 0.0192229|
| MatchPyramid | 0.879469319| 0.00032617|
| k-means (LSTM) | 0.863246| 0.002375|

### Test data without DownSampling
| Model   | Kaggle Score Public Score         |
|---------|----------------------------------|
| Sparse Model  | 0.87305|
| Standard ML: Linear Regression| 0.87278|
| Standard ML: Random Forests| 0.86513 |
| 3-layer Neural Network| 0.87168|
| Deep Neural Network| 0.87305 |
| MatchPyramid | 0.79634|
| k-means (linear regression)| 0.84520|
| k-means (LSTM) | 0.85558|
| k-means (BERT) | 0.87250|

## Summary
Throughout our model exploration and implementation we researched various machine learning and deep learning models. Some of the models included random forest classifiers, neural networks, MatchPyramid and a custom k-means + LSTM model. We also tested some pre-trained models and architectures, like BERT, LDA or DDSM.

Our recommended model to approach relevance of text-based searches would be MatchPyramid.  The recommendation comes from the fact that MatchPyramid is the best performing model on the validation data, and with the least amount of deviation.

The next potential steps would be the implementation of different models into an ensemble model. We tested this with some of our models in an ensemble voting classifier, which did not result to be more effective than the individual models. We still believe ensemble methods can mitigate the errors and biases of individual models.

Another way in which we could improve performance through pre-trained text-processing neural networks, and customising them to our problem. A potential next step is to implement BERT processing, and freezing some of the trained network and training it with our data.

Furthermore, we wish to implement further semantic and contextual processing to our text data in the pre-processing stages. These future steps we aim to achieve better predictions and extrapolate better to other search relevance problems.


## References

[1] Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems (pp. 2924-2932)

[2] Pang, L., Lan, Y., Guo, J., Xu, J., & Cheng, X. (2016). A Study of MatchPyramid Models on Ad-hoc Retrieval. arXiv preprint arXiv:1606.04648.