# Booking Model Notebook

### Overview
This notebook demonstrates the training and prediction stages of the model developed for the Booking Kaggle competition.

#### Important Notes:
Certain cells involve file-saving operations and have been disabled to optimize runtime for subsequent executions.
The notebook reflects the same process as the Python scripts used during development, ensuring consistency and reproducibility.


#### Utils Package
The utils package provides essential helper functions and classes for data preprocessing, tokenization, and dataset handling. These include:

##### General utilities: 
Functions for cleaning text, scaling data, and merging datasets (utils.general_utils).
##### Loader utilities: 
Classes for efficiently loading and batching training and validation datasets (utils.loader_utils).
##### Model utilities: 
Functions for loading pre-trained models and performing mean pooling of embeddings (utils.model_utils).

This modular structure ensures consistency across scripts and simplifies the workflow for training and inference.

## Prepare Data

In [1]:
from utils.data_utils import (load_data, fill_na, merge_users_reviews_by_matches, to_lower, clean_text,
                              replace_multiple_spaces, is_english)

### User Data Processing:
1. Load Data: Load user data for the specified split (train, val, test).
2. Handle Missing Values: Replace missing guest_country values with "Unknown".
3. Feature Engineering: Create a consolidated features_text field, combining key user and accommodation information (e.g., guest type, country, star rating).
4. Save Processed Data: Save the processed user data to a CSV file for further use.

In [2]:
def process_users_data(split):
    print('Loading users data...')
    df_users = load_data(split=split, data_type='users')

    print('Fill null values...')
    users_fill_na_params = {'guest_country': 'Unknown'}
    for col_name, new_value in users_fill_na_params.items():
        fill_na(df_users, col_name, new_value)

    df_users['features_text'] =\
        'Guest type: ' + df_users['guest_type'] + '\n' +\
        'Guest country: ' + df_users['guest_country'] + '\n' +\
        'Month: ' + df_users['month'].astype(str) + '\n' +\
        'Accommodation type: ' + df_users['accommodation_type'] + '\n' +\
        'Accommodation country: ' + df_users['accommodation_country'] + '\n' +\
        'Accommodation score: ' + df_users['accommodation_score'].astype(str) + '\n' +\
        'Accommodation star rating: ' + df_users['accommodation_star_rating'].astype(str) + '\n' +\
        'Room nights: ' + df_users['room_nights'].astype(str)

    df_users.to_csv(f'../data/processed_data/processed_{split}_users.csv', index=False)
    return df_users

### Review Data Processing:
1. Load Data: Load review data for the specified split (train, val, test).
2. Handle Missing Values:
 - For the train split, remove rows with missing or very short text.
 - For other splits, fill missing text fields with empty strings.
3. Text Cleaning:
 - Convert text to lowercase.
 - Remove unwanted characters (e.g., HTML tags, URLs, and non-alphanumeric characters).
 - Replace multiple spaces with a single space.
4. Feature Engineering: Create an all_text field consolidating review title, positive/negative content, score, and helpful votes.
5. Language Filter (Train Only): Retain only reviews written in English.
6. Save Processed Data: Save the processed review data to a CSV file for further use.

In [3]:
def process_reviews_data(split):
    print('Loading reviews data...')
    df_reviews = load_data(split=split, data_type='reviews')
    reviews_text_columns = ['review_title', 'review_positive', 'review_negative']

    if split == 'train':
        print('Fill null values...')
        # Drop rows where all text columns are NaN or empty
        df_reviews = df_reviews.dropna(subset=reviews_text_columns, how='all')
        df_reviews = df_reviews[(df_reviews['review_title'] != '') | (df_reviews['review_positive'] != '') | (df_reviews['review_negative'] != '')]

        print('Drop rows with very short combined text...')
        # Drop rows with very short combined text (e.g., fewer than 5 words)
        df_reviews['text'] = df_reviews['review_title'] + ' ' + df_reviews['review_positive'] + ' ' + df_reviews['review_negative']
        df_reviews = df_reviews[df_reviews['text'].str.split().str.len() > 50]
        df_reviews.drop('text', axis=1, inplace=True)
    else:
        print('Fill null values...')
        for col_name in reviews_text_columns:
            fill_na(df_reviews, col_name, '')

    print('Cleaning text...')
    # Convert text to lowercase
    for col_name in reviews_text_columns:
        df_reviews = to_lower(df_reviews, col_name)

    for col_name in reviews_text_columns:
        df_reviews[col_name] = df_reviews[col_name].apply(clean_text)

    for col_name in reviews_text_columns:
        df_reviews = replace_multiple_spaces(df_reviews, col_name)

    df_reviews['all_text'] = 'Review title: ' + df_reviews['review_title'] + '\n' + \
                             'Review positive: ' + df_reviews['review_positive'] + '\n' + \
                             'Review negative: ' + df_reviews['review_negative'] + '\n' + \
                             'Review score: ' + df_reviews['review_score'].astype(str) + '\n' + \
                             'Review helpful votes: ' + df_reviews['review_helpful_votes'].astype(str)

    if split == 'train':
        print('Check if text is in English...')
        # combine all text
        df_reviews = df_reviews[df_reviews['all_text'].apply(is_english)]

    df_reviews.to_csv(f'../data/processed_data/processed_{split}_reviews.csv', index=False)
    return df_reviews

### User-Review Merging (Train and Validation Only):
Combine user and review datasets using match files to create positive examples for training and validation.

In [4]:
    for split in ['train', 'val', 'test']:
        df_users = process_users_data(split)
        df_reviews = process_reviews_data(split)
        if split != 'test':
            merge_users_reviews_by_matches(df_users, df_reviews, split)

Loading users data...
Fill null values...
Loading reviews data...
Fill null values...
Drop rows with very short combined text...
Cleaning text...
Check if text is in English...
Loading users data...
Fill null values...
Loading reviews data...
Fill null values...
Cleaning text...
Loading users data...
Fill null values...
Loading reviews data...
Fill null values...
Cleaning text...


## Train Model

This section demonstrates the training process for the dual-encoder model used to match reviews and user features.

Key Steps:

1. Model Initialization:

 - Load pre-trained sentence-transformers/all-MiniLM-L6-v2 models for reviews and user features.
 - Transfer models to GPU (cuda) for efficient training.

2. Dataset Preparation:

 - Utilize preprocessed training and validation datasets loaded via UsersReviewsTrainDataset and UsersReviewsValDataset.
 - Batch size is set to 64 for training and 128 for validation.

3. Training Process:

 - Train the models over 100 epochs.
 - Use AdamW optimizer with a learning rate of 3e-5 for both models.
 - Apply a learning rate scheduler (ReduceLROnPlateau) to dynamically adjust learning rates based on validation loss.

4. Loss Function:

 - Compute similarity scores between review and user feature embeddings.
 - Optimize similarity matrix against a diagonal target using Binary Cross-Entropy Loss with Logits.

5. Forward Pass:

 - Compute embeddings for reviews and user features.
 - Aggregate token embeddings for each sequence using mean pooling.
 - Calculate similarity scores as the dot product of review and feature embeddings.

6. Validation:

 - Evaluate model performance on the validation dataset after each epoch.
 - Use validation loss to adjust the learning rate dynamically.

7. Model Saving:

 - Save the trained models (review encoder and feature encoder) after each epoch for checkpointing and future use.

In [7]:
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
import numpy as np

from utils.loader_utils import UsersReviewsTrainDataset, UsersReviewsValDataset
from utils.model_utils import load_model, mean_pooling

In [8]:
def train_model():
    print('Training model...')
    num_epochs = 100
    batch_size = 64

    # Load model
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    review_model, review_tokenizer = load_model(model_name)
    review_model.to('cuda')
    features_model, features_tokenizer = load_model(model_name)
    features_model.to('cuda')

    train_dataset = UsersReviewsTrainDataset(datapath='../data/processed_data/processed_train.csv',
                                             review_tokenizer=review_tokenizer,
                                             features_tokenizer=features_tokenizer,
                                             batch_size=batch_size)
    val_dataset = UsersReviewsValDataset(datapath='../data/processed_data/processed_val.csv',
                                         review_tokenizer=review_tokenizer,
                                         features_tokenizer=features_tokenizer,
                                         frac=0.2,
                                         batch_size=batch_size)

    train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size*2, shuffle=False)

    optimizer = torch.optim.AdamW([
        {"params": review_model.parameters(), "lr": 3e-5},  # Learning rate for Network A
        {"params": features_model.parameters(), "lr": 3e-5},   # Learning rate for Network B
    ])
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                           patience=1,
                                                           min_lr=1e-8,
                                                           mode='min')
    loss_fn = torch.nn.CrossEntropyLoss()
    running_loss = []

    for epoch in range(num_epochs):
        review_model.train()
        features_model.train()
        pb_train = tqdm(enumerate(train_dataloader), desc=f'Epoch {epoch}', total=len(train_dataloader))
        for idx, batch_data in pb_train:
            review_text, features_text = batch_data
            review_text = {key: value.squeeze(0) for key, value in review_text.items()}
            features_text = {key: value.squeeze(0) for key, value in features_text.items()}
            batch_data = review_text, features_text

            loss = forward_pass(review_model, features_model, batch_data, loss_fn)

            running_loss += [loss.item()]
            pb_train.set_postfix_str(f"loss: {np.mean(running_loss)}")

            # this is where the magic happens
            optimizer.zero_grad()  # reset optimizer so gradients are all-zero
            loss.backward()
            optimizer.step()

        val_loss = validate(review_model, features_model, val_dataloader, loss_fn)
        scheduler.step(val_loss)
        review_model.save_pretrained(f"./runs/run6/{model_name.split('/')[-1]}_review_e{epoch}.pth")
        features_model.save_pretrained(f"./runs/run6/{model_name.split('/')[-1]}_features_e{epoch}.pth")

In [9]:
def forward_pass(review_model, features_model, batch_data, loss_fn):
    review_text, features_text = batch_data
    review_text = {key: value.to('cuda') for key, value in review_text.items()}
    features_text = {key: value.to('cuda') for key, value in features_text.items()}

    review_emb = review_model(**review_text)
    features_emb = features_model(**features_text)

    review_emb = mean_pooling(review_emb, review_text['attention_mask'])
    features_emb = mean_pooling(features_emb, features_text['attention_mask'])

    # Compute similarity scores: a 32x32 matrix
    # row[N] reflects similarity between question[N] and answers[0...31]
    similarity_scores = review_emb @ features_emb.T

    target = torch.eye(review_emb.shape[0], dtype=torch.float32, device='cuda')
    loss = torch.nn.BCEWithLogitsLoss()(similarity_scores, target)
    return loss

In [10]:
@torch.no_grad()
def validate(review_model, features_model, val_dataloader, loss_fn):
    review_model.eval()
    features_model.eval()
    running_loss = []
    pb_val = tqdm(enumerate(val_dataloader), desc=f'Validation', total=len(val_dataloader))
    for idx, batch_data in pb_val:
        loss = forward_pass(review_model, features_model, batch_data, loss_fn)
        running_loss += [loss.item()]
        pb_val.set_postfix_str(f"loss: {np.mean(running_loss)}")
    return np.mean(running_loss)

In [12]:
train_model()

## Predict and Prepare Submission
#### Overview:
This section describes the process of generating predictions and preparing the final submission file for the Kaggle competition. The predictions involve matching reviews with user-accommodation pairs based on their embeddings.

### Key Steps:
1. Load Models and Tokenizers:

 - Load pre-trained models for reviews and user features (sentence-transformers/all-MiniLM-L6-v2).
 - Load tokenizers corresponding to the models.

2. Load Processed Data:

 - Load test user and review datasets (processed_test_users.csv and processed_test_reviews.csv).
 - Index the data by accommodation_id for efficient filtering.

3. Batch Prediction:

 - Iterate over accommodations in batches of size N.
 - For each batch:
     - Tokenize and encode review and user feature data.
     - Pass the encoded data through their respective models to generate embeddings.
     - Use mean pooling to compute sequence embeddings.

4. Similarity Computation:

 - Compute similarity scores between user-accommodation embeddings and review embeddings.
 - Identify the top 10 most relevant reviews for each user-accommodation pair.

5. Prepare Submission File:

 - Append results to the submission file, ensuring each row contains:
     - accommodation_id
     - user_id
     - The IDs of the top 10 matching reviews (or placeholders if fewer than 10 reviews are available).

6. Optimize Resource Usage:

 - Clear GPU memory (torch.cuda.empty_cache()) after processing each batch to ensure efficient resource usage.

7. Finalize Submission:

 - Add a unique ID column to the submission file.
 - Save the final submission as a CSV file in the specified directory.

In [13]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm
from utils.model_utils import mean_pooling

In [15]:
def prepare_submission(user_accommodation_embeddings, review_embeddings, users_df, reviews_df):
    submission = []

    for idx, row in users_df.iterrows():
        user_id = row['user_id']
        accommodation_id = row['accommodation_id']

        # Get the corresponding embedding
        user_accommodation_embedding = user_accommodation_embeddings[idx]  # Shape: (embedding_dim,)

        # Find all reviews for this accommodation
        relevant_reviews = reviews_df[reviews_df['accommodation_id'] == accommodation_id]

        if relevant_reviews.empty:
            continue  # No reviews for this accommodation

        # Encode only the relevant reviews
        relevant_review_embeddings = review_embeddings[relevant_reviews.index]  # Shape: (num_reviews_for_accommodation, embedding_dim)

        # Compute similarity scores
        similarity_scores = (relevant_review_embeddings @ user_accommodation_embedding.T).detach().cpu().numpy()

        # Get top 10 most similar reviews
        top_review_indices = similarity_scores.argsort()[-10:][::-1]  # Get top 10 indices

        # Retrieve the corresponding review IDs
        top_review_ids = relevant_reviews.iloc[top_review_indices]['review_id'].tolist()

        # Ensure we have exactly 10 reviews (fill with -1 if not enough)
        while len(top_review_ids) < 10:
            top_review_ids.append(-1)  # Placeholder for missing reviews

        # Store result
        submission.append([accommodation_id, user_id] + top_review_ids)

    # **Move DataFrame creation & return OUTSIDE the loop**
    submission_df = pd.DataFrame(submission, columns=['accommodation_id', 'user_id'] + [f"review_{i+1}" for i in range(10)])
    return submission_df

In [16]:
def prepare_review(sample, review_tokenizer):
    all_text = review_tokenizer(sample['all_text'],
                                padding='max_length',
                                truncation=True,
                                return_tensors='pt',
                                max_length=review_tokenizer.model_max_length)
    all_text.data['input_ids'] = all_text.data['input_ids'].squeeze()
    all_text.data['attention_mask'] = all_text.data['attention_mask'].squeeze()
    all_text.data['token_type_ids'] = all_text.data['token_type_ids'].squeeze()
    return all_text

In [17]:
def prepare_features_text(sample, features_tokenizer):
    features = features_tokenizer(sample['features_text'],
                                  padding='max_length',
                                  truncation=True,
                                  return_tensors='pt',
                                  max_length=features_tokenizer.model_max_length)
    features.data['input_ids'] = features.data['input_ids'].squeeze()
    features.data['attention_mask'] = features.data['attention_mask'].squeeze()
    features.data['token_type_ids'] = features.data['token_type_ids'].squeeze()
    return features

In [18]:
def stack_batch_encodings(batch_encodings):
    # Extract and stack 'input_ids'
    input_ids = torch.stack([enc['input_ids'].squeeze(0) for enc in batch_encodings.values()])

    # Extract and stack 'attention_mask'
    attention_mask = torch.stack([enc['attention_mask'].squeeze(0) for enc in batch_encodings.values()])

    # Optional: Handle 'token_type_ids' if present
    if "token_type_ids" in batch_encodings[0]:
        token_type_ids = torch.stack([enc['token_type_ids'].squeeze(0) for enc in batch_encodings.values()])
    else:
        token_type_ids = None

    # Return a dictionary for the batch
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "token_type_ids": token_type_ids,
    }

In [19]:
@torch.no_grad()
def predict():
    start = 0
    N = 10
    counter = 1
    save_path = "./submissions/best_submission/submission.csv"
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
    review_model = AutoModel.from_pretrained(f"./runs/best_run/{model_name.split('/')[-1]}_review.pth")
    review_model.to('cuda')
    review_tokenizer = AutoTokenizer.from_pretrained(model_name)
    features_model = AutoModel.from_pretrained(f"./runs/best_run/{model_name.split('/')[-1]}_features.pth")
    features_model.to('cuda')
    features_tokenizer = AutoTokenizer.from_pretrained(model_name)

    users_df = pd.read_csv('../data/processed_data/processed_test_users.csv')
    reviews_df = pd.read_csv('../data/processed_data/processed_test_reviews.csv')
    accommodation_ids = users_df['accommodation_id'].unique()
    users_indexed = users_df.set_index("accommodation_id")
    reviews_indexed = reviews_df.set_index("accommodation_id")

    for i in tqdm(range(start, len(accommodation_ids), N)):
        filtered_acc_ids = accommodation_ids[i:i + N]
        filtered_users_df = users_indexed.loc[filtered_acc_ids]
        filtered_users_df = filtered_users_df.reset_index()
        filtered_reviews_df = reviews_indexed.loc[filtered_acc_ids]
        filtered_reviews_df = filtered_reviews_df.reset_index()

        review_input = filtered_reviews_df.apply(lambda x: prepare_review(x, review_tokenizer), axis=1)
        features_input = filtered_users_df.apply(lambda x: prepare_features_text(x, features_tokenizer), axis=1)

        review_input = {key: value.to('cuda') for key, value in review_input.items()}
        features_input = {key: value.to('cuda') for key, value in features_input.items()}

        review_input = stack_batch_encodings(review_input)
        features_input = stack_batch_encodings(features_input)

        review_emb = review_model(**review_input)
        features_emb = features_model(**features_input)

        review_emb = mean_pooling(review_emb, review_input['attention_mask'])
        features_emb = mean_pooling(features_emb, features_input['attention_mask'])

        submission_df = prepare_submission(features_emb, review_emb, filtered_users_df,
                                           filtered_reviews_df)
        submission_df.to_csv(
            save_path,
            mode='a',
            header=True if counter == 1 else False,
            index=False)
        counter += 1

        del features_input, review_input, features_emb, review_emb  # Delete variables no longer needed
        torch.cuda.empty_cache()  # Release unused GPU memory

    submission_df = pd.read_csv(save_path)
    if 'ID' not in submission_df.columns:
        submission_df.insert(0, 'ID', range(1, len(submission_df) + 1))
    submission_df.to_csv(save_path, index=False)

In [20]:
predict()