# 1 - Introduction
In this part I will import the dataset, and understand how the dataset is formed.

### Data Download

In this cell, I download the input data from the official source provided in the exam instructions.  

The downloaded file contains the dataset described in the exam assignment, with reviews and various hotel-related and user-related fields.


In [None]:
!wget -O input_data.pkl 'http://frasca.di.unimi.it/MLDNN/input_data.pkl'

--2025-07-06 14:01:51--  http://frasca.di.unimi.it/MLDNN/input_data.pkl
Resolving frasca.di.unimi.it (frasca.di.unimi.it)... 159.149.130.139
Connecting to frasca.di.unimi.it (frasca.di.unimi.it)|159.149.130.139|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://frasca.di.unimi.it/MLDNN/input_data.pkl [following]
--2025-07-06 14:01:53--  https://frasca.di.unimi.it/MLDNN/input_data.pkl
Connecting to frasca.di.unimi.it (frasca.di.unimi.it)|159.149.130.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3390300 (3.2M)
Saving to: ‘input_data.pkl’


2025-07-06 14:01:54 (4.14 MB/s) - ‘input_data.pkl’ saved [3390300/3390300]



### Library Imports and Random Seeds

In this cell I import all the required libraries and tools necessary for data processing, model building, and training.

In [None]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, Dense, Dropout, Concatenate, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from scipy.stats import randint, uniform
from collections import Counter
import random
import gc
from itertools import product
import random
from sklearn.model_selection import KFold

np.random.seed(42)
tf.random.set_seed(42)

### Load Dataset

In this cell, I load the dataset from the downloaded pickle file into a pandas DataFrame, as instructed in the exam assignment.  

Printing the column names is a quick verification step to confirm that the dataset includes all the fields described in the assignment.
These will be preprocessed or discarded according to the strategy detailed in section 2.a of the written document.


In [None]:
# Load data
file_path = "/content/input_data.pkl"
df = pd.read_pickle(file_path)
print("Dataset columns:", df.columns.tolist())

Dataset columns: ['Hotel_Address', 'Review_Date', 'Average_Score', 'Hotel_Name', 'Reviewer_Nationality', 'Hotel_number_reviews', 'Reviewer_number_reviews', 'Review_Score', 'Review', 'Review_Type']


### Drop Unused Columns and Explore Dataset

In this section, I drop the columns that I explicitly chose not to use in my model, according to point 2.a of the written exam.

- Hotel_Address and Reviewer_Nationality were discarded because they are considered irrelevant to the prediction task.
- Review_Type is dropped because it would introduce leakage, as it directly relates to the target variable (Review_Score).
- Average_Score is removed because it wasn't included in the initial features of the exam assignment

After dropping columns, I print: general info on the DataFrame, a preview of the data and a count of missing values per column.

This is a necessary exploratory step before proceeding with preprocessing, confirming that the data is clean and matches the structure described in the assignment.


In [None]:
df = df.drop(columns=['Hotel_Address', 'Reviewer_Nationality', 'Review_Type', 'Average_Score'], errors='ignore')
# Basic data exploration
print("Dataset info:")
print(df.info())
print("\nFirst few rows:")
print(df.head())
print("\nMissing values:")
print(df.isnull().sum())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
Index: 13772 entries, 88526 to 61379
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Review_Date              13772 non-null  object 
 1   Hotel_Name               13772 non-null  object 
 2   Hotel_number_reviews     13772 non-null  int64  
 3   Reviewer_number_reviews  13772 non-null  int64  
 4   Review_Score             13772 non-null  float64
 5   Review                   13772 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 753.2+ KB
None

First few rows:
      Review_Date                              Hotel_Name  \
88526    5/2/2017  Copthorne Tara Hotel London Kensington   
42019    8/4/2016  BEST WESTERN Maitrise Hotel Maida Vale   
80574  11/17/2016                 Catalonia Ramblas 4 Sup   
27131    2/4/2016              Hyatt Regency Paris Etoile   
63857   7/27/2016         Best Western PLUS Epping Fores

# 2 - Input

This section in the exam was divided in two parts:
* How to (if) preprocess input data and which data would you retain/use;
* Which is, after the preprocessing step, the input of the model: type, shape, value, domain

### Feature Extraction and Date Processing

This step addresses part 2.a of my written exam. I extract structured features to be used by the model alongside the review text.

- I first split the `Review_Date` into three integer features: `day`, `month`, and `year`, as stated in the preprocessing strategy.

- The features used (`Hotel_Name`, `Reviewer_number_reviews`, `Hotel_number_reviews`, `Day`, `Month`, `Year`) are selected based on their expected predictive power, as explained in 2.b of the exam.

- `Review_Score` is used as the regression target `y`.

This structured input will later be scaled and embedded appropriately as input to the MLP after concatenation with the BiLSTM output.


In [None]:
# Extract day, month, and year as separate integer features from the review date
df[['Month', 'Day', 'Year']] = df['Review_Date'].str.split('/', expand=True).astype(int)

# Select structured features to use as model input (beside review text)
feature_cols = ['Hotel_Name', 'Reviewer_number_reviews', 'Hotel_number_reviews', 'Day', 'Month', 'Year']

# Create separate variables for structured input, textual input, and target
X_structured = df[feature_cols].copy()
X_text = df['Review'].copy()
y = df['Review_Score'].values

### Train/Validation/Test Splitting

Here, I split the dataset into training, validation, and test sets in proportions 70% / 15% / 15%, as described in point 6 of the exam. This is necessary to evaluate the model’s generalization performance.

- I first split 15% as the test set.
- Then I compute the relative proportion of validation from the remaining 85%, and apply a second split to extract 15% for validation.

In [None]:
# Define target proportions for training, validation, and test sets
train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15

# First split: separate out the test set (15% of the total)
X_struct_temp, X_struct_test, X_text_temp, X_text_test, y_temp, y_test = train_test_split(
    X_structured, X_text, y, test_size=0.15, random_state=42
)

# Compute validation size relative to the remaining data (85% of original)
val_ratio_relative = val_ratio / (train_ratio + val_ratio)  # 0.15 / 0.85 = 0.17647

# Second split: separate training and validation from remaining 85%
X_struct_train, X_struct_val, X_text_train, X_text_val, y_train, y_val = train_test_split(
    X_struct_temp, X_text_temp, y_temp, test_size=val_ratio_relative, random_state=42
)

# Print the distribution of samples across splits
total = len(X_structured)
print(f"Train: {len(X_struct_train)} ({len(X_struct_train)/total:.2%})")
print(f"Validation: {len(X_struct_val)} ({len(X_struct_val)/total:.2%})")
print(f"Test: {len(X_struct_test)} ({len(X_struct_test)/total:.2%})")

Train: 9640 (70.00%)
Validation: 2066 (15.00%)
Test: 2066 (15.00%)


## Pre-processing of the structural features:

### One-Hot Encoding of Hotel Names

**CHANGE**: In the implementation, I used a OneHotEncoder instead of a LabelEncoder for encoding the `Hotel_Name` feature.

**Motivation**:
In the written solution (point 2.a), I stated that each hotel name would be mapped to an integer. This would typically imply a LabelEncoder. However, using integer values alone would introduce ordinal relationships between hotels, which could mislead the model.

To better reflect the categorical and unordered nature of hotel names, I replaced the label encoding with one-hot encoding. This avoids introducing artificial hierarchies and allows the model to treat each hotel as an independent entity, which is more suitable for MLP input.

In [None]:
# One-hot encode hotel names, fit only on training data to avoid leakage
hotel_ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit the encoder on training hotel names
hotel_name_train_ohe = hotel_ohe.fit_transform(X_struct_train[['Hotel_Name']])

# Apply the trained encoder to validation and test sets
hotel_name_val_ohe = hotel_ohe.transform(X_struct_val[['Hotel_Name']])
hotel_name_test_ohe = hotel_ohe.transform(X_struct_test[['Hotel_Name']])

### Scaling Numerical Structured Features

Here I scale all numerical features to the [0, 1] range using a MinMaxScaler, as stated in point 2.a and 5 of the written exam.

- The scaler is fitted on training data only, and then applied to validation and test sets to ensure consistency and avoid leakage.

These scaled values ensure that all structured inputs are normalized before entering the MLP.


In [None]:
# Scale numerical features using MinMaxScaler fitted on training data
scaler = MinMaxScaler()
numerical_cols = ['Reviewer_number_reviews', 'Hotel_number_reviews', 'Day', 'Month', 'Year']

# Copy structured data to avoid modifying originals
X_struct_train_scaled = X_struct_train.copy()
X_struct_val_scaled = X_struct_val.copy()
X_struct_test_scaled = X_struct_test.copy()

# Fit scaler on training set, apply it to all splits
X_struct_train_scaled[numerical_cols] = scaler.fit_transform(X_struct_train[numerical_cols])
X_struct_val_scaled[numerical_cols] = scaler.transform(X_struct_val[numerical_cols])
X_struct_test_scaled[numerical_cols] = scaler.transform(X_struct_test[numerical_cols])

### Combine Structured Inputs and Normalize Target Variable

In this step, I concatenate the one-hot encoded `Hotel_Name` with the scaled numerical features to form the final structured input for the model.

Additionally:
- The target variable `Review_Score` is scaled from [0, 10] to [0, 1] to match the output range of the final sigmoid activation, as specified in point 3 of the written solution.
- I also extract the full list of structured feature names, mainly for interpretability and debugging purposes.

The structured input matrix is now ready for model integration with the BiLSTM output.


In [None]:
# Concatenate encoded hotel names and scaled numerical columns
X_struct_train_final = np.hstack([hotel_name_train_ohe, X_struct_train_scaled[numerical_cols].values])
X_struct_val_final = np.hstack([hotel_name_val_ohe, X_struct_val_scaled[numerical_cols].values])
X_struct_test_final = np.hstack([hotel_name_test_ohe, X_struct_test_scaled[numerical_cols].values])

# Normalize review scores from [0,10] to [0,1] for sigmoid output
y_train_scaled = y_train / 10
y_val_scaled = y_val / 10
y_test_scaled = y_test / 10

# Prepare feature names for reference (optional)
hotel_name_features = hotel_ohe.get_feature_names_out(['Hotel_Name'])
struct_feature_cols = list(hotel_name_features) + numerical_cols

# Show shapes and final feature list
print(f"Training structured features shape: {X_struct_train_final.shape}")
print(f"Validation structured features shape: {X_struct_val_final.shape}")
print(f"Test structured features shape: {X_struct_test_final.shape}")
print(f"Feature names: {struct_feature_cols}")

Training structured features shape: (9640, 1303)
Validation structured features shape: (2066, 1303)
Test structured features shape: (2066, 1303)
Feature names: ['Hotel_Name_11 Cadogan Gardens', 'Hotel_Name_1K Hotel', 'Hotel_Name_25hours Hotel beim MuseumsQuartier', 'Hotel_Name_41', 'Hotel_Name_88 Studios', 'Hotel_Name_9Hotel Republique', 'Hotel_Name_ABaC Restaurant Hotel Barcelona GL Monumento', 'Hotel_Name_AC Hotel Barcelona Forum a Marriott Lifestyle Hotel', 'Hotel_Name_AC Hotel Diagonal L Illa a Marriott Lifestyle Hotel', 'Hotel_Name_AC Hotel Irla a Marriott Lifestyle Hotel', 'Hotel_Name_AC Hotel Milano a Marriott Lifestyle Hotel', 'Hotel_Name_AC Hotel Paris Porte Maillot by Marriott', 'Hotel_Name_AC Hotel Sants a Marriott Lifestyle Hotel', 'Hotel_Name_AC Hotel Victoria Suites a Marriott Lifestyle Hotel', 'Hotel_Name_ADI Doria Grand Hotel', 'Hotel_Name_ARCOTEL Kaiserwasser Superior', 'Hotel_Name_ARCOTEL Wimberger', 'Hotel_Name_AZIMUT Hotel Vienna', 'Hotel_Name_Abba Sants', 'Hotel_Na

### Text Preprocessing: Tokenization and Cleaning

In this step, I clean and tokenize the training reviews, as described in point 2.a of the written exam.

This preprocessing prepares the review text for vocabulary construction and ensures consistent formatting across the dataset.

**CHANGE**: In addition to punctuation removal, I also removed non-alphabetic tokens.

**Motivation**: This was done to exclude tokens like numbers, dates, or mixed alphanumeric strings (e.g., "room123") that are unlikely to contribute semantic value and could introduce noise. While this step was not explicitly mentioned in the written solution, it helps clean the vocabulary for a more meaningful embedding space.


In [None]:
# Define function to clean and tokenize a review
def get_words(text):
    text = str(text).replace('--', ' ')  # Normalize double hyphens
    words = text.split()
    table = str.maketrans('', '', string.punctuation)
    words = [w.translate(table) for w in words]           # Remove punctuation
    words = [word for word in words if word.isalpha()]    # Keep only alphabetic tokens
    words = [word.lower() for word in words]              # Convert to lowercase
    return words

# Apply tokenization and cleaning only on training reviews
tokenized_train = X_text_train.apply(get_words)

### Vocabulary Construction and Word Index Mapping

From the cleaned training tokens, I build the full vocabulary used to index the review words:

- I count the frequency of all words.
- I do not apply any limit on the vocabulary size, this means all words found in the training set are included.
- I create a word-to-index dictionary (word2idx) where:
  - Indexing starts at 1 for known words.
  - Index 0 is reserved for unknown words (UNK), ensuring robust handling of unseen tokens in val/test sets.

**CHANGE**: I introduced a special token UNK to represent unknown words not seen during training.

**Motivation**: In the written solution, I described building a vocabulary from the training data, but I did not specify how to handle words that appear only in the validation or test sets. To ensure that the model can handle such cases gracefully and avoid indexing errors, I map unknown words to a dedicated index (0), reserved for UNK.


In [None]:
# Flatten the list of token lists into one list of all words
all_words = [word for tokens in tokenized_train for word in tokens]

# Count frequencies of all words
word_counts = Counter(all_words)

# Build vocabulary using all words found
most_common_words = word_counts.most_common()

# Create word-to-index dictionary (index starts at 1, 0 reserved for unknowns)
word2idx = {word: idx + 1 for idx, (word, _) in enumerate(most_common_words)}
word2idx["<UNK>"] = 0  # Special token for unknown words


### Convert Tokens to Integer Sequences

I convert the cleaned review text into sequences of word indices using the word2idx dictionary:

- For the training set, I use the already tokenized list.
- For validation and test sets, I reapply the same tokenization function (get_words()).
- Words not found in the training vocabulary are replaced with the index 0 (UNK).

This transformation is required to prepare the input for the embedding layer.

In [None]:
# Function to map a token list to its corresponding index list
def tokens_to_ids(tokens_list, word2idx):
    return [word2idx.get(word, 0) for word in tokens_list]  # Return 0 if word not found

# Apply conversion to all datasets
X_text_train_ids = tokenized_train.apply(lambda tokens: tokens_to_ids(tokens, word2idx))
X_text_val_ids = X_text_val.apply(lambda text: tokens_to_ids(get_words(text), word2idx))
X_text_test_ids = X_text_test.apply(lambda text: tokens_to_ids(get_words(text), word2idx))

### Sequence Padding for BiLSTM Input

**CHANGE**: I applied padding to the review sequences using a fixed maximum length (max_seq_len = 100).

**Motivation**: While the written solution assumes that reviews are represented as sequences of word indices, it did not specify that those sequences must be of uniform length. However, padding is required for compatibility with neural network layers like Embedding and BiLSTM, which expect inputs of equal shape. I used post-padding and truncation to preserve the initial part of the reviews while ensuring computational efficiency.


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Set a fixed maximum sequence length
max_seq_len = 100

# Pad and truncate sequences to uniform length (post-padding)
X_text_train_pad = pad_sequences(X_text_train_ids, maxlen=max_seq_len, padding='post', truncating='post')
X_text_val_pad = pad_sequences(X_text_val_ids, maxlen=max_seq_len, padding='post', truncating='post')
X_text_test_pad = pad_sequences(X_text_test_ids, maxlen=max_seq_len, padding='post', truncating='post')

# 3 OUTPUT -  4 LOSS - 5 MODEL CONFIGURATION


This section represents the following three parts of the written exam:

3. OUTPUT: How would you design the output layer and why;

4. LOSS: Which loss function and labels would you use to train your model and why;

5. MODEL CONFIGURATION

   a) Model composition (composition of layers, regardless their number, or their dimension)  
   b) How do you intend to configure your model to optimize your choice, mention a few hyperparameters most relevant in your opinion.


### Model Architecture: Text + Structured Fusion Model

In this cell, I define the full model architecture, as described in point 1 and point 5 of my written exam.

The model follows a dual-branch architecture:

- Text input branch:  
  A sequence of word indices representing the review is passed through an Embedding layer and then processed by a Bidirectional LSTM, as described in point 1.This branch is designed to capture the contextual and sequential structure of the review text.

- Structured input branch:  
  A second input receives the structured features, as specified in point 2.b, and passes them directly to the dense layers after concatenation.

These two branches are then concatenated and passed through an MLP with:
- a dense layer,
- dropout (for regularization),
- and batch normalization (for training stability), as outlined in point 5

The final output is a single neuron with sigmoid activation, predicting a value in [0,1], which corresponds to the normalized `Review_Score` (see point 3).

The model is compiled with MSE loss, consistent with the regression objective described in point 4.


In [None]:
# Define a dual-input model combining textual and structured information
def create_model(embedding_dim=50, lstm_units=64,
                 dense_units=64, dropout_rate=0.2,
                 output_activation='sigmoid',
                 loss_fn='mse', metrics_list=['mse']):

    # Define model input dimensions from preprocessed data
    vocab_size = len(word2idx) + 1           # Include padding index
    input_seq_len = X_text_train_pad.shape[1]  # Fixed sequence length for text
    structured_input_dim = X_struct_train_final.shape[1]  # Number of structured features

    # Text branch: Embedding + BiLSTM
    text_input = Input(shape=(input_seq_len,), name='text_input')
    embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim, mask_zero=False)(text_input)
    bilstm = Bidirectional(LSTM(lstm_units))(embedding)

    # Structured branch: direct dense input
    structured_input = Input(shape=(structured_input_dim,), name='structured_input')

    # Merge both branches and apply MLP
    combined = Concatenate()([bilstm, structured_input])
    x = Dense(dense_units, activation='sigmoid')(combined)
    x = Dropout(dropout_rate)(x)
    x = BatchNormalization()(x)

    # Output layer: single neuron for score prediction in [0,1]
    output = Dense(1, activation=output_activation)(x)

    # Compile model with specified loss and metrics
    model = Model(inputs=[text_input, structured_input], outputs=output)
    model.compile(optimizer='adam', loss=loss_fn, metrics=metrics_list)

    return model

# Build and inspect the model architecture
test_model = create_model()
test_model.summary()


### Define Hyperparameter Grid and Sample Configurations

In this cell, I define the hyperparameter grid for tuning the model architecture and training setup.
The hyperparameters I used are those I proposed to tune in point 5.b of the written exam (I decided to not include n_layers for both BiLSTM and MLP, to avoid an high number of combination).

**CHANGE**: Although I initially intended to use GridSearchCV for hyperparameter tuning (as stated in point 6 of the written exam), I encountered several practical issues during implementation:

- KerasRegressor compatibility: The scikit-learn wrapper KerasRegressor does not support models with multiple inputs, such as the combination of text and structured features used in this architecture.

- Multi-input data incompatibility: GridSearchCV expects a single input array X, whereas my model requires two separate inputs (`[X_text, X_structured]`), which cannot be handled natively through the scikit-learn interface.

- Version conflicts: Incompatibilities between scikit-learn, tensorflow, and keras versions introduced instability and errors when using wrapper classes.

Due to these limitations, I replaced GridSearchCV with a manual randomized search, which preserves full control over model construction and input formatting.  
To keep the execution time reasonable in the Colab environment, I randomly selected 5 combinations from the parameter grid instead of evaluating the full combinations.



In [None]:
# Define the search space for hyperparameter tuning
param_grid = {
    'embedding_dim': [50, 100, 150],
    'n_units': [16, 32],
    'dropout_rate': [0.1, 0.2],
    'lr': [0.0001, 0.0005],
    'batch_size': [64],
    'epochs': [2]
}

# Generate all possible hyperparameter combinations
all_combos = list(product(
    param_grid['dropout_rate'],
    param_grid['embedding_dim'],
    param_grid['lr'],
    param_grid['n_units'],
    param_grid['batch_size'],
    param_grid['epochs']
))

# Sample a fixed number of combinations for efficiency (random subset)
random.seed(42)
sampled_combos = random.sample(all_combos, 5)

### Initialize K-Fold Cross-Validation and Score Tracking

In this cell, I define a 2-fold cross-validation strategy using KFold.  
This replaces the fixed validation split originally proposed in point 6 of the written exam.

**CHANGE**: Since I could not use GridSearchCV due to compatibility issues with multi-input Keras models, I implemented my own manual search strategy. As part of this, I replaced the fixed validation split with K-Fold cross-validation.

**Motivation**:
- K-Fold provides a more robust estimate of model performance by validating on multiple folds.
- It ensures that every sample is used for both training and validation, making better use of the available data.
- This compensates for the loss of internal cross-validation that GridSearchCV would normally handle.

This setup ensures that my manual tuning loop has a reliable and reproducible validation structure.


In [None]:
# Initialize K-Fold cross-validation and tracking variables
kf = KFold(n_splits=2, shuffle=True, random_state=42)
best_score = float('inf')
best_params = None
all_val_scores = []

### Train and Evaluate Sampled Hyperparameter Configurations

In this cell, I loop over the 5 randomly sampled hyperparameter combinations and evaluate each using 2-fold cross-validation:

- For each fold, I split the training data, train the model, and evaluate it on the validation fold.
- The average validation MSE across folds is used to score the configuration.
- The configuration with the lowest average MSE is stored as the best.

This process simulates the behavior of GridSearchCV but is implemented manually to support the custom multi-input model.

While the evaluation setup is different from the one originally planned, the overall intent to reliably select the best hyperparameters based on validation performance is fully preserved.


In [None]:
# Evaluate sampled hyperparameter sets
for combo_count, combo in enumerate(sampled_combos, start=1):
    dropout_rate, embedding_dim, lr, n_units, batch_size, epochs = combo
    print(f"\n=== Combo {combo_count} ===")
    print(f"Embedding: {embedding_dim}, Units: {n_units}, Dropout: {dropout_rate}, LR: {lr}, Batch Size: {batch_size}, Epochs: {epochs}")

    fold_scores = []

    for train_index, val_index in kf.split(X_text_train_pad):
        # Prepare training and validation splits for this fold
        X_text_fold_train = X_text_train_pad[train_index]
        X_struct_fold_train = X_struct_train_final[train_index]
        y_fold_train = y_train_scaled[train_index]

        X_text_fold_val = X_text_train_pad[val_index]
        X_struct_fold_val = X_struct_train_final[val_index]
        y_fold_val = y_train_scaled[val_index]

        # Build model with current hyperparameters
        model = create_model(
            embedding_dim=embedding_dim,
            lstm_units=n_units,
            dropout_rate=dropout_rate,
            output_activation='sigmoid',
            loss_fn='mse',
            metrics_list=['mse']
        )
        model.optimizer.learning_rate.assign(lr)

        # Train model
        model.fit(
            [X_text_fold_train, X_struct_fold_train], y_fold_train,
            validation_data=([X_text_fold_val, X_struct_fold_val], y_fold_val),
            epochs=epochs,
            batch_size=batch_size,
            verbose=0
        )

        # Evaluate model performance on validation fold
        val_loss, val_mse = model.evaluate([X_text_fold_val, X_struct_fold_val], y_fold_val, verbose=0)
        fold_scores.append(val_mse)

    # Store average score for current hyperparameter combo
    avg_score = np.mean(fold_scores)
    all_val_scores.append((combo_count, avg_score))

    print(f"Average validation MSE: {avg_score:.4f}")

    # Track best-performing configuration
    if avg_score < best_score:
        best_score = avg_score
        best_params = {
            'embedding_dim': embedding_dim,
            'n_units': n_units,
            'dropout_rate': dropout_rate,
            'lr': lr,
            'batch_size': batch_size,
            'epochs': epochs
        }

# Print the best hyperparameter configuration and its score
print("\n=== Best Hyperparameters ===")
print(best_params)
print(f"Best Validation MSE: {best_score:.4f}")


=== Combo 1 ===
Embedding: 150, Units: 16, Dropout: 0.2, LR: 0.0001, Batch Size: 64, Epochs: 2
Average validation MSE: 0.0531

=== Combo 2 ===
Embedding: 50, Units: 32, Dropout: 0.1, LR: 0.0005, Batch Size: 64, Epochs: 2
Average validation MSE: 0.0747

=== Combo 3 ===
Embedding: 50, Units: 16, Dropout: 0.1, LR: 0.0001, Batch Size: 64, Epochs: 2
Average validation MSE: 0.0766

=== Combo 4 ===
Embedding: 150, Units: 32, Dropout: 0.2, LR: 0.0005, Batch Size: 64, Epochs: 2
Average validation MSE: 0.0944

=== Combo 5 ===
Embedding: 150, Units: 16, Dropout: 0.1, LR: 0.0001, Batch Size: 64, Epochs: 2
Average validation MSE: 0.0579

=== Best Hyperparameters ===
{'embedding_dim': 150, 'n_units': 16, 'dropout_rate': 0.2, 'lr': 0.0001, 'batch_size': 64, 'epochs': 2}
Best Validation MSE: 0.0531


### Initialize Final Model with Best Hyperparameters

In this cell, I instantiate the final model using the best-performing hyperparameter configuration found during the randomized search.

The model is built using the `create_model` function, passing in:
- `embedding_dim`, `lstm_units`, and `dropout_rate` for architecture
- `output_activation='sigmoid'` to ensure the output stays in [0,1] as specified in point 3 of the written exam
- `loss_fn='mse'` and `metrics_list=['mse']`, as specified in point 4 of the written exam

In [None]:
# Initialize final model with best hyperparameters
print("\nTraining final model with best hyperparameters...")

# Build model using the best parameters found during tuning
final_model = create_model(
    embedding_dim=best_params['embedding_dim'],
    lstm_units=best_params['n_units'],
    dropout_rate=best_params['dropout_rate'],
    output_activation='sigmoid',
    loss_fn='mse',
    metrics_list=['mse']
)

# Manually assign the best learning rate to the optimizer
final_model.optimizer.learning_rate.assign(best_params['lr'])


Training final model with best hyperparameters...


<tf.Tensor: shape=(), dtype=float32, numpy=9.999999747378752e-05>

### Train Final Model and Evaluate on Test Set

In this step, I train the model on the full training set and evaluate it on the held-out test set.

- The model is trained using the full preprocessed training data (`[X_text_train_pad, X_struct_train_final]`), with the validation set used for monitoring.
- After training, the model is evaluated on the test set to measure generalization performance.
This step corresponds to point 6 of the written exam.


In [None]:
# Train the model on all training data, using validation set for monitoring
history = final_model.fit(
    [X_text_train_pad, X_struct_train_final], y_train_scaled,
    validation_data=([X_text_val_pad, X_struct_val_final], y_val_scaled),
    epochs=best_params['epochs'],
    batch_size=best_params['batch_size'],
    verbose=1
)

# Evaluate model performance on the held-out test set
test_loss, test_mse = final_model.evaluate([X_text_test_pad, X_struct_test_final], y_test_scaled, verbose=1)
print(f"\nTest MSE: {test_mse:.4f}")

Epoch 1/2
[1m151/151[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 120ms/step - loss: 0.1533 - mse: 0.1533 - val_loss: 0.0518 - val_mse: 0.0518
Epoch 2/2
[1m151/151[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 115ms/step - loss: 0.1159 - mse: 0.1159 - val_loss: 0.0319 - val_mse: 0.0319
[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - loss: 0.0336 - mse: 0.0336

Test MSE: 0.0331


### Predict and Display Test Set Results

In this final step, I use the trained model to make predictions on the test set and inspect a few individual results.

- Predictions are made on the test inputs and scaled back from [0,1] to the original [0,10] review score range, as said in point 3 and 5b of the written exam.
- A small number of prediction ground truth pairs are printed for manual inspection.

This step is aligned with point 6 of the written exam: it helps verify how well the model generalizes and whether the predictions are numerically reasonable.


In [None]:
# Predict scaled outputs on the test set
y_pred_scaled = final_model.predict([X_text_test_pad, X_struct_test_final])

# Rescale predictions and ground truth to original [0,10] scale
y_pred = (y_pred_scaled * 10).flatten()
y_true = (y_test_scaled * 10).flatten()

# Print a few prediction results for inspection
for i in range(10):
    print(f"Review {i+1}: predicted = {y_pred[i]:.2f}, actual = {y_true[i]:.2f}")

[1m65/65[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 22ms/step
Review 1: predicted = 7.81, actual = 7.10
Review 2: predicted = 8.28, actual = 6.30
Review 3: predicted = 8.36, actual = 5.80
Review 4: predicted = 6.26, actual = 5.80
Review 5: predicted = 7.62, actual = 6.30
Review 6: predicted = 4.51, actual = 6.30
Review 7: predicted = 4.65, actual = 4.20
Review 8: predicted = 5.70, actual = 6.70
Review 9: predicted = 5.70, actual = 6.70
Review 10: predicted = 4.05, actual = 2.50
