In this notebook, we will train a second-level Logistic Regression model using descriptive text features along with the predictions from the three first-level models: DeBERTa, LSTM, and XGBoost. The holdout dataset will be used for training the second-level model, and the final model's performance will be assessed on the test set.

In [1]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.0.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.3-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.5-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.0.0-py3-none-any.whl (362 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.8/362.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.13.3-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.2/233.2 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Downloading Mako-1.3.5-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Ma

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import json
import pickle
import os
import joblib
from joblib import dump, load

import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, log_loss
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import optuna

from tqdm import tqdm
from collections import Counter

In [4]:
pd.set_option('display.max_columns', None)

In [5]:
BASIC_PATH = '/content/gdrive/MyDrive/ML/projects/feedback-prize/'
PREDICTION_DATASETS = '1st_level_preds/'
MODEL_PATH = '2nd_level_models/'

We will now load the previously saved predictions from the DeBERTa, LSTM, and XGBoost models, and combine them with the descriptive text features. We'll then split the data into training (holdout data), validation, and test sets for training the second-level Logistic Regression model.

In [6]:
holdout_deberta_preds = pd.read_csv(BASIC_PATH+PREDICTION_DATASETS+'holdout_1st_level_deberta_preds.csv')
holdout_lstm_preds = pd.read_csv(BASIC_PATH+PREDICTION_DATASETS+'holdout_1st_level_lstm_preds.csv')
holdout_xgb_preds = pd.read_csv(BASIC_PATH+PREDICTION_DATASETS+'holdout_1st_level_xgb_preds.csv')

test_deberta_preds = pd.read_csv(BASIC_PATH+PREDICTION_DATASETS+'test_1st_level_deberta_preds.csv')
test_lstm_preds = pd.read_csv(BASIC_PATH+PREDICTION_DATASETS+'test_1st_level_lstm_preds.csv')
test_xgb_preds = pd.read_csv(BASIC_PATH+PREDICTION_DATASETS+'test_1st_level_xgb_preds.csv')

In [7]:
holdout_deberta_preds.drop(['essay_id', 'target'], axis = 1, inplace = True)
holdout_lstm_preds.drop(['essay_id', 'target'], axis = 1, inplace = True)

test_deberta_preds.drop(['essay_id', 'target'], axis = 1, inplace = True)
test_lstm_preds.drop(['essay_id', 'target'], axis = 1, inplace = True)

In [8]:
train_df = holdout_deberta_preds.\
merge(holdout_lstm_preds, on = 'discourse_id', how = 'left').\
merge(holdout_xgb_preds, on = 'discourse_id', how = 'left')

test_df = test_deberta_preds.\
merge(test_lstm_preds, on = 'discourse_id', how = 'left').\
merge(test_xgb_preds, on = 'discourse_id', how = 'left')

In [9]:
COLS_TO_DROP = ['discourse_id', 'essay_id', 'target']
TARGET = 'target'
CAT_FEATURES = ['1st_level_deberta_preds', '1st_level_lstm_preds', '1st_level_xgb_preds', 'discourse_type']
NUM_FEATURES = ['discourse_len', 'essay_len']
OTHER_FEATURES = ['discourse_num_long_words', 'discourse_num_short_words', 'discourse_noun_count',
                  'discourse_adj_count', 'discourse_pnoun_count', 'essay_num_long_words',
                  'essay_num_short_words', 'essay_noun_count', 'essay_adj_count', 'essay_pnoun_count']

We will extract a slice of essay_ids from the test_data to create a separate validation dataset. This validation set will be used for evaluation during hyperparameter optimization.

In [10]:
test_ids, validation_ids = train_test_split(test_df['essay_id'].unique(), test_size = 0.2, random_state = 77)

In [11]:
validation_df = test_df[test_df['essay_id'].isin(validation_ids)].copy()
test_df = test_df[test_df['essay_id'].isin(test_ids)].copy()

validation_df.reset_index(drop = True, inplace = True)
test_df.reset_index(drop = True, inplace = True)

Preprocess categorical and numerical columns for the model.

In [12]:
y_train = train_df[TARGET]
y_val = validation_df[TARGET]
y_test = test_df[TARGET]

train_df.drop(COLS_TO_DROP, axis = 1, inplace = True)
validation_df.drop(COLS_TO_DROP, axis = 1, inplace = True)
test_df.drop(COLS_TO_DROP, axis = 1, inplace = True)

In [13]:
preprocessor = ColumnTransformer(
    transformers = [
        ('cat', OneHotEncoder(drop = 'first'), CAT_FEATURES),
        ('num', StandardScaler(), NUM_FEATURES)
        ],
    remainder = 'passthrough')

In [14]:
X_train_transformed = preprocessor.fit_transform(train_df)
X_validation_transformed = preprocessor.transform(validation_df)

In [15]:
transformed_columns = (
    preprocessor
    .transformers_[0][1]
    .get_feature_names_out(CAT_FEATURES).tolist() +
    NUM_FEATURES + OTHER_FEATURES
)

In [16]:
pd.DataFrame(X_validation_transformed, columns = transformed_columns)

Unnamed: 0,1st_level_deberta_preds_1,1st_level_deberta_preds_2,1st_level_lstm_preds_1,1st_level_lstm_preds_2,1st_level_xgb_preds_1,1st_level_xgb_preds_2,discourse_type_Concluding Statement,discourse_type_Counterclaim,discourse_type_Evidence,discourse_type_Lead,discourse_type_Position,discourse_type_Rebuttal,discourse_len,essay_len,discourse_num_long_words,discourse_num_short_words,discourse_noun_count,discourse_adj_count,discourse_pnoun_count,essay_num_long_words,essay_num_short_words,essay_noun_count,essay_adj_count,essay_pnoun_count
0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.906488,2.133127,0.211382,0.471545,0.317073,0.105691,0.0,0.213421,0.525853,0.243124,0.088009,0.005501
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.566913,2.133127,0.294118,0.411765,0.352941,0.176471,0.0,0.213421,0.525853,0.243124,0.088009,0.005501
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.610168,2.133127,0.250000,0.312500,0.187500,0.062500,0.0,0.213421,0.525853,0.243124,0.088009,0.005501
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.692746,2.133127,0.250000,0.583333,0.416667,0.083333,0.0,0.213421,0.525853,0.243124,0.088009,0.005501
4,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,6.652744,2.133127,0.180791,0.559322,0.197740,0.064972,0.0,0.213421,0.525853,0.243124,0.088009,0.005501
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1516,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.857901,1.669537,0.250000,0.250000,0.500000,0.250000,0.0,0.131579,0.591533,0.141876,0.094966,0.009153
1517,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.161889,1.669537,0.121951,0.634146,0.121951,0.097561,0.0,0.131579,0.591533,0.141876,0.094966,0.009153
1518,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.775324,1.669537,0.111111,0.444444,0.222222,0.222222,0.0,0.131579,0.591533,0.141876,0.094966,0.009153
1519,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.321781,1.669537,0.142857,0.555556,0.126984,0.095238,0.0,0.131579,0.591533,0.141876,0.094966,0.009153


Configure Optuna optimization and initiate the search for optimal hyperparameters.

In [17]:
def objective(trial):

    penalty = trial.suggest_categorical('penalty', ['l2', 'elasticnet', 'l1'])
    C = trial.suggest_float('C', 1e-4, 1e2, log = True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)

    model = LogisticRegression(
        penalty = penalty,
        C = C,
        l1_ratio = l1_ratio,
        solver = 'saga',
        max_iter = 1000,
        random_state = 97
    )

    model.fit(X_train_transformed, y_train)
    val_preds = model.predict_proba(X_validation_transformed)

    return log_loss(y_val, val_preds)

In [18]:
study = optuna.create_study(direction = 'minimize', study_name = 'LogReg parameters')

[I 2024-10-08 13:14:54,297] A new study created in memory with name: LogReg parameters


In [19]:
def callback(study, trial):
  pbar.update(1)

In [20]:
N_trials = 200

with tqdm(total = N_trials, desc = "Optuna Optimization", dynamic_ncols = True, bar_format = '{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]') as pbar:
    study.optimize(objective, n_trials = N_trials, callbacks = [callback])

[I 2024-10-08 13:15:10,920] Trial 0 finished with value: 0.7132072657428062 and parameters: {'penalty': 'elasticnet', 'C': 85.90915510729018, 'l1_ratio': 0.16321712364000374}. Best is trial 0 with value: 0.7132072657428062.
[I 2024-10-08 13:15:11,013] Trial 1 finished with value: 0.7276195059954488 and parameters: {'penalty': 'l1', 'C': 0.011570432497351835, 'l1_ratio': 0.6837687815599354}. Best is trial 0 with value: 0.7132072657428062.
[I 2024-10-08 13:15:17,386] Trial 2 finished with value: 0.7119283818750393 and parameters: {'penalty': 'l2', 'C': 11.698876475697332, 'l1_ratio': 0.8694673409936721}. Best is trial 2 with value: 0.7119283818750393.
[I 2024-10-08 13:15:17,609] Trial 3 finished with value: 0.7091926291241677 and parameters: {'penalty': 'l1', 'C': 0.13709858927801774, 'l1_ratio': 0.6229429841101539}. Best is trial 3 with value: 0.7091926291241677.
Optuna Optimization:   2%|▏         | 4/200 [00:20<12:56][I 2024-10-08 13:15:17,973] Trial 4 finished with value: 0.708279258

In [21]:
print("Best trial:")
trial = study.best_trial

print(f"  Value: {trial.value}")
print("  Params:")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

Best trial:
  Value: 0.707178574793981
  Params:
    penalty: l1
    C: 0.5276546058877624
    l1_ratio: 0.9953593630180491


Train the Logistic Regression model using the best hyperparameters found by Optuna, then save the trained model for future use.

In [22]:
logreg_model = LogisticRegression(
        penalty = 'l1',
        C = 0.5276546058877624,
        solver = 'saga',
        l1_ratio = 0.9953593630180491,
        max_iter = 1000,
        random_state = 97
    )

logreg_model.fit(X_train_transformed, y_train)



In [23]:
pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('logreg', logreg_model)
])

In [24]:
dump(pipeline, (BASIC_PATH+MODEL_PATH+'logistic_regression_pipeline_with_model.joblib'))

['/content/gdrive/MyDrive/ML/projects/feedback-prize/2nd_level_models/logistic_regression_pipeline_with_model.joblib']

Load the pipeline and evaluate the final metrics on the test set.

In [25]:
logreg_pipeline = load(BASIC_PATH+MODEL_PATH+'logistic_regression_pipeline_with_model.joblib')

In [26]:
test_probs = logreg_pipeline.predict_proba(test_df)
test_preds = test_probs.argmax(-1)
Counter(test_preds)

Counter({0: 4019, 1: 1360, 2: 482})

In [27]:
print('Test metrics:')
print(f"Loss: {log_loss(y_test, test_probs)}")
print(f"Precision: {precision_score(y_test, test_preds, average = 'macro')}")
print(f"Recall: {recall_score(y_test, test_preds, average = 'macro')}")
print(f"F1: {f1_score(y_test, test_preds, average = 'macro')}")

Test metrics:
Loss: 0.6835903988059158
Precision: 0.6609212962858931
Recall: 0.5885780373436105
F1: 0.6053336100336718


After fitting the second-level model, we achieved a slight improvement in the multiclass log loss, which is the target metric in the competition, compared to the first-level models. More importantly, we also observed an overall enhancement in other key metrics. Although this research isn't focused on creating a competition solution, but rather on compiling various NLP techniques, the results are promising. Please also keep in mind that I used a significant portion of the original data for testing purposes as part of this research. In a competition scenario, that data would typically be included in the training or holdout set for training the second-level model, with the final quality assessed on the hidden competition test set. By building on these methods and incorporating a few of your own ideas, there's a good chance of further improving the target metric.