# Semantic Textual Similarity Project

## Authors
- Kacper Poniatowski
- Pau Blanco

## Introduction
This notebook is dedicated to the exploration and implementation of methods for Semantic Textual Similarity (STS) as part of the end-of-semester project in the Introduction to Human Language Technologies (IHLT) course. This project revolves around Task 6 of the 'SamEval 2012' competition, which focuses on measuring the similarity of sentence pairs within text documents.

The pipeline outlined in this notebook consists of multiple stages:
1. Data Preparation and Preprocessing: Loading datasets and cleaning text for analysis.
2. Feature Extraction. Generating features that explore lexical, syntactic and other additional dimensions to capture semantic similarity.
3. Model Training and Evaluation: Various models are trained on the dataset and then evaluated, with primary focus on Random Forest (RF).
4. Result Analysis: Obtained results are interpreted and visualised.

This project adheres to the constraint outlined within the assignment brief: avoiding the use of pre-trained word embeddings like BERT.

## Pre-Reqs

### Imports

In [8]:
# Force auto-reload
%load_ext autoreload
%autoreload 2

import pandas as pd
import nltk
import numpy as np
import os

from utils import load_data, evaluate_rf_model, drop_highly_correlated_features, update_results_csv, generate_plots_from_metrics
from models import ModelTrainer
from feature_extraction import FeatureExtractor
from utils import save_predictions
from scipy.stats import pearsonr

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet_ic')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\paubl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\paubl\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\paubl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\paubl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paubl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     C:\Users\paubl\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet_ic is already u

True

### Define Constants

In [9]:
# Define constant paths used throughout notebook
TRAIN_PATH = '../data/train/01_raw/'
TRAIN_GS_PATH = '../data/train/scores/'
TEST_PATH = '../data/test/01_raw/'
TEST_GS_PATH = '../data/test/scores/'
TRAIN_SAVE_PATH = '../data/train/02_preprocessed/preprocessed_train_data.csv'
TEST_SAVE_PATH = '../data/test/02_preprocessed/preprocessed_test_data.csv'
PREDICTED_SAVE_PATH = '../data/test/03_predicted/'
RESULTS_SAVE_PATH = "project/test/03_predicted/results.csv"

### Load Data

In [10]:
# Load train data
print('\n Loading train data')
all_train_files = ['SMTeuroparl', 'MSRvid', 'MSRpar']
df_train = load_data(TRAIN_PATH, TRAIN_GS_PATH, all_train_files)

# Load test data
print('\n Loading test data')
all_test_files = ['SMTeuroparl', 'MSRvid', 'MSRpar', 'surprise.OnWN', 'surprise.SMTnews']
df_test = load_data(TEST_PATH, TEST_GS_PATH, all_test_files)

print('\n Train and test datasets loaded successfully')


 Loading train data

 Loading test data

 Train and test datasets loaded successfully


### Extract Features

In [4]:
feature_extractor = FeatureExtractor()

# Extract the desired features
def add_features(dt):
    feature_extractor.add_POS_statistics(dt)
    feature_extractor.add_synset_statistics(dt)
    feature_extractor.add_lemma_statistics(dt)

# Add features to the training and test data
add_features(df_train)
add_features(df_test)

print('\n Features added to datasets successfully')

# Save df_train and df_test to respective files - this is done to avoid re-running the
# feature extraction process each time as synset extraction is computationally expensive
df_test.to_csv(TEST_SAVE_PATH, index=False)
df_train.to_csv(TRAIN_SAVE_PATH, index=False)

print('\n Train and test datasets saved as .csv files successfully')

Adding POS based features...
Adding synset-based features...
Processed 2234 of 2234 rows (100%)      
Adding lemma based features...
Adding POS based features...
Adding synset-based features...
Processed 3108 of 3108 rows (100%)      
Adding lemma based features...

 Features added to datasets successfully

 Train and test datasets saved as .csv files successfully


## Pearson Score Calculations

### Load Existing Dataframes

In [11]:
if TRAIN_SAVE_PATH is None or TEST_SAVE_PATH is None:
    raise ValueError('TRAIN_SAVE_PATH and TEST_SAVE_PATH must be defined')

if not os.path.exists(TRAIN_SAVE_PATH):
    raise FileNotFoundError(f'The file {TRAIN_SAVE_PATH} does not exist')

if not os.path.exists(TEST_SAVE_PATH):
    raise FileNotFoundError(f'The file {TEST_SAVE_PATH} does not exist')

original_df_train = pd.read_csv(TRAIN_SAVE_PATH)
original_df_test = pd.read_csv(TEST_SAVE_PATH)

print('\n Train and test datasets loaded from .csv files successfully')


 Train and test datasets loaded from .csv files successfully


### Perform Feature Selection

#### Correlation Matrix

In [12]:
# Drop first 4 columns to remove non-numerical columns
df_train = original_df_train.drop(original_df_train.columns[:4], axis=1)
df_test = original_df_test.drop(original_df_test.columns[:4], axis=1)

correlation_matrix = df_train.corr()
print(correlation_matrix)

# Total number of columns in the correlation matrix
total_cols = correlation_matrix.shape[1]
print(f'\n Total number of columns (features) in the correlation matrix: {total_cols}')

                           s1_n_words  s2_n_words  s1_n_verbs_tot  \
s1_n_words                   1.000000    0.918735        0.758300   
s2_n_words                   0.918735    1.000000        0.673713   
s1_n_verbs_tot               0.758300    0.673713        1.000000   
s2_n_verbs_tot               0.590479    0.698453        0.740191   
s1_n_verbs_pres              0.246101    0.191457        0.571911   
...                               ...         ...             ...   
lemma_lcs_length             0.814533    0.793255        0.545660   
lemma_edit_distance          0.781976    0.809274        0.570023   
proportion_s1_in_s2          0.202598    0.231090        0.077001   
proportion_s2_in_s1          0.298470    0.213312        0.158630   
lemma_position_similarity    0.153036    0.110382        0.118022   

                           s2_n_verbs_tot  s1_n_verbs_pres  s2_n_verbs_pres  \
s1_n_words                       0.590479         0.246101         0.113169   
s2_n_words   

#### Find best correlation threshold

In [13]:
all_features = [
            's1_n_words', 's1_n_verbs_tot', 's1_n_verbs_pres', 's1_n_verbs_past', 's1_n_nouns', 's1_n_adjectives', 's1_n_adverbs', 
            's2_n_words', 's2_n_verbs_tot', 's2_n_verbs_pres', 's2_n_verbs_past', 's2_n_nouns', 's2_n_adjectives', 's2_n_adverbs', 
            'dif_n_words', 'dif_n_verbs_tot', 'dif_n_verbs_pres', 'dif_n_verbs_past', 'dif_n_nouns', 'dif_n_adjectives', 'dif_n_adverbs', 
            'jaccard_all_words', 'jaccard_verbs', 'jaccard_nouns', 'jaccard_adjectives', 'jaccard_adverbs',
            
            'all_all_shared_synsets_count', 'all_all_shared_synsets_ratio', 'all_all_avg_synset_similarity', 'all_all_max_synset_similarity',
            'all_verb_shared_synsets_count', 'all_verb_shared_synsets_ratio', 'all_verb_avg_synset_similarity', 'all_verb_max_synset_similarity',
            'all_noun_shared_synsets_count', 'all_noun_shared_synsets_ratio', 'all_noun_avg_synset_similarity', 'all_noun_max_synset_similarity',
            'all_adj_shared_synsets_count', 'all_adj_shared_synsets_ratio', 'all_adj_avg_synset_similarity', 'all_adj_max_synset_similarity',
            'all_adv_shared_synsets_count', 'all_adv_shared_synsets_ratio', 'all_adv_avg_synset_similarity', 'all_adv_max_synset_similarity',

            'best_all_shared_synsets_count', 'best_all_shared_synsets_ratio', 'best_all_avg_synset_similarity', 'best_all_max_synset_similarity',
            'best_verb_shared_synsets_count', 'best_verb_shared_synsets_ratio', 'best_verb_avg_synset_similarity', 'best_verb_max_synset_similarity',
            'best_noun_shared_synsets_count', 'best_noun_shared_synsets_ratio', 'best_noun_avg_synset_similarity', 'best_noun_max_synset_similarity',
            'best_adj_shared_synsets_count', 'best_adj_shared_synsets_ratio', 'best_adj_avg_synset_similarity', 'best_adj_max_synset_similarity',
            'best_adv_shared_synsets_count', 'best_adv_shared_synsets_ratio', 'best_adv_avg_synset_similarity', 'best_adv_max_synset_similarity',

            'lemma_diversity', 'shared_lemmas_ratio', 'lemma_jackard_similarity', 'avg_lemma_similarity', 'max_lemma_similarity', 'shared_lemma_count', 'dice_coefficient',
            'lemma_bigram_overlap', 'lemma_lcs_length', 'lemma_edit_distance', 'proportion_s1_in_s2', 'proportion_s2_in_s1', 'lemma_position_similarity'
            ]

def extract_features(feature_set, drop_features):
    return [f for f in feature_set if f not in drop_features]

model_trainer = ModelTrainer()
print(f"Total feaures: {len(all_features)}")
thresholds = [0.7, 0.75, 0.80, 0.85, 0.9, 0.95, 1]
for threshold in thresholds:
    print(f'\nDropping columns with correlation above {threshold}')
    dropped_columns = drop_highly_correlated_features(df_train, threshold)
    print(f'Columns dropped: {len(dropped_columns)}')
    new_columns = extract_features(all_features, dropped_columns)
    print(f'Remaining columns: {len(new_columns)}')
    best_rf_model_lex, rf_params_lex, metrics_lex, mean_correlation_rf_lex = evaluate_rf_model(
        model_trainer, 
        original_df_train, 
        original_df_test, 
        new_columns, 
        'gs', 
        PREDICTED_SAVE_PATH,
        f"test_{threshold}",
        10
    )

Total feaures: 79

Dropping columns with correlation above 0.7
Columns dropped: 41
Remaining columns: 38
Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}

Pearson correlation for the best RF model: 0.750920528636134
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7513897840948833
Predicted data saved to CSV: ..\data\test\03_predicted\test_0.7\2024-12-11_18-25-38_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\test_0.7\2024-12-11_18-25-38_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\test_0.7\2024-12-11_18-25-38_test_data.xlsx

Dropping columns with correlation above 0.75
Columns dropped: 36
Remaining columns: 43
Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}

Pearson correlation for the best RF model: 0.7623454002064807
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7591629084486324
Predicted data saved to CSV: ..\data\test\03_predicted\test_0.75\2024-12-11_18-26-26_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\test_0.75\2024-12-11_18-26-26_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\test_0.75\2024-12-11_18-26-26_test_data.xlsx

Dropping columns with correlation above 0.8
Columns dropped: 32
Remaining columns: 47
Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}

Pearson correlation for the best RF model: 0

  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}

Pearson correlation for the best RF model: 0.7520333671236472
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7531644600918147
Predicted data saved to CSV: ..\data\test\03_predicted\test_0.85\2024-12-11_18-28-37_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\test_0.85\2024-12-11_18-28-37_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\test_0.85\2024-12-11_18-28-37_test_data.xlsx

Dropping columns with correlation above 0.9
Columns dropped: 14
Remaining columns: 65
Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Pearson correlation for the best RF model: 0.7580475730311692
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7590360612800903
Predicted data saved to CSV: ..\data\test\03_predicted\test_0.9\2024-12-11_18-29-41_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\test_0.9\2024-12-11_18-29-41_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\test_0.9\2024-12-11_18-29-41_test_data.xlsx

Dropping columns with correlation above 0.95
Columns dropped: 10
Remaining columns: 69
Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}

Pearson correlation for the best RF model: 0.7612824081096524
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7599192290067266
Predicted data saved to CSV: ..\data\test\03_predicted\test_0.95\2024-12-11_18-30-57_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\test_0.95\2024-12-11_18-30-57_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\test_0.95\2024-12-11_18-30-57_test_data.xlsx

Dropping columns with correlation above 1
Columns dropped: 0
Remaining columns: 79
Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}

Pearson correlation for the best RF model: 0.7524110007850136
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7524362569046683
Predicted data saved to CSV: ..\data\test\03_predicted\test_1\2024-12-11_18-31-46_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\test_1\2024-12-11_18-31-46_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\test_1\2024-12-11_18-31-46_test_data.xlsx


In [15]:
BEST_TRESHOLD = 0.8
# Initial number of columns
total_cols = len(df_train.columns)

# Find columns to drop from df_train and df_test
dropped_columns = drop_highly_correlated_features(df_train, BEST_TRESHOLD)

df_train_trimmed = df_train.drop(columns=dropped_columns)
df_test_trimmed = df_test.drop(columns=dropped_columns)

# Statistics
print(f'\nNumber of features dropped: {len(dropped_columns)}')
print("Remaining features after dropping highly correlated ones:", df_train_trimmed.columns)
print(f'\nNumber of remaining features after dropping highly correlated ones: {len(df_train_trimmed.columns)}')
print(f'\nPercentage of features removed: {(total_cols - len(df_train_trimmed.columns)) / total_cols * 100:.2f}%')

# Print shape of df_test to verify consistency
# print(f'\nShape of df_test after column removal: {df_test_trimmed.shape}')


Number of features dropped: 32
Remaining features after dropping highly correlated ones: Index(['s1_n_words', 's1_n_verbs_tot', 's2_n_verbs_tot', 's1_n_verbs_pres',
       's2_n_verbs_pres', 's1_n_verbs_past', 's2_n_verbs_past',
       's1_n_adjectives', 's1_n_adverbs', 's2_n_adverbs', 'dif_n_words',
       'dif_n_verbs_tot', 'dif_n_verbs_pres', 'dif_n_verbs_past',
       'dif_n_nouns', 'dif_n_adjectives', 'dif_n_adverbs', 'jaccard_all_words',
       'jaccard_verbs', 'jaccard_nouns', 'jaccard_adjectives',
       'jaccard_adverbs', 'all_all_shared_synsets_count',
       'all_all_avg_synset_similarity', 'all_verb_shared_synsets_ratio',
       'all_verb_avg_synset_similarity', 'all_verb_max_synset_similarity',
       'all_noun_shared_synsets_ratio', 'all_noun_avg_synset_similarity',
       'all_adj_shared_synsets_count', 'all_adj_shared_synsets_ratio',
       'all_adj_max_synset_similarity', 'all_adv_shared_synsets_count',
       'all_adv_shared_synsets_ratio', 'all_adv_max_synset_simila

### Define Used Feature Sets (Prior to Feature Selection)

In [16]:
# Feature sets for the analysis
PoS_features = [
            's1_n_words', 's1_n_verbs_tot', 's1_n_verbs_pres', 's1_n_verbs_past', 's1_n_nouns', 's1_n_adjectives', 's1_n_adverbs', 
            's2_n_words', 's2_n_verbs_tot', 's2_n_verbs_pres', 's2_n_verbs_past', 's2_n_nouns', 's2_n_adjectives', 's2_n_adverbs', 
            'dif_n_words', 'dif_n_verbs_tot', 'dif_n_verbs_pres', 'dif_n_verbs_past', 'dif_n_nouns', 'dif_n_adjectives', 'dif_n_adverbs', 
            'jaccard_all_words', 'jaccard_verbs', 'jaccard_nouns', 'jaccard_adjectives', 'jaccard_adverbs',
            ]

synset_features = [
            'all_all_shared_synsets_count', 'all_all_shared_synsets_ratio', 'all_all_avg_synset_similarity', 'all_all_max_synset_similarity',
            'all_verb_shared_synsets_count', 'all_verb_shared_synsets_ratio', 'all_verb_avg_synset_similarity', 'all_verb_max_synset_similarity',
            'all_noun_shared_synsets_count', 'all_noun_shared_synsets_ratio', 'all_noun_avg_synset_similarity', 'all_noun_max_synset_similarity',
            'all_adj_shared_synsets_count', 'all_adj_shared_synsets_ratio', 'all_adj_avg_synset_similarity', 'all_adj_max_synset_similarity',
            'all_adv_shared_synsets_count', 'all_adv_shared_synsets_ratio', 'all_adv_avg_synset_similarity', 'all_adv_max_synset_similarity',

            'best_all_shared_synsets_count', 'best_all_shared_synsets_ratio', 'best_all_avg_synset_similarity', 'best_all_max_synset_similarity',
            'best_verb_shared_synsets_count', 'best_verb_shared_synsets_ratio', 'best_verb_avg_synset_similarity', 'best_verb_max_synset_similarity',
            'best_noun_shared_synsets_count', 'best_noun_shared_synsets_ratio', 'best_noun_avg_synset_similarity', 'best_noun_max_synset_similarity',
            'best_adj_shared_synsets_count', 'best_adj_shared_synsets_ratio', 'best_adj_avg_synset_similarity', 'best_adj_max_synset_similarity',
            'best_adv_shared_synsets_count', 'best_adv_shared_synsets_ratio', 'best_adv_avg_synset_similarity', 'best_adv_max_synset_similarity',
            ]

lemma_features = [
            'lemma_diversity', 'shared_lemmas_ratio', 'lemma_jackard_similarity', 'avg_lemma_similarity', 'max_lemma_similarity', 'shared_lemma_count', 'dice_coefficient',
            'lemma_bigram_overlap', 'lemma_lcs_length', 'lemma_edit_distance', 'proportion_s1_in_s2', 'proportion_s2_in_s1', 'lemma_position_similarity'
            ]

lexical_features = [
            'all_all_shared_synsets_count', 'all_all_shared_synsets_ratio', 'all_all_avg_synset_similarity', 'all_all_max_synset_similarity',
            'all_verb_shared_synsets_count', 'all_verb_shared_synsets_ratio', 'all_verb_avg_synset_similarity', 'all_verb_max_synset_similarity',
            'all_noun_shared_synsets_count', 'all_noun_shared_synsets_ratio', 'all_noun_avg_synset_similarity', 'all_noun_max_synset_similarity',
            'all_adj_shared_synsets_count', 'all_adj_shared_synsets_ratio', 'all_adj_avg_synset_similarity', 'all_adj_max_synset_similarity',
            'all_adv_shared_synsets_count', 'all_adv_shared_synsets_ratio', 'all_adv_avg_synset_similarity', 'all_adv_max_synset_similarity',

            'best_all_shared_synsets_count', 'best_all_shared_synsets_ratio', 'best_all_avg_synset_similarity', 'best_all_max_synset_similarity',
            'best_verb_shared_synsets_count', 'best_verb_shared_synsets_ratio', 'best_verb_avg_synset_similarity', 'best_verb_max_synset_similarity',
            'best_noun_shared_synsets_count', 'best_noun_shared_synsets_ratio', 'best_noun_avg_synset_similarity', 'best_noun_max_synset_similarity',
            'best_adj_shared_synsets_count', 'best_adj_shared_synsets_ratio', 'best_adj_avg_synset_similarity', 'best_adj_max_synset_similarity',
            'best_adv_shared_synsets_count', 'best_adv_shared_synsets_ratio', 'best_adv_avg_synset_similarity', 'best_adv_max_synset_similarity',

            'lemma_diversity', 'shared_lemmas_ratio', 'lemma_jackard_similarity', 'avg_lemma_similarity', 'max_lemma_similarity', 'shared_lemma_count', 'dice_coefficient',
            'lemma_bigram_overlap', 'lemma_lcs_length', 'lemma_edit_distance', 'proportion_s1_in_s2', 'proportion_s2_in_s1', 'lemma_position_similarity'
]

# Organize feature sets into labeled tuples
features_sets = [
    ('All', all_features),
    ('Synsets', synset_features),
    ('Lemmas', lemma_features),
    ('PoS (Syntactic)', PoS_features),
    ('Lexical', lexical_features),
]

# File sets for the analysis
files_sets = [
    ('SMTeuroparl', ['SMTeuroparl'], ['SMTeuroparl', 'surprise.OnWN', 'surprise.SMTnews']),
    ('MSRvid', ['MSRvid'], ['MSRvid', 'surprise.OnWN', 'surprise.SMTnews']),
    ('MSRpar', ['MSRpar'], ['MSRpar', 'surprise.OnWN', 'surprise.SMTnews']),
    ('All', ['SMTeuroparl', 'MSRvid', 'MSRpar'], ['SMTeuroparl', 'MSRvid', 'MSRpar', 'surprise.OnWN', 'surprise.SMTnews'])
]

In [17]:
# Pull remaining feature column names from df into "all_features"
remaining_features = df_train_trimmed.columns.tolist()

def filter_features(feature_set, remaining_features):
    return [f for f in feature_set if f in remaining_features]

# Create new filtered feature groups
new_all_features = filter_features(all_features, remaining_features)
new_PoS_features = filter_features(PoS_features, remaining_features)
new_synset_features = filter_features(synset_features, remaining_features)
new_lemma_features = filter_features(lemma_features, remaining_features)
new_lexical_features = filter_features(lexical_features, remaining_features)

filtered_feature_sets = [
    ('Original', all_features),
    ('All', new_all_features),
    ('PoS (Syntactic)', new_PoS_features),
    ('Synsets', new_synset_features),
    ('Lemmas', new_lemma_features),
    ('Lexical', new_lexical_features),
]

# Display remaining features in each feature set
for name, features in filtered_feature_sets:
    print(f"{name}: {len(features)} features")
    print(features)

Original: 79 features
['s1_n_words', 's1_n_verbs_tot', 's1_n_verbs_pres', 's1_n_verbs_past', 's1_n_nouns', 's1_n_adjectives', 's1_n_adverbs', 's2_n_words', 's2_n_verbs_tot', 's2_n_verbs_pres', 's2_n_verbs_past', 's2_n_nouns', 's2_n_adjectives', 's2_n_adverbs', 'dif_n_words', 'dif_n_verbs_tot', 'dif_n_verbs_pres', 'dif_n_verbs_past', 'dif_n_nouns', 'dif_n_adjectives', 'dif_n_adverbs', 'jaccard_all_words', 'jaccard_verbs', 'jaccard_nouns', 'jaccard_adjectives', 'jaccard_adverbs', 'all_all_shared_synsets_count', 'all_all_shared_synsets_ratio', 'all_all_avg_synset_similarity', 'all_all_max_synset_similarity', 'all_verb_shared_synsets_count', 'all_verb_shared_synsets_ratio', 'all_verb_avg_synset_similarity', 'all_verb_max_synset_similarity', 'all_noun_shared_synsets_count', 'all_noun_shared_synsets_ratio', 'all_noun_avg_synset_similarity', 'all_noun_max_synset_similarity', 'all_adj_shared_synsets_count', 'all_adj_shared_synsets_ratio', 'all_adj_avg_synset_similarity', 'all_adj_max_synset_si

In [18]:
# Add the first 4 columns from df_train to df at the beginning
df_train_trimmed = pd.concat([original_df_train[original_df_train.columns[:4]], df_train_trimmed], axis=1)
df_test_trimmed = pd.concat([original_df_test[original_df_test.columns[:4]], df_test_trimmed], axis=1)

### Initialise Model Trainer

In [19]:
model_trainer = ModelTrainer()
N_ITERATIONS = 10

### All the features (Not trimmed)

In [20]:
best_rf_model_lex, rf_params_lex, metrics_lex, mean_correlation_rf_lex = evaluate_rf_model(
    model_trainer, 
    original_df_train, 
    original_df_test, 
    all_features, 
    'gs', 
    PREDICTED_SAVE_PATH,
    "predicted_rf_original"
)

update_results_csv(
    results_file=f"{PREDICTED_SAVE_PATH}/results.csv",
    model_name="RandomForest",
    feature_set="Original",
    metrics=metrics_lex,
    prediction_file=f"{PREDICTED_SAVE_PATH}/predicted_rf_original.csv"
)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}

Pearson correlation for the best RF model: 0.7514347843502969
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7504323834995141
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_rf_original\2024-12-11_18-37-57_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\predicted_rf_original\2024-12-11_18-37-57_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_rf_original\2024-12-11_18-37-57_test_data.xlsx


### Trimmed features

#### Lexical features

In [None]:
# Synset based (semantic) features
best_rf_model_lex, rf_params_lex, metrics_lex, mean_correlation_rf_lex = evaluate_rf_model(
    model_trainer, 
    df_train_trimmed, 
    df_test_trimmed, 
    new_synset_features, 
    'gs', 
    PREDICTED_SAVE_PATH,
    "predicted_rf_synset"
)

update_results_csv(
    results_file=f"{PREDICTED_SAVE_PATH}/results.csv",
    model_name="RandomForest",
    feature_set="Synsets",
    metrics=metrics_lex,
    prediction_file=f"{PREDICTED_SAVE_PATH}/predicted_rf_sybnset.csv"
)

# Lemma based features
best_rf_model_lex, rf_params_lex, metrics_lex, mean_correlation_rf_lex = evaluate_rf_model(
    model_trainer, 
    df_train_trimmed, 
    df_test_trimmed, 
    new_lemma_features, 
    'gs', 
    PREDICTED_SAVE_PATH,
    "predicted_rf_lemma"
)

update_results_csv(
    results_file=f"{PREDICTED_SAVE_PATH}/results.csv",
    model_name="RandomForest",
    feature_set="Lemma",
    metrics=metrics_lex,
    prediction_file=f"{PREDICTED_SAVE_PATH}/predicted_rf_lemma.csv"
)

# All lexical features
best_rf_model_lex, rf_params_lex, metrics_lex, mean_correlation_rf_lex = evaluate_rf_model(
    model_trainer, 
    df_train_trimmed, 
    df_test_trimmed, 
    new_lexical_features, 
    'gs', 
    PREDICTED_SAVE_PATH,
    "predicted_rf_lexical"
)

update_results_csv(
    results_file=f"{PREDICTED_SAVE_PATH}/results.csv",
    model_name="RandomForest",
    feature_set="Lexical",
    metrics=metrics_lex,
    prediction_file=f"{PREDICTED_SAVE_PATH}/predicted_rf_lexical.csv"
)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Pearson correlation for the best RF model: 0.7140327037401144
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7139252144670927
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_rf_synset\2024-12-11_18-38-48_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\predicted_rf_synset\2024-12-11_18-38-48_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_rf_synset\2024-12-11_18-38-48_test_data.xlsx
Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}

Pearson correlation for the best RF model: 0.668500072605863
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.6688331779502276
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_rf_lemma\2024-12-11_18-39-19_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\predicted_rf_lemma\2024-12-11_18-39-19_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_rf_lemma\2024-12-11_18-39-19_test_data.xlsx
Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}

Pearson correlation for the best RF model: 0.742642731968866
Computing mean Pearson correlation for 10 itera

#### Syntactical features

In [22]:
best_rf_model_pos, rf_params_pos, metrics_pos, mean_correlation_rf_pos = evaluate_rf_model(
    model_trainer, 
    df_train_trimmed, 
    df_test_trimmed, 
    new_PoS_features, 
    'gs', 
    PREDICTED_SAVE_PATH,
    "predicted_rf_syn"
)

update_results_csv(
    results_file=f"{PREDICTED_SAVE_PATH}/results.csv",
    model_name="RandomForest",
    feature_set="Syntactical",
    metrics=metrics_pos,
    prediction_file=f"{PREDICTED_SAVE_PATH}/predicted_rf_syn.csv"
)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}

Pearson correlation for the best RF model: 0.6872474376830605
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.6845818162480091
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_rf_syn\2024-12-11_18-41-28_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\predicted_rf_syn\2024-12-11_18-41-28_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_rf_syn\2024-12-11_18-41-28_test_data.xlsx


#### All trimmed features

In [23]:
best_rf_model_all, rf_params_all, metrics_rf_all, mean_correlation_rf_all = evaluate_rf_model(
    model_trainer, 
    df_train_trimmed, 
    df_test_trimmed, 
    new_all_features, 
    'gs', 
    PREDICTED_SAVE_PATH,
    "predicted_rf_all"
)

update_results_csv(
    results_file=f"{PREDICTED_SAVE_PATH}/results.csv",
    model_name="RandomForest",
    feature_set="All",
    metrics=metrics_rf_all,
    prediction_file=f"{PREDICTED_SAVE_PATH}/predicted_rf_all.csv"
)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'bootstrap': False, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Pearson correlation for the best RF model: 0.7627933471204229
Computing mean Pearson correlation for 10 iterations...
Mean Pearson correlation over 10 iterations: 0.7608256233940128
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_rf_all\2024-12-11_18-42-58_test_data.csv
Feature importance graph saved as: ..\data\test\03_predicted\predicted_rf_all\2024-12-11_18-42-58_feature_importance.png
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_rf_all\2024-12-11_18-42-58_test_data.xlsx


#### Generate Plots

In [24]:
generate_plots_from_metrics(f"{PREDICTED_SAVE_PATH}/results.csv", save_path=f"{PREDICTED_SAVE_PATH}/plots")

Plots have been saved to ../data/test/03_predicted//plots/


<Figure size 1200x800 with 0 Axes>

### Evaluation of Additional Models

#### NN Model

In [25]:
# Train the best NN model using all features
best_nn_model = model_trainer.train_NN(df_train_trimmed, new_all_features, 'gs')

# Predict the test data
df_test_trimmed['predicted_nn'] = best_nn_model.predict(df_test_trimmed[new_all_features])

# Calculate the Pearson correlation for the best NN model
correlation_nn = pearsonr(df_test_trimmed['gs'], df_test_trimmed['predicted_nn'])[0]
print(f'\n Pearson correlation for the best NN model: {correlation_nn}')

# Save the predictions
save_predictions(df_test, PREDICTED_SAVE_PATH, 'predicted_nn_all.csv')

Fitting 3 folds for each of 8 candidates, totalling 24 fits


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Best Parameters: {'batch_size': 16, 'epochs': 100, 'model__hidden_layers': 2, 'model__learning_rate': 0.001, 'model__neurons': 10}

 Pearson correlation for the best NN model: 0.6666183714193672
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_nn_all.csv\2024-12-11_18-47-05_test_data.csv
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_nn_all.csv\2024-12-11_18-47-05_test_data.xlsx


#### MLP Model

In [26]:
# Train the best MLP model using all features
best_mlp_model = model_trainer.train_MLP(df_train_trimmed, new_all_features, 'gs')

# Predict the test data
df_test_trimmed['predicted_mlp'] = best_mlp_model.predict(df_test_trimmed[new_all_features])

# Calculate the Pearson correlation for the best NN model
correlation_mlp = pearsonr(df_test_trimmed['gs'], df_test_trimmed['predicted_mlp'])[0]
print(f'\n Pearson correlation for the best MLP model: {correlation_mlp}')

# Save the predictions
save_predictions(df_test, PREDICTED_SAVE_PATH, 'predicted_mlp_all.csv')

Fitting 5 folds for each of 144 candidates, totalling 720 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best Hyperparameters: {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'max_iter': 200, 'solver': 'adam'}

 Pearson correlation for the best MLP model: 0.7120592736152007
Predicted data saved to CSV: ..\data\test\03_predicted\predicted_mlp_all.csv\2024-12-11_18-47-45_test_data.csv
Predicted data saved to Excel: ..\data\test\03_predicted\predicted_mlp_all.csv\2024-12-11_18-47-45_test_data.xlsx


## Result Analysis

### Introduction

This section discusses the obtained results from performing the various methods throughout the project. 

The primary evaluation metric used to evaluate the trained models is the Pearson correlation score, which in the context of this project, is used to measure how closely the similarity scores predicted by our trained models align with the ground truth scores provided to us. 
The score range is [-1, 1], where:
- r = 1: the model is detecting paraphrase scores perfectly
- r = 0: the model is failing to predict the paraphase scores.
- r = -1: the model is inverting the ground truth score for each sentence pair.

The key approaches that were compared are:
1. Lexical features
2. Syntactic features
3. Combination of lexical and syntactic features
4. Additional features


### Baseline Results

The official baseline score for Task 6 of the SamEval 2012 paper uses simple lexical features and reports a Pearson correlation of 0.311. 

Here is a table with the most significant results to consider:

| Method/System        | Pearson Correlation | Description                              |
|----------------------|----------------------|------------------------------------------|
| SemEval Baseline     | 0.311               | Official baseline from SemEval Task 6, obtained from 'Table 1' of the SamEval 2012 paper   |
| Top-Performing System| 0.8239              | Best result from SemEval Task 6, obtained by UKP-run2         |
| 10th Participant     | 0.7562              | Threshold for excellent performance     |


### Performance of Individual Approaches

#### Lexical Features
Lexical features capture word-level semantic relationships between sentences using metrics such as shared synsets, lemma-based similarity, and Jaccard overlap across different parts of speech (e.g., nouns, verbs, adjectives). The primary features used include:

- **Synset-based Similarity**: Measures such as shared synsets and maximum synset similarity.
- **Lemma-based Features**: Metrics such as lemma diversity, proportion overlap, and lemma-based edit distance.

To evaluate their effectiveness, we trained a **Random Forest model** separately on synset-based features, lemma-based features, and a combination of both. The results (averaged over 10 runs) are summarized below:

| Feature Set   | Mean Pearson Correlation |
|---------------|---------------------------|
| Synset-based  | **0.7139**               |
| Lemma-based   | **0.6688**               |
| Combination   | **0.7416**               |

#### Syntactic Features
Syntactic features emphasize structural and grammatical aspects of sentences. 

These features focus on:
- **Part-of-Speech (PoS) Counts**: Quantities of nouns, verbs, adjectives, and adverbs in each sentence.
- **Jaccard Similarity**: Similarity in PoS distributions between sentences.

In this case, we maintained one feature set. The results (averaged over 10 runs) are as follows:

| Feature Set         | Mean Pearson Correlation |
|---------------------|---------------------------|
| Syntactic | **0.6846**               |

### Combined Features Analysis

The combined approach utilises both **lexical** and **syntactic features** to leverage the strengths of each. Lexical features capture semantic relationships at the word level, while syntactic features provide insight into structural alignment between sentences. This combination aims to address the limitations of each feature set when used independently.

The performance of the combined feature set, evaluated using a **Random Forest model**, is summarized below (averaged over 10 runs):

| Feature Set              | Mean Pearson Correlation |
|--------------------------|---------------------------|
| Lexical Features         | 0.7416               |
| Syntactic Features       | 0.6846               |
| Combined (Lexical + Syntactic) | **0.7672**               |

#### Comparison of Individual Features
The combined feature set achieved a Mean Pearson Correlation of 0.7672, surpassing the performance of both lexical-only (0.7416) and syntactic-only (0.6846) models. By integrating semantic similarity from lexical features and structural alignment from syntactic features, the model demonstrated improved robustness against diverse paraphrase types. This integration allowed the model to handle semantically equivalent sentences with different structures and structurally similar sentences with varying word usage more effectively.

Despite this improvement, it's apparent that lexical features contribute the majority of the predictive power in this task. By utilising syntactic features during training of the model, the Pearson correlation improved by a very modest 0.0256.


### Comparison with Official SemEval Results

The best-performing approach in this project achieved a **Mean Pearson Correlation of 0.7672** using a combination of lexical and syntactic features. This result is evaluated against the official SemEval 2012 Task 6 benchmarks:

| System                  | Pearson Correlation | Description                                   |
|-------------------------|----------------------|-----------------------------------------------|
| SemEval Baseline        | 0.311               | Simple word overlap baseline.                |
| UKP-run2 (Top System)   | 0.8239              | Best result using advanced features and ensemble techniques. |
| SRIUBC-SYSTEM2          | 0.7562              | 10th-ranked system, representing high-performance threshold. |
| Takelab-simple          | 0.8133              | Lightweight lexical and syntactic model.      |
| UNT-CombinedRegression  | 0.7418              | Regression-based approach with feature combination. |
| **Our Best Approach**   | **0.7672**          | Combined lexical and syntactic feature model.|

#### **Comparison**
Our best-performing approach significantly outperforms the SemEval baseline of **0.311**, demonstrating the effectiveness of incorporating both lexical and syntactic features. It also surpasses the **10th-ranked system (0.7562)**, placing it among the top-performing systems from the competition. Notably, it is competitive with other strong systems like **UNT-CombinedRegression (0.7418)** while falling slightly short of the **Takelab-simple (0.8133)** and **UKP-run2 (0.8239)** models.

#### **Insights**
The results show that our feature-driven approach is competitive with high-ranking systems in the SemEval task. While it does not surpass the top systems, it demonstrates the strength of a combined feature approach. Further exploration of advanced features, model tuning, or ensemble methods could narrow the gap with the top-performing systems. These results emphasize the value of balancing simplicity and complexity in feature engineering for sentence similarity tasks.


### Visualisation of Results

#### RF Mean Correlation by Feature Set
<img src="../data/test/03_predicted/plots/rf_mean_correlation_by_feature_set.png" alt="RF Mean Correlation" width="800"/>

The graph illustrates the mean Pearson correlation achieved by the Random Forest model across different feature sets, including Original, Synsets, Lemma, Lexical, Syntactical, and All (combined). The combined feature set achieves the highest mean correlation (~0.767)



#### RF RMSE by Feature Set
<img src="../data/test/03_predicted/plots/rf_rmse_by_feature_set.png" alt="RF RMSE" width="800"/>

This bar chart presents the RMSE for each feature set, reflecting the model's prediction error. The combined feature set achieves the lowest RMSE, demonstrating its ability to minimize errors effectively. Lexical features also exhibit a relatively low RMSE, consistent with their strong performance in mean correlation. In comparison, syntactic and lemma features have higher RMSE values, indicating their limitations in accurately predicting similarity when used alone.



#### RF Std Correlation by Feature Set
<img src="../data/test/03_predicted/plots/rf_std_correlation_by_feature_set.png" alt="RF Std Correlation" width="800"/>

This graph displays the standard deviation of correlation values for each feature set. All feature sets display a very low standard deviation, with the combined feature set achieving the highest (~0.0035), indicating consistent and stable performance between runs for all feature sets.



### Conclusion

This project explored the effectiveness of various feature-based approaches for Semantic Textual Similarity (STS), focusing on lexical, syntactic, and combined feature sets. Using the Random Forest model, we evaluated the predictive power of these features in approximating human similarity scores.

Key findings from the analysis include:

- Lexical Features: Demonstrated strong performance, with a mean Pearson correlation of 0.7416, highlighting their ability to capture word-level semantics and explicit overlaps effectively.
- Syntactic Features: Provided complementary insights into sentence structure, achieving a mean correlation of 0.6846 when combined with lexical features.
- Combined Features: Achieved the best performance, with a mean correlation of 0.7672, showcasing the synergistic benefits of integrating lexical and syntactic perspectives.

The results demonstrate that a feature-engineering approach can achieve competitive performance relative to benchmarks from the SemEval 2012 Task 6 competition. Our best model outperformed the SemEval baseline (0.311) and ranked above the 10th-place system (0.7562), underscoring the robustness of our feature combinations.

While promising, the results also highlight areas for improvement. The limited gain from syntactic features suggests potential redundancy or inefficiencies in feature selection. Future work could focus on:

- Selecting a more diverse set of features.
- Train our own word embedding model (to remain within the constraints of the brief)