# IHLT Project

---

## Introduction

In this project, we implement approaches to detect paraphrases using sentence similarity metrics by exploring:

- **Lexical features alone**
- **Syntactic features alone**
- **Combination of lexical, syntactic, and semantic features**

We use **XGBoost** as our machine learning model and cite properly where each technique is derived from, based on the curated compilation from SemEval 2012 papers.

### Motivation

Understanding semantic similarity between sentences is essential for various NLP tasks, such as machine translation, summarization, and question answering. The SemEval 2012 Task 6 provided a benchmark for evaluating semantic textual similarity methods.

### Features Overview

Based on insights from SemEval 2012 Task 6 papers ([2], [4], [8]), we implement the following features:

- **Lexical Features**  
  Derived from methods used in SemEval 2012 papers [2], [4], [8].
  - Jaccard similarity
  - Normalized edit distance
  - Cosine similarity using TF-IDF vectors
  - Word n-gram overlap
  - Character n-gram overlap
  - Token overlap ratio
  - Longest common subsequence
  - String matching metrics
  - Word order similarity
  - Normalized difference in sentence lengths

- **Syntactic Features**  
  Derived from methods in SemEval 2012 papers [2], [3].
  - POS tag overlap ratio
  - POS tag sequence similarity
  - Dependency relation overlap
  - Grammatical relations overlap

- **Semantic Features**  
  Derived from methods in SemEval 2012 papers [2], [8].
  - WordNet-based similarity metrics
  - Named entity overlap
  - Semantic word overlap using synonyms

---

The current jupyter notebook 

python 3.10.12 as in colab

## 1. Data Preparation

### 1.1 Import Libraries

In [None]:
# basic
import os
import pandas as pd

# our scripts
from scripts.data_loader import load_data
from scripts.feature_extraction import FeatureExtractor
from scripts.experiments import run_experiment
from scripts.feature_analysis import (
    load_best_model,
    get_feature_importances,
    analyze_feature_importance_per_dataset,
    get_top_features,
    plot_feature_importances_grid,
    plot_dataset_permutation_importances_grid,
    plot_error_distribution_grid,
    plot_true_vs_predicted_density_grid,
    plot_feature_correlation_matrix_grid,
    get_hardest_failures
)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 1.2 Load Data

In [2]:
# 2. Data Preparation

data_dir = 'data'  # Replace with your data directory path

# Load training data
train_data = load_data(data_dir, dataset_type='train')

# Load test data
test_data = load_data(data_dir, dataset_type='test')

# Check data format
print(f"Number of training samples: {len(train_data)}")
print(f"Number of test samples: {len(test_data)}")


Number of training samples: 2234
Number of test samples: 3108


### 1.3 Explore Data

In [None]:
pd.DataFrame(test_data)

## 2. Feature Extraction

To avoid recalculation, we extract all features at once and then filter before training!

In [3]:
extractor = FeatureExtractor()

In [None]:
train_df  = extractor.extract_features_parallel(train_data)

In [None]:
test_df = extractor.extract_features_parallel(test_data)

In [None]:
results_folder = 'results'
os.makedirs(results_folder, exist_ok=True)

# Save DataFrames to CSV files
train_csv_path = os.path.join(results_folder, 'train_features.csv')
test_csv_path = os.path.join(results_folder, 'test_features.csv')

In [None]:
# save them
train_df.to_csv(train_csv_path, index=False)
test_df.to_csv(test_csv_path, index=False)

In [5]:
# load them
train_df = pd.read_csv(train_csv_path)
test_df = pd.read_csv(test_csv_path)

## 3. Experiments

### 3.1 Define feature sets

In [6]:
lexical_features_columns = [col for col in train_df.columns if col.startswith('lex_')]
syntactic_features_columns = [col for col in train_df.columns if col.startswith('syn_')]
semantic_features_columns = [col for col in train_df.columns if col.startswith('sem_')]

feature_sets = {
    'lexical': lexical_features_columns,
    'syntactic': syntactic_features_columns,
    'semantic': semantic_features_columns,
    'combined': lexical_features_columns + syntactic_features_columns + semantic_features_columns
}

### 3.2 Run Experiments

In [None]:
model_save_path = 'models'

In [None]:
model_save_path = 'models'
os.makedirs(model_save_path, exist_ok=True)

# Prepare a dictionary to store metrics
metrics_dict = {}

# Run experiments to find the best models
for feature_set_name, feature_columns in feature_sets.items():
    print("="*80)
    print(f"Running experiment for feature set: {feature_set_name}")
    metrics = run_experiment(
        train_df,
        test_df.copy(),
        feature_columns,
        feature_set_name,
        model_save_path
    )
    metrics_dict[feature_set_name] = metrics

### 3.3 Feature Importance

In [7]:
#necessary info for later on:

feature_importances_dict = {}
feature_importance_per_dataset_dict = {}
error_dict = {}
y_true_dict = {}
y_pred_dict = {}
data_dict = {}
feature_set_names = list(feature_sets.keys())

model_save_path = 'models'

for feature_set_name, feature_columns in feature_sets.items():
    # Load the best model
    best_model = load_best_model(feature_set_name, model_save_path)
    
    # Get feature importances
    feature_importances = get_feature_importances(best_model, feature_columns)
    feature_importances_dict[feature_set_name] = feature_importances
    
    # Analyze feature importance per dataset
    feature_importance_per_dataset = analyze_feature_importance_per_dataset(
        best_model, test_df, feature_columns
    )
    feature_importance_per_dataset_dict[feature_set_name] = feature_importance_per_dataset
    
    # Predict on test data
    X_test = test_df[feature_columns]
    y_test = test_df['score']
    y_pred = best_model.predict(X_test)
    y_true_dict[feature_set_name] = y_test
    y_pred_dict[feature_set_name] = y_pred
    error_dict[feature_set_name] = y_pred - y_test
    
    # Store data for feature correlation matrix
    data_dict[feature_set_name] = test_df[feature_columns]


Loaded best model for 'lexical' from: models/best_model_lexical.joblib
Loaded best model for 'syntactic' from: models/best_model_syntactic.joblib
Loaded best model for 'semantic' from: models/best_model_semantic.joblib
Loaded best model for 'combined' from: models/best_model_combined.joblib


#### 3.3.1 Summarizing Feature Importances

In [None]:
plot_feature_importances_grid(feature_importances_dict, top_n=20)

#### 3.3.2 Analyzing Feature Interactions

In [None]:
# Plot permutation importances per dataset
plot_dataset_permutation_importances_grid(feature_importance_per_dataset_dict, feature_set_names, top_n=10)

#### 3.3.3 Other Interesting Visualizations

In [None]:
# Plot error distributions
plot_error_distribution_grid(error_dict, feature_set_names)

In [None]:
# Plot true vs. predicted density plots
plot_true_vs_predicted_density_grid(y_true_dict, y_pred_dict, feature_set_names)

In [None]:
# Plot feature correlation matrices
plot_feature_correlation_matrix_grid(data_dict, feature_importances_dict, feature_set_names, top_n=10)

### Hardest fails?

In [8]:
for feature_set_name in feature_set_names:
    print("="*80)
    print(f"Identifying hardest failures for feature set: {feature_set_name}")
    
    # Load the best model
    best_model = load_best_model(feature_set_name, model_save_path)
    
    # Predict on test data
    X_test = test_df[feature_sets[feature_set_name]]
    y_test = test_df['score']
    y_pred = best_model.predict(X_test)
    test_df_copy = test_df.copy()
    test_df_copy['predicted_score'] = y_pred
    
    # Get hardest failures
    failures = get_hardest_failures(test_df_copy, test_data, y_true_col='score', y_pred_col='predicted_score', top_n=5)
    
    # Display failures
    for dataset, df_failures in failures.items():
        print(f"Dataset: {dataset}")
        display(df_failures)


Identifying hardest failures for feature set: lexical
Loaded best model for 'lexical' from: models/best_model_lexical.joblib
Dataset: MSRpar


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
712,"The woman was hospitalized June 15, Kansas hea...",Missouri health officials said he had not been...,1.0,3.757185,2.757185
169,The SIA says the DRAM market is expected to gr...,The Americas market will decline 2.1 percent t...,1.2,3.727381,2.527381
174,A New Castle County woman has become the first...,A 62-year-old West Babylon man has contracted ...,1.5,4.00247,2.50247
707,"Shares of USA Interactive rose $2.28, or 7 per...","Shares of LendingTree rose $6.03, or 41 percen...",1.0,3.248112,2.248112
8,"In afternoon trading in Europe, France's CAC-4...","In Europe, France's CAC-40 rose 1.3 percent, B...",2.0,4.236926,2.236926


Dataset: MSRvid


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1352,The lady peeled the potatoe.,A woman is peeling a potato.,4.75,0.950958,3.799042
950,A kangroo is eating something.,A kangaroo is eating.,4.8,1.443057,3.356943
1449,The lady sliced a tomatoe.,Someone is cutting a tomato.,4.0,0.760429,3.239571
1188,A badger is burrowing a hole.,A badger is digging the earth.,4.6,1.457905,3.142095
939,A band is performing on a stage.,A band is playing onstage.,5.0,1.90918,3.09082


Dataset: SMTeuroparl


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1952,Then perhaps we could have avoided a catastrophe.,We might have been able to prevent a disaster.,4.25,0.099411,4.150589
1805,Then perhaps we could have avoided a catastrophe.,Then we might have been able to avoid a disaster.,4.6,0.754275,3.845725
1859,Then perhaps we could have avoided a catastrophe.,Then we might have been able to avoid a disaster.,4.6,0.754275,3.845725
1885,That provision could open the door wide to arb...,This point of the regulations opens the door t...,5.0,1.436707,3.563293
1882,That provision could open the door wide to arb...,This point of the regulations opens the door t...,5.0,1.667895,3.332105


Dataset: surprise.OnWN


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1966,restrict or confine,place limits on (extent or access).,4.75,0.26083,4.48917
2026,a concern or affair,some situation or event that is thought about.,4.5,0.194788,4.305212
2442,an expanse of land,an extended area of land.,4.5,0.641664,3.858336
2118,The act of having and controlling property.,the state or fact of being an owner.,4.25,0.523829,3.726171
2238,"have faith in, bet on",have faith or confidence in.,4.25,0.571267,3.678733


Dataset: surprise.SMTnews


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
3037,But they were necessary.,But they were needed.,5.0,1.03642,3.96358
2808,Other ways are needed.,Other means should be found.,4.6,0.790847,3.809153
3025,Other ways are needed.,Other means should be found.,4.6,0.790847,3.809153
2779,Other ways are needed.,It is necessary to find other means.,4.5,0.751772,3.748228
2924,The questions are many.,The questions are numerous.,5.0,1.434217,3.565783


Identifying hardest failures for feature set: syntactic
Loaded best model for 'syntactic' from: models/best_model_syntactic.joblib
Dataset: MSRpar


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
323,Moore had no immediate comment Tuesday.,Moore did not have an immediate response Tuesday.,4.778,0.74618,4.03182
632,"Five alternate jurors were also chosen, with a...","Five alternate jurors also were selected, with...",4.6,1.337417,3.262583
707,"Shares of USA Interactive rose $2.28, or 7 per...","Shares of LendingTree rose $6.03, or 41 percen...",1.0,4.057784,3.057784
169,The SIA says the DRAM market is expected to gr...,The Americas market will decline 2.1 percent t...,1.2,4.14138,2.94138
442,"One, Capt. Doug McDonald, remained hospitalize...","Her 20-year-old sister, Allyson, was severely ...",0.75,3.561767,2.811767


Dataset: MSRvid


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1120,The man is kissing and hugging the woman.,A man is hugging and kissing a woman.,5.0,0.821082,4.178918
1188,A badger is burrowing a hole.,A badger is digging the earth.,4.6,0.509353,4.090647
962,A woman is chopping up garlic.,The woman is dicing garlic.,4.8,0.871991,3.928009
1488,A man is smoking.,The man sat in his suit and smoked.,4.0,0.546399,3.453601
1466,A man jumping rope,A man is talking.,0.4,3.83578,3.43578


Dataset: SMTeuroparl


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1805,Then perhaps we could have avoided a catastrophe.,Then we might have been able to avoid a disaster.,4.6,0.777908,3.822092
1859,Then perhaps we could have avoided a catastrophe.,Then we might have been able to avoid a disaster.,4.6,0.777908,3.822092
1656,"Consumers will lose out, employees will lose o...","The consumers are the losers, with the employe...",4.75,1.045613,3.704387
1527,Then perhaps we could have avoided a catastrophe.,Perhaps we should have been able to prevent a ...,4.5,0.832982,3.667018
1952,Then perhaps we could have avoided a catastrophe.,We might have been able to prevent a disaster.,4.25,0.76665,3.48335


Dataset: surprise.OnWN


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
2234,marriage offer,an offer of marriage.,5.0,0.036269,4.963731
2001,physical matter left behind after a removal pr...,matter that remains after something has been r...,5.0,0.105415,4.894585
2000,persuade or achieve acceptance,persuade somebody to accept something.,4.333,-0.377483,4.710483
2286,"duplicate, match",duplicate or match.,4.75,0.134241,4.615759
2340,put or store in a bottle,store (liquids or gases) in bottles.,4.25,-0.337853,4.587853


Dataset: surprise.SMTnews


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
2814,This gross error is leading Russia to politica...,And this gross mistake is conducting Russia po...,4.75,0.679343,4.070657
2891,Today's great Pax Europa and today's pan-Europ...,"The large Europa of today, just as the prosper...",4.25,0.506787,3.743213
2779,Other ways are needed.,It is necessary to find other means.,4.5,0.813166,3.686834
3066,Today's great Pax Europa and today's pan-Europ...,"The great Pax Europe of today, as prosperity p...",5.0,1.404561,3.595439
2896,The People Versus Putin,People against Putin,4.75,1.218854,3.531146


Identifying hardest failures for feature set: semantic
Loaded best model for 'semantic' from: models/best_model_semantic.joblib
Dataset: MSRpar


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
169,The SIA says the DRAM market is expected to gr...,The Americas market will decline 2.1 percent t...,1.2,4.214226,3.014226
190,"""It's going to happen,"" said Jim Santangelo, p...","""That really affects the companies, big time,""...",1.5,4.468565,2.968565
174,A New Castle County woman has become the first...,A 62-year-old West Babylon man has contracted ...,1.5,3.858725,2.358725
14,"RT Jones analyst Juli Niemann said Grant was ""...","He has a very good reputation,"" RT Jones analy...",1.4,3.754665,2.354665
442,"One, Capt. Doug McDonald, remained hospitalize...","Her 20-year-old sister, Allyson, was severely ...",0.75,3.088209,2.338209


Dataset: MSRvid


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1320,A group of people sing.,Some people are singing.,5.0,0.278298,4.721702
1352,The lady peeled the potatoe.,A woman is peeling a potato.,4.75,0.249444,4.500556
1281,The man is slicing the tape from the box.,A man is cutting open a box.,4.333,0.759738,3.573262
1070,A person plays a keyboard.,Someone is playing a keyboard.,5.0,1.642818,3.357182
1188,A badger is burrowing a hole.,A badger is digging the earth.,4.6,1.257051,3.342949


Dataset: SMTeuroparl


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1536,There must be a balance as a whole.,Group must be in equilibrium.,4.5,0.979583,3.520417
1656,"Consumers will lose out, employees will lose o...","The consumers are the losers, with the employe...",4.75,1.386841,3.363159
1719,There must be a balance as a whole.,The unit must be in balance.,4.75,1.605036,3.144964
1722,The leaders have now been given a new chance a...,The leaders are here today to a new chance and...,5.0,2.243053,2.756947
1885,That provision could open the door wide to arb...,This point of the regulations opens the door t...,5.0,2.374075,2.625925


Dataset: surprise.OnWN


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
2187,"be against, resist",act against or in opposition to.,4.5,0.160818,4.339182
2061,a written message of nonacceptance,a message refusing to accept something that is...,5.0,0.859726,4.140274
2370,the state of being retained,the act of retaining something.,4.5,0.392772,4.107228
2026,a concern or affair,some situation or event that is thought about.,4.5,0.460879,4.039121
2202,a region allocated to hold something,the particular portion of space occupied by so...,4.5,0.577752,3.922248


Dataset: surprise.SMTnews


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
2822,This tendency extends deeper than headscarves.,This trend goes well beyond simple scarves.,4.5,1.130048,3.369952
2997,This tendency extends deeper than headscarves.,This trend goes well beyond simple scarves.,4.5,1.130048,3.369952
3087,This tendency extends deeper than headscarves.,This trend goes well beyond simple scarves.,4.5,1.130048,3.369952
2811,Other ways are needed.,We must find other ways.,4.4,1.176538,3.223462
2823,Other ways are needed.,We must find other ways.,4.4,1.176538,3.223462


Identifying hardest failures for feature set: combined
Loaded best model for 'combined' from: models/best_model_combined.joblib
Dataset: MSRpar


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
174,A New Castle County woman has become the first...,A 62-year-old West Babylon man has contracted ...,1.5,3.868588,2.368588
190,"""It's going to happen,"" said Jim Santangelo, p...","""That really affects the companies, big time,""...",1.5,3.836157,2.336157
748,"Graves reported from Albuquerque, Villafranca ...",Pete Slover reported from Laredo and Gromer Je...,1.25,3.403617,2.153617
45,"Earlier this month, RIM had said it expected t...",Excluding legal fees and other charges it expe...,1.2,3.350104,2.150104
712,"The woman was hospitalized June 15, Kansas hea...",Missouri health officials said he had not been...,1.0,3.129573,2.129573


Dataset: MSRvid


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1188,A badger is burrowing a hole.,A badger is digging the earth.,4.6,1.019477,3.580523
950,A kangroo is eating something.,A kangaroo is eating.,4.8,1.422938,3.377062
1051,Two little girls are talking on the phone.,A little girl is walking down the street.,0.5,3.731322,3.231322
851,A man and a woman are kissing.,A man and woman kiss.,5.0,1.833189,3.166811
1416,A woman is chopping a hard egg.,A person is cutting boiled egg into pieces.,3.533,0.404836,3.128164


Dataset: SMTeuroparl


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1805,Then perhaps we could have avoided a catastrophe.,Then we might have been able to avoid a disaster.,4.6,1.332725,3.267275
1859,Then perhaps we could have avoided a catastrophe.,Then we might have been able to avoid a disaster.,4.6,1.332725,3.267275
1527,Then perhaps we could have avoided a catastrophe.,Perhaps we should have been able to prevent a ...,4.5,1.307546,3.192454
1952,Then perhaps we could have avoided a catastrophe.,We might have been able to prevent a disaster.,4.25,1.431283,2.818717
1545,The vote will take place today at 5.30 p.m.,The vote will take place with 17h30.,4.75,2.265949,2.484051


Dataset: surprise.OnWN


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
1966,restrict or confine,place limits on (extent or access).,4.75,0.219731,4.530269
1968,"Bring back to life, return from the dead",cause to become alive again.,4.75,0.294106,4.455894
2026,a concern or affair,some situation or event that is thought about.,4.5,0.346234,4.153766
2131,"become master of, overcome, dominate",get on top of; deal with successfully.,4.0,0.084929,3.915071
2033,head of a country,the chief executive of a republic.,4.5,0.889548,3.610452


Dataset: surprise.SMTnews


Unnamed: 0,sentence1,sentence2,score,predicted_score,error
2924,The questions are many.,The questions are numerous.,5.0,1.755224,3.244776
2879,"Western Europeans, who have been spared this l...","Europeans of the West, who forgot this history...",5.0,1.845204,3.154796
2845,Being a Muslim and being an Islamist are not t...,It is necessary to are two different things.,0.25,3.38589,3.13589
3037,But they were necessary.,But they were needed.,5.0,2.275248,2.724752
2779,Other ways are needed.,It is necessary to find other means.,4.5,1.792892,2.707108



**References for All Features:**

- **Word Overlap Measures (Jaccard similarity, Dice coefficient, Overlap coefficient):** Used by multiple teams in SemEval 2012 Task 6, including [Baer et al., 2012], [Glinos, 2012], and [Jimenez et al., 2012].

- **Edit Distance and String Similarity Measures:** Used by [Glinos, 2012] and [Jimenez et al., 2012].

- **TF-IDF Vector Similarity:** Employed by the UKP team [Baer et al., 2012] for computing cosine similarity using TF-IDF vectors.

- **Character N-gram Features:** Utilized by teams like [Baer et al., 2012] and [Jimenez et al., 2012].

- **BLEU Score:** Used by [Baer et al., 2012] as part of the feature set.

- **Content Word Overlap:** Considered by [Jimenez et al., 2012] in their similarity measures.

- **POS Tag Features:** Teams like [Baer et al., 2012] and [Glinos, 2012] used POS tag overlaps and distributions.

- **Dependency Relations and Tree Structures:** Explored by [Štajner et al., 2012] for syntactic similarity.

- **WordNet-based Semantic Features:** Used extensively by the UKP team [Baer et al., 2012] and the TakeLab team [Štajner et al., 2012], including synonym overlap, hypernym/hyponym overlap, and various similarity measures.

- **Named Entity Features:** Incorporated by [Baer et al., 2012].

- **Sentiment Analysis Features:** Included by teams like [Gupta et al., 2012] in their submissions.

- **Negation Handling:** Addressed by [Baer et al., 2012] to capture differences due to negation.

**Referenced Papers:**

- **[Baer et al., 2012]:**

  Baer, P., and Zesch, T. (2012). UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures. *SemEval-2012*.

- **[Štajner et al., 2012]:**

  Štajner, S., Glavaš, G., Karan, M., Šnajder, J., and Dalbelo Bašić, B. (2012). TakeLab: Systems for Measuring Semantic Text Similarity. *SemEval-2012*.

- **[Glinos, 2012]:**

  Glinos, D. (2012). ATA-Semantics: Measuring the Similarity between Sentences. *SemEval-2012*.

- **[Jimenez et al., 2012]:**

  Jimenez, S., Becerra, C., and Gelbukh, A. (2012). Soft Cardinality: A Generalization of Dice's Similarity Coefficient for Enumerated Sets. *SemEval-2012*.

- **[Gupta et al., 2012]:**

  Gupta, S., Agarwal, A., and Joshi, S. (2012). Yedi: A Hybrid Distributional and Knowledge-based Word Similarity Measure. *SemEval-2012*.

**Note:** All features utilize methods and resources available in 2012, adhering to the constraints of the SemEval 2012 Task 6.

**Usage in Feature Extraction:**

# References

- Baer, D., Biemann, C., Gurevych, I., & Zesch, T. (2012). **UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures**. In *Proceedings of the First Joint Conference on Lexical and Computational Semantics* (pp. 435–440).
- Sarić, F., Glavaš, G., Karan, M., Šnajder, J., & Dalbelo Bašić, B. (2012). **TakeLab: Systems for Measuring Semantic Text Similarity**. In *Proceedings of the First Joint Conference on Lexical and Computational Semantics* (pp. 441–448).
- Jimenez, S., Becerra, C., & Gelbukh, A. (2012). **Soft Cardinality: A Generalized Similarity Measure for Comparesent of NLP Objects**. In *Proceedings of the First Joint Conference on Lexical and Computational Semantics* (pp. 449–453).
- Glinos, D. (2012). **ATA System: Text Similarity with LSA, Machine Learning, and Linguistic Features**. In *Proceedings of the First Joint Conference on Lexical and Computational Semantics* (pp. 475–480).
- Gupta, S., et al. (2012). **UMBC at SemEval-2012 Task 6: Similarity Based on Semantic Alignments**. In *Proceedings of the First Joint Conference on Lexical and Computational Semantics*.



In [None]:
# 4.1 Define Feature Sets

# Define feature columns
lexical_features_columns = [col for col in train_df.columns if col.startswith('lex_')]
syntactic_features_columns = [col for col in train_df.columns if col.startswith('syn_')]
semantic_features_columns = [col for col in train_df.columns if col.startswith('sem_')]

feature_sets = {
    'lexical': lexical_features_columns,
    'syntactic': syntactic_features_columns,
    'semantic': semantic_features_columns,
    'combined': lexical_features_columns + syntactic_features_columns + semantic_features_columns
}


In [None]:
# 4.2 Training on Combined Data and Evaluating per Dataset

# Prepare training data
X_train_sets = {}
for feature_set_name, feature_columns in feature_sets.items():
    X_train_sets[feature_set_name] = train_df[feature_columns]
y_train = train_df['score']

# Prepare test data per dataset
test_datasets = test_df['dataset'].unique()
results = {}

for feature_set_name, X_train in X_train_sets.items():
    print(f"Training model using {feature_set_name} features...")
    
    # Train the model on combined training data
    model = train_random_forest(X_train, y_train)
    
    # Evaluate on each test dataset separately
    for dataset in test_datasets:
        df_test_dataset = test_df[test_df['dataset'] == dataset]
        X_test = df_test_dataset[feature_sets[feature_set_name]]
        y_test = df_test_dataset['score']
        
        # Predict and evaluate
        y_pred = model.predict(X_test)
        test_correlation = evaluate_model(y_test, y_pred)
        
        print(f"Dataset: {dataset}, Pearson Correlation: {test_correlation:.4f}")
        
        # Store results
        results[(dataset, feature_set_name)] = test_correlation
    print()


In [None]:
# 5. Results and Analysis

# Create a list to collect rows for the DataFrame
rows = []

for key, value in results.items():
    dataset, feature_set = key
    test_corr = value
    rows.append({
        'Dataset': dataset,
        'Feature_Set': feature_set,
        'Test_Correlation': test_corr
    })

# Create a DataFrame from the collected rows
results_df = pd.DataFrame(rows)

# Display the results
print(results_df.pivot(index='Dataset', columns='Feature_Set', values='Test_Correlation'))


In [None]:
# 6.1 Analyzing Feature Importances

# Assume we've trained the combined model earlier
# For simplicity, we'll retrain it here
X_train_combined = train_df[feature_sets['combined']]
model_combined = train_random_forest(X_train_combined, y_train)

# Get feature importance
importance = model_combined.feature_importances_
feature_importance = pd.Series(importance, index=feature_sets['combined'])
feature_importance.sort_values(ascending=False, inplace=True)

# Display top 10 features
print("Top 10 Features:")
print(feature_importance.head(10))


In [None]:
# 6.2 Feature Selection

# Select top N features
top_N = 20
top_features = feature_importance.index[:top_N]

# Retrain the model with top features
X_train_top = train_df[top_features]
model_top = train_random_forest(X_train_top, y_train)

# Evaluate on test datasets
print(f"\nEvaluating model with top {top_N} features:")
for dataset in test_datasets:
    df_test_dataset = test_df[test_df['dataset'] == dataset]
    X_test = df_test_dataset[top_features]
    y_test = df_test_dataset['score']
    
    # Predict and evaluate
    y_pred = model_top.predict(X_test)
    test_correlation = evaluate_model(y_test, y_pred)
    
    print(f"Dataset: {dataset}, Pearson Correlation: {test_correlation:.4f}")
    
    # Update results
    results[(dataset, f'top_{top_N}')] = test_correlation


In [None]:
# 7. Final Results

# Collect new rows to add to the DataFrame
new_rows = []

for key, value in results.items():
    dataset, feature_set = key
    test_corr = value
    # Check if the combination already exists in the DataFrame
    if not ((results_df['Dataset'] == dataset) & (results_df['Feature_Set'] == feature_set)).any():
        new_rows.append({
            'Dataset': dataset,
            'Feature_Set': feature_set,
            'Test_Correlation': test_corr
        })

# If there are new rows, append them to the DataFrame
if new_rows:
    results_df = pd.concat([results_df, pd.DataFrame(new_rows)], ignore_index=True)

# Display the updated results
print(results_df.pivot(index='Dataset', columns='Feature_Set', values='Test_Correlation'))
