## In this notebook, two subtask of the Lexical Simplification Pipeline will be implemented:

1. Substitution Generation (the first method is Word2Vec, augmnted with POS filtering, and the second method is LSBert, which is based on Masking Language Models)
2. Substitution Selection ( the first method is computing the content words' vector vs. each candidate vector to select the best generated candidate. The second method is asking a decoder only language model to select from the substitutes)

For the task of Substitution Geberation, SG, neither trial nor training datasets will be used becuase the Word2vec and BERT are already pre-trained on millions of tokens.

The MLSP2024 Dataset will be used for testing and evaluation.



---
# 1. **Substiution Generation**: having the target complex word selected, two methods will be used to generate candidates.



---
##1.1 Word2Vec with POS filtering:
an already pre-trained Word2vec model will be installed, navigating through all the selected complex words in the dataset MLSP with labels. After generation, the generated candidates will be evaluated against the human annotated candidates in the datatset.

the metrics used are

1. potential@10
2. precesion@10
3. Recall@10


When running the first code block, the envirmomnet will ask the user to restart session. Please do and run the remaining codes.

In [None]:
!pip install --upgrade --force-reinstall numpy pandas gensim

Collecting numpy
  Downloading numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting numpy
  Downloading numpy-1.26.4-cp31

In [None]:
import pandas as pd

# loading the dataset of MLSP with labels to navigate through the candidates
df = pd.read_csv("/content/multils_test_english_combined_labels.tsv", sep="\t")

# Subset the important columns
df_subset = df[["context", "target"] + [col for col in df.columns if col.startswith("substitution_")]]

# Preview
df_subset.head()


Unnamed: 0,context,target,substitution_1,substitution_2,substitution_3,substitution_4,substitution_5,substitution_6,substitution_7,substitution_8,...,substitution_21,substitution_22,substitution_23,substitution_24,substitution_25,substitution_26,substitution_27,substitution_28,substitution_29,substitution_30
0,After Ron nearly dies drinking poisoned mead t...,distraught,worried,worried,troubled,troubled,distressed,distressed,upset,terrified,...,,,,,,,,,,
1,After Ron nearly dies drinking poisoned mead t...,drinking,consuming,consuming,consuming,consuming,taking,taking,intaking,ingesting,...,,,,,,,,,,
2,"After the war, Hitler remained in the army and...",oratory,speech,speech,speaking,speaking,vocal,talk,oral,,...,,,,,,,,,,
3,"After the war, Hitler remained in the army and...",reporting,notifying,notifying,notifying,informing,informing,informing,telling,telling,...,,,,,,,,,,
4,"After the war, Hitler remained in the army and...",infiltrating,invading,invading,invading,penetrating,penetrating,intruding,entering,,...,,,,,,,,,,


In [None]:
import nltk #loading this library for POS filtering
from nltk.corpus import wordnet as wn
from nltk import pos_tag, word_tokenize
from gensim.models import KeyedVectors #loading a pre-trained Word2Vec to generate candidates for the complex words
import gensim.downloader as api

# Download required NLTK resources
nltk.download("punkt")
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
w2v_model = api.load("word2vec-google-news-300")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.




In [None]:
def generate_substitutes(sentence, target_word, topn=10):
    """
    Generates substitutes for a given target word in a sentence using Word2Vec
    and POS filtering.

    1) First parameter is the sentence, containing the target word
    2) Second parameter is the target word
    3) Third parameter is the number of substitutes to generate

    Returns a list of substitutes that are semantically and syntactically
    similar to the target word.
    """
    words = word_tokenize(sentence) #the sentence is tokenized into words to find the POS of the target word
    tags = pos_tag(words)

    # Find POS of target word
    target_pos = None
    for word, tag in tags:
        if word.lower() == target_word.lower():
            target_pos = tag
            break

    # Get most similar words of the complex one by returning to Word2Vec
    if target_word in w2v_model:
        candidates = [w for w, _ in w2v_model.most_similar(target_word, topn=50)]
    else:
        return []

    # Finally, the candidates would run thorugh the POS filter.
    filtered = []
    for cand in candidates:
        cand_tagged = pos_tag([cand])[0][1]
        if cand_tagged == target_pos:
            filtered.append(cand)
        if len(filtered) == topn:
            break

    return filtered


In [None]:
# Trying the function on a random sentence
example = df_subset.iloc[80]
sentence = example["context"]
target_word = example["target"]

subs = generate_substitutes(sentence, target_word)
print("Sentence:", sentence)
print("Target Word:", target_word)
print("Substitutes:", subs)


Sentence: Currently, one seat equals roughly 600,000 constituents.
Target Word: equals
Substitutes: ['equates', 'translates', 'means', 'begets', 'Equals', 'implies', 'constitutes', 'multiplies', 'trumps', 'exceeds']




---

# 1.2 Mask Language Model (MLM):

In this approach, candidates will be generated using a Transformers-based Language Model, BERT. The complex word is hidden using a special token [MASK]. Based on the context of the sentence, the language model will use the attention mechansim to understand the context and the linguistic nuances of the sentence, thereby generating potential alternatives.

However, this approach is quite different from the Word2Vec one since context is taken as granted.

the metrics used are

1. potential@10
2. precesion@10
3. Recall@10


In [None]:
#testing the MLM approach on one sentence with the complex word being masked

!pip install -q transformers sentencepiece


from transformers import pipeline

# Load the fill-mask pipeline with BERT
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

# Example usage
masked_sent = "After Ron nearly dies drinking poisoned mead that was apparently intended for Professor Dumbledore, Hermione becomes so [MASK] that they end their feud for good."
print(fill_mask(masked_sent))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.2317497581243515, 'token': 4854, 'token_str': 'angry', 'sequence': 'after ron nearly dies drinking poisoned mead that was apparently intended for professor dumbledore, hermione becomes so angry that they end their feud for good.'}, {'score': 0.17666611075401306, 'token': 18835, 'token_str': 'enraged', 'sequence': 'after ron nearly dies drinking poisoned mead that was apparently intended for professor dumbledore, hermione becomes so enraged that they end their feud for good.'}, {'score': 0.1215352937579155, 'token': 6314, 'token_str': 'upset', 'sequence': 'after ron nearly dies drinking poisoned mead that was apparently intended for professor dumbledore, hermione becomes so upset that they end their feud for good.'}, {'score': 0.0993843525648117, 'token': 9943, 'token_str': 'furious', 'sequence': 'after ron nearly dies drinking poisoned mead that was apparently intended for professor dumbledore, hermione becomes so furious that they end their feud for good.'}, {'score': 0.0

In [None]:
from transformers import pipeline
from tqdm import tqdm
import pandas as pd



# Set up the LSBert Masked Language Model pipeline
mlm_pipeline = pipeline("fill-mask", model="bert-base-uncased", device=0)

# Define the substitution function using LSBert
def generate_lm_substitutes(sentence, target_word, topn=10):
    """
    this function is used to generate context-sensitive substitutes for a given target word in a sentence using
    the Masked Language Model (MLM) approach.

    the function replaces the complex word with a [MASK] token, then, based on the context, the model,
    in this case is BERT, predicts the top n substitutes that are pragmatically and syntactically similar to the target word.

    1) First parameter is the sentence, containing the target word
    2) Second parameter is the target word
    3) Third parameter is the number of substitutes to generate

    Returns a list of substitutes
    """
    try:
        # Replace the target word with the [MASK] token
        masked_sentence = sentence.replace(target_word, "[MASK]", 1)

        # Generate predictions
        outputs = mlm_pipeline(masked_sentence, top_k=topn)

        # Extract the top substitutes
        substitutes = [out["token_str"] for out in outputs]
        return substitutes
    except:
        return []


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [None]:

"""
this code block loops through each row in the dataset and
apply the two generating functions, one based on Word2vec and the second on Masking language model. Afterwards, the generated substitutes are stored into two new column called "generated substitutes" and "mlm_substitutes",
respectively.
This step comes to evaluate the generated substitutes against the human annotated substitutes
in the dataset.
"""

from tqdm import tqdm
tqdm.pandas(desc="Running all generation steps")

# Word2Vec + POS filtering SG
if "generated_substitutes" not in df_subset.columns:
    df_subset["generated_substitutes"] = df_subset.progress_apply(
        lambda row: generate_substitutes(row["context"], row["target"], topn=10),
        axis=1
    )

# MLM (like LSBert or masked BERT) SG
if "mlm_substitutes" not in df_subset.columns:
    df_subset["mlm_substitutes"] = df_subset.progress_apply(
        lambda row: generate_lm_substitutes(row["context"], row["target"]),
        axis=1
    )


Running all generation steps: 100%|██████████| 570/570 [01:32<00:00,  6.16it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset["generated_substitutes"] = df_subset.progress_apply(
Running all generation steps: 100%|██████████| 570/570 [00:42<00:00, 13.31it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset["mlm_substitutes"] = df_subset.progress_apply(



---

# After the candidates have been generated using two different apporaches, each approach will be evaluated a specific testing function

1. The Word2vec approach will be evaluated with a function called

```
def evaluate_predictions(df, k=10):
```

2. The MLM approach will be evaluated with a function called


```
def evaluate_mlm_predictions(df, k=10):

```



In [None]:
import numpy as np

def evaluate_predictions(df, k=10):
    """
    this function evaluates the generated substitutes against the human annotated substitutes
    in the dataset.

    it takes the dataset that contains the additional column containing the generated substitutes.
    it also takes the number of top predicted substitutes to evaluate.

    the function returns the metrics:
    1. potential@10: at least one correct substitute appears in the top 10 predictions
    2. precesion@10: Average number of correct substitutes found among the top-10 predictions, divided by 10
    3. Recall@10: Average proportion of gold substitutes that appear in the top-10 predictions.
    """
    potential_hits = 0
    total_precision = 0
    total_recall = 0
    total_instances = 0

    for _, row in df.iterrows():
        gold_subs = set(str(s).lower() for s in row[df.columns[2:-1]].dropna())
        pred_subs = row["generated_substitutes"][:k] if isinstance(row["generated_substitutes"], list) else []

        if not gold_subs or not pred_subs:
            continue
        total_instances += 1
        gold_hit = len(set(pred_subs) & gold_subs)

        # Metrics
        potential_hits += int(gold_hit > 0)
        total_precision += gold_hit / k
        total_recall += gold_hit / len(gold_subs)

    print(f"\n Evaluation Results (k={k}):")
    print("Potential@10:", round(potential_hits / total_instances, 3))
    print("Precision@10:", round(total_precision / total_instances, 3))
    print("Recall@10:", round(total_recall / total_instances, 3))



evaluate_predictions(df_subset, k=10)


 Evaluation Results (k=10):
Potential@10: 0.442
Precision@10: 0.072
Recall@10: 0.115


In [None]:
import numpy as np

def evaluate_mlm_predictions(df, k=10):
    """
    this function evaluates the generated substitutes against the human annotated substitutes
    in the dataset.

    it takes the dataset that contains the additional column containing the generated substitutes.
    it also takes the number of top predicted substitutes to evaluate.

    the function returns the metrics:
    1. potential@10: at least one correct substitute appears in the top 10 predictions
    2. precesion@10: Average number of correct substitutes found among the top-10 predictions, divided by 10
    3. Recall@10: Average proportion of gold substitutes that appear in the top-10 predictions.
    """
    potential_hits = 0
    total_precision = 0
    total_recall = 0
    total_instances = 0

    sub_cols = [col for col in df.columns if col.startswith("substitution_")]

    for _, row in df.iterrows():
        gold_subs = set(str(s).lower() for s in row[sub_cols].dropna())
        pred_subs = row["mlm_substitutes"][:k] if isinstance(row["mlm_substitutes"], list) else []

        if not gold_subs or not pred_subs:
            continue  # Skip if missing data

        total_instances += 1
        gold_hit = len(set(pred_subs) & gold_subs)

        # Metrics
        potential_hits += int(gold_hit > 0)
        total_precision += gold_hit / k
        total_recall += gold_hit / len(gold_subs)

    print(f"\nEvaluation Results for MLM (k={k}):")
    print("Potential@10:", round(potential_hits / total_instances, 3))
    print("Precision@10:", round(total_precision / total_instances, 3))
    print("Recall@10:", round(total_recall / total_instances, 3))

# Run it
evaluate_mlm_predictions(df_subset, k=10)


Evaluation Results for MLM (k=10):
Potential@10: 0.577
Precision@10: 0.098
Recall@10: 0.201




---
# 2. Substitution Selection: Given that the candidate words are generated, two methods will be used to select the best candidate






---
# 2.1 (Paetzold & Specia 2015) Style Semantic Ranking:



to select the best candidates, generated from Word2Vec method, the candidates are ranked based on their semantic similarity to the content words in the sentence.



In [None]:
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity

# Download stopwords corpus to remove them from the sentence
#only content words are ranked (no stopwords)
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Grab proper English stopwords
stop_words = set(stopwords.words("english"))

# Paetzold & Specia (2015) style ranking function
def rank_by_context_similarity(context, candidates):
    tokens = word_tokenize(context.lower())
    content_words = [w for w in tokens if w.isalpha() and w not in stop_words and w in w2v_model]
    if not content_words:
        return [] # function to tokenize, remove punctuation, stopwords, and words not found in W2V

    content_vecs = [w2v_model[w] for w in content_words] #getting the Vectors of the Content Words
    context_matrix = np.stack(content_vecs)

    ranked = []
    for cand in candidates:
        if cand in w2v_model:
            cand_vec = w2v_model[cand].reshape(1, -1)
            sims = cosine_similarity(cand_vec, context_matrix)
            avg_sim = np.mean(sims)
            ranked.append((cand, avg_sim)) # getting the vector for each candidate, comparing the vectors with sentence context. then, takes the average similarity across all content words

    ranked.sort(key=lambda x: x[1], reverse=True)
    return [w for w, _ in ranked]

# Apply it to the Word2Vec-generated substitutes
df_subset["w2v_context_ranked_subs"] = df_subset.apply(
    lambda row: rank_by_context_similarity(row["context"], row["generated_substitutes"])
    if isinstance(row["generated_substitutes"], list) else [],
    axis=1
)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def evaluate_ranked_predictions_by_proportion(df, proportion=0.5, column="w2v_context_ranked_subs"):
    """
    Evaluate ranked Word2Vec-based SS using top 'proportion' of candidates.

    Args:
        df: DataFrame with ranked substitutes in column.
        proportion: Proportion (e.g., 0.5 means top 50%).
        column: Column name of ranked substitutes.

    Returns:
        Prints Potential@p, Precision@p, Recall@p
    """
    potential_hits = 0
    total_precision = 0
    total_recall = 0
    total_instances = 0

    sub_cols = [col for col in df.columns if col.startswith("substitution_")]

    for _, row in df.iterrows():
        gold_subs = set(str(s).lower() for s in row[sub_cols].dropna())
        ranked_subs = row[column]

        if not gold_subs or not isinstance(ranked_subs, list) or not ranked_subs:
            continue

        k = max(1, int(len(ranked_subs) * proportion))
        selected_subs = ranked_subs[:k]

        gold_hit = len(set(selected_subs) & gold_subs)

        total_instances += 1
        potential_hits += int(gold_hit > 0)
        total_precision += gold_hit / k
        total_recall += gold_hit / len(gold_subs)

    print(f"\nEvaluation Results for W2V Context-Ranking (Top {int(proportion*100)}%):")
    print(" Potential@p:", round(potential_hits / total_instances, 3))
    print(" Precision@p:", round(total_precision / total_instances, 3))
    print(" Recall@p:", round(total_recall / total_instances, 3))

# Usage example
evaluate_ranked_predictions_by_proportion(df_subset, proportion=0.5)



Evaluation Results for W2V Context-Ranking (Top 50%):
 Potential@p: 0.344
 Precision@p: 0.113
 Recall@p: 0.095




---
# 2.2 Prompt Engineering:

in this method, the substitutes, generated from MLM, are fed into a decoder only model (Mistral-Instruct 7B V.1). The language model decides what is the best replacement for the complex word, given the candidates list.

the prompt is:



```
"""<s>[INST] Given the sentence:

"{context}"

What is the best replacement for the complex word "{target}" in this list?

{candidate_str}

Respond with one word only. Do not explain. [/INST]
"""
```



In [None]:
!pip install transformers accelerate --quiet


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m123.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m101.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



writing a prompting function to select the best replacement

In [None]:
def build_prompt(context, target, candidates):
    candidate_str = ", ".join(candidates)

    return f"""<s>[INST] Given the sentence:

"{context}"

What is the best replacement for the complex word "{target}" in this list?

{candidate_str}

Respond with one word only. Do not explain. [/INST]
"""


fine-tuning the hyperparameters of the model to give us the best output needed, i.e. only one word.

In [None]:
def get_best_substitute(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=5,     # we just need one word
        temperature=0.7,      # a bit of diversity
        top_k=20,             # limit randomness
        top_p=0.9,            # nucleus sampling
        do_sample=True,       # allow sampling
        repetition_penalty=1.2
    )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded.split("\n")[-1].strip()


splitting the dataset into three batches to save time

In [None]:
# Define split points
batch1 = df_subset.iloc[:190]     # Rows 0–189
batch2 = df_subset.iloc[190:380]  # Rows 190–379
batch3 = df_subset.iloc[380:]     # Rows 380–569


In [None]:
from tqdm import tqdm
tqdm.pandas(desc="Mistral v1 SS - Batch 1")

batch1["mlm_selected_sub"] = batch1.progress_apply(
    lambda row: get_best_substitute(
        build_prompt(row["context"], row["target"], row["mlm_substitutes"])
    ) if isinstance(row["mlm_substitutes"], list) and row["mlm_substitutes"] else None,
    axis=1
)


Mistral v1 SS - Batch 1:   0%|          | 0/190 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   1%|          | 2/190 [00:11<18:24,  5.88s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   2%|▏         | 3/190 [00:21<22:50,  7.33s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   2%|▏         | 4/190 [00:32<27:45,  8.95s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   3%|▎         | 5/190 [00:39<25:34,  8.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   3%|▎         | 6/190 [00:46<24:08,  7.87s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   4%|▎         | 7/190 [00:58<27:44,  9.10s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 1:   4

In [None]:
batch1.to_csv("mistral_batch1.tsv", sep="\t", index=False)
from google.colab import files
files.download("mistral_batch1.tsv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from tqdm import tqdm
tqdm.pandas(desc="Mistral v1 SS - Batch 2")

batch2["mlm_selected_sub"] = batch2.progress_apply(
    lambda row: get_best_substitute(
        build_prompt(row["context"], row["target"], row["mlm_substitutes"])
    ) if isinstance(row["mlm_substitutes"], list) and row["mlm_substitutes"] else None,
    axis=1
)


Mistral v1 SS - Batch 2:   0%|          | 0/190 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   1%|          | 2/190 [00:04<07:20,  2.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   2%|▏         | 3/190 [00:16<19:27,  6.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   2%|▏         | 4/190 [00:23<20:15,  6.53s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   3%|▎         | 5/190 [00:32<23:07,  7.50s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   3%|▎         | 6/190 [00:39<22:29,  7.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   4%|▎         | 7/190 [00:46<22:03,  7.23s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 2:   4

In [None]:
batch2.to_csv("mistral_batch2.tsv", sep="\t", index=False)

from google.colab import files
files.download("mistral_batch2.tsv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from tqdm import tqdm
tqdm.pandas(desc="Mistral v1 SS - Batch 3")

batch3["mlm_selected_sub"] = batch3.progress_apply(
    lambda row: get_best_substitute(
        build_prompt(row["context"], row["target"], row["mlm_substitutes"])
    ) if isinstance(row["mlm_substitutes"], list) and row["mlm_substitutes"] else None,
    axis=1
)


Mistral v1 SS - Batch 3:   0%|          | 0/190 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   1%|          | 2/190 [00:09<14:53,  4.75s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   2%|▏         | 3/190 [00:21<24:04,  7.72s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   2%|▏         | 4/190 [00:30<26:01,  8.40s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   3%|▎         | 5/190 [00:37<24:28,  7.94s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   3%|▎         | 6/190 [00:45<23:28,  7.66s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   4%|▎         | 7/190 [00:52<22:44,  7.46s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Mistral v1 SS - Batch 3:   4

In [None]:
batch3.to_csv("mistral_batch3.tsv", sep="\t", index=False)

from google.colab import files
files.download("mistral_batch3.tsv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

the three batches are combined together for testing

In [None]:
import pandas as pd


batch1 = pd.read_csv("/content/mistral_batch1.tsv", sep="\t")
batch2 = pd.read_csv("/content/mistral_batch2.tsv", sep="\t")
batch3 = pd.read_csv("/content/mistral_batch3.tsv", sep="\t")

# Merge all three
merged_df = pd.concat([batch1, batch2, batch3], ignore_index=True)

# Save merged file
merged_df.to_csv("merged_mistral_substitutions.tsv", sep="\t", index=False)

from google.colab import files
files.download("merged_mistral_substitutions.tsv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
def evaluate_prompt_based_selection_at_k(df, k=10):
    """
    Evaluate prompt-based selection assuming top-k selected substitutes (if available).

    Parameters:
    - df: DataFrame with 'mlm_selected_sub' column (list of candidates).
    - k: number of top substitutes to consider but the model selected only 1

    Prints:
    - Potential@k
    - Precision@k
    - Recall@k
    """
    potential_hits = 0
    total_precision = 0
    total_recall = 0
    total_instances = 0

    sub_cols = [col for col in df.columns if col.startswith("substitution_")]

    for _, row in df.iterrows():
        gold_subs = set(str(s).lower() for s in row[sub_cols].dropna())
        preds = row["mlm_selected_sub"]

        if isinstance(preds, str):
            preds = [preds.strip().lower()]
        elif isinstance(preds, list):
            preds = [p.strip().lower() for p in preds[:k]]
        else:
            continue

        if not gold_subs or not preds:
            continue

        gold_hit = len(set(preds) & gold_subs)
        total_instances += 1
        potential_hits += int(gold_hit > 0)
        total_precision += gold_hit / k
        total_recall += gold_hit / len(gold_subs)

    print(f"\n Evaluation Results for Prompt-based SS at Top-{k}:")
    print(" Potential@10:", round(potential_hits / total_instances, 3))
    print(" Precision@10:", round(total_precision / total_instances, 3))
    print(" Recall@10:", round(total_recall / total_instances, 3))
evaluate_prompt_based_selection_at_k(merged_df, k=10)


 Evaluation Results for Prompt-based SS at Top-10:
 Potential@10: 0.208
 Precision@10: 0.021
 Recall@10: 0.045
