<a href="https://colab.research.google.com/github/hadar-grimberg/data-science-portfolio/blob/main/SentenceTransformer_for_similarity_w_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description

### The purpose of the code is to finetune a pretrained model to measure the semantic similarity of two texts: job title and job description (Semantic Textual Similarity task - STS)

### The data is labeled dataset of job ads, contains the following features:
<br>
**row_index_title_en** - index of the row
<br>
**title_en - the job title** in english
<br>
**row_index_description_en** - the row index of description
<br>
**description_en - the job** description in english
<br>
**subcategory_id_title_en** - subcategory index of title
<br>
**subcategory_id_description_en** - subcategory index of description
<br>
**label** - 0 or 1
<br>
**label_title** - diff or same
<br>
**subcategory_title_en** - the subcategory of the title
<br>
**subcategory_description_en**  - the subcategory of the description
<br><br>
* The label is given depending on the identity between the two subcategories.

### The model:
<br>
The model is "all-MiniLM-L6-v2", which is a sentence-transformers model. It maps sentences & paragraphs to a 384 dimensional dense vector space.
"all-MiniLM-L6-v2" is faster (than other pretrained sentence-transformers models) and still offers good quality results for STS task.
<br>
The purpose code is to fine tune the network and recognize the similarity of sentences.
I used Cosine Similarity as a loss function for this task

## Load Libraries

In [None]:
import os
import numpy as np
import pandas as pd
from datetime import datetime
from scipy import stats
import torch
from torch.utils.data import DataLoader # Python iterable over a dataset
from sentence_transformers import evaluation # SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings
from sentence_transformers import SentenceTransformer, InputExample, losses, models, util
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Configure

In [None]:
# configure pandas display
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
# configure loading and saving paths
model_name = "similarity_classification" # fill with a representive short name to the algorithem being used
time_stamp = str(datetime.now().timestamp())# get time stamp
model_dir_data = model_name + "_" + time_stamp
base_path = os.path.abspath('')
input_path = os.path.join(base_path, "input_data")
output_path = os.path.join(base_path, "output_models")
model_path = os.path.join(output_path, model_dir_data)
if not os.path.exists(model_path):
    os.makedirs(model_path)
max_seq_length =  128 # fill

## Create Model

"all-MiniLM-L6-v2" model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

In [None]:
# Define appropriate model for the task
model = SentenceTransformer('all-MiniLM-L6-v2')  # A model for Semantic Textual Similarity
# Define the maximal inpute
model.max_seq_length = max_seq_length

## Load Data

In [None]:
# Load the datasets from csv
train_df = pd.read_csv(os.path.join(input_path, 'trial_1', 'train.csv'), encoding='utf-8-sig')
val_df = pd.read_csv(os.path.join(input_path, 'trial_1', 'val.csv'), encoding='utf-8-sig')
test_df = pd.read_csv(os.path.join(input_path, 'trial_1', 'test.csv'), encoding='utf-8-sig')

## Data Exploration

In [None]:
# Check if all datasets have the same columns (see that the data structure is the same, otherwise this should be taken into account in the preprocess)
print(all(train_df.columns==val_df.columns))
print(all(train_df.columns==test_df.columns))
# all tables have the same features, now, let's examine them
print(f"dataset has {len(train_df.columns)} columns")
print(train_df.columns)


True
True
dataset has 12 columns
Index(['row_index_title_en', 'title_en', 'row_index_description_en',
       'description_en', 'subcategory_id_title_en',
       'subcategory_id_description_en', 'job_id_title_en',
       'job_id_description_en', 'label', 'label_title', 'subcategory_title_en',
       'subcategory_description_en'],
      dtype='object')


In [None]:
# check the type of the columns
train_df.dtypes

row_index_title_en                 int64
title_en                          object
row_index_description_en           int64
description_en                    object
subcategory_id_title_en          float64
subcategory_id_description_en    float64
job_id_title_en                    int64
job_id_description_en              int64
label                              int64
label_title                       object
subcategory_title_en              object
subcategory_description_en        object
dtype: object

In [None]:
# Check the first 5 rows to see how the data looks like
train_df.head()

Unnamed: 0,row_index_title_en,title_en,row_index_description_en,description_en,subcategory_id_title_en,subcategory_id_description_en,job_id_title_en,job_id_description_en,label,label_title,subcategory_title_en,subcategory_description_en
0,35612,Intern for the Neurological Department,35612,The neurological department at a medical cente...,539.0,539.0,5226341,5226341,1,same,Physician and surgeons,Physician and surgeons
1,97706,You are invited to join the service family in ...,97706,Because your time is precious to us! You are i...,693.0,693.0,6155479,6155479,1,same,Customer service representatives,Customer service representatives
2,52920,Helpdesk Man/Wife,52920,Require a Helpdesk man/wife. \nFull-time in th...,1459.0,1459.0,5728215,5728215,1,same,Help Desk - Software,Help Desk - Software
3,4932,Architectural engineer,90211,A law firm located in BSR 3 tower in Bnei Brak...,1631.0,669.0,5291642,6111701,0,diff,architecture engineer,Lawyer
4,77544,Management teams for Israel's largest fashion ...,27736,A construction and architecture company specia...,1517.0,400.0,5620686,5947423,0,diff,Branch manager / store manager,Foreman


In [None]:
# Looks like the numeric features are the indecies of the nominal features. Let's check how many unique values we have of each
numeric_features = train_df.select_dtypes(include=[np.number]).columns.to_list()
print(train_df[numeric_features].nunique())
print(train_df.label_title.unique())

row_index_title_en               1000
row_index_description_en         1000
subcategory_id_title_en           381
subcategory_id_description_en     379
job_id_title_en                  1000
job_id_description_en            1000
label                               2
dtype: int64
['same' 'diff']


There are 1000 kinds of each job titles and  job descriptions (but not same indecies)
<br>
There are 381 kinds subcategory_id_title and 379 kinds of subcategory_id_descriptions
<br>
The labels are "same" or "diff"

In [None]:
# Check if the unique indecies are same between titles and descriptions
print("Is all job_id_title are the same as job_id_description? ", all(np.sort(train_df.job_id_title_en.unique())==np.sort(train_df.job_id_description_en.unique())))
print("Is all job_id_title are the same as job_id_description? ", any(np.sort(train_df.job_id_title_en.unique())==np.sort(train_df.job_id_description_en.unique())))
# only few of them
print("Is all row_index_title are the same as row_index_description? ", all(np.sort(train_df.row_index_title_en.unique())==np.sort(train_df.row_index_description_en.unique())))
print("Is all row_index_title are the same as row_index_description? ", any(np.sort(train_df.row_index_title_en.unique())==np.sort(train_df.row_index_description_en.unique())))

Is all job_id_title are the same as job_id_description?  False
Is all job_id_title are the same as job_id_description?  True
Is all row_index_title are the same as row_index_description?  False
Is all row_index_title are the same as row_index_description?  True


In [None]:
# Check if when row_index_title_en = row_index_description_en, job_id_title_en is not equal to job_id_description_en
print(train_df[["job_id_title_en", "job_id_description_en"]][(train_df.row_index_title_en==train_df.row_index_description_en)&(train_df.job_id_title_en!=train_df.job_id_description_en)])

# Table is empty, row_index is redundant
# Let's remove it
train_df.drop(["job_id_title_en", "job_id_description_en"], inplace=True, axis=1)
val_df.drop(["job_id_title_en", "job_id_description_en"], inplace=True, axis=1)
test_df.drop(["job_id_title_en", "job_id_description_en"], inplace=True, axis=1)

Empty DataFrame
Columns: [job_id_title_en, job_id_description_en]
Index: []


Check for nulls in data

In [None]:
train_df.isnull().sum()

row_index_title_en               0
title_en                         0
row_index_description_en         0
description_en                   0
subcategory_id_title_en          3
subcategory_id_description_en    3
label                            0
label_title                      0
subcategory_title_en             3
subcategory_description_en       3
dtype: int64

In [None]:
val_df.isnull().sum()

row_index_title_en               0
title_en                         0
row_index_description_en         0
description_en                   0
subcategory_id_title_en          0
subcategory_id_description_en    0
label                            0
label_title                      0
subcategory_title_en             0
subcategory_description_en       0
dtype: int64

In [None]:
test_df.isnull().sum()

row_index_title_en               0
title_en                         0
row_index_description_en         0
description_en                   0
subcategory_id_title_en          0
subcategory_id_description_en    0
label                            0
label_title                      0
subcategory_title_en             0
subcategory_description_en       0
dtype: int64

Only three nulls within the train-set. There is no need deal with missing values

In [None]:
#Check maximum length of strings in "object" columns
measurer = np.vectorize(len)
str_len=measurer(train_df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)
cols=train_df.select_dtypes(include=[object]).columns.to_list()
for c, i in zip(cols, str_len):
  print(f"The maximum length of {c} column is {i}")

The maximum length of title_en column is 163
The maximum length of description_en column is 3344
The maximum length of label_title column is 4
The maximum length of subcategory_title_en column is 70
The maximum length of subcategory_description_en column is 70


looks like description_en is too long, we need to handle it , since our model can process up to 128 tabs

In [None]:
train_df["length"] = train_df.description_en.apply(lambda x: len(x))

In [None]:
# 94.7% of the records length's is 1000 or less, about 50% of the records length's is 321 or less
print(train_df["length"].describe())
print("The percentile of length 1000: ", stats.percentileofscore(train_df["length"], 128))
print("The percentile of length 1000: ", stats.percentileofscore(train_df["length"], 1000))
print(f'The length in the 90th percentile: {train_df["length"].quantile(0.90):.2f}')

count   1,000.00
mean      395.45
std       330.43
min         6.00
25%       176.00
50%       321.00
75%       506.25
max     3,344.00
Name: length, dtype: float64
The percentile of length 1000:  15.55
The percentile of length 1000:  94.7
The length in the 90th percentile: 750.30


In [None]:
# check if the labels in the various datasets are balanced
print(f"% of same within the train set is: {(len(train_df[train_df.label==1])/len(train_df))*100:.1f}% out of {len(train_df)} records")
print(f"% of same within the validation set is: {(len(val_df[val_df.label==1])/len(val_df))*100:.1f}% out of {len(val_df)} records")
print(f"% of same within the test set is: {(len(test_df[test_df.label==1])/len(test_df))*100:.1f}% out of {len(test_df)} records")

% of same within the train set is: 52.1% out of 1000 records
% of same within the validation set is: 54.0% out of 100 records
% of same within the test set is: 54.0% out of 100 records


About 83% of the data is for train, 8% for validation and 8% for test.
<br>
The labels looks balanced within each data set.

## Define callback routine

In [None]:
# The callback here is defined to track the loss improvement during the training
def callback(score, epoch, steps):
    print (f"The evaluation score after epoch {epoch}: {score}")

## Preprocess data

In [None]:
# Convert the labels to float so that the model can process it
train_df.label= train_df.label.astype("float64")
val_df.label= val_df.label.astype("float64")
test_df.label= test_df.label.astype("float64")

We found that description_en is too long, let's summarize it before the STS task

In [None]:
# compute degree_centrality_scores for text-summarization
"""
LexRank implementation
Source: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/text-summarization/LexRank.py
"""

from scipy.sparse.csgraph import connected_components

def degree_centrality_scores(
    similarity_matrix,
    threshold=None,
    increase_power=True):
    if not (
        threshold is None
        or isinstance(threshold, float)
        and 0 <= threshold < 1):
        raise ValueError(
            '\'threshold\' should be a floating-point number '
            'from the interval [0, 1) or None')

    if threshold is None:
        markov_matrix = create_markov_matrix(similarity_matrix)

    else:
        markov_matrix = create_markov_matrix_discrete(
            similarity_matrix,
            threshold)

    scores = stationary_distribution(
        markov_matrix,
        increase_power=increase_power,
        normalized=False)

    return scores


def _power_method(transition_matrix, increase_power=True):
    eigenvector = np.ones(len(transition_matrix))

    if len(eigenvector) == 1:
        return eigenvector

    transition = transition_matrix.transpose()

    while True:
        eigenvector_next = np.dot(transition, eigenvector)

        if np.allclose(eigenvector_next, eigenvector):
            return eigenvector_next

        eigenvector = eigenvector_next

        if increase_power:
            transition = np.dot(transition, transition)


def connected_nodes(matrix):
    _, labels = connected_components(matrix)

    groups = []

    for tag in np.unique(labels):
        group = np.where(labels == tag)[0]
        groups.append(group)

    return groups


def create_markov_matrix(weights_matrix):
    n_1, n_2 = weights_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'weights_matrix\' should be square')

    row_sum = weights_matrix.sum(axis=1, keepdims=True)

    return weights_matrix / row_sum


def create_markov_matrix_discrete(weights_matrix, threshold):
    discrete_weights_matrix = np.zeros(weights_matrix.shape)
    ixs = np.where(weights_matrix >= threshold)
    discrete_weights_matrix[ixs] = 1

    return create_markov_matrix(discrete_weights_matrix)


def graph_nodes_clusters(transition_matrix, increase_power=True):
    clusters = connected_nodes(transition_matrix)
    clusters.sort(key=len, reverse=True)

    centroid_scores = []

    for group in clusters:
        t_matrix = transition_matrix[np.ix_(group, group)]
        eigenvector = _power_method(t_matrix, increase_power=increase_power)
        centroid_scores.append(eigenvector / len(group))

    return clusters, centroid_scores


def stationary_distribution(
    transition_matrix,
    increase_power=True,
    normalized=True,):
    n_1, n_2 = transition_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'transition_matrix\' should be square')

    distribution = np.zeros(n_1)

    grouped_indices = connected_nodes(transition_matrix)

    for group in grouped_indices:
        t_matrix = transition_matrix[np.ix_(group, group)]
        eigenvector = _power_method(t_matrix, increase_power=increase_power)
        distribution[group] = eigenvector

    if normalized:
        distribution /= n_1

    return distribution

In [None]:
# Create an extractive summarization of a long job description.
def summarization(df, model):
  #Split the document into sentences and compute the sentence embeddings
  df["description_sentences"] = df.description_en.apply(lambda x: nltk.sent_tokenize(x))
  #Compute the pair-wise cosine similarities
  df["embeddings"] = df.description_sentences.apply(lambda x: model.encode(x, convert_to_tensor=True))
  #Compute the centrality for each sentence
  df["centrality_scores"] = df.embeddings.apply(lambda x: degree_centrality_scores(util.cos_sim(x, x).cpu().numpy()))
  # Argsort so that the first element is the sentence with the highest score
  df["centrality_scores"]=df.centrality_scores.apply(lambda x: np.argsort(-x))
  #5 sentences with the highest scores
  df["description_summary"] = df[["description_sentences", "centrality_scores"]].apply(lambda x: [x.description_sentences[i] for i in x.centrality_scores[:5]],axis=1)
  df["embeddings_summary"] = df[["embeddings", "centrality_scores"]].apply(lambda x: [x.embeddings[i] for i in x.centrality_scores[:5]],axis=1)
  return (df)

In [None]:
# Execute summarization fot the three datasets
train_df = summarization(train_df, model)
val_df = summarization(val_df, model)
test_df = summarization(test_df, model)

In [None]:
# prepare the data to the model
input_train_examples_list = train_df[["title_en", "description_summary", "label"]].apply(lambda x: InputExample(texts=[x['title_en'], str(x['description_summary'])], label=torch.full([1], x["label"], dtype=torch.float)),axis=1)
input_train_examples_list=list(input_train_examples_list)
input_val_examples_list = val_df[["title_en", "description_summary", "label"]].apply(lambda x: InputExample(texts=[x['title_en'], str(x['description_summary'])], label=torch.full([1], x["label"], dtype=torch.float)),axis=1)
input_val_examples_list=list(input_val_examples_list)
train_dataloader = DataLoader(input_train_examples_list, shuffle=True, batch_size=16)

# define the loss function
train_loss = losses.CosineSimilarityLoss(model=model)

scores = val_df['label'].to_list()
# define the evaluator to the model
evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(input_val_examples_list)



### Train the model

In [None]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=100, callback=callback,
          output_path=model_path,save_best_model=True)

Epoch:   0%|          | 0/100 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 0: 0.651292585348354


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 1: 0.6554630821595494


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 2: 0.6603286617726107


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 3: 0.6749254006117948


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 4: 0.688131973847247


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 5: 0.7020336298845652


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 6: 0.7124598719125537


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 7: 0.7235811967424082


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 8: 0.7326172731666649


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 9: 0.7458238464021171


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 10: 0.7520795916189105


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 11: 0.7569451712319716


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 12: 0.7604205852413012


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 13: 0.7666763304580945


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 14: 0.7694566616655579


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 15: 0.7715419100711557


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 16: 0.7750173240804853


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 17: 0.7729320756748875


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 18: 0.7784927380898149


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 19: 0.7791878208916806


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 20: 0.784053400504742


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 21: 0.784748483306608


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 22: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 23: 0.7833583177028762


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 24: 0.7861386489103398


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 25: 0.7875288145140716


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 26: 0.7854435661084739


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 27: 0.784053400504742


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 28: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 29: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 30: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 31: 0.784053400504742


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 32: 0.7819681520991443


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 33: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 34: 0.784748483306608


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 35: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 36: 0.7854435661084739


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 37: 0.7854435661084739


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 38: 0.7819681520991443


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 39: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 40: 0.7875288145140716


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 41: 0.7861386489103398


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 42: 0.7896140629196694


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 43: 0.7910042285234011


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 44: 0.7910042285234011


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 45: 0.7944796425327307


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 46: 0.7944796425327307


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 47: 0.7937845597308647


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 48: 0.7930894769289988


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 49: 0.7923943941271329


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 50: 0.7889189801178035


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 51: 0.7896140629196694


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 52: 0.7910042285234011


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 53: 0.7875288145140716


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 54: 0.7833583177028762


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 55: 0.784053400504742


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 56: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 57: 0.7916993113252669


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 58: 0.7819681520991443


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 59: 0.7798829036935466


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 60: 0.7805779864954124


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 61: 0.7854435661084739


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 62: 0.784053400504742


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 63: 0.777797655287949


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 64: 0.7791878208916806


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 65: 0.7805779864954124


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 66: 0.7896140629196694


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 67: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 68: 0.7896140629196694


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 69: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 70: 0.7805779864954124


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 71: 0.784748483306608


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 72: 0.7791878208916806


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 73: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 74: 0.7819681520991443


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 75: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 76: 0.7861386489103398


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 77: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 78: 0.7743222412786194


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 79: 0.7819681520991443


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 80: 0.7805779864954124


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 81: 0.7750173240804853


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 82: 0.777797655287949


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 83: 0.7729320756748875


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 84: 0.784053400504742


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 85: 0.7826632349010102


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 86: 0.777102572486083


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 87: 0.7694566616655579


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 88: 0.7708468272692899


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 89: 0.7757124068823513


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 90: 0.7757124068823513


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 91: 0.777797655287949


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 92: 0.7729320756748875


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 93: 0.7736271584767535


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 94: 0.7764074896842171


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 95: 0.7687615788636921


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 96: 0.7687615788636921


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 97: 0.7812730692972785


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 98: 0.7638959992506308


Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

The evaluation score after epoch 99: 0.7708468272692899


During the fine-tuning training, the validation score (score of set that the model is not train on) improved from 65% to 79% (The best weights saved, not the last ones).
<br>
Not a lot of improvement after epoch 18.
<br>
Future improvement: add early stopping



## Configure the test set paths

In [None]:
sentence_transformer_model = SentenceTransformer(model_path)
transformer_path = os.path.join(output_path, "results" + "_" + time_stamp)
if not os.path.exists(transformer_path):
    os.makedirs(transformer_path)

## Test the model after the fine-tuning

In [None]:
# prepare the data to the model
input_test_examples_list = test_df[["title_en", "description_summary", "label"]].apply(lambda x: InputExample(texts=[x['title_en'], str(x['description_summary'])], label=torch.full([1], x["label"], dtype=torch.float)),axis=1)
input_test_examples_list=list(input_test_examples_list)
# implement next steps


test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(input_test_examples_list)
test_evaluator(sentence_transformer_model, output_path=transformer_path)

0.7673714132599603

Test performance is 76.7%