## Sentence Transformation and Embeddings

### Initialization

#### Acknowledgements

- https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- https://www.kaggle.com/datasets/mauricerupp/englishspeaking-politicians (v6 Dataset)

#### Packages

In [1]:
import sentence_transformers as pkg_sentence_transformers
import scipy.spatial as pkg_spatial
import pandas as pkg_pandas
import sklearn.model_selection as pkg_model_selection

#### Common

In [2]:
model_path = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
model = pkg_sentence_transformers.SentenceTransformer(model_path)

In [3]:
cosine_similarity = lambda x,y: (1 - pkg_spatial.distance.cosine(x,y))

### Sample: Find Similiarity

In [4]:
# Reference Sentence
ref_sentence = "That is a happy person"

# Sentences we want compare with
cmp_sentences = [
    "Sunny days make people happy",
    "Rainy days make people happy"
    "Relationships enhance happiness",
    "There is no happy person in the world. Happiness is misnomer",
    "Mind free from tensions make a person happy",
    "People on vacation at beach side over bright and sunny days",
    "People kind at heart and wearing smile always",
    "People having fun, laughing out loud, elevating the mood"
]

ref_embeddings = model.encode(sentences=[ref_sentence])
print("Embedding/Vector Shape = {}".format(ref_embeddings[0].shape))

cmp_embeddings = model.encode(cmp_sentences)

print ("Reference Sentence: {}".format(ref_sentence))
for i in range(len(cmp_embeddings)):
    similiarity = cosine_similarity(ref_embeddings[0], cmp_embeddings[i])
    print("Similiarity: {} for sentence = {}".format(similiarity, cmp_sentences[i]))

Embedding/Vector Shape = (768,)
Reference Sentence: That is a happy person
Similiarity: 0.6686985492706299 for sentence = Sunny days make people happy
Similiarity: 0.5185865163803101 for sentence = Rainy days make people happyRelationships enhance happiness
Similiarity: 0.48634278774261475 for sentence = There is no happy person in the world. Happiness is misnomer
Similiarity: 0.5353088974952698 for sentence = Mind free from tensions make a person happy
Similiarity: 0.23558664321899414 for sentence = People on vacation at beach side over bright and sunny days
Similiarity: 0.5658841729164124 for sentence = People kind at heart and wearing smile always
Similiarity: 0.553466796875 for sentence = People having fun, laughing out loud, elevating the mood


### Play: Find Similar Sentences

#### Load Data

In [5]:
baseline_df = pkg_pandas.read_csv("../data/kaggle/datasets/english-speaking-politicians-v6_0.csv.gz", compression='gzip', on_bad_lines='skip')
baseline_inputs = baseline_df
baseline_inputs.head()

Unnamed: 0,Author,Country,Date,Speech,Title,URL
0,Justin Trudeau,Canada,2020-10-20,"Good morning, everyone.\nI’m happy to be here ...",Prime Minister’s remarks on Small Business Wee...,https://pm.gc.ca/en/news/speeches/2020/10/20/p...
1,Justin Trudeau,Canada,2020-10-16,Hello. Good morning everyone.\nI’m pleased to ...,Prime Minister’s remarks on the measures taken...,https://pm.gc.ca/en/news/speeches/2020/10/16/p...
2,Justin Trudeau,Canada,2020-10-13,Hello everyone.\nIt’s good to be here this mor...,Prime Minister’s remarks on COVID-19 testing a...,https://pm.gc.ca/en/news/speeches/2020/10/13/p...
3,Justin Trudeau,Canada,2020-10-09,Hello everyone.\nI’m happy to be joined today ...,Prime Minister’s remarks on support for Canadi...,https://pm.gc.ca/en/news/speeches/2020/10/09/p...
4,Justin Trudeau,Canada,2020-10-08,"Hello everyone.\nThank you, Minister Bains. It...",Prime Minister’s remarks on a new commitment t...,https://pm.gc.ca/en/news/speeches/2020/10/08/p...


In [6]:
# This dataset contains 9567 samples, train_size of 2% means roughly 191 samples
train_inputs, test_inputs = pkg_model_selection.train_test_split(baseline_inputs, train_size=0.02)

print("=== Baseline Split - Train and Test ===")
print("Lengths: Baseline = {}, Train = {}, Test = {}".format(len(baseline_inputs), len(train_inputs), len(test_inputs)))

=== Baseline Split - Train and Test ===
Lengths: Baseline = 9567, Train = 191, Test = 9376


In [7]:
# Persist the train samples as these are the ones we refer to using an index
train_inputs.to_csv("../.outputs/persisted/st_mpnet_basev2_sample_sentences.csv", index=False)

#### Process

In [8]:
# Calculate Sentence Embeddings
# (Assume order is preserved)
train_embeddings = model.encode(train_inputs.Speech.array)
train_embeddings.shape

(191, 768)

In [9]:
# Output Column Name
output_column_name = "similiarity"

# Calculate Similarities
similarites_df = pkg_pandas.DataFrame(columns=["i", "j", output_column_name])

for i in range(len(train_embeddings)):
    for j in range(len(train_embeddings)-i-1):
        # Do not compare various speeches of same author (bias removal?) 
        if (train_inputs.iloc[i].Author != train_inputs.iloc[j].Author):
            similiarity = cosine_similarity(train_embeddings[i], train_embeddings[i+j+1])
            similarites_df.loc[len(similarites_df)] = [i, j, similiarity]

similarites_df.head()

Unnamed: 0,i,j,similiarity
0,0.0,1.0,0.288101
1,0.0,2.0,0.210644
2,0.0,4.0,0.262663
3,0.0,6.0,0.396193
4,0.0,11.0,0.142414


In [10]:
# Filter rows that do not have at least 60% similarity
similarites_df = similarites_df[(similarites_df[output_column_name] > 0.60)]
similarites_df.head()

Unnamed: 0,i,j,similiarity
122,1.0,5.0,0.703965
189,1.0,73.0,0.61095
215,1.0,100.0,0.629581
250,1.0,139.0,0.635559
362,2.0,64.0,0.60959


In [11]:
# Sort using similarity
similarites_df.sort_values(by=[output_column_name], ascending=False, inplace=True)
similarites_df.to_csv("../.outputs/persisted/st_mpnet_basev2_similar_sentences.csv", index=False)
similarites_df.head()

Unnamed: 0,i,j,similiarity
1997,14.0,6.0,0.88455
3885,27.0,18.0,0.835458
2062,14.0,73.0,0.824804
3011,21.0,9.0,0.814522
3067,21.0,66.0,0.795639
