## Sentence Transformation and Embeddings

### Initialization

#### Acknowledgements

- https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- https://www.kaggle.com/datasets/mauricerupp/englishspeaking-politicians (v6 Dataset)

#### Packages

In [1]:
import sentence_transformers as pkg_sentence_transformers
import scipy.spatial.distance as pkg_distance
import pandas as pkg_pandas
import sklearn.model_selection as pkg_model_selection

#### Common

In [2]:
model_path = 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2'
model = pkg_sentence_transformers.SentenceTransformer(model_path)

### Sample: Find Cosine Similiarity

In [3]:
# Reference Sentence
ref_sentence = "That is a happy person"

# Sentences we want compare with
cmp_sentences = [
    "Sunny days make people happy",
    "Rainy days make people happy"
    "Relationships enhance happiness",
    "There is no happy person in the world. Happiness is misnomer",
    "Mind free from tensions make a person happy",
    "People on vacation at beach side over bright and sunny days",
    "People kind at heart and wearing smile always",
    "People having fun, laughing out loud, elevating the mood"
]

ref_embeddings = model.encode(sentences=[ref_sentence])
print("Embedding/Vector Shape = {}".format(ref_embeddings[0].shape))

cmp_embeddings = model.encode(cmp_sentences)

print ("Reference Sentence: {}".format(ref_sentence))
for i in range(len(cmp_embeddings)):
    similiarity = 1 - pkg_distance.cosine(ref_embeddings[0], cmp_embeddings[i])
    print("Cosine Similiarity: {} for sentence = {}".format(similiarity, cmp_sentences[i]))

Embedding/Vector Shape = (768,)
Reference Sentence: That is a happy person
Cosine Similiarity: 0.6686985492706299 for sentence = Sunny days make people happy
Cosine Similiarity: 0.5185865163803101 for sentence = Rainy days make people happyRelationships enhance happiness
Cosine Similiarity: 0.48634278774261475 for sentence = There is no happy person in the world. Happiness is misnomer
Cosine Similiarity: 0.5353088974952698 for sentence = Mind free from tensions make a person happy
Cosine Similiarity: 0.23558664321899414 for sentence = People on vacation at beach side over bright and sunny days
Cosine Similiarity: 0.5658841729164124 for sentence = People kind at heart and wearing smile always
Cosine Similiarity: 0.553466796875 for sentence = People having fun, laughing out loud, elevating the mood


### Play: Find Similar Sentences

#### Load Data

In [4]:
baseline_df = pkg_pandas.read_csv("../data/kaggle/datasets/english-speaking-politicians-v6_0.csv.gz", compression='gzip', on_bad_lines='skip')
baseline_inputs = baseline_df
baseline_inputs.head()

Unnamed: 0,Author,Country,Date,Speech,Title,URL
0,Justin Trudeau,Canada,2020-10-20,"Good morning, everyone.\nI’m happy to be here ...",Prime Minister’s remarks on Small Business Wee...,https://pm.gc.ca/en/news/speeches/2020/10/20/p...
1,Justin Trudeau,Canada,2020-10-16,Hello. Good morning everyone.\nI’m pleased to ...,Prime Minister’s remarks on the measures taken...,https://pm.gc.ca/en/news/speeches/2020/10/16/p...
2,Justin Trudeau,Canada,2020-10-13,Hello everyone.\nIt’s good to be here this mor...,Prime Minister’s remarks on COVID-19 testing a...,https://pm.gc.ca/en/news/speeches/2020/10/13/p...
3,Justin Trudeau,Canada,2020-10-09,Hello everyone.\nI’m happy to be joined today ...,Prime Minister’s remarks on support for Canadi...,https://pm.gc.ca/en/news/speeches/2020/10/09/p...
4,Justin Trudeau,Canada,2020-10-08,"Hello everyone.\nThank you, Minister Bains. It...",Prime Minister’s remarks on a new commitment t...,https://pm.gc.ca/en/news/speeches/2020/10/08/p...


In [5]:
# This dataset contains 9567 samples, train_size of 2% means roughly 191 samples
train_inputs, test_inputs = pkg_model_selection.train_test_split(baseline_inputs, train_size=0.02)

print("=== Baseline Split - Train and Test ===")
print("Lengths: Baseline = {}, Train = {}, Test = {}".format(len(baseline_inputs), len(train_inputs), len(test_inputs)))

=== Baseline Split - Train and Test ===
Lengths: Baseline = 9567, Train = 191, Test = 9376


In [6]:
# Persist the train samples as these are the ones we refer to using an index
train_inputs.to_csv("../.outputs/persisted/st_mpnet_basev2_sample_sentences.csv", index=False)

#### Process

In [7]:
# Calculate Sentence Embeddings
# (Assume order is preserved)
train_embeddings = model.encode(train_inputs.Speech.array)
train_embeddings.shape

(191, 768)

In [8]:
# Calculate Distances/Similarities
similarites_df = pkg_pandas.DataFrame(columns=["i", "j", "cosine_distance", "euclidean_distance"])

for i in range(len(train_embeddings)):
    for j in range(len(train_embeddings)-i-1):
        # Do not compare various speeches of same author (bias removal?) 
        if (train_inputs.iloc[i].Author != train_inputs.iloc[j].Author):
            cosine_distance = pkg_distance.cosine(train_embeddings[i], train_embeddings[i+j+1])
            euclidean_distance = pkg_distance.euclidean(train_embeddings[i], train_embeddings[i+j+1])
            similarites_df.loc[len(similarites_df)] = [i, j, cosine_distance, euclidean_distance]

similarites_df.head()

Unnamed: 0,i,j,cosine_distance,euclidean_distance
0,0.0,1.0,0.697709,3.26246
1,0.0,2.0,0.612179,3.035995
2,0.0,3.0,0.717433,3.32487
3,0.0,4.0,0.619551,2.997704
4,0.0,5.0,0.620591,3.029624


In [9]:
# Filter rows that have less than 40% cosine distance
similarites_df = similarites_df[(similarites_df["cosine_distance"] < 0.40)]
similarites_df.head()

Unnamed: 0,i,j,cosine_distance,euclidean_distance
22,0.0,23.0,0.393853,2.384844
49,0.0,50.0,0.368908,2.339922
201,1.0,13.0,0.265744,2.093289
237,1.0,49.0,0.24949,2.013427
239,1.0,51.0,0.322221,2.250472


In [10]:
# Sort using similarity
similarites_df.sort_values(by=["cosine_distance", "euclidean_distance"], ascending=True, inplace=True)
similarites_df.to_csv("../.outputs/persisted/st_mpnet_basev2_similar_sentences.csv", index=False)
similarites_df.head()

Unnamed: 0,i,j,cosine_distance,euclidean_distance
2293,14.0,102.0,0.085004,1.152188
15066,153.0,30.0,0.157777,1.645235
7382,51.0,133.0,0.176266,1.628031
14309,133.0,17.0,0.18663,1.690782
4675,30.0,150.0,0.190413,1.744928
