In this notebook, we observe an output example of a speech-to-text model (i.e., a conversation text generated from a audio record). We try to extract main topics from this conversation by using Gensim.

In [9]:
import json
import pandas as pd
import spacy

modelSpacy = spacy.load('en')

In [10]:
sentencesFile = open("sentences.json")
transcriptFile = open("transcript.json")
 
sentencesData = json.load(sentencesFile)
transcriptData = json.load(transcriptFile)

# Exploring data

We observe all sentences in a conversation under the following structure: speaker (integer), sentence (string), document (i.e., a document includes all sentences from one speaker before changing to another speaker).

In [11]:
sentences = {}
sentIdx = 0
docIdx = 0
currentSpeaker = sentencesData['data'][0]['speaker']
for line in sentencesData['data']:
    if int(line['speaker']) != currentSpeaker:
        docIdx+=1
        currentSpeaker = line['speaker']
    for sent in line['sentence']:
        sentences[sentIdx] = {"speaker": line['speaker'], "document":docIdx, "sentence": sent, "sentWordCount":len(sent.split())}
        sentIdx+=1


sentencesDf = pd.DataFrame.from_dict(sentences, orient='index')

print("Number of sentences:", len(sentencesDf))
print("Number of documents:", len(sentencesDf.document.unique()))
print("Number of speaker:", len(sentencesDf.speaker.unique()), sentencesDf.speaker.unique())

Number of sentences: 644
Number of documents: 35
Number of speaker: 7 [1 0 2 3 4 5 6]


In [7]:
sentencesDf.head()

Unnamed: 0,speaker,document,sentence,sentWordCount
0,1,0,Yes.,1
1,1,0,Hello everyone.,2
2,1,0,And welcome to today's master class.,6
3,1,0,It's about how to develop ethical AI in busine...,14
4,1,0,"It was the mother young Maya, she's the CEO an...",14


We observe number of documents and sentences grouped by speaker. We can see that the sentences generated by speaker 0 and 1 occupy most of the conversation.

In [None]:
sentencesDf.groupby(by='speaker').agg(documentQty=('document', 'nunique'), sentenceQty=('sentence', 'count'))

Unnamed: 0_level_0,documentQty,sentenceQty
speaker,Unnamed: 1_level_1,Unnamed: 2_level_1
0,14,496
1,14,97
2,1,10
3,3,15
4,1,10
5,1,9
6,1,7


We observe number of sentences grouped by document. There are large differences between these documents.

In [17]:
sentencesDf.groupby(by='document').agg(sentenceQty=('sentence', 'count')).sort_values(by='sentenceQty', ascending=False)

Unnamed: 0_level_0,sentenceQty
document,Unnamed: 1_level_1
3,237
16,101
8,66
31,25
34,21
32,18
6,16
14,15
0,15
28,14


We now observe all words in this conversation under the following structure: word, frequency (of this word), speaker, number of speaker and confidence (mean confidence score of this word). Note that a confidence score is a metric of confidence for a word generated by a speech-to-text model.

In [18]:
stopwords = modelSpacy.Defaults.stop_words

vocabulary = {}
for line in transcriptData["monologues"]:
    for element in line["elements"]:
        if element["type"]=="text":
            value = modelSpacy(element["value"].lower())
            for word in value:
                lemmaWord = word.lemma_
                if lemmaWord not in vocabulary:
                    vocabulary[lemmaWord] = {"frequency":1, "confidence":element["confidence"], "speakers":[line["speaker"]]}
                else:
                    vocabulary[lemmaWord]["frequency"]+=1
                    vocabulary[lemmaWord]["confidence"]+=element["confidence"]
                    if line["speaker"] not in vocabulary[lemmaWord]["speakers"]:
                        vocabulary[lemmaWord]["speakers"].append(line["speaker"])

for word in vocabulary:
    vocabulary[word]["confidence"] = vocabulary[word]["confidence"]/vocabulary[word]["frequency"]
    vocabulary[word]["speakerQty"] = len(vocabulary[word]["speakers"])

vocabularyDf = pd.DataFrame.from_dict(vocabulary, orient='index').reset_index().rename(columns={"index": "word"})

The most frequent spoken words (tokenized and lemmatized) are shown in the table below. We can see that filler words (e.g., um, uh, yeah) have the highest frequencies. Note that we consider "-PRON" (i.e., pronouns) as stopwords.

In [20]:
stopwords.add("-PRON-")
vocabularyDf[~vocabularyDf["word"].isin(stopwords)].sort_values(by=['frequency'], ascending=False).head(10)

Unnamed: 0,word,frequency,confidence,speakers,speakerQty
134,um,125,0.93712,"[0, 2, 1, 3, 4, 5, 6]",7
129,uh,92,0.823261,"[0, 2, 3, 4, 5]",5
115,yeah,64,0.914375,"[0, 1, 2]",3
16,ai,61,0.992787,"[1, 0, 2, 3]",4
45,like,50,0.9672,"[1, 0, 2, 3, 5, 6]",6
128,think,50,0.991,"[0, 1, 4, 5]",4
182,kind,49,0.982653,"[0, 1]",2
236,system,43,0.98814,[0],1
222,solution,43,0.997907,"[0, 1]",2
559,value,42,0.972857,[0],1


# Topic modeling

There are several traditional approaches for this topic modeling: Latent semantic analysis, latent dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis \[[Albalawi et al., 2020](https://www.frontiersin.org/articles/10.3389/frai.2020.00042/full)]. Results of LDA method implemented in [Gensim](https://radimrehurek.com/gensim/) \(based on this [tutorial](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)\) are shown in this notebook.

Another approach is VAE \[[Srivastava and Sutton 2017](https://arxiv.org/abs/1703.01488)\]. [Dieng et al. (2019)](https://arxiv.org/abs/1907.04907v1) propose ETM for large and heavy-tailed vocabularies. [Bianchi et al. (2021)](https://aclanthology.org/2021.eacl-main.143/) extend VAE approach by using tranfer learning. A nice framework including these mentioned approaches is [OCTIS](https://github.com/MIND-Lab/OCTIS). [Pineda et al. (2018)](https://arxiv.org/abs/1807.00938v2) use Min-Hashing (note that this approach is well-used in collaborative filtering for large items and users), they confirme that their model "can handle massive text corpora and large vocabularies using modest computer hardware and does not require to fix the number of topics in advance".

In [None]:
!pip install --upgrade gensim

In [22]:
from gensim.corpora import Dictionary
from gensim.models import Phrases, LdaModel, CoherenceModel, nmf

def preprocessing(documents):
    documentsPreprocessed = []
    for _ in documents:
        docIn = modelSpacy(_.lower())
        docOut = []
        for token in docIn:
            if token.lemma_ not in stopwords and not token.is_punct:
                docOut.append(token.lemma_)
        documentsPreprocessed.append(docOut)
    
    bigram = Phrases(documentsPreprocessed, min_count=5)
    for idx in range(len(documentsPreprocessed)):
        for token in bigram[documentsPreprocessed[idx]]:
            if '_' in token:
                documentsPreprocessed[idx].append(token)
    
    return documentsPreprocessed

def computePerformance(ldaModel, corpus, documentsPreprocessed, dictionary):
    # Compute Perplexity
    print('Perplexity: ', ldaModel.log_perplexity(corpus))

    # Compute Coherence Score
    coherenceLdaModel = CoherenceModel(model=ldaModel, texts=documentsPreprocessed, dictionary=dictionary, coherence='c_v')
    coherenceLda = coherenceLdaModel.get_coherence()
    print('Coherence Score: ', coherenceLda)

    return ldaModel, documentsPreprocessed

def trainLDA(documents, num_topics = 10, chunksize = 2000, passes = 20, iterations = 100, eval_every = None):
    documentsPreprocessed = preprocessing(documents)
    
    dictionary = Dictionary(documentsPreprocessed)
    corpus = [dictionary.doc2bow(doc) for doc in documentsPreprocessed]

    print('Number of unique tokens:', len(dictionary))
    print('Number of documents:', len(corpus))

    # Make a index to word dictionary.
    temp = dictionary[0]
    id2word = dictionary.id2token

    ldaModel = LdaModel(
        corpus=corpus,
        id2word=id2word,
        chunksize=chunksize,
        alpha='auto',
        eta='auto',
        iterations=iterations,
        num_topics=num_topics,
        passes=passes,
        eval_every=eval_every
    )

    computePerformance(ldaModel, corpus, documentsPreprocessed, dictionary)

    return ldaModel

def printBigram(documentsPreprocessed):
    bigrams = []
    for doc in documentsPreprocessed:
        for token in doc:
            if len(token.split('_')) > 1 and token not in bigrams:
                bigrams.append(token)

    print(bigrams)

## Corpus based on Documents or Sentences ?

We can build the corpus based on documents or sentences. Note that a document is a part of conversation spoken by a speaker, and a speaker can have more than one document. These results are shown in the next cells.

In [25]:
# Based on Documents
print("Topics based on documents:")
documents = sentencesDf[['document','sentence']].groupby(by='document')['sentence'].apply(lambda x: ' '.join(x))
ldaModel = trainLDA(documents)
ldaModel.print_topics(num_topics=-1, num_words=5)

Topics based on documents:
Number of unique tokens: 1045
Number of documents: 35
Perplexity:  -6.5617880875509105
Coherence Score:  0.23608711616131087


[(0, '0.001*"uh" + 0.001*"um" + 0.001*"kind" + 0.001*"system" + 0.001*"yeah"'),
 (1,
  '0.040*"um" + 0.035*"example" + 0.035*"work" + 0.015*"uh" + 0.015*"good"'),
 (2, '0.040*"um" + 0.025*"uh" + 0.017*"ai" + 0.017*"yeah" + 0.015*"like"'),
 (3,
  '0.040*"thank" + 0.023*"know" + 0.023*"bye" + 0.023*"okay" + 0.017*"maybe"'),
 (4,
  '0.052*"think" + 0.026*"question" + 0.020*"yeah" + 0.020*"general" + 0.017*"good"'),
 (5,
  '0.025*"accountable" + 0.017*"decide" + 0.017*"employee" + 0.017*"foresee" + 0.009*"second"'),
 (6, '0.018*"um" + 0.011*"uh" + 0.011*"ai" + 0.011*"like" + 0.011*"business"'),
 (7,
  '0.032*"um" + 0.026*"know" + 0.019*"yeah" + 0.016*"think" + 0.016*"like"'),
 (8,
  '0.054*"uh" + 0.024*"think" + 0.018*"thank" + 0.012*"um" + 0.012*"question"'),
 (9,
  '0.024*"um" + 0.022*"uh" + 0.019*"ai" + 0.017*"solution" + 0.015*"value"')]

In [26]:
# Based on Sentences
print("Topics based on sentences:")
documentsBySentence = sentencesDf['sentence'].to_list()
ldaModel = trainLDA(documentsBySentence)
ldaModel.print_topics(num_topics=-1, num_words=5)

Topics based on sentences:
Number of unique tokens: 1045
Number of documents: 644
Perplexity:  -6.935174328913833
Coherence Score:  0.3939808402838773


[(0,
  '0.040*"um" + 0.039*"uh" + 0.019*"kind" + 0.016*"example" + 0.016*"learning"'),
 (1,
  '0.040*"machine" + 0.036*"machine_learning" + 0.034*"learning" + 0.033*"ai" + 0.021*"datum"'),
 (2, '0.039*"um" + 0.028*"like" + 0.027*"uh" + 0.018*"ai" + 0.014*"okay"'),
 (3,
  '0.027*"want" + 0.026*"bias" + 0.024*"thank" + 0.023*"solution" + 0.017*"ai"'),
 (4,
  '0.048*"um" + 0.031*"important" + 0.030*"value" + 0.022*"system" + 0.021*"uh"'),
 (5,
  '0.024*"security" + 0.023*"know" + 0.019*"mean" + 0.019*"important" + 0.016*"question"'),
 (6,
  '0.019*"value" + 0.016*"ai" + 0.016*"color" + 0.016*"term" + 0.015*"kind"'),
 (7,
  '0.035*"um" + 0.022*"usually" + 0.017*"yeah" + 0.017*"project" + 0.014*"kind"'),
 (8, '0.079*"yeah" + 0.033*"um" + 0.032*"think" + 0.028*"good" + 0.016*"uh"'),
 (9,
  '0.031*"uh" + 0.029*"solution" + 0.025*"um" + 0.023*"question" + 0.018*"think"')]

**Based on the coherence scores, we can see that considering documents as a corpus for topic modeling is not an appropriate approach.**

## Problem with fillers

As can be seen in the result of LDA, the topics include several fillers such as um, yeah, uh; since these words are the most frequent. Therefore, we need to modify our preprocessing step. One possible solution is that we only keep NOUN, PROPN (proper noun), ADV, ADJ.

In [None]:
pos = ["NOUN", "PROPN", "ADV", "ADJ"]
def preprocessing(documents):
    documentsPreprocessed = []
    for _ in documents:
        docIn = modelSpacy(_.lower())
        docOut = []
        for token in docIn:
            if token.lemma_ not in stopwords and not token.is_punct and token.pos_ in pos:
                docOut.append(token.lemma_)
        documentsPreprocessed.append(docOut)
    
    bigram = Phrases(documentsPreprocessed, min_count=5)
    for idx in range(len(documentsPreprocessed)):
        for token in bigram[documentsPreprocessed[idx]]:
            if '_' in token:
                documentsPreprocessed[idx].append(token)
    
    return documentsPreprocessed

In [None]:
documentsBySentence = sentencesDf['sentence'].to_list()
ldaModel = trainLDA(documentsBySentence)
print("Topics:")
ldaModel.print_topics(num_topics=-1, num_words=5)

Number of unique tokens: 803
Number of documents: 644
Perplexity:  -6.862700005709316
Coherence Score:  0.5115125855999766
Topics:


[(0,
  '0.064*"value" + 0.043*"good" + 0.027*"security" + 0.024*"kind" + 0.021*"basically"'),
 (1,
  '0.028*"maybe" + 0.021*"experience" + 0.021*"datum" + 0.021*"example" + 0.018*"woman"'),
 (2,
  '0.046*"usually" + 0.029*"lot" + 0.022*"time" + 0.017*"accountable" + 0.014*"important"'),
 (3,
  '0.074*"question" + 0.053*"kind" + 0.034*"ai" + 0.018*"hand" + 0.016*"project"'),
 (4,
  '0.039*"example" + 0.020*"company" + 0.017*"ethical" + 0.017*"aspect" + 0.017*"nice"'),
 (5,
  '0.050*"solution" + 0.026*"value" + 0.022*"ai" + 0.019*"kind" + 0.018*"general"'),
 (6,
  '0.061*"learning" + 0.045*"machine" + 0.037*"machine_learning" + 0.032*"kind" + 0.029*"big"'),
 (7,
  '0.028*"intelligence" + 0.027*"human" + 0.021*"solution" + 0.019*"actually" + 0.019*"development"'),
 (8,
  '0.038*"ai" + 0.028*"actually" + 0.021*"ethic" + 0.018*"business" + 0.016*"possible"'),
 (9,
  '0.052*"important" + 0.044*"bias" + 0.044*"system" + 0.033*"bit" + 0.025*"solution"')]

We can see an improvement about +0.1 for coherence score.

## Ensemble learning in Gensim to generate stable topics.

In [None]:
from gensim.models import EnsembleLda

documentsBySentence = sentencesDf['sentence'].to_list()
documentsPreprocessed = preprocessing(documentsBySentence)
dictionary = Dictionary(documentsPreprocessed)
corpus = [dictionary.doc2bow(doc) for doc in documentsPreprocessed]

num_topics = 10
chunksize = 2000
passes = 20
iterations = 100
epsilon = 0.5

temp = dictionary[0]
id2word = dictionary.id2token
num_models = 10

ensemble = EnsembleLda(
    corpus=corpus,
    id2word=id2word,
    num_topics=num_topics,
    passes=passes,
    epsilon=epsilon,
    num_models=num_models,
    topic_model_class='lda',
    iterations=iterations
)

print("Number of topics: ", len(ensemble.ttda))
print("Number of stable topics: ", len(ensemble.get_topics()))

Number of topics:  100
Number of stable topics:  3


In [None]:
ensemble.print_topics(num_topics=-1, num_words=10)

[(0,
  '0.025*"ai" + 0.015*"datum" + 0.014*"actually" + 0.012*"bias" + 0.012*"intelligence" + 0.011*"example" + 0.011*"solution" + 0.011*"lot" + 0.009*"course" + 0.009*"kind"'),
 (1,
  '0.044*"learning" + 0.033*"system" + 0.027*"deep" + 0.026*"machine" + 0.023*"deep_learning" + 0.020*"machine_learning" + 0.014*"kind" + 0.013*"ai" + 0.012*"model" + 0.011*"big"'),
 (2,
  '0.082*"question" + 0.029*"maybe" + 0.019*"lot" + 0.018*"solution" + 0.012*"actually" + 0.012*"example" + 0.010*"intelligence" + 0.010*"ethical" + 0.009*"good" + 0.007*"human"')]

## Confidence score of words in the topics

We observe the confidence scores of the words found in the stable topics produced by EnsembleLda of Gensim. As can be seen in the following cell, the confidence scores of these words are higher than 0.95.

In [None]:
import re

topics = []
for topic in ensemble.print_topics(num_topics=-1, num_words=20):
    for word in topic[1].split('+'):
        for token in re.search('[a-z_]+', word)[0].split('_'):
            if token not in topics:
                topics.append(token)

print(topics)

vocabularyDf[vocabularyDf["word"].isin(topics)].sort_values(by=['frequency'], ascending=False)

['ai', 'datum', 'actually', 'bias', 'intelligence', 'example', 'solution', 'lot', 'course', 'kind', 'time', 'important', 'business', 'specific', 'people', 'system', 'human', 'artificial', 'certain', 'big', 'learning', 'deep', 'machine', 'model', 'aspect', 'security', 'usually', 'good', 'way', 'question', 'maybe', 'ethical', 'hand', 'bit', 'ethic', 'practical', 'user', 'moment', 'technical']


Unnamed: 0,word,frequency,confidence,speakers,speakerQty
16,ai,61,0.992787,"[1, 0, 2, 3]",4
182,kind,49,0.982653,"[0, 1]",2
236,system,43,0.98814,[0],1
222,solution,43,0.997907,"[0, 1]",2
200,actually,34,0.989118,"[0, 2, 5]",3
130,important,33,0.999091,"[0, 3]",2
143,question,32,0.985313,"[1, 0, 2, 4, 6]",5
109,datum,31,0.953226,"[1, 0, 3]",3
237,example,31,0.984194,"[0, 2, 1]",3
181,learning,30,1.0,[0],1


## Replacing words having low confidence scores in the output of speech-to-text model.

There are several words with a low confidence score (e.g., lower than 0.9). We use Transformer to re-fill incertain words of a sentence such that we can improve the similarity between this sentence and its neighbors.

In [None]:
!pip install transformers

In [None]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-cased')

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [None]:
sentencesProcessed = []
threshold = 0.90
for line in transcriptData["monologues"]:
    sentence = []
    confidence = []
    
    for element in line["elements"]:
        if element["type"] == 'punct' and element["value"] == '.':
            sentence.append(element['value'])
            confidence.append(1.0)
            sentences = [" ".join(sentence)]
            for idx, conf in enumerate(confidence):
                if conf < threshold:
                    maskedSentence = sentence.copy()
                    maskedSentence[idx] = "[MASK]"
                    newSentences = unmasker(" ".join(maskedSentence))
                    for _ in newSentences[:2]:
                        sentences.append(_['sequence'])

            sentencesProcessed.append(sentences)

            sentence = []
            confidence = []
        else:
            if element["type"] == 'punct':
                if element["value"] != ' ':
                    sentence.append(element['value'])
                    confidence.append(1.0)
            else:
                if element["type"] != 'unknown':
                    sentence.append(element['value'])
                    confidence.append(element['confidence'])

For each incertain word (e.g., confidence score < 0.9) of a sentence, only this word is masked (i.e., other words in the sentence are kept) and several replacements (e.g., the most 2 probable replacements) are predicted by the language model. Several examples are displayed in the next cell: note that the first sentence is the original sentence.

For short sentences, the replacements can be harder to be predicted e.g., \['Yeah , I was .', 'Yeah, it was.', 'Yeah, he was.', 'Yeah, I thought.', 'Yeah, I said.']. One possible solution is that several neighbor sentences in a document are concatenated with the masked sentence.

In [None]:
for idx, sent in enumerate(sentencesProcessed[:20]):
    if len(sent) > 1:
        print("Sentence %d:" % idx, sent)

Sentence 4: ["It was the mother young Maya , she's the CEO and founder of Skype form .", "she was the mother young Maya, she's the CEO and founder of Skype form.", "She was the mother young Maya, she's the CEO and founder of Skype form.", "It is the mother young Maya, she's the CEO and founder of Skype form.", "It : the mother young Maya, she's the CEO and founder of Skype form.", "It was her mother young Maya, she's the CEO and founder of Skype form.", "It was my mother young Maya, she's the CEO and founder of Skype form.", "It was the mother of Maya, she's the CEO and founder of Skype form.", "It was the mother for Maya, she's the CEO and founder of Skype form.", "It was the mother young lady, she's the CEO and founder of Skype form.", "It was the mother young entrepreneur, she's the CEO and founder of Skype form.", 'It was the mother young Maya, also the CEO and founder of Skype form.', 'It was the mother young Maya, later the CEO and founder of Skype form.', "It was the mother youn

### How to score these replacement sentences ?

We consider this direction: improving the similarity between these replacement sentences and neighbor sentences in the same document.

In [None]:
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')

Let's observe the sentence 4th:

In [None]:
print("Original    0:", sentencesProcessed[4][0])
for idx, sent in enumerate(sentencesProcessed[4][1:]):
    print("Replacement %d: " % (idx+1), sent)

Original    0: It was the mother young Maya , she's the CEO and founder of Skype form .
Replacement 1:  she was the mother young Maya, she's the CEO and founder of Skype form.
Replacement 2:  She was the mother young Maya, she's the CEO and founder of Skype form.
Replacement 3:  It is the mother young Maya, she's the CEO and founder of Skype form.
Replacement 4:  It : the mother young Maya, she's the CEO and founder of Skype form.
Replacement 5:  It was her mother young Maya, she's the CEO and founder of Skype form.
Replacement 6:  It was my mother young Maya, she's the CEO and founder of Skype form.
Replacement 7:  It was the mother of Maya, she's the CEO and founder of Skype form.
Replacement 8:  It was the mother for Maya, she's the CEO and founder of Skype form.
Replacement 9:  It was the mother young lady, she's the CEO and founder of Skype form.
Replacement 10:  It was the mother young entrepreneur, she's the CEO and founder of Skype form.
Replacement 11:  It was the mother young

Neighbours of the sentence 4h are shown in the next cell. Note that we use the original neighbor sentences.

In [None]:
neighborsName = []
neighbors = []
for idx, sent in enumerate(sentencesProcessed[0:10]):
    if idx != 4:
        print("Sentence %d:" % idx, sent)
        neighbors.append(sent[0])
        neighborsName.append("Sent %d" % idx)

Sentence 0: ['Yes .']
Sentence 1: ['Hello everyone .']
Sentence 2: ["And welcome to today's master class ."]
Sentence 3: ["It's about how to develop ethical AI in business and what come with me ."]
Sentence 5: ['And she basically sits in Zurich .']
Sentence 6: ["And for all of you who really want to know a little bit more about who she is , I'm just like quoting her from her website .", "And for all of you who really want to know a little bit more about who she is, I'm just like quoting her from her website.", "And for all of you who really need to know a little bit more about who she is, I'm just like quoting her from her website."]
Sentence 7: ["She's a computer national science scientist engineer , and then award-winning serial entrepreneur .", 'became a computer national science scientist engineer, and then award - winning serial entrepreneur.', 'was a computer national science scientist engineer, and then award - winning serial entrepreneur.', "She's a computer and science scienti

In [None]:
sentenceEmb = model.encode(sentencesProcessed[4])
neighborsEmb = model.encode(neighbors)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

simDf = pd.DataFrame(cosine_similarity(sentenceEmb,neighborsEmb), columns =neighborsName)
simDf['sumSimilarity'] = simDf[list(simDf.columns)].sum(axis=1)
simDf

Unnamed: 0,Sent 0,Sent 1,Sent 2,Sent 3,Sent 5,Sent 6,Sent 7,Sent 8,Sent 9,sumSimilarity
0,0.163223,0.255804,0.455655,0.490742,0.465042,0.455459,0.718225,0.393399,0.444212,3.84176
1,0.171276,0.26325,0.46047,0.483422,0.453373,0.457857,0.71631,0.407703,0.451926,3.865587
2,0.171276,0.26325,0.46047,0.483422,0.453373,0.457857,0.71631,0.407703,0.451926,3.865587
3,0.167195,0.261912,0.459853,0.494875,0.461442,0.459008,0.716117,0.39699,0.448645,3.866036
4,0.122391,0.22366,0.441149,0.48562,0.446049,0.463232,0.724966,0.387394,0.441749,3.736209
5,0.114416,0.20378,0.434559,0.465304,0.454964,0.456837,0.712915,0.376906,0.426432,3.646114
6,0.132762,0.219291,0.431743,0.470328,0.467449,0.452055,0.720607,0.37516,0.430295,3.69969
7,0.219586,0.295831,0.456679,0.479233,0.488567,0.455948,0.712811,0.397159,0.446205,3.952019
8,0.173093,0.257805,0.459738,0.507308,0.482808,0.470137,0.717014,0.402475,0.454485,3.924863
9,0.166825,0.2577,0.430422,0.487051,0.459299,0.478369,0.720439,0.415312,0.459218,3.874634


This suggests that the replacement 7th "It was the mother of Maya, she's the CEO and founder of Skype form." is the most similar to the rest of the document.

There are other methodes to compute similarity between two sentences such as considering different semantic granularities (e.g., an approach of [Liu, 2018](https://arxiv.org/abs/1803.00179)). Several approaches can be found in [paperwithcode](https://paperswithcode.com/task/semantic-textual-similarity).