The file 'test_sent.txt' is associated with a runtime and is deleted when one terminates. This file, along with the results of this analysis ('BERT-similarities.txt' and 'GloVe-similarities.txt') can be found on Canvas.

From the data obtained in 'Final-Word-Embeddings-Sinha.ipynb', we test the similarities between 'words' that apparently conform to the same definition. For instance, 'smh' and 'shaking my head' might superficially mean the same thing however, from our analysis below, it is found that these words do not have the same connotations. 

For the purpose of looking at the contextual meanings of these words/acronyms/phrases, we use some test sentences from 'tumblr.txt'. The sentences are identical, except apparent synonyms are replaced in this instance to test what kind of a difference this makes to the meanings of the sentences. We consider two different pre-trained models - BERT (for contextual embeddings) and GloVe (for non-contextual embeddings). While these models approach the problem differently, they yield similar results - however, GloVe obviously finds a greater degree of similarity between sentences given its non-contextual approach.

The goal of this exercise is to bring to light a pertinent problem in translation and other aspects of natural language processing, and even manual linguistic analysis - connotations. The existence of the same implies that the same word (or equivalently, synonyms) can convey different meanings depending on the context. The findings are discussed in greater detail in the final paper.


In [1]:
# Install sentence_transformers
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.98-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_hub-0.14.1-py3-n

In [10]:
# Import useful libraries
from sentence_transformers import SentenceTransformer, util
import re
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# The following code is from Practicum 3
# Takes in a string of text and returns a list of cleaned tokens
def clean_text(text):
  tokens = nltk.word_tokenize(text)
  pattern = re.compile("[a-zA-Z0-9']")
  clean_tokens = [tok.lower() for tok in tokens if pattern.match(tok)]
  return clean_tokens

In [4]:
# Reads the contents of the file 'test_sent.txt'
with open('test_sent.txt') as f:
  text = f.read()

# Splits the text into posts
sents = text.split('\n')

# Splits the texts into two lists. Corresponding indices of the lists are to
# be compared below.
sent_1 = []
sent_2 = []

# Alternately places the test sentences in each of the lists under consideration.
for i in range(len(sents)):
  if i % 2 == 0:
    sent_1.append(' '.join(clean_text(sents[i])))
  else:
    sent_2.append(' '.join(clean_text(sents[i])))

In [6]:
# Loads the pre-trained BERT model.
bert_model = SentenceTransformer('stsb-bert-large')

Downloading (…)e1ba8/.gitattributes:   0%|          | 0.00/523 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)e4bc9e1ba8/README.md:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

Downloading (…)a8/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)bc9e1ba8/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e1ba8/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/379 [00:00<?, ?B/s]

Downloading (…)e4bc9e1ba8/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)c9e1ba8/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [9]:
# The following code is adapted from:
# https://www.sbert.net/docs/usage/semantic_textual_similarity.html
# Creates sentence embeddings for the test sentences.
bert_embeddings1 = bert_model.encode(sent_1, convert_to_tensor=True)
bert_embeddings2 = bert_model.encode(sent_2, convert_to_tensor=True)

# Computes cosine-similarities.
cosine_scores = util.cos_sim(bert_embeddings1, bert_embeddings2)

# Prints the pairs and their scores.
for i in range(len(sent_1)):
    print("{}\n{}\nScore: {:.4f}\n".format(sent_1[i], sent_2[i], cosine_scores[i][i]))

rethinking my policy on sharing 'help ' posts now meanwhile my leg is literally snapped in half smh
rethinking my policy on sharing 'help ' posts now meanwhile my leg is literally snapped in half shaking my head
Score: 0.9463

smh the disrespect
shaking my head the disrespect
Score: 0.8274

smh i have so much to say but i just wo n't i 'll just unfollow for now
shaking my head i have so much to say but i just wo n't i 'll just unfollow for now
Score: 0.9081

smh why am i like this
shaking my head why am i like this
Score: 0.7687

lol haha
lol lmao
Score: 0.8482

haha lol
haha lmao
Score: 0.7877

alright im done being sappy now haha
alright enough being sappy now haha
Score: 0.8967

maybe leon should smoke weed instead
maybe leon should smoke marijuana instead
Score: 0.9564

like wow i met you on livejournal when i was 14 hi girlie they wouldnt recognize me and idk if theyd remember me if i asked so here we are
like wow i met you on livejournal when i was 14 hi girlie they wouldnt recog

In [26]:
# Loads the pre-trained GloVe model.
glove_model = SentenceTransformer('average_word_embeddings_glove.6B.300d')

Downloading (…)dc709/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/480M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

Downloading (…)mbedding_config.json:   0%|          | 0.00/164 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8744edc709/README.md:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)4edc709/modules.json:   0%|          | 0.00/248 [00:00<?, ?B/s]

In [27]:
# The following code is adapted from:
# https://www.sbert.net/docs/usage/semantic_textual_similarity.html
# Creates sentence embeddings for the test sentences.
glove_embeddings1 = glove_model.encode(sent_1, convert_to_tensor=True)
glove_embeddings2 = glove_model.encode(sent_2, convert_to_tensor=True)

# Computes cosine-similarities.
cosine_scores = util.cos_sim(glove_embeddings1, glove_embeddings2)

# Prints the pairs and their scores.
for i in range(len(sent_1)):
    print("{}\n{}\nScore: {:.4f}\n".format(sent_1[i], sent_2[i], cosine_scores[i][i]))

rethinking my policy on sharing 'help ' posts now meanwhile my leg is literally snapped in half smh
rethinking my policy on sharing 'help ' posts now meanwhile my leg is literally snapped in half shaking my head
Score: 0.9186

smh the disrespect
shaking my head the disrespect
Score: 0.3602

smh i have so much to say but i just wo n't i 'll just unfollow for now
shaking my head i have so much to say but i just wo n't i 'll just unfollow for now
Score: 0.8602

smh why am i like this
shaking my head why am i like this
Score: 0.3378

lol haha
lol lmao
Score: 0.7924

haha lol
haha lmao
Score: 0.7893

alright im done being sappy now haha
alright enough being sappy now haha
Score: 0.8801

maybe leon should smoke weed instead
maybe leon should smoke marijuana instead
Score: 0.8807

like wow i met you on livejournal when i was 14 hi girlie they wouldnt recognize me and idk if theyd remember me if i asked so here we are
like wow i met you on livejournal when i was 14 hi girlie they wouldnt recog