# Get Cosine Similarity

tltr: Script for computing semantic similarity between 2 words (a target word and what the participants memory or guess of it is) using a pretrained GloVe word embedding glove.6B.100d 

#### What problem does this notebook help you solve?

In psychological research, participants are often asked to recall a previously seen word or guess a word based on some features (e.g., watching a muted video in which the word is discussed). If their response is the same as the correct answer, they receive a correct score. However, if the response is incorrect, in some case we want to know how incorrect it is. This can help us score the responses better. For example, if the answer is "apple," we would give this trial a higher score than if the response is "kettle" since "pear" is more similar to "apple" than "kettle." To achieve this, we can use a semantic similarity measure such as cosine distance between the vectors that represent the given words in a word embedding. In this tutorial, we use a pretrained GloVe embedding. 

I used the pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) word embedding glove.6B.100d - which should be downloaded from the https://nlp.stanford.edu/projects/glove/ and saved in trained_glove folder. This word embedding has 8 billion tokens and 100 dimentions. You can also use different pretrained GloVe word embeddings e.g., the glove.6B.300d with 300 dimension. 


Before looking at the cosine distance, participaints' responses are autocorrected using [symspellpy](https://symspellpy.readthedocs.io/en/latest/examples/lookup.html#basic-usage) and both participants responses and the answers are lemmatized using [Stanza](https://stanfordnlp.github.io/stanza/lemma.html).


Required packages: 
- pandas 
- gensim
- symspellpy 
- stanza

In [2]:
import pandas as pd
import os

#### Set up for cosine similarity

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

In [7]:
# make an absolute path to training set from relative path
root_folder='.'
data_folder_name='trained_glove'
glove_filename='glove.6B.100d.txt' # edit as required
DATA_PATH = os.path.abspath(os.path.join(root_folder, data_folder_name))
glove_path = os.path.abspath(os.path.join(DATA_PATH, glove_filename))

In [9]:
# converting the `glove_input_file` in GloVe format to word2vec format and write it to `word2vec_output_file`
word2vec_output_file = glove_filename+'.word2vec'
glove2word2vec(glove_path, word2vec_output_file)

  glove2word2vec(glove_path, word2vec_output_file)


(400000, 100)

In [10]:
# Load KeyedVectors from the word2vec_output_file produced by the original C word2vec-tool format.
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

In [11]:
# test if working 
print("distance between potato & garlic = " + str(model.distance('potato','garlic')))
print("distance between onion & garlic = " + str(model.distance('onion','garlic')))


distance between potato & garlic = 0.4077409505844116
distance between onion & garlic = 0.13924723863601685


Onion is closer to garlic than it is to potato, which makes sense. 

#### Set up for autocorrect

In [12]:
import pkg_resources
from symspellpy import SymSpell, Verbosity

In [14]:
# set up how SymSpell functions 
# max_dictionary_edit_distance: Maximum edit distance for doing lookups.
# prefix_length: The length of word prefixes used for spell checking.
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

# load dictionary to spell checker. 
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

In [19]:
# see if it works! 
input_terms = [
    "memebers",  # misspelling of "members"
    "chatalainee", # misspelling of "chatelaine"
    "papple"
]
for input_term in input_terms:
    print("input: ")
    print(input_term)
    suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, max_edit_distance=2)
    print("suggestions: ")
    for suggestion in suggestions:
        print(suggestion)


input: 
memebers
suggestions: 
members, 1, 226656153
input: 
chatalainee
suggestions: 
chatelaine, 2, 86807
input: 
papple
suggestions: 
apple, 1, 50551171
dapple, 1, 62176
popple, 1, 34253


#### Set up lemmatizer

In [20]:
# set up Stanza for lemma 
import stanza
stanza.download('en') # only need to run this the first time 
stanza_nlp = stanza.Pipeline('en')

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 55.3MB/s]                    
2022-12-12 11:07:23 INFO: Downloading default packages for language: en (English) ...
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.1/models/default.zip: 100%|██████████| 561M/561M [00:15<00:00, 36.7MB/s] 
2022-12-12 11:07:48 INFO: Finished downloading models and saved to C:\Users\n\stanza_resources.
2022-12-12 11:07:48 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 48.3MB/s]                    
2022-12-12 11:07:50 INFO: Loading these models for language: en (English):
| Processor    | Package   |
--------------

In [32]:
# see if it works!
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')

word2lemma = 'ran' 
doc = nlp(word2lemma)
lemmad_word = doc.sentences[0].words[0].lemma
print('the lemma of ['+word2lemma+'] is ['+lemmad_word+']')

2022-12-12 11:16:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 64.4MB/s]                    
2022-12-12 11:16:32 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| pos       | combined |
| lemma     | combined |

2022-12-12 11:16:32 INFO: Use device: cpu
2022-12-12 11:16:32 INFO: Loading: tokenize
2022-12-12 11:16:32 INFO: Loading: pos
2022-12-12 11:16:33 INFO: Loading: lemma
2022-12-12 11:16:33 INFO: Done loading processors!


the lemma of [ran] is [run]


#### Set up utility dictionaries

In the study this code was used we wanted to give a full score to certain words that were either typos or a synonyme. answekey contains these words.

In [33]:
answerkey = {
  'toucan': 'tucan',
  'castanets': 'castagnette',
  'binoculars': 'lorgnette',
  'scissors': ['scizor','scisor'],
  'leek': 'leak',
  'kazoo': ['cazoo', 'kazu'],
  'pan' : 'frying pan',
  'anteater' : 'ant eater',
  'pliers' : ['pliers','plier']
}

answer = 'kazoo'
response = 'kazu'
if answer in answerkey.keys():
    good_enough_answer = answerkey[answer]
    if response in good_enough_answer:
    # set cos dist to 0 
        print('the answer is good enough')

the answer is good enough


we also used a list of words that a lot of participants to say they didn't remember. This was prefered over allowing the code to just generate a large distance value for speed. 

In [34]:
no_answers = ['?','??','=','can,t guess', "can't answer","can't guess","cannot guess",
"canot guess","cant guess","cant tell", "cant' guess", "don't know","don't know again",
"don't know again","don't know at all I'm afraid", "don't know here, having trouble with the English accents in these videos",
"dont know","no beep","no clue","no guess","no idea","no idea on that one","no selection",
"no word","not sure","not sure here at all","nan"]

for i in range(len(no_answers)):
    no_answers[i] = no_answers[i].lower()


#### Analyse the responses  

In [55]:
data = pd.read_csv("data.csv",index_col=False)

answer_response = data[['ANSWER', 'Response']].values.tolist()  # turn df to list
# answer - what the currect answer to the task is
# response - what the participant remembered and typed in

outputs = [] # to store autocorrect, lemmatized, cos_dist, & correct
error_n = 0

for answer, response in answer_response:

    # what participant has written (in lower case)
    response = str(response).lower()

    # Answers we manually decide are good enough
    if (answer in answerkey.keys()):
        good_enough_answer = answerkey[answer]
        if response in good_enough_answer:
            outputs.append([[],[],[],1]) # no autocorrect, no lemma, no cos dim, correct = 1 
            continue

    # responses that we are a "I dont know/remember" e.g. "can't guess"
    if response in no_answers:
        outputs.append([[],[],[],0]) # no autocorrect, no lemma, no cos dim, correct = 0
        continue

    # lemma the answer
    answer_lemma = nlp(answer).sentences[0].words[0].lemma
    word_correct = 0
    if (answer_lemma in response) | (response in answer):
        outputs.append([[],[],[],1]) # no autocorrect, no lemma, no cos dim, correct = 1 
        continue

    # for some words autocorrect will fail (because it can't correct it) and/or they dont exist in the vector space, these are skipped.
    try:
        # autocorrect
        response_corrected = sym_spell.lookup(
            response, Verbosity.CLOSEST, max_edit_distance=2)[0].term

        # lemmatizer
        response_lematize = nlp(response_corrected).sentences[0].words[0].lemma

        # check if the response is the same as what it should be - this entail a correct response
        word_correct = 0
        if (response_lematize in answer_lemma):
            word_correct = 1

        # cosine distance rounded up to 4 decimal places 
        word_cosine = model.distance(answer_lemma, response_lematize).round(4)

        outputs.append([response_corrected,response_lematize,word_cosine,word_correct]) # autocorrect, lemma, cos dim, correct = 0

    except:  # if cos_dist is not found
        #data2.cos_dist[w] = np.nan
        error_n += 1
        outputs.append([[],[],[],word_correct]) # no autocorrect, no lemma, no cos dim, correct = 1 
        continue


print(f"there were {error_n} errors")

outputs_df = pd.DataFrame(outputs,columns=['autocorrected', 'lemmatized','cosine_dist','correct'])

df = data.join(outputs_df)
df.head(20)


there were 3 errors


Unnamed: 0,ANSWER,Response,autocorrected,lemmatized,cosine_dist,correct
0,xylophone,xylophone,[],[],[],1
1,peacock,horses,horses,horse,0.678,0
2,spade,showel,showed,show,0.7993,0
3,pot,bowl,bowl,bowl,0.5847,0
4,sausage,burger,burger,burger,0.5398,0
5,apple,pear,pear,pear,0.411,0
6,mushroom,banana,banana,banana,0.6099,0
7,sausage,sausages,[],[],[],1
8,parrot,bird,bird,bird,0.4826,0
9,cat,dog,dog,dog,0.1202,0


As it can be seen in the above output, cosine distance is larger for semantically similar words (e.g., cat & dog) than for words that are less semantically similar (e.g., kettle & shoe).