# Generating incorrect answer suggestions
Using word embeddings I'm going to find the most similar words to an answer and then order them by how relative they are.

## Importing the word embeddings
Unfortunately our beloved *spacy* does not offer most similar words. We'll use **gensim** for that.

In [1]:
import gensim
#model = gensim.models.KeyedVectors.load_word2vec_format('data/embeddings/GoogleNews-vectors-negative300.bin', binary=True)



The word2vec dataset (*3.39GB*) bricked my laptop (*twice*). Seems like a smaller pretained embedding should suffice. 

In [8]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
glove_file = datapath('D:\ML\Question-Generation\\05.Generating-incorrect-answers\embeddings\glove.6B.300d.txt')
tmp_file = get_tmpfile("D:\ML\Question-Generation\\05.Generating-incorrect-answers\embeddings\word2vec-glove.6B.300d.txt")

# call glove2word2vec script
# default way (through CLI): python -m gensim.scripts.glove2word2vec --input <glove_file> --output <w2v_file>
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

## Generating similar words

In [9]:
model.most_similar(positive=['koala'], topn=10)

[('probo', 0.5426342487335205),
 ('koalas', 0.4729689359664917),
 ('orangutan', 0.4557289779186249),
 ('grizzly', 0.41816502809524536),
 ('marsupial', 0.39361128211021423),
 ('wombat', 0.3832378685474396),
 ('cuddly', 0.3804110288619995),
 ('kodiak', 0.37843799591064453),
 ('kade', 0.37742379307746887),
 ('kangaroo', 0.3612629175186157)]

It seems to be working fine. Though what *the f* * is a probo?

![image.png](https://i.gyazo.com/8e982abd6da0025cb985b388c07507a8.png)

Ok.

At this point we asume that we have our answer, the sentence it's in, the entire text, and the title. Let's explore some words.

*__Oxygen__ is a chemical element with symbol O and atomic number 8.*  

In [10]:
model.most_similar(positive=['oxygen'], topn=10)

[('hydrogen', 0.63267982006073),
 ('nitrogen', 0.6251459717750549),
 ('helium', 0.5435217022895813),
 ('nutrients', 0.5369840860366821),
 ('breathing', 0.5023170709609985),
 ('chlorine', 0.4946938157081604),
 ('monoxide', 0.4911428987979889),
 ('dioxide', 0.4911195933818817),
 ('ammonia', 0.49079084396362305),
 ('carbon', 0.4836854636669159)]

That was easy. Let's try something more difficult.

*the oldest portuguese university was first established in **lisbon** before moving to coimbra.*

In [14]:
model.most_similar(positive=['lisbon'], topn=10)

[('portugal', 0.6408252716064453),
 ('porto', 0.5835250616073608),
 ('benfica', 0.5504175424575806),
 ('copenhagen', 0.5288481712341309),
 ('portuguese', 0.5266897678375244),
 ('madrid', 0.5219067335128784),
 ('brussels', 0.5173484683036804),
 ('oporto', 0.5147969126701355),
 ('prague', 0.5037161707878113),
 ('amsterdam', 0.5018222332000732)]

Seems like we are getting closer to *football teams* rather than *cities with old universities*. Let's add some more words from the sentence.

In [15]:
model.most_similar(positive=['lisbon', 'university'], topn=10)

[('faculty', 0.5288037061691284),
 ('college', 0.523701012134552),
 ('professor', 0.5193326473236084),
 ('graduate', 0.5135288834571838),
 ('universities', 0.5098860859870911),
 ('copenhagen', 0.5022274255752563),
 ('campus', 0.4942850172519684),
 ('prague', 0.4880773425102234),
 ('madrid', 0.4852182865142822),
 ('portugal', 0.4788099527359009)]

Great! But now the words are getting too close to university. It would be good if we can add more weight to the orignal answer.

I can manually do it by taking the best embeddings to the original answer and counting how many times they occur in the joint embeddings.

In [23]:
model.most_similar(positive=['lisbon', 'coimbra'], topn=10)

[('porto', 0.6089159846305847),
 ('portugal', 0.6070287823677063),
 ('oporto', 0.5988742113113403),
 ('braga', 0.5796492099761963),
 ('benfica', 0.5514551401138306),
 ('leiria', 0.5170067548751831),
 ('aveiro', 0.4983532428741455),
 ('viseu', 0.491713285446167),
 ('évora', 0.4914955198764801),
 ('são', 0.4868907928466797)]

Using another city really makes a difference and shows some good candidates. I think it'll be a good idea to use words in the sentence that are next to the answer.

### Words with the same stem

In [17]:
model.most_similar(positive=['write'], topn=10)

[('writing', 0.6969849467277527),
 ('read', 0.6291235089302063),
 ('wrote', 0.6251993179321289),
 ('written', 0.6065735816955566),
 ('publish', 0.5670630931854248),
 ("'d", 0.5343195796012878),
 ('writes', 0.5341792702674866),
 ('tell', 0.5337096452713013),
 ('you', 0.5316603779792786),
 ('books', 0.5285096168518066)]

We could just remove all similar words that have the same stem as the original answer.

Additionally, the incorrect answers should be the same part of speech. Like with **write** - *read*, *publish*, *tell* are good candidates, but *books* could be easily discarded for being a noun.

### Numbers

In [18]:
model.most_similar(positive=['1944'], topn=10)

[('1943', 0.9581360220909119),
 ('1942', 0.9418259859085083),
 ('1941', 0.9256348609924316),
 ('1940', 0.8975383043289185),
 ('1945', 0.8817087411880493),
 ('1939', 0.8315708637237549),
 ('1946', 0.8234671950340271),
 ('1938', 0.781980574131012),
 ('1937', 0.7764101028442383),
 ('1935', 0.7516504526138306)]

Not that bad. They seem to gravitate around the events of WW2. It seems better than ramdon numbers or closest numbers if we need to have multiple answer question. But I think it may be a better question if you have to input the number yourself, and you get a better score if you are closer to the correct answer.

### Names

In [24]:
model.most_similar(positive=['bush'], topn=10)

[('clinton', 0.7889922261238098),
 ('obama', 0.7570987939834595),
 ('gore', 0.6871949434280396),
 ('w.', 0.6750580072402954),
 ('cheney', 0.6621242761611938),
 ('mccain', 0.6613168716430664),
 ('barack', 0.6568867564201355),
 ('administration', 0.6468127965927124),
 ('george', 0.6463572978973389),
 ('kerry', 0.6004412174224854)]

In [20]:
model.most_similar(positive=['euclid'], topn=10)

[('postulate', 0.4412064254283905),
 ('archimedes', 0.43941453099250793),
 ('n.e.', 0.39649108052253723),
 ('pythagoras', 0.39116495847702026),
 ('aristotle', 0.3895653486251831),
 ('avenue', 0.38695406913757324),
 ('proclus', 0.3855825662612915),
 ('greektown', 0.3836863040924072),
 ('ptolemy', 0.38028305768966675),
 ('berea', 0.37123364210128784)]

In [30]:
model.most_similar(positive=['atanasov'], topn=10)

[('atanas', 0.6365466713905334),
 ('fery', 0.4410214424133301),
 ('simeonov', 0.4386071562767029),
 ('atanassov', 0.4376071095466614),
 ('mladenov', 0.4347333312034607),
 ('sergeevich', 0.4314761757850647),
 ('neophytos', 0.4266960620880127),
 ('geleta', 0.419179230928421),
 ('vassilev', 0.41890764236450195),
 ('stoev', 0.414333313703537)]

I expected to be a lot worse. Names of famous people gets us other names of people with the same profesion - US presidents and greek mathematicians come up pretty easily. 

But with some less known figures, like a general in a certain battle, it woulnd't work. In those cases it would be good if we find other names in the same text or if we're working with a textbook we can use the names from other topics.

# Filtering closer words

In [69]:
sentence = 'oxygen is a chemical element with symbol O and atomic number 8.'
answer = 'oxygen'

### Stemming

In [70]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [71]:
print(stemmer.stem('koalas'))

koala


### Part of speech

In [72]:
from nltk.corpus import wordnet as wn
words = ['write', 'oxygen', 'lisbon']

for word in words:
    tmp = wn.synsets(word)[0].pos()
    print (word, ":", tmp)

write : v
oxygen : n
lisbon : n


### Removing stopwords from sentence

In [73]:
from nltk.corpus import stopwords
word_list = sentence.replace(answer, '').split()
word_list = sentence.replace('.', '').split()

filtered_words = [word for word in word_list if word not in stopwords.words('english')]

print(filtered_words)

['oxygen', 'chemical', 'element', 'symbol', 'O', 'atomic', 'number', '8']


### Extracting closest words

In [74]:
closestWords = model.most_similar(positive=[answer], topn=40)

In [75]:
answerPOS = wn.synsets(answer)[0].pos()
answerStem = stemmer.stem(answer)
words = []

for i in range(len(closestWords)):
    #Having a threshold. Word embedding shouldn't be further than 0.20
    if closestWords[i][1] > 0.20:
        word = closestWords[i][0]
        wordPOS = wn.synsets(word)[0].pos() 
        wordStem = stemmer.stem(word)
        if wordPOS == answerPOS and wordStem != answerStem:
            words.append(word)
        
print(words)

['hydrogen', 'nitrogen', 'helium', 'nutrients', 'breathing', 'chlorine', 'monoxide', 'dioxide', 'ammonia', 'carbon', 'liquid', 'hemoglobin', 'tissues', 'vapor', 'respiration', 'atoms', 'molecules', 'oxide', 'hypoxia', 'sulfur', 'phosphorus', 'photosynthesis', 'water', 'nutrient', 'potassium', 'molecule', 'lungs', 'calcium', 'co2', 'peroxide', 'methane', 'combustion', 'gases', 'blood', 'acid', 'glucose', 'ions']


### Strenght of closest words
A list holding the occurences for each closest word of the answer for every other word in the sentence

In [76]:
wordOccurences =  [0] * len(words)

for sentenceWordIndex in range(len(filtered_words)):
    if filtered_words[sentenceWordIndex] not in model.vocab:
        continue
        
    sentenceWords = model.most_similar(positive=[answer, filtered_words[sentenceWordIndex]], topn=40)
    sentenceWordClosest = []
    for i in range(len(sentenceWords)):
        #Having a threshold. Word embedding shouldn't be further than 0.20
        if sentenceWords[i][1] > 0.20:
            sentenceWordClosest.append(sentenceWords[i][0])
                
    for i in range(len(sentenceWordClosest)):
        #Checking if the embedding is also contained in the embedding of the answer
        if sentenceWordClosest[i] in words:
            wordIndex = words.index(sentenceWordClosest[i])
            wordOccurences[wordIndex]+=1
            
print(wordOccurences)

[7, 5, 5, 3, 1, 3, 2, 5, 2, 5, 4, 1, 2, 4, 1, 5, 4, 3, 1, 4, 2, 1, 4, 1, 2, 5, 1, 1, 1, 2, 2, 1, 4, 2, 3, 1, 2]


In [79]:
combined = list(zip(words, wordOccurences))

In [80]:
sorted(combined, key=lambda x: x[1], reverse=True)

[('hydrogen', 7),
 ('nitrogen', 5),
 ('helium', 5),
 ('dioxide', 5),
 ('carbon', 5),
 ('atoms', 5),
 ('molecule', 5),
 ('liquid', 4),
 ('vapor', 4),
 ('molecules', 4),
 ('sulfur', 4),
 ('water', 4),
 ('gases', 4),
 ('nutrients', 3),
 ('chlorine', 3),
 ('oxide', 3),
 ('acid', 3),
 ('monoxide', 2),
 ('ammonia', 2),
 ('tissues', 2),
 ('phosphorus', 2),
 ('potassium', 2),
 ('peroxide', 2),
 ('methane', 2),
 ('blood', 2),
 ('ions', 2),
 ('breathing', 1),
 ('hemoglobin', 1),
 ('respiration', 1),
 ('hypoxia', 1),
 ('photosynthesis', 1),
 ('nutrient', 1),
 ('lungs', 1),
 ('calcium', 1),
 ('co2', 1),
 ('combustion', 1),
 ('glucose', 1)]

## Function

In [98]:
def getClosestWords(sentence, answer):

    #Stemming
    stemmer = PorterStemmer()

    #Removing stopwords, answer and punctuation from sentence
    word_list = sentence.replace(answer, '').split()
    word_list = sentence.replace('.', '').split()

    filtered_words = [word for word in word_list if word not in stopwords.words('english')]

    ##Extracting closest words for the answer
    closestWords = model.most_similar(positive=[answer], topn=40)
    
    answerPOS = wn.synsets(answer)[0].pos()
    answerStem = stemmer.stem(answer)
    words = []

    for i in range(len(closestWords)):
        #Having a threshold. Word embedding shouldn't be further than 0.20
        if closestWords[i][1] > 0.20:
            word = closestWords[i][0]
            wordStem = stemmer.stem(word)
            if wn.synsets(word) != [] and wn.synsets(word)[0].pos() == answerPOS and wordStem != answerStem:
                words.append(word)

    #Strenght of closest words
    wordOccurences =  [0] * len(words)

    for sentenceWordIndex in range(len(filtered_words)):
        if filtered_words[sentenceWordIndex] not in model.vocab:
            continue

        sentenceWords = model.most_similar(positive=[answer, filtered_words[sentenceWordIndex]], topn=40)
        sentenceWordClosest = []
        for i in range(len(sentenceWords)):
            #Having a threshold. Word embedding shouldn't be further than 0.20
            if sentenceWords[i][1] > 0.20:
                sentenceWordClosest.append(sentenceWords[i][0])

        for i in range(len(sentenceWordClosest)):
            #Checking if the embedding is also contained in the embedding of the answer
            if sentenceWordClosest[i] in words:
                wordIndex = words.index(sentenceWordClosest[i])
                wordOccurences[wordIndex]+=1

    combined = list(zip(words, wordOccurences))
    return sorted(combined, key=lambda x: x[1], reverse=True)

In [100]:
sentence = 'the oldest portuguese university was first established in lisbon before moving to coimbra.'
answer = 'lisbon'

getClosestWords(sentence, answer)

[('portugal', 8),
 ('porto', 8),
 ('madrid', 8),
 ('amsterdam', 7),
 ('copenhagen', 6),
 ('portuguese', 5),
 ('prague', 5),
 ('brussels', 4),
 ('oporto', 4),
 ('rome', 4),
 ('paris', 4),
 ('braga', 3),
 ('istanbul', 3),
 ('warsaw', 3),
 ('barcelona', 3),
 ('luanda', 3),
 ('stockholm', 3),
 ('bucharest', 2),
 ('budapest', 2),
 ('treaty', 2),
 ('madeira', 2),
 ('athens', 2),
 ('aires', 1),
 ('milan', 1),
 ('lima', 1),
 ('seville', 1),
 ('helsinki', 1),
 ('sofia', 1)]