# Question Answering with Nearest Neighbour Search and Locality Sensitive Hashing

---



## Downloading and Unpacking Word Embeddings


<b>Google News Vectors</b><br>
* Total Vocab: 3,000,000,000<br>
* Dimensions: 300



In [None]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gzip -d "GoogleNews-vectors-negative300.bin.gz"

--2020-10-09 08:15:36--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.34.22
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.34.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-10-09 08:16:05 (54.9 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]




## Importing Packages



* Pandas (DataFrame and CSV)
* Numpy (Arrays and Vector Maths)
* NLTK (Language Processing, Stopwords, WordNet)
* Gensim (Word2Vec Model)



In [None]:
import re
import pickle
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from gensim.models import KeyedVectors

nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Cleaning Data

In [None]:
# df = pd.read_excel('sem-faq-db.xlsx')
# df.drop(['Source', 'Metadata', 'SuggestedQuestions', 'IsContextOnly', 'Prompts'], axis = 1, inplace = True)
# print(df.head())
# df.to_csv('sem-faq-db.csv')
df = pd.read_csv('sem-faq-db.csv')
df.drop([df.columns[0]], axis = 1, inplace = True)
df.head()

Unnamed: 0,Question,Answer,QnaId
0,Do you ever get hurt?,I don't have a body.,1
1,Can you breathe,I don't have a body.,1
2,Do you ever breathe,I don't have a body.,1
3,can you masticate?,I don't have a body.,1
4,Can you burp?,I don't have a body.,1


## Word2Vec Model

* Vocab Considered: 1,000,000
* Dimensions: 300

In [None]:
en_embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True, limit=1100000)
# pickle.dump(en_embeddings, open( "en_embeddings.p", "wb" ))
# en_embeddings_subset = pickle.load(open("en_embeddings.p", "rb"))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Processing Data


1. Process Text
    * Remove Punctuations
    * Convert to Lowercase
    * Tokenize
    * Lemmatize all words
    * Remove Stopwords

In [None]:
def process_que(text):
    lemmatizer = WordNetLemmatizer()
    text = re.sub("\'", "", text) 
    text = re.sub("[^a-zA-Z]"," ",text) 
    text = ' '.join(text.split()) 
    text = text.lower()
    _t = ""
    for t in text.split():
        _t += lemmatizer.lemmatize(t, pos='a') + " "
    text = _t
    stop_words = set(stopwords.words('english'))
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    text = no_stopword_text
    return text

2. Cosine Similarity

In [None]:
def cosine_similarity(A, B):
    cos = -10
    dot = np.dot(A, B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma * normb)
    return cos

3. Vector for Each Document

In [None]:
def get_document_embedding(que, en_embeddings): 
    doc_embedding = np.zeros(300)
    processed_doc = process_que(que)
    for word in processed_doc:
        if word in en_embeddings.vocab:        
            doc_embedding += en_embeddings[word]
        else:
            doc_embedding += 0
    return doc_embedding

exmp = df.Question.values[0]
print(exmp)
exmp_embedding = get_document_embedding(exmp, en_embeddings)
exmp_embedding[:5]

Do you ever get hurt?


array([-0.1962738 , -0.02954102, -0.26391602,  0.30395508, -0.24365234])

4. Embeddings for All Documents

In [None]:
def get_document_vecs(all_docs, en_embeddings):
    ind2Doc_dict = {}
    document_vec_l = []
    for i, doc in enumerate(all_docs):
        doc_embedding = get_document_embedding(doc, en_embeddings)
        ind2Doc_dict[i] = doc_embedding
        document_vec_l.append(doc_embedding)
    document_vec_matrix = np.vstack(document_vec_l)
    return document_vec_matrix, ind2Doc_dict

document_vecs, ind2Tweet = get_document_vecs(df.Question.values, en_embeddings)
print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")

length of dictionary 9793
shape of document_vecs (9793, 300)


## Searching Dataset w/ Nearest Neighbour

Testing NN

In [None]:
Query = input("Question: ")
que_embed = get_document_embedding(Query, en_embeddings)
idx = np.argmax(cosine_similarity(document_vecs, que_embed))
print("Matched Question: " + df.Question.values[idx]) 
print("Possible Answer: " + df.Answer.values[idx])

Question: are you good?
Matched Question: Looks like you'd better start job hunting
Possible Answer: Okay, but I'm still here if you need me.


Inference Function

In [None]:
def ask(que):
    que_embed = get_document_embedding(que, en_embeddings)
    idx = np.argmax(cosine_similarity(document_vecs, que_embed))

    return df.Answer.values[idx]

df['Predicted'] = [ask(x) for x in list(df.Question.values)]

acc = 0
for x, y in zip(df.Answer.values, df.Predicted.values):
    if x == y:
        acc = acc + 1

print('Accuracy: ', acc/df.shape[0])

  


Accuracy:  0.4159093229858062
