# Ranking documents using skip grams word embedding model for phrase queries

In [1]:
from nltk.tokenize import word_tokenize
from nltk.corpus import *
from nltk.stem.porter import *
import pickle
import gensim 
import numpy as np
import pandas as pd
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

In [2]:
# ! Uncomment in first run 
# nltk.download('punkt')
# nltk.download('words')
# nltk.download('stopwords')

## Setup
#### The corpus/documents are extracted from the pickle files.
#### Stemmer has been initialised to convert term to its root form (Example : received -> receive)

In [3]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
root = Path("../")

my_path = root / "Pickled_files" / "Documents"
dbfile = open(my_path, 'rb')     
documents = pickle.load(dbfile)
dbfile.close()

## Model

Skip-gram model is a type of neural network architecture that is commonly used for word embedding. It is a form of unsupervised learning that aims to learn the distributional representation of words.The basic idea behind skip-gram model is to predict the context words surrounding a given target word, rather than predicting the target word from the context words. The context words are defined as the words that occur within a certain window of the target word.

The skip-gram model consists of an input layer, a hidden layer, and an output layer. The input layer represents the target word, and the output layer represents the context words. The hidden layer is used to transform the input word into a distributed representation, or embedding, which is used to predict the context words.During training, the skip-gram model is fed with a large corpus of text. For each target word in the corpus, a training example is created by randomly selecting one of its context words as the output, and setting the remaining context words as inputs to the model. The model is then trained to predict the output context word given the input context words.

The training process involves adjusting the weights of the model to minimize the difference between the predicted context word and the actual context word. The weights are adjusted using backpropagation algorithm, which updates the weights based on the difference between the predicted and actual output.After training, the skip-gram model generates a dense vector representation for each word in the vocabulary. The vector representation captures the semantic and syntactic information of the word, and can be used as input to other machine learning models for various natural language processing tasks such as text classification, information retrieval, and machine translation.

## Building the model

The following code initializes an empty list called data, which will be used to store the tokenized and preprocessed documents. Then, it loops over each document in the documents list and tokenizes each sentence into a list of words. Each word is then converted to lowercase and added to a temporary list called temp. The temp list is then appended to the data list, which now contains a list of tokenized and preprocessed documents.

Next, the Word2Vec model from the gensim library is used to train a word embedding model on the data list. The min_count parameter specifies the minimum frequency count of a word in the corpus for it to be included in the model. The vector_size parameter specifies the dimensionality of the word vectors. The window parameter specifies the maximum distance between the target word and the context word within a sentence. The sg parameter specifies the training algorithm to be used - 1 for skip-gram and 0 for CBOW. In this case, sg is set to 1 which corresponds to the skip-gram algorithm for training.

After training, the word embedding model skipgram_model contains dense vector representations for each word in the corpus. These vector representations can be used for various natural language processing tasks such as text classification, information retrieval, and machine translation.

In [4]:
data = list()
for i in documents: 
    temp = list()
    # tokenize the sentence into words 
    for j in word_tokenize(i[0]): 
        temp.append(j.lower()) 
    data.append(temp)

skipgram_model = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5,sg = 1)

## Generating word embeddings for documents

The following code defines two functions.

1.**preprocess(text)** takes a string of text as input, tokenizes it, removes stop words and filters out words with length less than two. It returns a list of preprocessed tokens.

2.**text_embedding(text, model)** takes a string of text and a pre-trained word embedding model as inputs. It preprocesses the text using the preprocess() function, obtains the word embeddings for each word in the preprocessed text from the pre-trained word embedding model, takes the average of the word embeddings to generate a document embedding, and returns the document embedding as a numpy array.

Finally, the **document_vecs** variable is assigned a list of document embeddings generated using the text_embedding() function and the pre-trained word embedding model (skipgram_model) for each document in the documents list.


In [5]:
def preprocess(text):
    tokens = []
    for word in word_tokenize(text, language='english'):
        if len(word) >= 2 and word not in stop_words:
            tokens.append(word)
    return tokens

# Define a function to generate text embeddings
def text_embedding(text, model):
    # Preprocess the text (tokenize, remove stop words, stem)
    preprocessed_text = preprocess(text)
    # Get the word embeddings for each word in the document
    word_embeddings = [model.wv[word] for word in preprocessed_text if word in model.wv.key_to_index]
    # Take the average of the word embeddings to generate a document embedding
    if len(word_embeddings) > 0:
        document_embedding = np.mean(word_embeddings, axis=0)
    else:
        document_embedding = np.zeros(model.vector_size)
    return document_embedding

document_vecs = [text_embedding(document[0], skipgram_model) for document in documents]

In [6]:
pd.DataFrame(document_vecs)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.029961,0.137852,0.062892,-0.068636,0.146065,-0.310918,0.260103,0.312034,0.085926,-0.058731,...,0.274211,-0.067892,0.193503,-0.108390,0.133346,0.016425,0.352965,-0.080150,0.173055,0.033377
1,0.048956,0.234289,0.091454,-0.085685,0.082340,-0.287559,0.304864,0.296115,0.054177,0.057240,...,0.189385,0.059318,0.233837,-0.185014,0.125110,0.014443,0.592293,0.182993,0.217441,-0.070333
2,0.016117,0.101110,0.106387,-0.169617,0.151749,-0.275378,0.192407,0.284254,0.063564,0.102821,...,0.243920,0.053439,0.261050,-0.167039,0.216173,0.064337,0.637676,0.147637,0.076408,0.102397
3,-0.003193,0.060116,0.077803,-0.178501,0.169118,-0.322478,0.214613,0.379391,0.065270,0.057274,...,0.311613,-0.011907,0.250686,-0.105574,0.113224,0.001763,0.529442,0.129211,0.103091,0.152931
4,0.064057,0.228277,0.115788,-0.094056,0.122794,-0.212706,0.220429,0.212295,0.121120,0.042652,...,0.221088,0.066263,0.284462,-0.167477,0.198747,0.168184,0.530396,0.123433,0.212684,0.025017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3470,-0.118166,0.080370,-0.036874,-0.134574,0.177163,-0.342189,0.307943,0.371783,0.085643,-0.011377,...,0.435848,-0.052106,0.202213,-0.166312,0.224200,0.195284,0.424065,-0.084877,0.113451,0.105040
3471,-0.051591,0.098346,0.058517,-0.072703,0.126266,-0.244646,0.276554,0.280397,0.028105,0.010057,...,0.273462,-0.015334,0.217563,-0.182987,0.229434,0.126491,0.399488,-0.086635,0.098177,0.005683
3472,-0.122357,0.128365,0.046634,0.126418,0.052945,-0.296985,0.271161,0.230271,0.071167,-0.098535,...,0.283309,-0.030030,0.039129,-0.072816,0.084887,0.247428,0.109177,-0.357931,0.095543,0.124211
3473,-0.141574,0.083228,0.023239,0.177175,0.036909,-0.447578,0.315401,0.190893,0.052432,-0.145392,...,0.330529,-0.066754,-0.023257,-0.066352,0.081898,0.274095,0.035102,-0.518561,0.058561,0.131167


## Similarity measure

Here cosine_similarity is used to compare the query vector and and the document vector.

In [7]:
# Define a function to calculate cosine similarity between two vectors
def cosine_similarity(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

## Saving document vectors for future use

In [8]:
my_path = root / "Pickled_files" / "Document_vectors"
dbfile = open(my_path, 'wb')
pickle.dump(document_vecs, dbfile) 
dbfile.close()

my_path = root / "Pickled_files" / "Document_vectors"
dbfile = open(my_path, 'rb')     
document_vecs = pickle.load(dbfile)
dbfile.close()

## Processing the query

This code calculates the similarity scores between a given query and a set of documents. It first generates a document vector for each document in the set using the text_embedding function. Then, it generates a query vector using the same function. The cosine similarity between the query vector and each document vector is calculated and stored in the similarity_scores list. Finally, the documents are ranked in descending order based on their similarity scores with the query, and the top five documents are printed.

In [9]:
# Example usage: compare a query with a set of documents
query = "Excluded driver"

query_vec = text_embedding(query,skipgram_model)
# Calculate similarity scores between the query and all documents
similarity_scores = [cosine_similarity(query_vec, document_vec) for document_vec in document_vecs]
# Rank documents based on similarity scores
ranked_documents = [document for _, document in sorted(zip(similarity_scores, documents), reverse=True)]
# Print the ranked documents
ranked_documents[0:5]

[('excluded drivers and driving without permission except for certain accident benefits coverage , there is no coverage ( including coverage for occupants ) under this policy if the automobile is used or operated by a person in possession of the automobile without the owner ’ s consent or is driven by a person named as an excluded driver of the automobile policy or a person who , at the time he or she willingly becomes an occupant of an automobile , knows or ought reasonably to know that the automobile is being used or operated by a person in possession of the automobile without the owner ’ s consent . ',
  '.\\Docs\\Auto\\1215E.2.docx'),
 ('we will not pay damages to or for any household member who has a massachusetts auto policy of his or her own or who is covered by a massachusetts auto policy of another household member providing underinsured auto insurance with higher limits.anyone else while occupying your auto.we will not pay damages to or for anyone else who has a massachusetts