<center>
<h1><b>Word Embedding Based Answer Evaluation System for Online Assessments (WebAES)</b></h1>
<h3>A smart system to automate the process of answer evaluation in online assessments.</h3>
<h5>Doc2Vec Model training</h5>

# Environment setup

For this project the following packages and libraries are required:

1. **string:** To perform string manipulation required for basic text pre-processing.
2. **gensim:** Contains the Doc2Vec model and other founctions for building, training and saving a Doc2Vec model.
3. **scipy:** Contains implmentations of mathematical functions such as cosine of angle between 2 vectors.
4. **time:** To measure the time elapsed for model training.

These packages and libraries are installed and imported in the following code cell.

In [1]:
# Install and import all required packages

# For string manipulation
import string

# To build, train and save a Doc2Vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.downloader as api
from gensim.test.utils import get_tmpfile

# To determine similarity
import scipy

# To measure time elapsed
import time

# Supress warnings
import warnings
warnings.filterwarnings('ignore')

# Doc2Vec Model

## 1. Training corpus

To build and train a word/document embedding model using Doc2Vec, a training corpus of text documents is required. One such corpus is the **text8 corpus**, which contains the first **10^8 bytes (100MB)** of data from the **English Wikipedia** dump. Therefore, it contains a large number of English documents sourced from Wikipedia, that talk about a number of varied topics. Such a text corpus is suitable to build a Doc2Vec model that can form vector representations of documents and words.

In the following code cell, the text8 corpus, already available in the gensim library, is loaded and a sample document is displayed.

In [2]:
# Load text8 corpus
text8_corpus = api.load('text8')
text8_data = [doc for doc in text8_corpus]

# Display a sample document
print(text8_data[1][0:25])

['reciprocity', 'qualitative', 'impairments', 'in', 'communication', 'as', 'manifested', 'by', 'at', 'least', 'one', 'of', 'the', 'following', 'delay', 'in', 'or', 'total', 'lack', 'of', 'the', 'development', 'of', 'spoken', 'language']


In order to train a Doc2Vec model, a list of Tagged Documents is required. The Doc2Vec model takes this list of Tagged Documents as input and generates vector representations of texts (words and documents) by learning how different words are used along with other words. This makes it possible to capture semantic relationships between words and thereby, capture the meaning of a document (set of words).

In the following code cell, a function is defined to generate Tagged Documents using the *TaggedDocument()* function available in the gensim library. Each document in the text corpus is plit into its individual words (tokens) and tagged with a unique ID. A sample Tagged Document is shown.

In [3]:
# Function to generate tagged documents from text corpus
def tagged_document(corpus_documents):
    # For each document in corpus, yield a TaggedDocument object
    for i, list_of_words in enumerate(corpus_documents):
        yield TaggedDocument(list_of_words, [i])

In [4]:
# Get tagged documents for training data
training_data = list(tagged_document(text8_data))

# Display a sample tagged document
print(training_data[1])

TaggedDocument(['reciprocity', 'qualitative', 'impairments', 'in', 'communication', 'as', 'manifested', 'by', 'at', 'least', 'one', 'of', 'the', 'following', 'delay', 'in', 'or', 'total', 'lack', 'of', 'the', 'development', 'of', 'spoken', 'language', 'not', 'accompanied', 'by', 'an', 'attempt', 'to', 'compensate', 'through', 'alternative', 'modes', 'of', 'communication', 'such', 'as', 'gesture', 'or', 'mime', 'in', 'individuals', 'with', 'adequate', 'speech', 'marked', 'impairment', 'in', 'the', 'ability', 'to', 'initiate', 'or', 'sustain', 'a', 'conversation', 'with', 'others', 'stereotyped', 'and', 'repetitive', 'use', 'of', 'language', 'or', 'idiosyncratic', 'language', 'lack', 'of', 'varied', 'spontaneous', 'make', 'believe', 'play', 'or', 'social', 'imitative', 'play', 'appropriate', 'to', 'developmental', 'level', 'restricted', 'repetitive', 'and', 'stereotyped', 'patterns', 'of', 'behavior', 'interests', 'and', 'activities', 'as', 'manifested', 'by', 'at', 'least', 'one', 'of',

## 2. Initialise and train Doc2Vec model

Having loaded the text8 corpus and generated Tagged Documents for each document, the next step is to initialise and train the Doc2Vec model.

For the purpose of this project, the following parameters are selected for the Doc2Vec model:

1. **vector_size = 50** (the model will generate a 50-dimensional vector for each document)
2. **min_count = 2** (words that occur less than 2 times in the text corpus will be ignored)
3. **epochs = 40** (the model will train iteratively for 40 epochs)

First, the vopcabulary for the model is built based on the text8 corpus. This includes all the words that occur more the twice in the corpus. Next, the model is trained using the training data (corpus). Total time elapsed for this process (in seconds) is shown.

In [5]:
# Get start time
start = time.time()

# Initialise Doc2Vec model
model = Doc2Vec(vector_size=50, min_count=2, epochs=40)

# Build vocabulary from given text corpus
model.build_vocab(training_data)

# Train model
model.train(training_data, total_examples=model.corpus_count, epochs=model.epochs)

# Get end time
end = time.time()

# Display time elapsed
print('Time elapsed: {} seconds'.format(end-start))

Time elapsed: 569.3098156452179 seconds


The Doc2Vec model takes nearly 5 minutes to train.

Once the Doc2Vec model is trained, it can be used to infer/generate the vector representations for any peice of text, and thereby determine the similarity between pairs of texts using cosine similarity measure. Some examples are shown below.

In [6]:
# Sample texts
sample_text1 = 'Nothing is bad'
sample_text2 = 'Everything is good'

# Infer vectors for sample texts using trained Doc2Vec model
sample_text1_vector = model.infer_vector(sample_text1.split())
sample_text2_vector = model.infer_vector(sample_text2.split())

# Determine similarity using cosine similarity
sim = 1 - scipy.spatial.distance.cosine(sample_text1_vector, sample_text2_vector)

# Display similarity score
print(sim)

0.9429934327925633


Using the described method, the similarity between the 2 sample texts is determined to be 0.9391 (~94%).

## 3. Test with sample documents

After training the Doc2Vec model, it can be tested with a pair of sample documents to determine semantic similarity between them.

The function defined below takes a text document as input and tokenizes it (split into individual words) after removing unwanted punctuations. It return a list of tokens for the input document.

In [7]:
# Function to tokenize text documents
def tokenize(document):
    # Remove all punctuation marks
    document = document.translate(str.maketrans('', '', string.punctuation))
    
    # Split document into individual words
    tokens = document.lower().split()
    
    # Return list of tokens
    return tokens

Sample document 1 (source: https://www.datarobot.com/wiki/data-science/)

The vector representation for the sample document is inferred using the trained Doc2Vec model and displayed here.

In [8]:
sample_doc1 = '''Data science is the field of study that combines domain expertise, programming skills, and knowledge of 
mathematics and statistics to extract meaningful insights from data. Data science practitioners apply machine learning 
algorithms to numbers, text, images, video, audio, and more to produce artificial intelligence (AI) systems to perform 
tasks that ordinarily require human intelligence. In turn, these systems generate insights which analysts and business 
users can translate into tangible business value.'''

# Tokenize sample document
sample_doc1_tokens = tokenize(sample_doc1)

# Infer vector using trained model
doc1_vector = model.infer_vector(sample_doc1_tokens)

# Display inferred vector
print(doc1_vector)

[ 0.13286893 -1.2322876  -1.9172021   0.8146317   0.92000616  0.07555575
  0.33878285 -0.8097486  -0.3451892  -0.6000168   0.430709    1.2293409
  0.2270408  -0.21894792  0.15632525 -1.7304384   0.05622553 -0.2064903
 -1.1359855  -1.6128716  -1.426148   -0.22439794 -0.85566664 -0.36731967
 -0.09791453 -1.7759222  -1.0530732  -1.1950265  -0.40541992 -0.75427556
  1.5395336  -0.45556417 -0.79839605 -0.5609269  -1.0341463  -0.03317484
  1.0452812   0.37783346 -0.43193278  1.2293113   0.6621519  -0.71475995
  0.82637686 -1.4677111   1.0038252  -0.15299498 -0.13292463 -0.54987985
  1.1619424  -0.8855966 ]


Sample document 2 (source: https://en.wikipedia.org/wiki/Data_science)

The vector representation for the sample document is inferred using the trained Doc2Vec model and displayed here.

In [9]:
sample_doc2 = '''Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and 
systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and 
actionable insights from data across a broad range of application domains. Data science is related to data mining, 
machine learning and big data.'''

# Tokenize sample document
sample_doc2_tokens = tokenize(sample_doc2)

# Infer vector using trained model
doc2_vector = model.infer_vector(sample_doc2_tokens)

# Display inferred vector
print(doc2_vector)

[-0.15553468 -1.266999   -1.3765799   1.0690178   0.7393133   0.6061039
 -0.00314287 -0.25976923  0.1151078  -0.38529044  0.0763981   0.80284387
  0.8806282  -0.43346977  0.18722571 -1.5667044   0.06902097  0.58789337
 -0.6691542  -1.7919095  -1.5733536   0.6398737  -0.9336003  -0.09777802
 -1.3884649  -1.0396327  -0.66473716 -1.219378    0.35274348 -0.46030825
  1.2755452  -0.1862244   0.03700712  0.63001686 -0.7626884   0.10283075
  0.45660853 -0.09105806 -0.12229317  0.98861     0.33697143 -1.1812943
 -0.4700193  -1.387528    0.5422426   0.5360279  -0.02451541 -0.6216781
  1.092974   -0.537946  ]


Cosine of the angle between these 2 vectors can be used to determine the measure of similarity between the 2 documents.

The cosine of the angle between 2 vectors gives a measure of how close the 2 vectors are to each other in the vector space. The closer 2 vectors are to each other, smaller will be the value of cosine of the angle between them. Subtracting this value from 1 gives a measure of the similarity (closeness) of the pair of vectors.

In [10]:
# Determine similarity score using cosine similarity
sim_score = 1 - scipy.spatial.distance.cosine(doc1_vector, doc2_vector)

# Display score
print(sim_score)

0.8301742423909662


The similarity score for the given sample documents, as calculated using cosine similarity, is 0.8485. It can be said that the 2 documents are ~85% similar.

# Save trained model

This trained Doc2Vec model can be saved for later use.

In [11]:
# Save model to disk
model.save('./WebAES_Doc2Vec_Model.model')