<center>
<div style="text-align: Center">Documents by Context Category</div>
<div style="text-align: Center">Learns Relationships</div>
<div style="text-align: Center">Pawel Sobieralski, 2022 </div>
</center>

# Solution Design - Linear Relationships

Determine to which context category document belongs

## Description:
This solution is based on finding document vector representation doc2vec meaning and uses gensim implementation. Then it uses its remarkable linear relationships to compare each context with new document. Vector space have similar meanings based on context, and vectors distant to each other have differing meanings.

Is vector(“**new document**”) more similiar to:

vector("**laundering activity**") - vector('**allegations','accusations','charges**') + vector('**conviction','sentencing**')  

or is it similiar to:

vector("**laundering activity**") - vector('**conviction','sentencing**') + vector('**allegations','accusations','charges**')


## Example usage:
 
<strong>Preprocess</strong>  
train_doc = se_process_document(train_corpus)  
test_doc = se_process_document(test_document)

<strong>Build vector space</strong>  
vocab_dict = se_build_vocabulary(train_doc)

<strong>Build context vectors</strong>  
train_vector = se_build_context(train_doc,vocab_dict)
test_vector = se_build_context(test_doc,vocab_dict)

<strong>Compare documents in given category</strong>  
se_compare_by_context(train_vector, test_vector, context1)

## Files list

Compare Documents by Context Category - Jupiter Notebook with solution prototype  
utils.py - Utilities  
DocByContextCategory.py - The same solution as a Python class  

In [1]:
import sys
sys.path.append("../Python")

from utils import se_cosine_similarity

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim import corpora

# Sample Corpus

These documents are assumed to be all money laundering documents. While the sentences may not necessarily make sense to humans - they have properties required for this project - corpus with multiple context. 

First 5 sentences [allegations,accusations,charges]

Last 4 sentences [conviction,sentencing]

Also - unseen before document

In [9]:
train_corpus = [
    "Laundering Human machine interface for lab abc allegations applications",
    "Laundering A survey of user opinion of allegations accusations charges time",
    "Laundering The EPS user interface management accusations",
    "Laundering Accusations and human accusations engineering testing of EPS",
    "Laundering Relation of user perceived charges time to error measurement",
    "Laundering The generation of random binary unordered conviction",
    "Laundering The intersection sentencing of paths in conviction",
    "Laundering Sentencing minors IV Widths of conviction and well quasi ordering",
    "Laundering Sentencing minors A survey",
]

context1 = ['allegations','accusations','charges']
context2 = ['conviction','sentencing']

#Unseen before document
#It belongs to context 1 but it does not contain any keyword from context 1
test_document = ["Machine versus human movies"]

text_tokens = [[text for text in mydoc.split()] for mydoc in train_corpus]
text_dict = corpora.Dictionary(text_tokens)
dictionary_tokens = text_dict.token2id.keys()

# Build Model

Estimate model parameters

In [3]:
def se_build_model(train_corpus, vector_size, window, min_count=1, workers=4):
    
    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_corpus)]
    model = Doc2Vec(documents, vector_size=vector_size, window=window, min_count=1, workers=4)

    return model

model = se_build_model(train_corpus,len(dictionary_tokens), 3)

# Infer Vector
Infer Vector for a New Document, Context 1 and Context 2

In [4]:
def se_infer_vector(model, sentence):
    
    return model.infer_vector(sentence)

corpus_vector = se_infer_vector(model, dictionary_tokens) #corpus
new_document_vector = se_infer_vector(model, test_document) #Test document

context1_vector = se_infer_vector(model, context1)
context2_vector = se_infer_vector(model, context2)
laundering_vector = se_infer_vector(model, ['laundering'])

# Similiarity Between Context Category and Unseen Document
Compare new document vector with each context vector

In [5]:
from utils import se_cosine_similarity

def get_context_category(new_doc_vector, context1_vector, context2_vector):
    
    context1_similiarity = se_cosine_similarity(new_doc_vector,context1_vector)
    context2_similiarity = se_cosine_similarity(new_doc_vector,context2_vector)
    
    print("New document similiarity with context 1 is " + str(context1_similiarity))
    print("New document similiarity with context 2 is " + str(context2_similiarity))
    
    if context1_similiarity > context2_similiarity:
        return 1
    else:
        return 2

# Context Category from Linear Relationship

Is vector(“**new document**”) more similiar to:

vector("**laundering activity**") - vector('**allegations','accusations','charges**') + vector('**conviction','sentencing**')  

or is it similiar to:

vector("**laundering activity**") - vector('**conviction','sentencing**') + vector('**allegations','accusations','charges**')

In [6]:
laundering_context1 = laundering_vector - context2_vector + context1_vector
laundering_context2 = laundering_vector - context1_vector + context2_vector

get_context_category(new_document_vector, laundering_context1, laundering_context2)

New document similiarity with context 1 is 0.04724569
New document similiarity with context 2 is 0.016226169


1