<center>
<div style="text-align: Center">Documents by Context Category</div>
<div style="text-align: Center">From Scratch</div>
<div style="text-align: Center">Pawel Sobieralski, 2022 </div>
</center>

# Solution Design - From Scratch

Determine to which context category document belongs


## Description:
This is a prototype solution built from basic numerical functions. 
The solution builds two context vectors in vacabulary space - one for the training corpus and one for an unseen document. Then it compares distance between these vectors in a given subset of dimensions. The subset dimensions corespond to context category.
The training corpus vector in given dimensions (context 1) will be silmiliar to the new document vector if true document context is 1.
The training corpus vector in given dimensions (context 2) will be disimilar to the new document vector if true document context is 1.

Another approach I considered is use linear relationships in doc2vec - see my other workbook.

Extra python code introduces interface and class style coding for this solution.

## Example usage:
 
<strong>Preprocess</strong>  
train_doc = se_process_document(train_corpus)  
test_doc = se_process_document(test_document)

<strong>Build vector space</strong>  
vocab_dict = se_build_vocabulary(train_doc)

<strong>Build context vectors</strong>  
train_vector = se_build_context(train_doc,vocab_dict)
test_vector = se_build_context(test_doc,vocab_dict)

<strong>Compare documents in given category</strong>  
se_compare_by_context(train_vector, test_vector, context1)

## Files list

This document - Jupiter Notebook with solution prototype  
context_category.py - Python class implementation

utils.py - Utilities  

In [1]:
import sys
sys.path.append("../Python")

# Utilities
import utils

import pandas as pd
import numpy as np
from numpy import linalg as la

# Sample Corpus
These documents are assumed to be all money laundering documents. While the sentences may not necessarily make sense to humans - they have properties required for this project - corpus with multiple context. 

First 5 sentences [allegations,accusations,charges]

Last 4 sentences [conviction,sentencing]

Also - unseen before document

In [2]:
train_corpus = [
    "Human machine interface for lab abc allegations applications",
    "A survey of user opinion of allegations accusations charges time",
    "The EPS user interface management accusations",
    "Accusations and human accusations engineering testing of EPS",
    "Relation of user perceived charges time to error measurement",
    "The generation of random binary unordered conviction",
    "The intersection sentencing of paths in conviction",
    "Sentencing minors IV Widths of conviction and well quasi ordering",
    "Sentencing minors A survey",
]

context1 = ['allegations','accusations','charges']
context2 = ['conviction','sentencing']

#Unseen before document
#It belongs to context 1 but it does not contain any keyword from context 1
test_document = ["Machine management interface"]

# Process Document
Tokenize into individual words, remove stop words

In [3]:
def se_process_document(corpus_doc):
    
    proc_doc = list(map(utils.process_document,corpus_doc)) 
    tokens_flat_list = [item for sublist in proc_doc for item in sublist]
    return tokens_flat_list

train_doc = se_process_document(train_corpus)
train_doc[:10]

['human',
 'machine',
 'interface',
 'lab',
 'abc',
 'allegations',
 'applications',
 'survey',
 'user',
 'opinion']

# Build Vector Space
Builds sorted vocabulary and indexe words to process faster later.

The same vector space used throughtout all training and test/validations 

In [4]:
def se_build_vocabulary(tc):

    vocab = list(set(tc))
    vocab.sort()
    
    vocab_dict = {}
    for index, word in enumerate(vocab):
        vocab_dict[word] = index
        
    return vocab_dict

vocab_dict = se_build_vocabulary(train_doc)
vocab_dict['allegations']

2

# Build Context
Build context in the co-occurance matrix for a given corpora and vector space

In [5]:
def se_build_context(corpus_list, vocab_dict, window):

    co_occurrence_vectors = pd.DataFrame(
        np.zeros([len(vocab_dict), len(vocab_dict)]),
        index = vocab_dict.keys(),
        columns = vocab_dict.keys()
    )
    
    for index, element in enumerate(corpus_list):
        
        start = 0 if index - window < 0 else index - window
        finish = len(corpus_list) if index+2 > len(corpus_list) else index+3
        
        context = corpus_list[start:index] + corpus_list[index+1:finish]
        for word in context:
            
            co_occurrence_vectors.loc[element, word] = (
                co_occurrence_vectors.loc[element, word]+1
            )
            
    return co_occurrence_vectors/len(corpus_list) #Scale

train_vector = se_build_context(train_doc,vocab_dict, 2)
train_vector.iloc[0:5,0:5]

Unnamed: 0,abc,accusations,allegations,applications,binary
abc,0.0,0.0,0.019231,0.019231,0.0
accusations,0.0,0.076923,0.019231,0.0,0.0
allegations,0.019231,0.019231,0.0,0.019231,0.0
applications,0.019231,0.0,0.019231,0.0,0.0
binary,0.0,0.0,0.0,0.0,0.0


# Measure of Similiarity - Pair Wise
 Example of top words by pair wise similarity, for example charges is close to accusations.
 This is just for reaserch, the solution is using vector similiarity not pair wise

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_words = pd.DataFrame(
    
    cosine_similarity(train_vector),
    columns = vocab_dict.keys(),
    index = vocab_dict.keys()

)

similarity_words.loc['charges'].sort_values(ascending=False).head(10)

charges        1.000000
user           0.507093
time           0.500000
measurement    0.474342
opinion        0.474342
relation       0.474342
perceived      0.474342
eps            0.400000
management     0.387298
accusations    0.368932
Name: charges, dtype: float64

# Measure of Similiarity - Vector Wise

In [7]:
def se_cosine_similarity(v1,v2):
    
    denominator = la.norm(v1) * la.norm(v2)
    
    if denominator > 0:
        return v1.dot(v2) / denominator
    else:
        return 0
    
v = np.array([40,11,4,11,14]) #Test

se_cosine_similarity(v,v)

0.9999999999999998

In [8]:
se_cosine_similarity(np.array(train_vector['charges']),np.array(train_vector['accusations']))

0.36893239368631087

In [9]:
#Test OK
se_cosine_similarity(np.array(train_vector['charges']),np.array(train_vector['human']))

0.273861278752583

# Document Unseen Before
Builds context from unseen document in the exactly same vector space as training corpora

In [11]:
test_doc = se_process_document(test_document)
test_vector = se_build_context(test_doc,vocab_dict,2)
test_vector.iloc[0:5,0:5]

Unnamed: 0,abc,accusations,allegations,applications,binary
abc,0.0,0.0,0.0,0.0,0.0
accusations,0.0,0.0,0.0,0.0,0.0
allegations,0.0,0.0,0.0,0.0,0.0
applications,0.0,0.0,0.0,0.0,0.0
binary,0.0,0.0,0.0,0.0,0.0


# Compare Documents by Context
Compares different documents by given context

In [12]:
def se_compare_by_context(doc1, doc2, dimensions):
    
    angle = 0.0
    for i, d in enumerate(dimensions):
        
        #Reverse cosine to radians to get avg
        angle += np.arccos(se_cosine_similarity(np.array(doc1[d]), np.array(doc2[d])))
        angle = angle / len(dimensions)
        
        return (dimensions,np.cos(angle))

# Tests

In [13]:
def se_get_context_score(train_vector, test_vector, context_list):
    
    return round(se_compare_by_context(train_vector, test_vector, context_list)[1],3)

context1 = ('allegations','accusations','charges')
context2 = ('conviction','sentencing')

doc_by_context = {}
doc_by_context[context1] = se_get_context_score(train_vector, test_vector, context1)
doc_by_context[context2] = se_get_context_score(train_vector, test_vector, context2)

if doc_by_context[context1] > doc_by_context[context2]:
    print('Test sentence is in ' + ','.join(context1) + ' context')
    print('Score: ' + str(doc_by_context[context1]))
else:
    print('Test sentence is in ' + ','.join(context2) + ' context')
    print('Score: ' + str(doc_by_context[context2]))

Test sentence is in allegations,accusations,charges context
Score: 0.866
