In [5]:
import os
import pickle
import spacy
import pandas as pd
import settings

# Distributed vector representation model. Trainig and results.

All hard, preporatory work is not included in this report and will be issued as the distinct document
in the near future. At the moment final results are more important for us.
This document is terse description of the work stages with some results.

As an input text file to train network and create model we use one big file (49.5Mb) about 7000000 words. 
File consists of the texts downloaded from sec.gov 

Network is initialized and trained with the following parameters:
1. Vector size - 100
2. Context window - 5

Network trained about 2 hours on the usual computer.

In [6]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [7]:
NORMALIZED_TEXT_FILE_NAME = 'normalized_text.txt'
normalized_text_file = os.path.join(settings.NORMALIZED_DATA_PATH, NORMALIZED_TEXT_FILE_NAME)

In [8]:
# Load normalized text
normalized_text = LineSentence(normalized_text_file)
# Path where model will be saved
word2vec_filepath = os.path.join(settings.NORMALIZED_DATA_PATH, 'word2vec_model')

### Start distributed model training. Set vector dimension and epochs number.

At this stage we train the model.

In [6]:
# Run it to retrain our model. Model will be saved in "word2vec_filepath"
# Most important training parameters:
# size - word vector dimension
# window - context size

# Training parameters
# Word vector dimension 
vector_dim = 100
# Context size
context_size = 15
# Training epochs
epochs = 20

# Make False to use pretraind models
if True:
    # Take trigram text and start first epoch
    text2vec = Word2Vec(normalized_text, size=vector_dim, window=context_size,
                        min_count=20, sg=1, workers=4)
    # Save first iteration 
    text2vec.save(word2vec_filepath)
    # Train another epochs and save model in "word2vec_filepath"
    for i in range(1,epochs):
        text2vec.train(normalized_text)
        print("\rThis is epoch N-{}".format(i), flush=True)
        text2vec.save(word2vec_filepath)     

### Load trained  model

At this stage we load trained model to memory.

In [9]:
# Load vector representation from trained 'word2vec_model'
text2vec = Word2Vec.load(word2vec_filepath)
text2vec.init_sims()

# Shows number of training epochs
print('{} training epochs.'.format(text2vec.train_count))

20 training epochs.


In [11]:
# Create list of word2vector tuples
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in text2vec.vocab.items()]

# Sort oredred vocab by voc.count
ordered_vocab = sorted(ordered_vocab, key=lambda count: count[2]) # try to use -count[2]

# Make three lists of: terms, indices, counts
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# Create panda's data frame of word vector representation
word_vectors = pd.DataFrame(text2vec.syn0norm[term_indices, :],
                            index=ordered_terms)

### Word vector representation

Here are presented some of our words (30) from M&A corpus and their 100-dimensional vector represntation.
Every vector represents the meaning of the word in our context. 
Every number in every vectors' dimension is the result of the neural network training.

As we see preprocessing algorithm constructed some meaningful phrases (n-grams) out of the raw texts.
This are all words with undescore. Neural network treats them as one word/token.
For example:
* collision_repair
* dealership_portfolio
* deutsche_bank
* september_22_2014
* wilmington_trust_national_association
* pretax_gain
* bad_faith

So, really, they look like meaningful phrases. 


In [23]:
word_vectors[500:530]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
collision_repair,-0.234665,-0.105964,-0.18197,-0.027173,0.216593,0.066296,-0.018588,0.122088,0.022738,-0.064255,...,-0.013122,-0.028042,0.0154,0.053156,0.093446,-0.090473,-0.051553,0.030833,-0.049862,-0.029715
jones,-0.14629,-0.352505,0.018154,0.048279,0.080486,-0.036981,-0.064703,-0.095003,0.066536,-0.008441,...,0.048305,-0.091004,-0.070644,0.008029,-0.037263,-0.123986,-0.007194,-0.096259,-0.078466,-0.129087
pr_newswire,-0.032324,-0.155148,0.002653,0.003505,-0.018679,0.077924,0.125363,-0.097271,0.207067,-0.099413,...,-0.065402,0.132563,-0.154628,0.127788,0.13929,-0.191377,0.004012,-0.042212,-0.106033,-0.041239
continually_evaluate,-0.033985,-0.229221,0.134431,0.003934,0.131076,0.109065,-0.18933,0.142083,-0.027587,-0.1914,...,-0.021968,0.172532,0.051193,0.120674,0.065641,-0.089725,0.132214,0.004765,0.022643,-0.076314
indiana,0.035653,-0.244028,0.065464,0.187814,-0.003585,-0.016603,0.066988,-0.016253,0.15356,0.180725,...,-0.038464,0.158812,0.096473,-0.138742,0.036389,0.030722,-0.195299,0.031037,-0.141276,-0.081856
dealership_portfolio,-0.189963,-0.149379,-0.080655,0.049316,0.027082,-0.028766,-0.005897,0.09584,-0.0733,-0.096352,...,-0.053223,0.128556,0.135245,0.147787,0.088493,-0.05578,0.064052,0.076589,0.075423,0.170601
brooks_pierce_mclendon_humphrey,-0.16039,-0.313627,0.037305,-0.030371,0.026818,0.000555,0.072036,-0.179898,0.160174,-0.078982,...,0.080582,-0.064754,-0.118342,-0.001408,-0.073989,-0.092229,-0.040657,-0.026282,-0.061205,-0.102548
renewal_refunding,-0.208038,-0.105159,0.16163,-0.040642,0.054332,-0.163459,0.113355,0.080622,0.021484,-0.053682,...,0.037282,0.193606,0.035139,0.094498,0.055051,-0.112136,0.043139,-0.118641,0.054777,0.127092
avenue,0.131787,-0.148764,-0.220184,0.020768,0.040241,-0.008429,-0.035631,0.12541,-0.030789,-0.071048,...,0.006728,-0.10686,0.017012,-0.014196,0.044144,-0.017346,-0.143514,0.132297,-0.015919,0.01685
utility,0.082244,-0.072461,-0.126125,0.108329,0.079772,-0.166692,-0.052188,0.108333,0.140084,-0.066708,...,0.004082,-0.109741,-0.109243,0.030326,-0.01897,-0.027581,0.080442,0.05487,0.017541,-0.063395


### Words similarity function

In [25]:
# Get similar/realated word in context
def get_context_related_words(token, topn=10):
    '''Returns topn context related words as a dictionary'''
    word_sim = {}
    for word, similarity in text2vec.most_similar(positive=[token], topn=topn):
        word_sim.update({word:similarity})
    return word_sim

### Testing model's performance. Trying to find semantically close words and phrases.

Now we test how our traind model is able to find 20 more semantically/contextually close words:
* 'cost'
* 'brand'
* 'september_22_2014'
* 'renewal_refunding'

And even for such abstract as 'bad_faith'(whatever it means)
Numbers after column designate the closiness of contextual meaning.
As we see the resuls are not so bad and promising.

In [21]:
get_context_related_words('cost', topn=20)

{'153.7': 0.38838207721710205,
 '2,472.9': 0.42199376225471497,
 '2,494.1': 0.4554964601993561,
 '27.7': 0.48493120074272156,
 '3,663.0': 0.4689284861087799,
 'amortize': 0.5120112895965576,
 'brokerage': 0.39271965622901917,
 'deferred': 0.3934595584869385,
 'discount': 0.40827229619026184,
 'expense': 0.44802170991897583,
 'expensing': 0.39666426181793213,
 'extraordinary': 0.39504194259643555,
 'gain': 0.3928598165512085,
 'human_resource': 0.43243271112442017,
 'legal': 0.414577454328537,
 'obstacle': 0.397826224565506,
 'pricing': 0.40854325890541077,
 'recognition': 0.4200395345687866,
 'retrospective_adoption_approximately': 0.4277220666408539,
 'technology': 0.39840424060821533}

In [26]:
get_context_related_words('brand', topn=20)

{'192907': 0.48337626457214355,
 '31': 0.45746520161628723,
 '70': 0.4456402063369751,
 'capability': 0.5140751600265503,
 'central_transportation': 0.4337252080440521,
 'core': 0.4336196184158325,
 'daily_newspaper': 0.4709404408931732,
 'dealership': 0.4575847387313843,
 'digital': 0.5604076385498047,
 'follow_chart': 0.5184388756752014,
 'internet': 0.4958656430244446,
 'location': 0.45098748803138733,
 'manufacture': 0.5362410545349121,
 'many_different_industry': 0.5087342858314514,
 'network': 0.46270158886909485,
 'nonvested': 0.4447895288467407,
 'primary': 0.4532267153263092,
 'reportable_segment': 0.45416444540023804,
 'sheet': 0.4404601454734802,
 'thousand': 0.4499105215072632}

In [27]:
get_context_related_words('september_22_2014', topn=20)

{'00104762': 0.5499814748764038,
 '12_2011': 0.5256521105766296,
 '24_2002': 0.530756950378418,
 '89120': 0.5044031739234924,
 '9_2012': 0.5485042929649353,
 'april_5': 0.5080548524856567,
 'august_10_2005': 0.5389214158058167,
 'august_12': 0.5492216348648071,
 'definitive_proxy': 0.5590988397598267,
 'february_5': 0.5649344325065613,
 'february_6_2008': 0.5622209310531616,
 'instructions': 0.49542659521102905,
 'july_19': 0.5380418300628662,
 'kpmg_llp_independent': 0.5129932165145874,
 'november_14': 0.5055705308914185,
 'november_18_2011': 0.5434110760688782,
 'omit': 0.509431004524231,
 'rule_430b': 0.49982258677482605,
 'rule_430c': 0.5344349145889282,
 'schedule_14a': 0.5104280114173889}

In [30]:
get_context_related_words('renewal_refunding', topn=20)

{'36.0': 0.4212509095668793,
 'arrangement': 0.4238327741622925,
 'client_sit_weekly': 0.40826311707496643,
 'contain': 0.4170317053794861,
 'contractual_encumbrance': 0.4866047501564026,
 'depositary': 0.486359566450119,
 'different_currency': 0.4359331727027893,
 'disadvantageous': 0.4139227569103241,
 'instrument': 0.43563172221183777,
 'item_9.01': 0.41418778896331787,
 'legally_bind_write': 0.4364796578884125,
 'periodic_reporting': 0.4792257249355316,
 'promptly_disclose': 0.40269172191619873,
 'proviso': 0.42787623405456543,
 'refunding_extension': 0.44858142733573914,
 'remain_unsold': 0.46572163701057434,
 'rescission': 0.4025128483772278,
 'typically_visit': 0.41896378993988037,
 'undersigned_registrant_undertake': 0.47711119055747986,
 'unutilized_commitment': 0.4239489436149597}

In [31]:
get_context_related_words('bad_faith', topn=20)

{'18_1101': 0.46810561418533325,
 '5.02': 0.5020743012428284,
 'aviation': 0.4724143445491791,
 'competent_jurisdiction': 0.5114517211914062,
 'donor': 0.481427937746048,
 'enterprise_nonprofit_entity': 0.4395328164100647,
 'former_governing': 0.46154260635375977,
 'fraud': 0.4677311182022095,
 'fraud_willful_violation': 0.6946159601211548,
 'gross_negligence': 0.7734322547912598,
 'hold_harmless': 0.4359149932861328,
 'indicate_when_signing': 0.4479483664035797,
 'intentional_misconduct': 0.5549612045288086,
 'loyalty_owe': 0.560850977897644,
 'misconduct': 0.5061584711074829,
 'nonexempt_prohibit': 0.47448039054870605,
 'particularly_important': 0.4628704786300659,
 'submit_evidence': 0.44530439376831055,
 'willful': 0.5116668343544006,
 'willful_misconduct': 0.44860759377479553}

### Words meanings linear algebra function

As soon as our words are represented in the vector form and live in the linear space.
We can try to apply to them some linear operation, such as vector addition and vector substraction.
Apart from this two operations there are many other operations such as linear operators, matrix inversion and so on.
Currently their are not implemented, but I believe they also can be useful.

First two operations, namely vector addition and subtractions have proved their usefulness for meaning extraction.
So let's try some of them:

In [33]:
def word_algebra(add=[], subtract=[], topn=1):
    '''Returns topn words as the result of operations 
    add=['token1','token2']
    subtract=['token1','token2']
    '''
    answers = text2vec.most_similar(positive=add, negative=subtract, topn=topn)
    for term, similarity in answers:
        print(term)

### Adding and subtracting meanings

### Addition
Let's try to add some meanings
* 'cost' and 'brand', surprisingly we get 'human_resource' so for me it has some sense.
* 'brand' + 'bad_faith' makes 'gross_negligence'
* 'instrument' + 'enhance' makes 'evaluate'
* 'internet' + 'brand' makes 'advertising' it really suprised me !!!
* 'periodic_reporting' + 'follow_chart' makes 'accurate' also surpisingly exact !!!

### Subtraction
Let's try to subtract meanings
*  'cost' - 'brand' makes 'correspondingly_reduce' great!
*  'internet' - 'brand' makes 'encumbrance' K-h-h-m!
*  'internet' - 'human_resource' makes 'redesignation'
*  'august_12' - 'july_19' makes 'ratio_threshold' strange maybe because of numbers.
*  'expense' - 'discount' makes 'protection'

So, after making some simple experiments in M&A corpus with words meaning manipulation in linear space, 
we can conclude with the high level of plausability that results we have got are not just random  coincidences, 
but have profound and yet unrecovered foundation behind them.

It is really cool and surprising for me. 
Also experiments showed that result strongly depends on the qaulity of the raw text.
So we have to give espesiall attention to the quality and the size of the corpus.
As we used in this experiment only texts from the sec.gov resources which contains many numbers, I see big 
inclination of the context meaning to the numbers. We have to do something with this problem. 
I suppose the using of tagging for the terms that repeats very often and bear low sense.  

In [34]:
word_algebra(add=['cost','brand'], subtract=[], topn=1)

human_resource


In [36]:
word_algebra(add=['brand','bad_faith'], subtract=[], topn=1)

gross_negligence


In [37]:
word_algebra(add=[], subtract=['cost','brand'], topn=1)

correspondingly_reduce


In [49]:
word_algebra(add=['instrument','enhance'], subtract=[], topn=1)

evaluate


In [53]:
word_algebra(add=['internet','brand'], subtract=[], topn=1)

advertising


In [54]:
word_algebra(add=[], subtract=['internet','brand'], topn=1)

encumbrance


In [59]:
word_algebra(add=[], subtract=['august_12','july_19'], topn=1)

ratio_threshold


In [62]:
word_algebra(add=[], subtract=['expense','discount'], topn=1)

protection


In [71]:
word_algebra(add=['periodic_reporting','follow_chart'], subtract=[], topn=1)

accurate


## t-SNE Distributed stochastic neighbor embedding

Map high dimensional data to low dimensions 2 or 3.
Results are used to graphically present the words vector space.

In [72]:
from sklearn.manifold import TSNE

In [74]:
# Number of vectors to apply t-SNE 
tsne_vectors = 5000

# Take data from panda's data frame. Remove stopwords from it.
tsne_input = word_vectors.drop(spacy.en.STOPWORDS, errors=u'ignore')

# Take the vectors  
tsne_input = tsne_input.head(tsne_vectors)

# Path to save model in binary file 'tsne_model'
tsne_filepath = os.path.join(settings.NORMALIZED_DATA_PATH, 'tsne_model')

# Path to save vectors in binary file 'tsne_model'
tsne_vectors_filepath = os.path.join(settings.NORMALIZED_DATA_PATH, 'tsne_vectors.npy')
print("t-SNE vectors saved")

t-SNE vectors saved


### t-SNE training

At this stage we train the distributed stochastic neighbor embedding

In [75]:
# Trains t-SNE dimension reduction. !!!Check additional twicks.
# Saves t-sne model in file 'tsne_filepath'
# Saves t-sne vectors in file 'tsne_vectors_filepath'
if True:
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)
    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    print("t-SNE training completed")

t-SNE training completed


### Load pretrained t-SNE model

Train t-SNE dimensionality reduction model.
This model tries to find such vectors transformation which separates meanings the better way.

In [76]:
# Loads t-SNE models
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)

# Loads t-SNE vectors from 'tsne_vectors_filepath'
tsne_vectors = pd.np.load(tsne_vectors_filepath)

# Converts tsne_vectors to panda's data frame
tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

print("t-SNE model loaded")

t-SNE model loaded


### Reduced vector space checking 

Now we test that vector space was really reduced to 2 dimensions x and y. 

In [79]:
# Just shows everyting is OK.
tsne_vectors[500:530]

Unnamed: 0,x_coord,y_coord
collision_repair,-8.721708,-0.881913
jones,4.599451,-12.735028
pr_newswire,-2.907264,-8.021707
continually_evaluate,-7.654195,-1.998917
indiana,-8.487156,7.454317
dealership_portfolio,-6.098268,-2.869543
brooks_pierce_mclendon_humphrey,4.614096,-12.730273
renewal_refunding,2.300097,-2.56723
avenue,1.857357,12.313571
utility,-9.701994,-3.30124


In [80]:
# Just renames field of panda's data frame 
tsne_vectors['word'] = tsne_vectors.index

## Plot t-SNE data

Here we'll try to see our vector space nonlineary transformed and projected to 2-dimesional space.

In [81]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [82]:
# Add tsne_vectors from DataFrame to bokeh as ColumnDataSource 
plot_data = ColumnDataSource(tsne_vectors)

# Create plot
tsne_plot = figure(title='Word embedding for M&A domain vector space',
                   plot_width = 800,
                   plot_height = 800,
                   tools= ('pan, wheel_zoom, box_zoom,'
                           'box_select, resize, reset'),
                   active_scroll='wheel_zoom')

# Add hover tool to plot
tsne_plot.add_tools(HoverTool(tooltips = '@word'))

# Plot words as circle
tsne_plot.circle('x_coord', 'y_coord', source=plot_data,
                 color='blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color='black')
# Title
tsne_plot.title.text_font_size = value('16pt')

# Axis parameters
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

In [83]:
show(tsne_plot)

Projection of the words' vector space on the 2-dimensional plane shows the signs of clastering. 
In reality the true meanings clastering exists on the surface/volume of the manifold that was embeded in 
100 dimensional space, but it is very hard to imagine for humans (maybe kitties can). 
You can imagine a sphere with some colored spots splittered over its surface that tend to gravitate to the centers
of meanings. 

## Conclusion

First results of testing algorithm and generated model showed that there is strong relation between the linear 
vector space words' representation and the semantics of the human language, in our case English. 
As for me, some results of addition and subtraction operations astounished me so strong, that I am planning
to write some kind of philosophical essay on the theme in the near future. 

This was the first test with the big enough corpus of M&A domain about 7000000 words 49.5 mb. 
Corpus included only SEC materials.

### Experiments have shown

1. Big and clean corpus is the main precondition to the good result. 
I think we should extend our corpus up to the size of 1G and more.
As any categorical algorithm it will be confused if it meet any unseen word. So the size is the matter.
We should add to our corpus texts from another sources: such as wiki, reddit, another sources. The more the better.

2. We should continue to experiment with different parameters of algorithm.
We will try to train network with bigger dimension from 300 to 500 and different context windows, maybe up to 30.

3. We should continue to experiment with the different models of neural network.
Particulary, deep neural networks. I beleive they are promising.

4. We should construct the testing engine to measure differenet implementation of algorithms.
So we need the real or quasi real texts that Dealroom clients are going to use.

### About implementation of algorithm with Dealroom

This algorithm is very abstract, it is able to extract the meaning from any text and operate with it in the context of
the pretrained model. This universality is its main advantage. We can compare closeness of meanings of any 
clients' text and choose the better choise. It can be made with the simple softmax function.

### Implementation workflow
 
This is the next stage in our work, that comes after we train and test model with the new, big corpus.

So, in any case I see it this way.
I train the model on the local mashine with the big and clean corpus (by the way, training can take days up to the week on the usual computer). After model is trained it can be uploaded as a file (its size is up to 1G, maybe more) to Amazone instance and used by the Dealroom service. This is the prefferd way because of message delays. Model can be loaded un run even on the any local comuter with the big RAM, but I suspect network delays will be unacceptable. 
Periodically, while getting new results from clients, we retrain model and reload it, with hope that it will 
be better than former.