## Explore Daedalus Word Vector Space Models

word embeddings = word vectors = distributed representations = is a **dense** repr of words in a (relatively) **low-dimensional vector space**, compared to e.g. one-hot representaions that is high-dimensional and sparse.


- [1]	Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
- [2]	Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
- [3]	Optimizing word2vec in gensim, [Link](http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/)


Word2vec skip-gram:
- small window (e.g. 2 = 5 words 2 w l/r) to get an interchangability model
- larger window (e.g. 50) to get an attribute model

https://www.youtube.com/watch?v=tAxrlAVw-Tk&t=17s

What is the closest word to "king"? Is it "Canute" or is it "crowned"? There are many ways to define "similar words" and "similar texts". Depending on your definition you should choose a word embedding to use. There is a new generation of word embeddings added to Gensim open source NLP package using morphological information and learning-to-rank: Facebook's FastText, VarEmbed and WordRank.


|Method|Origin|Description|
|---|---|---|
|Word2vec|Google|Needs >= 5m words. https://code.google.com/p/word2vec/|
|FastText|Facebook|Better than Word2Vec because of use of morphology - splits words into morphemes
|GloVe|Stanford|Non-probibalistic (numeric) GloVe
|**WordRank**|Performs well on small data (1m words). Gives top 10 words, skips the rest. Ranks words. Slow!|
|VarEmbed||

 https://github.com/pydataberlin/conf2017slides/tree/master/data_science_for_digital_humanities
Create dataset to validate model e..g table with related words
https://github.com/pydataberlin
 https://gist.github.com/tmylk
 https://gist.github.com/tmylk/690374ba266d90f11bc40221cc6d3d90
 https://code.google.com/archive/p/word2vec/
 
 How?
 Co-occurrence matrix is vocab x vocab, good but to big
 Reduce dimensions:
   (vocab x vocab) = (vocab x small) \* (small x vocab)
    C = **U** x V
    in w2v UxV is approcimated to PMI(C) - log(m)
    CO-occurrence score in word2vrc = Uword * Vcontext

![image.png](attachment:image.png)

### Setup Notebook and Dependencies
The first three lines in the first cell runs code found in utility scripts which reside in the same folder as the notebook. The remaining lines imports dependent libraries and sets up the notebook.

In [80]:
# Setup Environment
%run ./common/wordvector_utility
%run ./common/vectorspace_utility
%run ./common/widgets_utility

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

from IPython.core.display import display, HTML, clear_output
from IPython.core.interactiveshell import InteractiveShell
from nltk import word_tokenize

import ipywidgets as widgets
import bokeh.models as bm
import bokeh.plotting as bp
import bokeh.io as bio
import pandas as pd
import numpy as np
import types

%autosave 120
%config IPCompleter.greedy=True

InteractiveShell.ast_node_interactivity = "all"
TOOLS = "pan,wheel_zoom,box_zoom,reset,previewsave"

bp.output_notebook()


Autosaving every 120 seconds


In [31]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) { return false; }

<IPython.core.display.Javascript object>

### Select Word Vector Space Model (to be Used in Subsequent Steps)
Avaliable vector space models are stored in subfolder./data. New models can be added simpy by uploading them to **~/notebooks/VaticanTexts/data** (with for instance WinSCP).
- **get_model_names** retrieves the filenames of all dat-files. 
- **load_model_vector** returns avaliable models stored in ./data sub-folder.
Remember to re-run all dependens cells (use **run** button or **shift-enter**) after a new model is loaded. There is no automatic execution of sub-sequent cells enables.

In [None]:
# Current Model
class ModelState:
    
    def __init__(self, data_folder):
        
        self.data_folder = data_folder
        self.filenames = WordVectorUtility.get_model_names(data_folder)
        self.filename = self.filenames[0]
        self.wordvector = None
        
    def set_model(self, filename=None):

        filename = filename or self.filename
        self.filename = filename
        self.wordvectors = WordVectorUtility.load_model_vector(os.path.join(self.data_folder, filename))
        print('Model {} loaded...'.format(self.filename))

state = ModelState('./vsm-data')

z = BaseWidgetUtility()
z.filename = z.create_select_widget(description='Model', options=state.filenames, value=state.filename, layout=widgets.Layout(width='75%'))
w = widgets.interactive(state.set_model, filename=z.filename)
display(widgets.VBox((z.filename,) + (w.children[-1],)))
w.update()

### Corpus Statistics

### Reduce the high-dimensional word vectors to 2D using PCA and t-SNE (sklearn)

Dimensionality reduction is a way to visualize high-dimensional word-vectors produced by word2vec in 2D or 3D. Note that these methods reduces the data in different ways, and under cetrain assumptions. Reducing data to lower dimensions will (almost) certainly lose information (see discussion in [Q&A](https://stats.stackexchange.com/questions/66060/does-dimension-reduction-always-lose-some-information))

See (Pearson, 1901 [PDF](http://stat.smmu.edu.cn/history/pearson1901.pdf)) and [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis#Dimensionality_reduction) for a description of Primary Component Analysis (PCA). See [(Maaten and Hinton, 2008)](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) for a descrioption of tSNE. t-SNE transforms a set of coordinates in a high-dimensional vector space (our word vectors) into a "faithful" representation in a lower-dimensional space e.g. 3D-space or a 2D-plane.  t-SNE (tries to) preserves *local distances* when coordinates are transformes to lower dimensions (i.e preserves clusters).

Open questions:
1. What is the impact of reducing *only a subset* of the coordinates (compares to reducing the entire vocabulary)? That is, if we are intresested in a subset of words, we can 1) reduce the entire vocabulary or 2) only reduce the subset (for speed). 
2. t-SNE has a “perplexity” parameter which effects the local vs global clustering of the data. The perplexity affects the (guessed) distributon of neighbouring points to a given point. Try different values for this parameter. This variable has no effect for PCA.


In [73]:
# Setup T-SNE Plot
def setup_tsne_plot(title=''):
    xp = bp.figure(
        plot_width=900, plot_height=600, title=title,
        tools=TOOLS, toolbar_location="right",
        x_axis_type=None, y_axis_type=None
    )
    source = ColumnDataSource(dict(w=[''],x=[0], y=[0]))
    crs = xp.scatter(x='x', y='y', source=source)
    crl = bm.LabelSet(x='x', y='y', text='w', level='glyph', x_offset=-2, y_offset=5, source=source)
    xp.add_layout(crl)
    handle = bp.show(xp, notebook_handle=True)
    return types.SimpleNamespace(handle=handle,source=source,points=crs,labels=crl)

def tsne_plot_reduce(words_of_interest, method='tsne', perplexity=30):
    global state
    X_m_space, index2word = WordVectorUtility.create_X_m_space_matrix(state.wordvectors, words_of_interest)
    opts = dict(n_components=2, init='pca', random_state=55887, perplexity=perplexity)
    X_2_space = VectorSpaceHelper.reduce_dimensions(X_m_space, method=method, **opts)
    return X_2_space, index2word
    
tsne_plot = setup_tsne_plot('Dimensionality Reduced Word Vector Space Model')


In [34]:
# T-SNE/PCA dimensionality reduction
def update_tsne_plot(x,y,w):
    tsne_plot.source.data.update(dict(w=w,x=x, y=y))
    bio.push_notebook(handle=tsne_plot.handle)
    
def reduce_dimensions_and_display_words(raw_text, method, perplexity):
    global state, cw

    if len(raw_text) > 0 and raw_text[-1] != ' ':
        return
    words_of_interest = list(set(word_tokenize(raw_text)))

    if len(words_of_interest) >= 3:
        cw.progress.value = 2
        X_2_space, index2word = tsne_plot_reduce(words_of_interest, method.lower(), perplexity)
        cw.progress.value = 3
        update_tsne_plot(x=X_2_space[:, 0], y=X_2_space[:, 1], w=index2word)
        cw.progress.value = 5
    cw.progress.value = 0
    
cw = BaseWidgetUtility(
    progress = wf.create_int_progress_widget(min=0, max=5, step=1, value=0),
    words=wf.create_text_area_input_widget(
        description='Words', placeholder='(enter or paste words to be plotted)',
        value='', layout=widgets.Layout(width='80%', height='200px') #, continuous_update=False
    ),
    method=wf.create_select_widget(description='Method', options=['TSNE', 'PCA'], value='PCA',
                                   layout=widgets.Layout(width="60%")),
    perplexity=wf.create_int_slider('Perplexity', min=5, max=100, step=1, value=30),
)

iwa = widgets.interactive(
    reduce_dimensions_and_display_words,
    raw_text=cw.words,
    method=cw.method,
    perplexity=cw.perplexity
)

display(widgets.VBox([
    widgets.HBox([cw.words, widgets.VBox([cw.method, cw.perplexity, cw.progress])]),
    iwa.children[-1]
]))
iwa.update()

In [23]:
' '.join(list(state.wordvectors.vocab)[:250])

'brevet senast mej storman allvarligt trappan stadiet grundförstärkning bruksanvisning balplex ångtrycket establishments inköpas rest föreställningsvärld klädd avritade kontorschefen köken hartz byggnadsarbetena urvalet motive abstrakta apple sthlm åhret realiteten radions bildfönster åkermans stark eiffeltornet yrkeskategorier ytan återupptog emeritus järnvägsbyggare högtstående årsproduktionen universitetens optikern styras krutets måttstav kökets ägarnas högtidligt kronolänsman f. omöjliggjorde perspective livsstil mobil klättrar köps ridån idrifttagningen nämligen kyl skapade poleringen skutor ursprungsland kista fyrkantiga amongst gruppera tillträdet kraftvärmeverk isak ungefärliga anger osmundjärn slarvat industrihistorien handstycket bakugn nertill aritmetisk beans exportvara commodore hushållets skänkt elspisar kunders patentsystem decenniernas läroböcker utrustning släktnamn oil arbetsrummet renhorn brandförsäkringen fönstret mödosamt curio frankrike karlskrona framställes väv

In [None]:
state.wordvectors?

In [35]:
#
def update_expression_plot(exp_points, result_points, expr_trail):
    global expression_plot
    expression_plot.expr_words_source.data.update(exp_points)
    expression_plot.result_words_source.data.update(result_points)
    expression_plot.expr_trail_source.data.update(expr_trail)
    bio.push_notebook(handle=expression_plot.handle)


def setup_expression_plot():
    xp = bp.figure(
        plot_width=900, plot_height=600, title='Word Vector Expressions',
        tools=TOOLS, toolbar_location="right",
        #x_axis_type=None, y_axis_type=None
    )

    xp.cross(x=0, y=0, size=10, color='blue')
                                                                    
    expr_words_source = ColumnDataSource(dict(w=[''],x=[0], y=[0]))
    result_words_source = ColumnDataSource(dict(w=[''],x=[0], y=[0]))
    expr_trail_source = ColumnDataSource(dict(w=[''],x=[0], y=[0], x2=[0], y2=[0]))
    
    crs = xp.scatter(x='x', y='y', color='black', source=expr_words_source)
    crl = bm.LabelSet(x='x', y='y', text='w', level='glyph', x_offset=-2, y_offset=5, source=expr_words_source)
    xp.add_layout(crl)
    
    rp = xp.scatter(x='x', y='y', size=5, color='green', source=result_words_source)
    rl = bm.LabelSet(x='x', y='y', text='w', level='glyph', x_offset=-2, y_offset=5, source=result_words_source)
    xp.add_layout(rl)
    
    xp.add_layout(bm.Arrow(
        x_start=0,
        y_start=0,
        x_end='x',
        y_end='y',
        line_color='red',
        source=expr_words_source,
        end=bm.NormalHead(size=10, fill_color='black', fill_alpha=1.0, line_alpha=0.2),
    ))

    xp.add_layout(bm.Arrow(
        x_start='x',
        y_start='y',
        x_end='x2',
        y_end='y2',
        line_color='blue',
        source=expr_trail_source,
        end=bm.NormalHead(size=10, fill_color='black', fill_alpha=1.0, line_alpha=0.2),
    ))
 
    handle = bp.show(xp, notebook_handle=True)
    return types.SimpleNamespace(
        handle=handle,
        points=crs,
        labels=crl,
        expr_words_source=expr_words_source,
        result_words_source=result_words_source,
        expr_trail_source=expr_trail_source
    )

    
expression_plot = setup_expression_plot()

In [None]:
# Calculator
history_state = ['', 'man - pojke + flicka']
z = BaseWidgetUtility(
    method = wf.create_select_widget(
        description='Reducer',
        options=['pca', 'tsne' ],
        value='tsne'
    ),
    perplexity=wf.create_int_slider(description='Perplexity', min=10, max=100, step=1, value=0),
    expression = wf.create_text_input_widget(
        description='Expression',
        placeholder='(enter expression e.g. sverige + oslo - stockholm)',
        value='',
        layout=widgets.Layout(width='90%')
    )
)

def compute_expression(expression, method, perplexity, n_top=10):
    global state, history_state
    
    result, options = WordVectorUtility.compute_most_similar_expression(state.wordvectors, expression)

    if result is None or options is None:
        return
    
    expression_words = (options['positives'] or []) + (options['negatives'] or [])
    
    #if len(expression_words) < 3:
    #    return
    
    result_words, result_weights = list(zip(*result[:n_top]))
    result_words = [ z for z in result_words if z not in expression_words ]

    #df = pd.DataFrame(result_words).assign(weight=result_weights)
    #display(HTML(df.to_html()))
    #return
    
    X_2_space, index2word = tsne_plot_reduce(expression_words+result_words,perplexity=perplexity)
    
    word2index = dict(zip(index2word, range(0, len(index2word))))

    expr_index = [ word2index[x] for x in expression_words ]
    result_index = [ word2index[x] for x in result_words ]
    positives = options['positives']
    color = len(index2word) * ['red']
    expr_words=dict(
        x=list(X_2_space[expr_index, 0]),
        y=list(X_2_space[expr_index, 1]),
        w=[ index2word[i] for i in expr_index ],
        s=[ index2word[i] in positives for i in expr_index ]
    )
    result_words=dict(
        x=list(X_2_space[result_index, 0]),
        y=list(X_2_space[result_index, 1]),
        w=[ index2word[i] for i in result_index ],
    )
    
    expr_points = [dict(zip(expr_words,t)) for t in zip(*expr_words.values())]
    ''' Let trail start at origo '''
    trail = [dict(x=0, y=0, w='', x2=0, y2=0)]
    for p in expr_points:
        ''' tp is our current position '''
        tp = trail[-1]
        sign = 1 if p['s'] else -1
        
        ''' np = (x, y) is our next position '''
        x, y = tp['x'] + sign * p['x'], tp['y'] + sign * p['y']
        w = tp['w'] + (' + ' if p['s'] else ' - ') + p['w']
        
        np = dict(x=x, y=y, w=w, x2=0, y2=0)
        tp['x2'], tp['y2'] = np['x'], np['y']
        trail += [ np ]
    
    expr_trail = dict(
        x=[ p['x'] for p in trail ][1:-1],
        y=[ p['y'] for p in trail ][1:-1],
        w=[ p['w'] for p in trail ][1:-1],
        x2=[ p['x2'] for p in trail ][1:-1],
        y2=[ p['y2'] for p in trail ][1:-1]
    )

    update_expression_plot(expr_words, result_words, expr_trail)
    
    #point_map = dict(zip(index2word, zip(X_2_space[:, 0], X_2_space[:, 1])))

    #update_expression_plot(x=X_2_space[:, 0], y=X_2_space[:, 1], w=index2word)
            
    df = pd.DataFrame(result_words).assign(weight=result_weights)
    display(HTML(df.to_html()))

w = widgets.interactive(compute_expression, expression=z.expression, method=z.method, perplexity=z.perplexity)
display(widgets.VBox([
    widgets.HBox([z.expression]),
    widgets.HBox([z.method, z.perplexity]),
    w.children[-1]
]))
# w.update()
# sverige + oslo - stockholm  ### Similarity to Anthologies# setup_anthology_similarity_and_plot
anthology_plot = setup_tsne_plot('Dimensionality Reduced Word Vector Space Model')
def compute_similarity_to_anthologies(word_vectors, scale_x_pair, scale_y_pair, word_list):

    scale_x = word_vectors[scale_x_pair[0]] - word_vectors[scale_x_pair[1]]
    scale_y = word_vectors[scale_y_pair[0]] - word_vectors[scale_y_pair[1]]

    word_x_similarity = [1 - spatial.distance.cosine(scale_x, word_vectors[x]) for x in word_list ]
    word_y_similarity = [1 - spatial.distance.cosine(scale_y, word_vectors[x]) for x in word_list ]

    df = pd.DataFrame({ 'words': word_list, 'x': word_x_similarity, 'y': word_y_similarity })

    return df

def compute_similarity_to_single_words(word_vectors, word_x, word_y, word_list):

    word_x_similarity = [ word_vectors.similarity(x, word_x) for x in word_list ]
    word_y_similarity = [ word_vectors.similarity(x, word_y) for x in word_list ]

    df = pd.DataFrame({ 'words': word_list, 'x': word_x_similarity, 'y': word_y_similarity })

    return df# Code
def update_anthology_plot(x,y,w):
    anthology_plot.source.data.update(dict(w=w,x=x, y=y))
    bio.push_notebook(handle=anthology_plot.handle)
    
def compute_anthology_similarity_and_display(
    x_low, x_high, y_low, y_high, raw_text, method, perplexity
):
    global state, aw

    # if len(raw_text) > 0 and raw_text[-1] != ' ':
    #    return
    aw.progress.value = 1
    words_of_interest = list(set(word_tokenize(raw_text)))

    # if len(words_of_interest) >= 3:
    aw.progress.value = 2
    X_2_space, index2word = tsne_plot_reduce(words_of_interest, method.lower(), perplexity)
    aw.progress.value = 3
    update_anthology_plot(x=X_2_space[:, 0], y=X_2_space[:, 1], w=index2word)
    aw.progress.value = 5

    aw.progress.value = 5
    aw.progress.value = 0
    
aw = BaseWidgetUtility(
    progress = wf.create_int_progress_widget(min=0, max=5, step=1, value=0, layout=widgets.Layout(width='100%')),
    words=wf.create_text_area_input_widget(
        description='Words', placeholder='(enter or paste words to be plotted)',
        value='', layout=widgets.Layout(width='80%', height='200px') #, continuous_update=False
    ),
    x_low=wf.create_text_input_widget(description='X-low', value='Industri', continuous_update=False),
    x_high=wf.create_text_input_widget(description='X-high', value='Jordbruk', continuous_update=False),
    y_low=wf.create_text_input_widget(description='Y-low', value='Män', continuous_update=False),
    y_high=wf.create_text_input_widget(description='Y-high', value='Kvinnor', continuous_update=False),
    
    method=wf.create_select_widget(description='Method', options=['TSNE', 'PCA'], value='PCA',
                                   layout=widgets.Layout(width="40%")),
    perplexity=wf.create_int_slider('Perplexity', min=5, max=100, step=1, value=30, layout=widgets.Layout(width='65%'))
)

awa = widgets.interactive(
    compute_anthology_similarity_and_display,
    x_low=aw.x_low,
    x_high=aw.x_high,
    y_low=aw.y_low,
    y_high=aw.y_high,
    raw_text=aw.words,
    method=aw.method,
    perplexity=aw.perplexity
)

display(widgets.VBox([
    widgets.HBox([
        aw.words, widgets.VBox([
            aw.progress,
            widgets.HBox([aw.x_low, aw.x_high]),
            widgets.HBox([aw.y_low, aw.y_high]),
            widgets.HBox([aw.method, aw.perplexity]),
           ])]),
    iwa.children[-1]
]))
    
awa.update()

### Similarity to Anthologies

In [79]:
# setup_anthology_similarity_and_plot
anthology_plot = setup_tsne_plot('Dimensionality Reduced Word Vector Space Model')
def compute_similarity_to_anthologies(word_vectors, scale_x_pair, scale_y_pair, word_list):

    scale_x = word_vectors[scale_x_pair[0]] - word_vectors[scale_x_pair[1]]
    scale_y = word_vectors[scale_y_pair[0]] - word_vectors[scale_y_pair[1]]

    word_x_similarity = [1 - spatial.distance.cosine(scale_x, word_vectors[x]) for x in word_list ]
    word_y_similarity = [1 - spatial.distance.cosine(scale_y, word_vectors[x]) for x in word_list ]

    df = pd.DataFrame({ 'words': word_list, 'x': word_x_similarity, 'y': word_y_similarity })

    return df

def compute_similarity_to_single_words(word_vectors, word_x, word_y, word_list):

    word_x_similarity = [ word_vectors.similarity(x, word_x) for x in word_list ]
    word_y_similarity = [ word_vectors.similarity(x, word_y) for x in word_list ]

    df = pd.DataFrame({ 'words': word_list, 'x': word_x_similarity, 'y': word_y_similarity })

    return df

In [77]:
# Code
def update_anthology_plot(x,y,w):
    anthology_plot.source.data.update(dict(w=w,x=x, y=y))
    bio.push_notebook(handle=anthology_plot.handle)
    
def compute_anthology_similarity_and_display(
    x_low, x_high, y_low, y_high, raw_text, method, perplexity
):
    global state, aw

    # if len(raw_text) > 0 and raw_text[-1] != ' ':
    #    return
    aw.progress.value = 1
    words_of_interest = list(set(word_tokenize(raw_text)))

    # if len(words_of_interest) >= 3:
    aw.progress.value = 2
    X_2_space, index2word = tsne_plot_reduce(words_of_interest, method.lower(), perplexity)
    aw.progress.value = 3
    update_anthology_plot(x=X_2_space[:, 0], y=X_2_space[:, 1], w=index2word)
    aw.progress.value = 5

    aw.progress.value = 5
    aw.progress.value = 0
    
aw = BaseWidgetUtility(
    progress = wf.create_int_progress_widget(min=0, max=5, step=1, value=0, layout=widgets.Layout(width='100%')),
    words=wf.create_text_area_input_widget(
        description='Words', placeholder='(enter or paste words to be plotted)',
        value='', layout=widgets.Layout(width='80%', height='200px') #, continuous_update=False
    ),
    x_low=wf.create_text_input_widget(description='X-low', value='Industri', continuous_update=False),
    x_high=wf.create_text_input_widget(description='X-high', value='Jordbruk', continuous_update=False),
    y_low=wf.create_text_input_widget(description='Y-low', value='Män', continuous_update=False),
    y_high=wf.create_text_input_widget(description='Y-high', value='Kvinnor', continuous_update=False),
    
    method=wf.create_select_widget(description='Method', options=['TSNE', 'PCA'], value='PCA',
                                   layout=widgets.Layout(width="40%")),
    perplexity=wf.create_int_slider('Perplexity', min=5, max=100, step=1, value=30, layout=widgets.Layout(width='65%'))
)

awa = widgets.interactive(
    compute_anthology_similarity_and_display,
    x_low=aw.x_low,
    x_high=aw.x_high,
    y_low=aw.y_low,
    y_high=aw.y_high,
    raw_text=aw.words,
    method=aw.method,
    perplexity=aw.perplexity
)

display(widgets.VBox([
    widgets.HBox([
        aw.words, widgets.VBox([
            aw.progress,
            widgets.HBox([aw.x_low, aw.x_high]),
            widgets.HBox([aw.y_low, aw.y_high]),
            widgets.HBox([aw.method, aw.perplexity]),
           ])]),
    iwa.children[-1]
]))
    
awa.update()