## Explore Daedalus Word Vector Space Models

Word embeddings (a.k.a word vectors or distributed representations) is a **dense** repr of words in a (relatively) **low-dimensional vector space**, compared to e.g. one-hot representation that is high-dimensional and sparse.

- Zellig S. Harris, DISTRIBUTIONAL STRUCTURE, 1954 [PDF](http://www.tandfonline.com/doi/pdf/10.1080/00437956.1954.11659520)
- Tomas Mikolov, et. al. Efficient Estimation of Word Representations in Vector Space. [PDF](https://arxiv.org/pdf/1301.3781)
- Tomas Mikolov, et. al. Distributed Representations of Words and Phrases and their Compositionality. [PDF](https://arxiv.org/abs/1310.4546)
- Turney, Pantel: From frequency to meaning: Vector space models of semantics [PDF](javascript:void(0))
- Google Code Archive word2vec [Link](https://code.google.com/archive/p/word2vec/)
- Shusen Liu: Visual Exploration of Semantic Relationships in Neural Word Embeddings, 2018 [PDF](http://ieeexplore.ieee.org/abstract/document/8019864/)
- Word2vec in Python, Part Two: Optimizing (RADIM ŘEHŮŘEK), [Link](http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/)
- Talk by one of gensims developers: [Video](https://www.youtube.com/watch?v=tAxrlAVw-Tk&t=17s)
- Word2Vec lecture: [Video](https://youtu.be/T8tQZChniMk)
- Lecture, Christoffer Mannings, Stanford: [Video](https://youtu.be/ERibwqs9p38)

- How exactly does word2vec work? [PDF](http://www.1-4-5.net/~dmm/ml/how_does_word2vec_work.pdf)
- Ryan Heuser: [Word Vectors in the Eighteenth Century](http://ryanheuser.org/word-vectors/)
- Ben Schmidt: [Vector Space Models for the Digital Humanities](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html) [Visualization](http://benschmidt.org/profGender/#%7B%22database%22%3A%22RMP%22%2C%22plotType%22%3A%22pointchart%22%2C%22method%22%3A%22return_json%22%2C%22search_limits%22%3A%7B%22word%22%3A%5B%22funny%22%5D%2C%22department__id%22%3A%7B%22%24lte%22%3A25%7D%7D%2C%22aesthetic%22%3A%7B%22x%22%3A%22WordsPerMillion%22%2C%22y%22%3A%22department%22%2C%22color%22%3A%22gender%22%7D%2C%22counttype%22%3A%5B%22WordsPerMillion%22%5D%2C%22groups%22%3A%5B%22department%22%2C%22gender%22%5D%2C%22testGroup%22%3A%22D%22%7D)
- Michael A. Gavin: [The Arithmetic of Concepts: a response to Peter de Bolla](http://modelingliteraryhistory.org/2015/09/18/the-arithmetic-of-concepts-a-response-to-peter-de-bolla/)


Word2vec skip-gram: Use a small window (e.g. 2 = 5 words 2 w l/r) to get an **interchangability** model, and larger window (e.g. 50) to get an **attribute** model.

|Method|Origin|Description|
|---|---|---|
|Word2vec|Google|Needs >= 5m words. https://code.google.com/p/word2vec/|
|FastText|Facebook|Better than Word2Vec because of use of morphology - splits words into morphemes
|GloVe|Stanford|Non-probibalistic (numeric) GloVe
|**WordRank**||Performs well on small data (1m words). Gives top 10 words, skips the rest. Ranks words. Slow!|
|VarEmbed|x|x|

 https://github.com/pydataberlin/conf2017slides/tree/master/data_science_for_digital_humanities
 https://github.com/pydataberlin
 https://gist.github.com/tmylk
 https://gist.github.com/tmylk/690374ba266d90f11bc40221cc6d3d90
 
#### How?
Assume a corpus C with vocabulay V of size N. Create co-occurrence matrix $M_{CO}$ of size N x N. Dimension needs to be reduced in such a way that $(N \times N) = (N \times K) \times (K \times N)$ where K is the target dimensionality.
$M_{CO} = U \times V$. In word2vec $U \times V$ is approximated to $PMI(M_{CO}) - log(m)$

CO-occurrence score in word2vec = Uword * Vcontext(?)

Definition of PMI, Pointwise Mutual Information, from [Q&A](https://stackoverflow.com/questions/13488817/pointwise-mutual-information-on-text). P(x, y) is probability that words x and y co-occur in the same context, P(x) and P(y) are (global) probability of x and y. This equals the probability of x occuring, given that y occur.
\begin{equation}
\mathrm{pmi}(x,y)=\log(\frac{P(x, y)}{P(x)P(y)})=\log(\frac{P(x|y)}{P(x)})
\end{equation}


#### Sample NLP Tasks
- Language translation, detection
- Sentiment analysis
- PoS-tagging, Dependency analysis
- Document ranking/classification, sentiment analysis.
- Document querying (compare query vector and document vector with e.g [KL-divergence](http://web.engr.illinois.edu/~hanj/cs412/bk3/KL-divergence.pdf))
- Recommendation system (such as Spotify). See [this](http://arno.uvt.nl/show.cgi?fid=136352) and [this](https://github.com/mattdennewitz/playlist-to-vec) link.


### Step 0: Setup Notebook and Dependencies
The first three lines in the first cell runs code found in utility scripts which reside in the same folder as the notebook. The remaining lines imports dependent libraries and sets up the notebook.<img src="./tm-data/images/w2v-workflow-prepare.svg">

In [3]:
# Setup Environment
%run ./common/wordvector_utility
%run ./common/vectorspace_utility
%run ./common/widgets_utility

import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

from IPython.core.display import display, HTML, clear_output
from IPython.core.interactiveshell import InteractiveShell
from nltk import word_tokenize

import ipywidgets as widgets
import bokeh.models as bm
import bokeh.plotting as bp
import bokeh.io as bio
import pandas as pd
import numpy as np
import types

# %autosave 120
# %config IPCompleter.greedy=True

# InteractiveShell.ast_node_interactivity = "all"
TOOLS = "pan,wheel_zoom,box_zoom,reset,previewsave"

bp.output_notebook()


In [5]:
# %%javascript
# IPython.OutputArea.prototype._should_scroll = function(lines) { return false; }

In [3]:
%%bash
# sudo ./update_user_notebooks.bash

### Step 1: Select Word Vector Space Model (to be Used in Subsequent Steps)
Available vector space models are stored in subfolder./data. New models can be added simpy by uploading them to the **./tsm-data** folder (with for instance WinSCP). All files with extension '.dat' in this folder are assumed to be a WEM and hence added to the dropdown. Remember to re-run all dependens cells (use **run** button or **shift-enter**) when a new model has been selected. There is **no** automatic execution of sub-sequent cells!

In [6]:
# Current Model
class ModelState:
    
    def __init__(self, data_folder):
        
        self.data_folder = data_folder
        self.WEM = None
        self.patterns = ['*.dat', '*.bin.gz']
        self.filenames = self.get_model_names(self.data_folder, self.patterns)
        self.filename = self.filenames[0]
        
    def set_model(self, filename=None):

        filename = filename or self.filename
        
        if filename is None:
            print('Please select a model')
            return
        
        print('Loading...')
        self.filename = filename
        self.WEM = WordVectorUtility.load_model_vector(os.path.join(self.data_folder, filename), limit=100000)
        print('Model {} loaded...'.format(self.filename))

    def get_basename(self):
        basename = self.filename
        for prefix in ['w2v_model_', 'w2v_']:
            if basename.startswith(prefix):
                basename = basename[len(prefix):]
        return basename[:-4]

    def get_statistics(self):
        filename = os.path.join(self.data_folder, 'stats_{}.tsv'.format(self.get_basename()))
        # print(filename)
        df = pd.read_csv(filename, sep='\t')
        if 'Unnamed: 0' in df.columns:
            df.drop('Unnamed: 0', axis=1, inplace=True)
        return df
    
    def get_model_names(self, data_folder, patterns):

        filenames = flatten([ glob.glob(os.path.join(data_folder, x)) for x in patterns ])
        filenames = [ os.path.split(x)[1] for x in filenames ]
        filenames.sort()
        return [ None ] + filenames

state = ModelState('./vsm-data')

sw = BaseWidgetUtility(
    filename=wf.create_select_widget(
        description='Model', options=state.filenames, value=state.filename, layout=widgets.Layout(width='85%')
    )
)
swi = widgets.interactive(state.set_model, filename=sw.filename)
display(widgets.VBox([sw.filename, swi.children[-1]]))
swi.update()

Please select a model


### Corpus Statistics

In [5]:
# Press SHIFT-ENTER to display or update stats
wchart = widgets.Output()
wtable = widgets.Output()

def display_stats(count_type, view_type):
    global wchart, wtable
    try:
        df = state.get_statistics()
        df['year'] = df.filename.apply(lambda x: int(re.search(r'(\d{4})', x).group(0)))
        token_count = 'total_tokens' if count_type == 'Total' else 'tokens'
        title = 'Total tokens per year' + (' stopwords, rare words removed, bigrams etc.' if count_type != 'Total' else '')
        df_count = df.groupby('year')[[token_count]].agg([len, np.sum, np.mean, np.min, np.max, np.std])
        df_count.columns = ['Documents', 'Tokens', 'Average', 'Min', 'Max', 'Std']
        if view_type == 'Chart':
            df_count[['Tokens', 'Average', 'Min', 'Max']].plot(figsize=(14,8), title=title)
        else:
            display(df_count)
    except Exception as ex:
        print("Statistics missing for selected model")
        
cw = BaseWidgetUtility(
    count_type=widgets.ToggleButtons(options=['Total', 'Processed'], description='Type:', disabled=False, button_style=''),
    view_type=widgets.ToggleButtons(options=['Chart', 'Table'], description='View:', disabled=False, button_style='')
)

cwi = widgets.interactive(display_stats, count_type=cw.count_type, view_type=cw.view_type)
display(widgets.VBox([widgets.HBox([cw.count_type, cw.view_type]), cwi.children[-1]]))
cwi.update()

VBox(children=(HBox(children=(ToggleButtons(description='Type:', options=('Total', 'Processed'), value='Total'…

### Step: Seed Similar Words

In [6]:
# Code
%run ./common/vectorspace_utility
def seed_similar_words(expression, n_top=10):
    global state
    #try:
    result, options = WordVectorUtility.compute_most_similar_expression(state.WEM, expression, topn=n_top)
    if result is None or options is None:
        return
    result_words, _ = list(zip(*result[:n_top]))
    print(' '.join(list(result_words)))
    df = pd.DataFrame(result, columns=['word', 'distance'])
    display(df)
#    except Exception as ex:
#        print(ex)

sw = BaseWidgetUtility(
    n_top=wf.create_int_slider(description='Count', min=1, max=250, step=1, value=20),
    expression = wf.create_text_input_widget(
        description='Expression',
        placeholder='(enter seed expression or word)',
        value='',
        layout=widgets.Layout(width='50%')
    )
)
isw = widgets.interactive(seed_similar_words, expression=sw.expression, n_top=sw.n_top)
display(widgets.VBox([
    widgets.HBox([sw.expression, sw.n_top]),
    isw.children[-1]
]))



VBox(children=(HBox(children=(Text(value='', description='Expression', layout=Layout(width='50%'), placeholder…

### Step: Display a Set of Words Using Dimensionality Reduction

Dimensionality reduction is a way to visualize high-dimensional word-vectors produced by word2vec in 2D or 3D. Note that these methods reduces the data in different ways, and under cetrain assumptions. Reducing data to lower dimensions will with certainty lose information (see discussion in [Q&A](https://stats.stackexchange.com/questions/66060/does-dimension-reduction-always-lose-some-information)). [This article](http://www.sci.utah.edu/~beiwang/publications/Word_Embeddings_BeiWang_2017.pdf) discusses some problems with D-reductions.

See (Pearson, 1901 [PDF](http://stat.smmu.edu.cn/history/pearson1901.pdf)) and [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis#Dimensionality_reduction) for a description of Primary Component Analysis (PCA). See [(Maaten and Hinton, 2008)](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) for a descrioption of tSNE. t-SNE transforms a set of coordinates in a high-dimensional vector space (our word vectors) into a "faithful" representation in a lower-dimensional space e.g. 3D-space or a 2D-plane.  t-SNE (tries to) preserves *local distances* when coordinates are transformes to lower dimensions (i.e preserves clusters).

Open questions:
1. What is the impact of reducing *only a subset* of the coordinates (compares to reducing the entire vocabulary)? That is, if we are intresested in a subset of words, we can 1) reduce the entire vocabulary or 2) only reduce the subset (for speed). 
2. t-SNE has a “perplexity” parameter which effects the local vs global clustering of the data. The perplexity affects the (guessed) distributon of neighbouring points to a given point. Try different values for this parameter. This variable has no effect for PCA.


In [7]:
# Setup T-SNE Plot
def setup_tsne_plot(title=''):
    xp = bp.figure(
        plot_width=1000, plot_height=600, title=title,
        tools=TOOLS, toolbar_location="right",
        x_axis_type=None, y_axis_type=None
    )
    source = ColumnDataSource(dict(w=[''],x=[0], y=[0]))
    crs = xp.scatter(x='x', y='y', source=source)
    crl = bm.LabelSet(x='x', y='y', text='w', level='glyph', x_offset=-2, y_offset=5, source=source)
    xp.add_layout(crl)
    handle = bp.show(xp, notebook_handle=True)
    return types.SimpleNamespace(handle=handle,source=source,points=crs,labels=crl)

def tsne_plot_reduce(words_of_interest, method='tsne', perplexity=30):
    global state
    X_m_space, index2word = WordVectorUtility.create_X_m_space_matrix(state.WEM, words_of_interest)
    opts = dict(n_components=2, init='pca', random_state=55887, perplexity=perplexity)
    X_2_space = VectorSpaceHelper.reduce_dimensions(X_m_space, method=method, **opts)
    return X_2_space, index2word
    
tsne_plot = setup_tsne_plot('Dimensionality Reduced Word Vector Space Model')


In [9]:
# T-SNE/PCA dimensionality reduction
def update_tsne_plot(x,y,w):
    tsne_plot.source.data.update(dict(w=w,x=x, y=y))
    bio.push_notebook(handle=tsne_plot.handle)
    
def reduce_dimensions_and_display_words(raw_text, method, perplexity):
    global state, cw

    if len(raw_text) > 0 and raw_text[-1] != ' ':
        return
    words_of_interest = list(set(word_tokenize(raw_text)))

    if len(words_of_interest) >= 3:
        cw.progress.value = 2
        X_2_space, index2word = tsne_plot_reduce(words_of_interest, method.lower(), perplexity)
        cw.progress.value = 3
        update_tsne_plot(x=X_2_space[:, 0], y=X_2_space[:, 1], w=index2word)
        cw.progress.value = 5
    cw.progress.value = 0
    
cw = BaseWidgetUtility(
    progress = wf.create_int_progress_widget(min=0, max=5, step=1, value=0),
    words=wf.create_text_area_input_widget(
        description='Words', placeholder='(enter or paste words to be plotted)',
        value='', layout=widgets.Layout(width='80%', height='200px') #, continuous_update=False
    ),
    method=wf.create_select_widget(description='Method', options=['TSNE', 'PCA'], value='PCA',
                                   layout=widgets.Layout(width="60%")),
    perplexity=wf.create_int_slider('Perplexity', min=5, max=100, step=1, value=30),
)

iwa = widgets.interactive(
    reduce_dimensions_and_display_words,
    raw_text=cw.words,
    method=cw.method,
    perplexity=cw.perplexity
)

display(widgets.VBox([
    widgets.HBox([cw.words, widgets.VBox([cw.method, cw.perplexity, cw.progress])]),
    iwa.children[-1]
]))
iwa.update()

VBox(children=(HBox(children=(Textarea(value='', description='Words', layout=Layout(height='200px', width='80%…

### Word Vector Expression Calculator
This is an attempt to visualize word-vector expressions in 2D. Note that the word vector expression is computed in he high-dimensional vector space, as given in the result table, and in 2-dimensional vector space as visualized in the chart. The  dimensionality reduction loses information which explains the discripancies in result.

The **compute_most_similar_expression** function in ModelUtility parses a string of words each (optionally) prefixed with a plus or minus sign. The extracted "positive" and "negative" words are then used as arguments to gensims **most_similar** function, which, using numeric vector operations (add and subtract) finds the words most similar (cosine similarity) to the result of the given expression.
Example expressions:
- kvinnor - flickor + pojkar
- sverige + oslo - stockholm

In [10]:
#
def update_expression_plot(exp_points, result_points, expr_trail):
    global expression_plot
    expression_plot.expr_words_source.data.update(exp_points)
    expression_plot.result_words_source.data.update(result_points)
    expression_plot.expr_trail_source.data.update(expr_trail)
    bio.push_notebook(handle=expression_plot.handle)


def setup_expression_plot():
    try:
        xp = bp.figure(
            plot_width=900, plot_height=600, title='Word Vector Expressions',
            tools=TOOLS, toolbar_location="right",
            #x_axis_type=None, y_axis_type=None
        )

        xp.cross(x=0, y=0, size=10, color='blue')

        expr_words_source = ColumnDataSource(dict(w=[''],x=[0], y=[0]))
        result_words_source = ColumnDataSource(dict(w=[''],x=[0], y=[0]))
        expr_trail_source = ColumnDataSource(dict(w=[''],x=[0], y=[0], x2=[0], y2=[0]))

        crs = xp.scatter(x='x', y='y', color='black', source=expr_words_source)
        crl = bm.LabelSet(x='x', y='y', text='w', level='glyph', x_offset=-2, y_offset=5, source=expr_words_source)
        xp.add_layout(crl)

        rp = xp.scatter(x='x', y='y', size=5, color='green', source=result_words_source)
        rl = bm.LabelSet(x='x', y='y', text='w', level='glyph', x_offset=-2, y_offset=5, source=result_words_source)
        xp.add_layout(rl)

        xp.add_layout(bm.Arrow(
            x_start=0,
            y_start=0,
            x_end='x',
            y_end='y',
            line_color='red',
            source=expr_words_source,
            end=bm.NormalHead(size=10, fill_color='black', fill_alpha=1.0, line_alpha=0.2),
        ))

        xp.add_layout(bm.Arrow(
            x_start='x',
            y_start='y',
            x_end='x2',
            y_end='y2',
            line_color='blue',
            source=expr_trail_source,
            end=bm.NormalHead(size=10, fill_color='black', fill_alpha=1.0, line_alpha=0.2),
        ))

        handle = bp.show(xp, notebook_handle=True)
        return types.SimpleNamespace(
            handle=handle,
            points=crs,
            labels=crl,
            expr_words_source=expr_words_source,
            result_words_source=result_words_source,
            expr_trail_source=expr_trail_source
        )
    except:
        print('Something bad happend.')
    
expression_plot = setup_expression_plot()

In [11]:
# Calculator
history_state = ['', 'man - pojke + flicka']
z = BaseWidgetUtility(
    method = wf.create_select_widget(
        description='Reducer',
        options=['pca', 'tsne' ],
        value='tsne'
    ),
    perplexity=wf.create_int_slider(description='Perplexity', min=10, max=100, step=1, value=0),
    expression = wf.create_text_input_widget(
        description='Expression',
        placeholder='(enter expression e.g. sverige + oslo - stockholm)',
        value='',
        layout=widgets.Layout(width='90%')
    )
)

def compute_expression(expression, method, perplexity, n_top=10):
    global state, history_state
    try:
        result, options = WordVectorUtility.compute_most_similar_expression(state.WEM, expression)

        if result is None or options is None:
            return

        expression_words = (options['positives'] or []) + (options['negatives'] or [])

        #if len(expression_words) < 3:
        #    return

        result_words, result_weights = list(zip(*result[:n_top]))
        result_words = [ z for z in result_words if z not in expression_words ]

        #df = pd.DataFrame(result_words).assign(weight=result_weights)
        #display(HTML(df.to_html()))
        #return

        X_2_space, index2word = tsne_plot_reduce(expression_words+result_words,perplexity=perplexity)

        word2index = dict(zip(index2word, range(0, len(index2word))))

        expr_index = [ word2index[x] for x in expression_words ]
        result_index = [ word2index[x] for x in result_words ]
        positives = options['positives']
        color = len(index2word) * ['red']
        expr_words=dict(
            x=list(X_2_space[expr_index, 0]),
            y=list(X_2_space[expr_index, 1]),
            w=[ index2word[i] for i in expr_index ],
            s=[ index2word[i] in positives for i in expr_index ]
        )
        result_words=dict(
            x=list(X_2_space[result_index, 0]),
            y=list(X_2_space[result_index, 1]),
            w=[ index2word[i] for i in result_index ],
        )

        expr_points = [dict(zip(expr_words,t)) for t in zip(*expr_words.values())]
        ''' Let trail start at origo '''
        trail = [dict(x=0, y=0, w='', x2=0, y2=0)]
        for p in expr_points:
            ''' tp is our current position '''
            tp = trail[-1]
            sign = 1 if p['s'] else -1

            ''' np = (x, y) is our next position '''
            x, y = tp['x'] + sign * p['x'], tp['y'] + sign * p['y']
            w = tp['w'] + (' + ' if p['s'] else ' - ') + p['w']

            np = dict(x=x, y=y, w=w, x2=0, y2=0)
            tp['x2'], tp['y2'] = np['x'], np['y']
            trail += [ np ]

        expr_trail = dict(
            x=[ p['x'] for p in trail ][1:-1],
            y=[ p['y'] for p in trail ][1:-1],
            w=[ p['w'] for p in trail ][1:-1],
            x2=[ p['x2'] for p in trail ][1:-1],
            y2=[ p['y2'] for p in trail ][1:-1]
        )

        update_expression_plot(expr_words, result_words, expr_trail)

        #point_map = dict(zip(index2word, zip(X_2_space[:, 0], X_2_space[:, 1])))

        #update_expression_plot(x=X_2_space[:, 0], y=X_2_space[:, 1], w=index2word)

        df = pd.DataFrame(result_words).assign(weight=result_weights)
        display(HTML(df.to_html()))
    except Exception as ex:
        print(ex)
        
w = widgets.interactive(compute_expression, expression=z.expression, method=z.method, perplexity=z.perplexity)
display(widgets.VBox([
    widgets.HBox([z.expression]),
    widgets.HBox([z.method, z.perplexity]),
    w.children[-1]
]))
# w.update()
# sverige + oslo - stockholm  

VBox(children=(HBox(children=(Text(value='', description='Expression', layout=Layout(width='90%'), placeholder…

### Similarity to Analogies/Dichotomies.

#### Similarity to analogy direction

Let $A$ and $B$ be a word-pair analogy (e.g. "good" and "evil") represented by word vectors $V_A$ and $V_B$. Then the *analogy direction vector* $V_{AB}$ is defined as $V_{AB} = V_B - V_A$. The position of a word $W$ represented by vector $V_W$ is then defined as the cosine similarity of $V_{AB}$ and $V_W$.

#### TODO Alternative Center

Let $A$ and $B$ be a word-pair analogy (e.g. "good" and "evil") represented by word vectors $V_A$ and $V_B$. Then the position of a word $W$ represented by vector $V_W$ on an *"$A$ to $B$ scale"* is defined as the the distance between $V_{A}$ and $V_W$ minus the distance between $V_{B}$ and $V_W$. The distance can be e.g. cosine similarity or euclidean. 


In [12]:
# setup_anthology_similarity_and_plot

def setup_plot_xy_words(**fig_kwargs):
    
    xp = bp.figure(**fig_kwargs)
    
    word_source = bm.ColumnDataSource(dict(w=[''],x=[0], y=[0]))  # , color=['blue']))
    word_opts = dict(x_offset=5, y_offset=5, render_mode='canvas', level='glyph', text_font_size="9pt", text_color='black')
    crs = xp.scatter(x='x', y='y', size=8, source=word_source, alpha=0.5, color='blue')
    crw = bm.LabelSet(x='x', y='y', text='w', source=word_source, **word_opts)
    xp.add_layout(crw)

    reference_source = bm.ColumnDataSource(dict(w=[],x=[], y=[]))  # , color=['blue']))
    ref_opts = dict(x_offset=5, y_offset=5, render_mode='canvas', level='glyph', text_font_style='bold', text_color='red')
    crr = xp.scatter(x='x', y='y', size=8, source=reference_source, alpha=0.5, color='red')
    crx = bm.LabelSet(x='x', y='y', text='w', text_font_size="9pt", source=reference_source, **ref_opts)
    xp.add_layout(crx)

    handle = bp.show(xp, notebook_handle=True)
                                 
    return types.SimpleNamespace(
        handle=handle,
        points=crs,
        labels=crw,
        word_source=word_source,
        reference_source=reference_source,
        plot=xp
    )

args = dict(plot_width=900, plot_height=600,
            title='Dichotomy plot',tools=TOOLS, toolbar_location="right",
            #x_axis_type=None,
            #y_axis_type=None
           )
dichotomy_plot = setup_plot_xy_words(**args)


In [12]:
# Code
import numpy as np
from functools import reduce

# @staticmethod
def WordVectorUtility_compute_word_expression(wem, wexpr):
    try:
        options = WordVectorUtility.split_word_expression(wexpr)

        positives = [ wem.word_vec(w) for w in options['positives'] ]
        negatives = [ -1.0 * (wem.word_vec(w)) for w in options['negatives'] ]
        result = reduce(np.add, positives + negatives)
        
        return result, options
    except Exception as ex:
        logger.error(str(ex))
        return None, None
    
#def WordVectorUtility_compute_analogy_direction_vector_similarity(wv, axis, words):
#    scale = (wv[axis[1]] - wv[axis[0]]) if len(axis) == 2 else wv[axis[0]]
#    values = [(1.0 - spatial.distance.cosine(scale, wv[w])) for w in words ]
#    return values

def show_similarity_to_dichotomies(wem, xy_axis, xy_expr, words, display_table=False):
    global dichotomy_plot
    
    words = [ x for x in words if x in wem.vocab.keys() ]
    
    if len(words) == 0:
        return
    
    vectors = [ wem.word_vec(w) for w in words ]
    
    wxs = wem.cosine_similarities(xy_axis[0], vectors)
    wys = wem.cosine_similarities(xy_axis[1], vectors)
    
    dichotomy_plot.word_source.data.update(dict(w=words+['',''], x=list(wxs)+[0.0, 1.0], y=list(wys)+[0.0,1.0]))
    #dichotomy_plot.word_source.data.update(dict(w=words, x=wxs, y=wys))

    if xy_expr is not None and xy_expr[2]:
        axis_vectors = [ wem.word_vec(w) for w in xy_expr[3] ]
        rxs = wem.cosine_similarities(xy_axis[0], axis_vectors + [xy_axis[0], xy_axis[1]])
        rys = wem.cosine_similarities(xy_axis[1], axis_vectors + [xy_axis[0], xy_axis[1]])
        axis_words = xy_expr[3] + [xy_expr[0], xy_expr[1]]
        dichotomy_plot.reference_source.data.update(dict(w=axis_words, x=rxs, y=rys))
        
    else:
        dichotomy_plot.reference_source.data.update(dict(w=[], x=[], y=[]))

    bio.push_notebook(handle=dichotomy_plot.handle)
    
    if display_table:
        df = pd.DataFrame({ 'Word': words, 'Distance to X-expr': wxs, 'Distance to Y-expr': wys })
        display(df)
            
def normalize(v):
    norm = np.linalg.norm(v)
    return v if norm == 0 else v / norm

def display_dichotomies_similarity(
    x_axis_expr, y_axis_expr, raw_text, plot_axis_words=False, do_normalize=False, display_table=False
):
    global state
    
    try:
        wem = state.WEM
        
        x_axis, opt1 = WordVectorUtility_compute_word_expression(wem, x_axis_expr)
        y_axis, opt2 = WordVectorUtility_compute_word_expression(wem, y_axis_expr)
        
        if x_axis is None or y_axis is None:
            return
        
        if do_normalize is True:
            x_axis = normalize(x_axis)
            y_axis = normalize(y_axis)
            
        words_of_interest = list(set(word_tokenize(raw_text)))

        if len(words_of_interest) == 0:
            print("Please enter some words to plot.")
            return
        
        dichotomy_plot.plot.xaxis[0].axis_label = x_axis_expr
        dichotomy_plot.plot.yaxis[0].axis_label = y_axis_expr

        axis_words = list(
            set(opt1.get('positives', []) +
                opt1.get('negatives', []) +
                opt2.get('positives', []) +
                opt2.get('negatives', []))) if opt1 is not None and opt2 is not None else []

        show_similarity_to_dichotomies(
            wem=wem,
            xy_axis=(x_axis, y_axis),
            xy_expr=(x_axis_expr, y_axis_expr, plot_axis_words, axis_words),
            words=words_of_interest,
            display_table=display_table
        )
        
        print('')
        
    except Exception as ex:
        print(ex)
        # raise
    
aw = BaseWidgetUtility(
    progress = widgets.IntProgress(min=0, max=5, step=1, value=0, layout=widgets.Layout(width='98%')),
    words=widgets.Textarea(
        description='Words', placeholder='(enter or paste words to be plotted)',
        value='dator', layout=widgets.Layout(width='80%', height='200px'), continuous_update=True
    ),
    x_axis_expr=widgets.Text(description='X-axis', value='människa', layout=widgets.Layout(width='98%')),
    y_axis_expr=widgets.Text(description='Y-axis', value='maskin', layout=widgets.Layout(width='98%')),
    plot_axis_words=widgets.Checkbox(value=True, description='Plot axis words'),
    do_normalize=widgets.Checkbox(value=False, description='Normalize'),
    display_table=widgets.Checkbox(value=False, description='Show table'),
)

awa = widgets.interactive(
    display_dichotomies_similarity,
    x_axis_expr=aw.x_axis_expr,
    y_axis_expr=aw.y_axis_expr,
    raw_text=aw.words,
    plot_axis_words=aw.plot_axis_words,
    do_normalize=aw.do_normalize,
    display_table=aw.display_table,
)

display(widgets.VBox([
    widgets.HBox([
        aw.words, widgets.VBox([
            aw.progress,
            aw.x_axis_expr,
            aw.y_axis_expr,
            widgets.HBox([aw.plot_axis_words, aw.do_normalize]),
            aw.display_table,
           ])]),
    awa.children[-1]
]))
    
awa.update()


### Venn diagram of words most similar to expressions

In [13]:
# Code
from matplotlib import pyplot as plt
#from matplotlib_venn import venn2, venn3, venn3_circles
from matplotlib_venn_wordcloud import venn2_wordcloud
#%matplotlib notebook
%matplotlib inline

#import mpld3
#mpld3.enable_notebook()

def seed_word_toplist(wem, seed_word, topn=100):
    values = wem.most_similar(seed_word, topn=topn)
    return values

def color_func(word, *args, **kwargs):
    if word in just_dem:
        # return "#000080" # navy blue
        return "#0000ff" # blue1
    elif word in just_rep:
        # return "#8b0000" # red4
        return "#ff0000" # red1
    else:
        return "#0f0f0f" # gray6 (aka off-black)

def display_wenn_diagram(expr1, expr2, word_count):
    global state, vw
    try:
        vw.progress.value = 0
        expr1, expr2 = expr1.lower(), expr2.lower()
        
        vw.progress.value = 1
        
        #result1, _ = WordVectorUtility.compute_most_similar_expression(state.WEM, expr1, topn=word_count)
        #result2, _ = WordVectorUtility.compute_most_similar_expression(state.WEM, expr2, topn=word_count)
        
        axis1_vector, _ = WordVectorUtility_compute_word_expression(state.WEM, expr1)
        axis2_vector, _ = WordVectorUtility_compute_word_expression(state.WEM, expr2)
        
        result1 = state.WEM.similar_by_vector(axis1_vector, topn=word_count)
        result2 = state.WEM.similar_by_vector(axis2_vector, topn=word_count)
        
        # print([positive_words, negative_words])
        
        if result1 is None or result2 is None:
            return
        
        result1 = result1 + [('x',0.1)]
        result2 = result2 + [('x',0.1)]
        
        word_frequencies = [ dict(result1), dict(result2) ]
        x_words, y_words = list(word_frequencies[0].keys()), list(word_frequencies[1].keys())

        ''' Use avg weight for words in intersection '''
        intersection_words = (set(x_words) & set(y_words))
        intersection_dict = { w: ( word_frequencies[0][w] + word_frequencies[1][w] ) / 2.0 for w in intersection_words }

        word_to_frequency = {}
        word_to_frequency = extend(word_to_frequency, word_frequencies[0])
        word_to_frequency = extend(word_to_frequency, word_frequencies[1])
        
        vw.progress.value = 2
        if len(intersection_dict.keys()) > 0:
            word_to_frequency = extend(word_to_frequency, intersection_dict)
            
        word_to_frequency = { w: 100.0 * word_to_frequency[w] for w in word_to_frequency.keys() }

        fig, ax = plt.subplots(1,1, figsize=(16,12))
        #ax.set_title("Venn diagram over most similar words", fontsize=36)
        vw.progress.value = 3

        v = venn2_wordcloud(
            [ set(x_words), set(y_words) ],
            [ expr1, expr2 ],
            #set_edgecolors=['b', 'r'],
            ax=ax,
            word_to_frequency=word_to_frequency,
            wordcloud_kwargs=dict(relative_scaling=0.5),  #
        )
        v.get_patch_by_id('10').set_color('green')
        v.get_patch_by_id('01').set_color('blue')
        v.get_patch_by_id('11').set_color('red')
        
        vw.progress.value = 5
        vw.progress.value = 0
    except Exception as ex:
        print(ex)
        raise
        vw.progress.value = 0
    
vw = BaseWidgetUtility(
    progress = wf.create_int_progress_widget(min=0, max=5, step=1, value=0, layout=widgets.Layout(width='40%')),
    expr1=wf.create_text_input_widget(description='Expr #1', value='kvinnor', continuous_update=False),
    expr2=wf.create_text_input_widget(description='Expr #2', value='män', continuous_update=False),
    #plot_vector_words=widgets.Checkbox(value=False, description='Plot refs.', disabled=False)
    word_count=wf.create_int_slider('Word count', min=5, max=500, step=1, value=40)
)

iwv = widgets.interactive(
    display_wenn_diagram,
    expr1=vw.expr1,
    expr2=vw.expr2,
    word_count=vw.word_count
)

display(widgets.VBox([
    widgets.HBox([vw.expr1, vw.expr2, vw.word_count, vw.progress]),
    iwv.children[-1]
]))
    
iwv.update()

### Semantic axis -  display words similar to axis directions

In [15]:
### Semantix axis test

def display_semantic_axis_diagram(semantic_axis_expr, threshold, output_format='chart'):
    global state, saw
    try:
        semantic_axis_expr = semantic_axis_expr.lower()
        semantic_axis_vector, _ = WordVectorUtility_compute_word_expression(state.WEM, semantic_axis_expr)
        positive_words = state.WEM.similar_by_vector(semantic_axis_vector, topn=100)
        negative_words = state.WEM.similar_by_vector(-1.0 * semantic_axis_vector, topn=100)
        
        words = positive_words + [ (w, -d) for w, d in negative_words ]
        words = [ (w, d) for w, d in words if abs(float(d)) > threshold ]
        df = pd.DataFrame(words, columns=['word', 'distance']).sort_values('distance', ascending=False)
        df['color'] = df.distance.apply(lambda x: 'lightgreen' if x >= 0 else 'pink')
        if output_format == 'chart':

            source = bm.ColumnDataSource(dict(y=df.word, right=df.distance, color=df.color))

            ydr = list(df.word.values)
            xdr = bm.DataRange1d(*(-1.0, 1.0))
            word_count = len(df)
            plot = bp.figure(
                title=None,
                x_range=xdr, y_range=ydr,
                plot_width=900, plot_height=300+word_count*10,
                h_symmetry=False, v_symmetry=False,
                min_border=0,
                toolbar_location=None)

            plot.hbar(y="y", right="right", left=0, height=0.5, fill_color='color', source=source)
            
            plot.xgrid.grid_line_color = None

            bp.show(plot)
        else:
            display(df)
    except Exception as ex:
        logger.error(ex)
        # raise
    
saw = BaseWidgetUtility(
    semantic_axis=widgets.Text(
        value='kvinnor-män', placeholder='(word expr)', description='Axis:', continuous_update=False,
        layout=widgets.Layout(width='60%')
    ),
    threshold=widgets.BoundedFloatText(
        value=0.20, min=0.05, max=1.0, step=0.01, description='Threshold:', layout=widgets.Layout(width='15%')
    )
)

isaw = widgets.interactive(
    display_semantic_axis_diagram,
    semantic_axis_expr=saw.semantic_axis,
    threshold=saw.threshold
)

display(widgets.VBox([
    widgets.HBox([saw.semantic_axis, saw.threshold]),
    isaw.children[-1]
]))
    
isaw.update()

### Some notes


#### Tool selection strategy:


|Task|Corpus size|Tool|
|---|---|---|
|Attribute model (related words)|> 1M|word2vec with large window|
|Attribute model (related words)|< 1M|WordRank|
|interchangability model (synonyms)|\*|FastText, word2vec small window, VarEmbed|

- Ofir Pele and Michael Werman, "A linear time histogram metric for improved SIFT matching".
- Ofir Pele and Michael Werman, "Fast and robust earth mover's distances".
- Matt Kusner et al. "From Word Embeddings To Document Distances".


#### Sample research questions (within the humanities)
- Find sentiments in text...?
- Find implicit stereotyping...?
- Find sematical relationsships (word analogies)...?
- Find "viral text" - same or similar documents - (see [Cordell's Viral Text project](http://viraltexts.org/) not WEM though)
- Ben Schmidt: [Vector Space Models for the Digital Humanities](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html) [blog](http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html)

#### How to validate a model
- Test semantic relationsships using linear algebraic calculations (i.e. word pair analogies such as queen-king)
- Create dataset to validate model e..g table with related words

#### Sample pitfalls and challanges

- Visualization using dimensionality reduction (e.g. T-SNE, PCA) [Shusen Liu, 2016]
- Size of data
- How to test validity of model

#### Some critical perspectives
- Where to begin...?

#### Visualization Tools

- Word Embedding Visual Explorer [Online](http://residue3.sci.utah.edu/?)
- [wevi: word embedding visual inspector [source](https://ronxin.github.io/wevi/), [paper](https://arxiv.org/abs/1411.2738)
- [Taporware](http://taporware.ualberta.ca/)
- LAMVI (a tool to model training)
- TensorBoard (Google)
- ~~ Sentiview Hiérarchie ~~

#### TODOs

- Test other word embeddings (see alternatives above)
- Append PoS-tags to words for better precision (see e.g. [link](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/))
- Use PoS-tags as filters instead of stopwords

