# h2o for w2v

### This is a simple Natural Language Processing (NLP) demo to show the power of word embedding such as generated by the Word2vec algorithm.  

It was designed as part of the development of this course (https://learn.mikegchambers.com/p/aws-machine-learning-specialty-certification-course) for the AWS Machine Learning Specialty Certification.  However its use is general, and it has been published here for the wider good.

As a courtesy please attribute this repository if you present this demo.

## Import the libraries...

This includes a helper class that is saved along with this notebook.  This was done to keep the notebook as clutter free as possible.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from helper import helper as h

## The h2o dataset

(Dataset is probably too grandiose a name for this simple Python dict but it gets the job done.)

This Python dictionary is a list of words and corresponding vectors.  These vectors were produced by hand, but in principle, they could have been generated by a Word2vec algorithm.

In [None]:
words = {
    'ice'   : np.array([-1.0,-1.0]),
    'water' : np.array([-1.0, 0.0]),
    'steam' : np.array([-1.0, 1.0]),
    'freeze': np.array([ 0.0,-1.0]),
    'thaw'  : np.array([ 0.0, 0.0]),
    'boil'  : np.array([ 0.0, 1.0]),
    'cold'  : np.array([ 1.0,-1.0]),
    'tepid' : np.array([ 1.0, 0.0]),
    'hot'   : np.array([ 1.0, 1.0]),
}

## Plot the Words on a Vector Map

Here we use matplotlib to create a 2d word map.  You can start to see the relationships between the words in vector space.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.margins(0.2)
for w in words:
    ax.scatter(words[w][0], words[w][1], s=10)
    ax.text(words[w][0], words[w][1], w, fontsize=20)

## Word Equations

Now we can perform some word equations, and see simple NLP in action:

First, we take two words vectors (in this case 'water' and 'hot') and add them to form a new vector.

In [None]:
new_vector = words['water'] + words['hot']

print(new_vector)

We find the location of the resulting vector in our vector space above.

Or we can use a helper function called `nearest_word` to calculate the euclidean distance between this new vector and all the other word vectors to find the closest match, and return that word.

In [None]:
h.nearest_word([new_vector], words)

In summary, we just 'calculated' that:

`water + hot = boil`

Which is, if you'll parden the pun, very cool!

We can do some more equations, and reduce the code down to a single line:

In [None]:
h.nearest_word(words['water'] + words['cold'], words)

In [None]:
h.nearest_word(words['ice'] + words['hot'], words)

## What if we Subtract the Vectors?

In [None]:
h.nearest_word(words['ice'] - words['cold'], words)

_Git pull requests welcome.  Thanks for playing with data!_