> You shall know a word by the company it keeps
>
> *John Rudolf Firth, 1957*

This concept was used to learn numerical representation for words. Simply assigning a random unique numerical code to each word does not capture similarities and differences between different words, and does not allow you to use common information between for example `cats` and `dogs` in a pet context.

However, one could try to capture each word by a list of numbers, for example:

`[0.5, 0.26, -0.213, 0.76]`

The first number could indicate whether the word is more materialistic of more conceptual, the second whether it s a verb or not, the third wheter it's an animal, or whether it is furry, etc. Of course 4 numbers are not enough, but no worries, modern computers can handle a bit more than $4$ numbers.

The concept is cute, but how would one define these numbers? Well, by the company of the words! If you have a lot of sentences, you can look at windows of for example 5 adjacent words. For example, if you have

`The grey cat has fluffy hair and drinks milk` has these windows:

* the, grey, **cat**, has, fluffy
* grey, cat, **has**, fluffy, hair
* cat, has, **fluffy**, hair, and
* has, fluffy, **hair**, and, drinks
* fluffy, hair, **and**, drinks, milk

The middle word is bold on purpose, this word is kept company by two other words on both sides. With a so called *neural network* one could try to predict the middle word on the basis of the accompanying four words. Of course, this is ill-defined. In the middle of $4$ words, there are often multiple words that fit:

* the, grey, **cat**, has, fluffy
* the, grey, **dog**, has, fluffy
* the, grey, **bird**, has, fluffy

However

* the, grey, **snake**, has, fluffy

makes less sense, and in general will not occur in a corpus, for example in all text from Wikipedia.


For the network to perform well, it will learn some similar features for cat, dogs and birds, but not for snakes. If you then train the network on for example wikipedia, or any other corpus of a language, the numerical representations will make sense. In general the numerical representation consists of several thousand of numbers.

Lets use gensim, and see what happens.

Run the cell below

In [None]:
from gensim.models import Word2Vec, KeyedVectors

Now let's load the scentences from the previous exercise

In [None]:
sentences = []
with open('poems.txt', 'r') as poems_file:
    for line in poems_file:
        sentence = line.strip().lower().split()
        if ''.join(sentence) != '':
            sentences.append(sentence)

And then create a model an train it on these sentences.

In [None]:
model = Word2Vec(size=64, window=5, min_count=5)

model.build_vocab(sentences)
model.train(sentences, total_examples=len(sentences), epochs=1000)

Check out the 64 numbers that represent **"heart"**

In [None]:
model['heart']

The number of sentences is way to small to create a decent model of the words, one needs at least 100 sentences per word. Stanford's NLP Group published some decent wordvectors based on 2 billion tweets. Let's load it.

In [None]:
# Takes about 2 minutes.

stanford_model = KeyedVectors.load_word2vec_format('glove.6B.100d.bin', binary=False)

Now that this is done, lets find out what **queen - woman + man** evaluates to.

In [None]:
stanford_model.most_similar(positive=['queen', 'man'], negative=['woman'])

How about **walking - aiming + aim**?

In [None]:
stanford_model.most_similar(positive=['walking', 'aim'], negative=['aiming'])

# Excercise

As an exercise, compute **France - Netherlands + Amsterdam**.

In [None]:
# TODO: One line of code

Some words seem to make sense, but that's probably just randomness anyway.