# Introduction
In this notebook I follow up on `1. Machine Translation - Character Level Model` by building a seq2seq model based onf word embeddings instead of character level embeddings.
My previous notebooks was heavily inspired by Francois Chollet's article [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) and accompanying [code](https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py). This notebook will be similar, but will probably diverge more.

## Approach:
Instead of one-hot encoding characters and terating them as separate units in the input and output sequences, I will be encoding full word tokens using an embedding layer.

It would be nice to use pre-trained word embeddings, which are widely available for the English language, but not as common for Swedish. One approach would be to train Swedish embeddings on an auxilary task, but lets just use the standard Keras Embeddings layer this time.

# Data
I will be using data from the same source as Chollet, http://www.manythings.org/anki/. I'm using the 17303 sentence long swe-eng data set, that contains english sentences and their swedish translations. The french data set used by Chollet is much larger, but he limited his training set to 10 000 sentences and used 20% of it for validation during training.

## Load the data

In [1]:
data_path = 'data/swe-eng/swe.txt'
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

Read all sentences. I won't add the begining of sentence symbol `<BOS>` until after tokenization to avoid it being split into mutliple tokens.

In [2]:
input_sentences, target_sentences = [], []
for line in lines:
    try:
        input_text, target_text, *_ = line.split('\t')
    except ValueError:
        print(line)
        
    input_sentences.append(input_text)
    target_sentences.append(target_text)




## Tokenize and encode the data
I will use the NLTK tokenizer to split sentences into tokens, and then process these tokens with Keras tokenizer which maps tokens to integers.

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.tokenize import word_tokenize

Using TensorFlow backend.


NLTK splits sentences into lists, but Keras tokenier expects them to be strings with a delimeter character.
I will convert the lists back to strings with space as delimeter.

Also, I will be adding a beggining of sentence token `<BOS>` to all target sequences, which will be used to seed the predictions during inference.
I will also add the end of sentence token `<EOS>` token to the targets.

I do not need these tokens for the inputs, as I will simply read the full input sentence during inference. The only stop condition I have will be based on the output. 

I will end up with four special tokens:
* `<BOS>` and `<EOS>` are added explicitly 
* `<PAD>` is added implicitly during padding
* `<UNK>`, the unkown token, will be inserted during inference for unrecognized words

In [4]:
input_tokenized = [" ".join(word_tokenize(sentence)) for sentence in input_sentences]

In [5]:
input_tokenized[0]

'Run !'

NLTK even comes with a pre-trained swedish tokenizer!

In [6]:
target_tokenized = ["<BOS> " + " ".join(word_tokenize(sentence, language='swedish')) + " <EOS>" for sentence in target_sentences]

In [7]:
target_tokenized[0]

'<BOS> Spring ! <EOS>'

I want to keep `!?.,` as punctuations, so I won't be filtering those out.
I think one of the nicest features of my character level model was that it understood where to put punctuation.
I will not be keeping the case of characters, sure, it's nice if sentences start with a capital letter, but this can easily be handled by a heuristic during inference.

Also, I will keep the `<>` characters in the targets, as I use them in `<BOS>` and `<EOS>`.

In [8]:
input_tokenizer = Tokenizer(filters='"#$%&()*+-/:;<=>@[\]^_`{|}~ ')
input_tokenizer.fit_on_texts(input_tokenized)
input_sequences = input_tokenizer.texts_to_sequences(input_tokenized)

In [9]:
target_tokenizer = Tokenizer(filters='"#$%&()*+-/:;=@[\]^_`{|}~ ')
target_tokenizer.fit_on_texts(target_tokenized)
target_sequences = target_tokenizer.texts_to_sequences(target_tokenized)

In [10]:
input_vocab_size = len(input_tokenizer.word_index)
target_vocab_size = len(target_tokenizer.word_index)

Build a reverse lookup table from integer to word.

In [11]:
reverse_input_word_index = dict(
    (i, word) for word, i in input_tokenizer.word_index.items())
reverse_target_word_index = dict(
    (i, word) for word, i in target_tokenizer.word_index.items())

In [12]:
" ".join((map(lambda x: reverse_input_word_index[x], input_sequences[0])))

'run !'

In [13]:
" ".join((map(lambda x: reverse_target_word_index[x], target_sequences[0])))

'<bos> spring ! <eos>'

In [14]:
max_input_seq_len = max([len(sent) for sent in input_sequences])

In [15]:
max_target_seq_len = max([len(sent) for sent in target_sequences])

In [16]:
print("Max input sequence length: {}" .format(max_input_seq_len))
print("Max target sequence length: {}" .format(max_target_seq_len))

Max input sequence length: 36
Max target sequence length: 34


The longest sentence is 36 tokens long. Last time I limited sentences to be 50 characters long to reduce the impact of padding. This time I will not limit them.

## Pad the data

In [17]:
input_sequences = pad_sequences(input_sequences, maxlen=max_input_seq_len, padding='post')
target_sequences = pad_sequences(target_sequences, maxlen=max_target_seq_len, padding='post')

Shift the decoder targets by one timestep and reshape to fit each target inside an array.

In [18]:
import numpy as np

In [19]:
decoder_target_sequences = np.zeros((target_sequences.shape[0], target_sequences.shape[1]))
decoder_target_sequences[:,:-1] = target_sequences[:,1:]
decoder_target_sequences = decoder_target_sequences.reshape(decoder_target_sequences.shape[0], 
                                                            decoder_target_sequences.shape[1], 1)

In [20]:
decoder_target_sequences.shape

(17304, 34, 1)

## Divide data into a training and a validation set
I will use 8 000 sentances as training set and 2000 as validation set.

In [21]:
import numpy as np

trainig_size, validation_size = 8000, 2000

shuffle_idx = np.random.permutation(len(input_sentences))

train_idx, val_idx = shuffle_idx[:trainig_size], shuffle_idx[trainig_size:trainig_size+validation_size]

input_sequences_train, input_sequences_val = input_sequences[train_idx], input_sequences[val_idx]

target_sequences_train, target_sequences_val = target_sequences[train_idx], target_sequences[val_idx]

decoder_target_sequences_train, decoder_target_sequences_val = decoder_target_sequences[train_idx], decoder_target_sequences[val_idx]

In [80]:
input_sentences_train = np.array(input_sentences)[train_idx]
input_sentences_val = np.array(input_sentences)[val_idx]

target_sentences_train = np.array(target_sentences)[train_idx]
target_sentences_val = np.array(target_sentences)[val_idx]

# Training Model

Just like in my previous notebook I will opt for using GRUs instead of LSTM, mainly because I like the idea of their simpler architecture and because I would like to compare my results with my previous attempt.

In [22]:
# Config
batch_size = 64
latent_dim = 256
embedding_dim = 100

In [23]:
from keras.layers import GRU, Embedding, Input, Dense
from keras.models import Model

I name layers that I won't reference in the future `x`.

In [98]:
encoder_inputs = Input(shape=(None,))
x = Embedding(input_vocab_size+1, embedding_dim)(encoder_inputs)
x, state_h = GRU(latent_dim, return_state=True)(x)

decoder_inputs = Input(shape=(None,))
decoder_embeddings = Embedding(target_vocab_size+1, embedding_dim)(decoder_inputs)
decoder_gru = GRU(latent_dim, return_sequences=True, return_state=True)
x, _ = decoder_gru(decoder_embeddings, initial_state=state_h)
decoder_dense = Dense(target_vocab_size+1, activation='softmax')
decoder_outputs = decoder_dense(x)

In [99]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

My targets are sequences of integers, so I use the `sparse_categorical_crossentropy` loss function.

In [113]:
# Run training
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', sample_weight_mode='temporal')

Let's use sample_weight to ignore the padding when calculating the loss function!

In [109]:
sample_weights_train = (decoder_target_sequences_train != 0).reshape(decoder_target_sequences_train.shape[0], decoder_target_sequences_train.shape[1])

sample_weights_train = sample_weights_train.astype(int)

In [114]:
model.fit([input_sequences_train, target_sequences_train], decoder_target_sequences_train,
          batch_size=batch_size,
          epochs=2,
          sample_weight=sample_weights_train,
          validation_data=([input_sequences_val, target_sequences_val], decoder_target_sequences_val))

Train on 8000 samples, validate on 2000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1d70239d7b8>

Let's see how it translates some of the training sentences.

In [115]:
pred_train = model.predict([input_sequences_train[:10], target_sequences_train[:10]])
pred_val = model.predict([input_sequences_val[:10], target_sequences_val[:10]])

In [69]:
def decode_output_seq(output_seq):
    return " ".join([reverse_target_word_index[sampled_word_index] if sampled_word_index > 0 else "" for sampled_word_index in np.argmax(output_seq, 1)])

In [123]:
print("Predictions by the training model (Fed correct decoder input at each step)")
for i, pred in enumerate(pred_train[:5]):
    print("Input Sentence: " + input_sentences_train[i])
    print("Target Sentence: " + target_sentences_train[i])
    print("Predicted Sentence: " + decode_output_seq(pred))
    print("--")

Predictions by the training model (Fed correct decoder input at each step)
Input Sentence: Did you enjoy that?
Target Sentence: Njöt du av det där?
Predicted Sentence: tom är inte tom ? ? <eos> . . . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: He is kind.
Target Sentence: Han är snäll.
Predicted Sentence: jag är inte . <eos> . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: I know this is hard.
Target Sentence: Jag vet att det är svårt.
Predicted Sentence: jag har att jag är . . <eos> . . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: We can deal with it.
Target Sentence: Vi kan ta itu med det.
Predicted Sentence: tom har inte det ? ? ? <eos> . . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: I must finish my homework before dinner.
Target Sentence: Jag måste göra klart läxan innan middagen.
Predicted Sentence: jag har inte att ? . . . <eos> . . . . . . . . . . . . . . . . . . . . . . . . .
--


Okay, so 30 minutes of training and it cannot handle even the training sentences, even when being guided with the correct decoder input at each step. Let's see how it performs on the validation sentences.

In [122]:
print("Predictions by the training model (Fed correct decoder input at each step)\n")
for i, pred in enumerate(pred_val[:5]):
    print("Input Sentence: " + input_sentences_val[i])
    print("Target Sentence: " + target_sentences_val[i])
    print("Predicted Sentence: " + decode_output_seq(pred))
    print("--")

Predictions by the training model (Fed correct decoder input at each step)

Input Sentence: I don't think this armchair is comfortable.
Target Sentence: Jag tycker inte att den här fåtöljen är bekväm.
Predicted Sentence: jag är att att jag . . . . . <eos> . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: I'm cold. May I close the window?
Target Sentence: Jag fryser. Kan jag stänga fönstret?
Predicted Sentence: jag är inte <eos> att jag . . <eos> . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: A child is missing.
Target Sentence: Ett barn är försvunnet.
Predicted Sentence: jag är . . . <eos> . . . . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: How was the reunion?
Target Sentence: Hur var återträffen?
Predicted Sentence: tom du du ? <eos> ? . . . . . . . . . . . . . . . . . . . . . . . . . . . .
--
Input Sentence: Greek is not an easy language.
Target Sentence: Grekiska är inget lätt språk.
Predicted Sentence: tom är inte . . . <

The predictions seem to follow to patterns, they either start with `tom` or `jag`. The `jag` predictions are actually accurate for the first word! 
Anyway, the performance is very poor.

It also predicts words after the `<eos>` token, this will not happen during inference though.

# Inference Model

In [119]:
encoder_model = Model(encoder_inputs, state_h)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_outputs, state_h = decoder_gru(
    decoder_embeddings, initial_state=decoder_state_input_h)
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + [decoder_state_input_h],
    [decoder_outputs] + [state_h])

In [124]:
# This cell is straight up copy pasted from https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
# With only small modifications to fit my GRU model and my global variable names


def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate target sequence with just the <bos> token.
    target_seq = np.array(target_tokenizer.word_index['<bos>']).reshape(1,1)

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = []
    while not stop_condition:
        output_tokens, h = decoder_model.predict(
            [target_seq] + [states_value])

        # Sample a token
        sampled_word_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = reverse_target_word_index[sampled_word_index] if sampled_word_index > 0 else ""
        decoded_sentence.append(sampled_word)

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_word == '<eos>' or
           len(decoded_sentence) > max_target_seq_len):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.array(target_tokenizer.word_index[sampled_word] if sampled_word else 0).reshape(1,1)

        # Update states
        states_value = h

    return " ".join(decoded_sentence)

First, let's check what the inference model predicts for some of the sentences in the training set.

In [125]:
for seq_index in range(5):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = input_sequences_train[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_train[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: Did you enjoy that?
Decoded sentence: tom är inte att jag har inte . <eos>
-
Input sentence: He is kind.
Decoded sentence: jag har inte att jag har inte . <eos>
-
Input sentence: I know this is hard.
Decoded sentence: jag har inte en dag . <eos>
-
Input sentence: We can deal with it.
Decoded sentence: tom är inte att jag har inte . <eos>
-
Input sentence: I must finish my homework before dinner.
Decoded sentence: jag har inte att jag har inte . <eos>


My character level model performed much better after two epochs, but let's give this model the chance to train a little bit more.

In [126]:
model.fit([input_sequences_train, target_sequences_train], decoder_target_sequences_train,
          batch_size=batch_size,
          epochs=4,
          sample_weight=sample_weights_train,
          validation_data=([input_sequences_val, target_sequences_val], decoder_target_sequences_val))

Train on 8000 samples, validate on 2000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1d702874a90>

In [128]:
for seq_index in range(5):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = input_sequences_train[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_train[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: Did you enjoy that?
Decoded sentence: kan du få det här ? <eos>
-
Input sentence: He is kind.
Decoded sentence: han är en bra . <eos>
-
Input sentence: I know this is hard.
Decoded sentence: jag är en vän . <eos>
-
Input sentence: We can deal with it.
Decoded sentence: vi har en vän . <eos>
-
Input sentence: I must finish my homework before dinner.
Decoded sentence: jag har inte varit en dag . <eos>


The language has improved! I'll give the model another couple of epochs to train.

In [130]:
model.save('keras_models/s2s_word_6epochs.h5')

  str(node.arguments) + '. They will not be included '


In [131]:
model.fit([input_sequences_train, target_sequences_train], decoder_target_sequences_train,
          batch_size=batch_size,
          epochs=24,
          sample_weight=sample_weights_train,
          validation_data=([input_sequences_val, target_sequences_val], decoder_target_sequences_val))

Train on 8000 samples, validate on 2000 samples
Epoch 1/24
Epoch 2/24
Epoch 3/24
Epoch 4/24
Epoch 5/24
Epoch 6/24
Epoch 7/24
Epoch 8/24
Epoch 9/24
Epoch 10/24
Epoch 11/24
Epoch 12/24
Epoch 13/24
Epoch 14/24
Epoch 15/24
Epoch 16/24
Epoch 17/24
Epoch 18/24
Epoch 19/24
Epoch 20/24
Epoch 21/24
Epoch 22/24
Epoch 23/24
Epoch 24/24


<keras.callbacks.History at 0x1d702874908>

Almost 8 hours of training in total!

In [133]:
model.save('keras_models/s2s_word_30epochs.h5')

  str(node.arguments) + '. They will not be included '


Obviously I should have used sample weights for the validation data as well, let's see what the actual validation loss was.

In [132]:
sample_weights_val = (decoder_target_sequences_val != 0).reshape(decoder_target_sequences_val.shape[0], decoder_target_sequences_val.shape[1])

sample_weights_val = sample_weights_val.astype(int)

In [134]:
model.evaluate([input_sequences_val, target_sequences_val], decoder_target_sequences_val, sample_weight=sample_weights_val)



3.283867162704468

So loss is 3, not 13!

In [135]:
for seq_index in range(5):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = input_sequences_train[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_train[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: Did you enjoy that?
Decoded sentence: tycker du om det här ? <eos>
-
Input sentence: He is kind.
Decoded sentence: han är din vän . <eos>
-
Input sentence: I know this is hard.
Decoded sentence: jag vet att det är svårt . <eos>
-
Input sentence: We can deal with it.
Decoded sentence: vi kan inte göra det . <eos>
-
Input sentence: I must finish my homework before dinner.
Decoded sentence: jag måste göra mig nästa gång i morgon . <eos>


Wow, almost correct translations!

In [137]:
for seq_index in range(10):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = input_sequences_val[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', input_sentences_val[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: I don't think this armchair is comfortable.
Decoded sentence: jag vet inte vad som är för så att jag älskar dig . <eos>
-
Input sentence: I'm cold. May I close the window?
Decoded sentence: jag är på det som jag är borta på att dansa . <eos>
-
Input sentence: A child is missing.
Decoded sentence: de kommer att ha reda . <eos>
-
Input sentence: How was the reunion?
Decoded sentence: hur ska vi hjälpa ? <eos>
-
Input sentence: Greek is not an easy language.
Decoded sentence: det är mycket att vi ska göra . <eos>
-
Input sentence: Be friendly.
Decoded sentence: var inte . <eos>
-
Input sentence: Tom is lying.
Decoded sentence: tom har rätt . <eos>
-
Input sentence: I must be there.
Decoded sentence: jag måste göra det . <eos>
-
Input sentence: I feel so pretty.
Decoded sentence: jag har en barn . <eos>
-
Input sentence: I attempted to swim across the river.
Decoded sentence: jag gick i en vid vid mina saker . <eos>


But the validation data is much worse... 
The start words are typically correct, like `I` -> `Jag`, `Be` -> `Var`, but I think the language is worse than what the character level model produced. 
The sentences have less flow to them, and sound less like real Swedish.

# Investigating the word embeddings
I trained my model to translate English sentences to Swedish sentences.
In the process it has produced two word embedding layers. 

Let's explore the properties of these layers!

In [144]:
target_embedding_input = Input(shape=(None,))
target_embeddings_model = Model(decoder_inputs, decoder_embeddings)

In [172]:
def encode_word(word_index):
    return target_embeddings_model.predict(np.array(word_index).reshape(1,1)).flatten()

In [173]:
embeddings = {}
for word, i in target_tokenizer.word_index.items():
    embeddings[word] = encode_word(i)

In [174]:
len(embeddings)

7112

Sometimes word emmbedings place similar words close to eachother in the vector space. If that is the case similar words can be found by checking the words euclidean distance.

In [177]:
def word_distance(w1, w2):
    embedding1 = embeddings[w1]
    embedding2 = embeddings[w2]
    
    return np.linalg.norm(embedding1- embedding2)

In [233]:
def most_similar(word, k=5, embedded=False):
    if embedded:
        embedding=word
    else:
        embedding = embeddings[word]
    # Calculate the distance to all other words
    distances = np.array(list(map(lambda x: np.linalg.norm(embedding- x), embeddings.values())))
    
    # Find the k shortest distances, might not be sorted
    idx = np.argpartition(distances, k)[:k]
    
    # Sort the k shortest distances
    idx = idx[np.argsort(distances[idx])]
    
    # Return the words with the k shortest distances, together with the distance
    closest_embeddings = np.array(list(embeddings.keys()))[idx]
    return list(zip(closest_embeddings, distances[idx]))

In [226]:
word_distance('tom', 'jag')

1.570869

In [234]:
most_similar('jag')

[('jag', 0.0),
 ('hon', 1.1784053),
 ('du', 1.2056358),
 ('man', 1.2721634),
 ('ni', 1.2950184)]

Wow, all the closests words are actually pronouns!
Pronouns are very common in the training data, most sentences actually start with some kind of pronoun, so I would think this word group might be the easiest to find.

In [235]:
most_similar('köpa', k=10)

[('köpa', 0.0),
 ('svalde', 0.88832903),
 ('släpp', 0.89586461),
 ('tände', 0.90464681),
 ('bevisade', 0.91694701),
 ('hålla', 0.93043923),
 ('skadade', 0.93186134),
 ('blonda', 0.93238038),
 ('for', 0.94291615),
 ('kallade', 0.94447839)]

Okay, so all these wors are adjectives, though not in the same tense as `köpa`.

In [244]:
most_similar('katt')

[('katt', 0.0),
 ('lov', 0.6620847),
 ('öron', 0.69108564),
 ('lek', 0.69138545),
 ('order', 0.69297498)]

All are nouns, but I wouldn't say they carry very similar meaning.

In [246]:
word_distance('katt', 'hund')

1.0792097

So my embeddings think `Katt`, which means `Cat`, is much closer to the word `Öron` which means `Ears` than to `Hund` which is `Dog`.

I think my training set is way to small to learn the similarities of these nouns, they simply don't appear in similar contexts that often.

Lets see if we can do operations on the word embeddings. Unfortunately I dont have a very big vocabulary, so I cant try classics like `King` - `Man` + `Woman`.

In [241]:
most_similar(embeddings['flygresa'] - embeddings['flyg'], k=10, embedded=True)

[('flygresa', 0.27931079),
 ('diskar', 0.48389259),
 ('jordnötter', 0.49537137),
 ('räkningen', 0.49831137),
 ('teckenspråk', 0.4996382),
 ('frivilligt', 0.5030669),
 ('cookie', 0.50501686),
 ('vilar', 0.50531614),
 ('glänser', 0.50846142),
 ('hushållssysslorna', 0.51011735)]

I did't find any good operations. Maybe there are some, but as with the other words I think my training set is way to small to build meaningfull embeddings.

# Summary
I trained a RNN encoder decoder model to translate Englis to Swedish based on word tokens. 
In the process I trained two word embedding layers!

I did not achieve great results with this model, as I still think the translations are very poor. I think this largely depends on the size of my training data, but unfortunately training time is so long that I cannot really scale up train data size or the model complexity.

I experimented with the embeddings created for the Swedish words and found that the closest words to a pronoun were pronouns, same story for adjectives and nouns. However, I did not think the closest words to any of the ones I tried were very similar other than having the same part of speech tag. Anyway, I think this really shows the potential of word embeddings in part of speech tagging.