# Realms2Vec

© Yuriy Guts, 2016

Using only the raw text of [several Forgotten Realms novels](https://en.wikipedia.org/wiki/List_of_Forgotten_Realms_novels), we'll derive and explore the semantic properties of their words.

Credits for this code go to [Yuriy Guts](https://github.com/YuriyGuts/). I've merely run it on Forgotten Realms novels corpus.

## Imports

In [1]:
from __future__ import absolute_import, division, print_function

In [2]:
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re

In [3]:
import nltk
import gensim.models.word2vec as w2v
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [6]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Prepare Corpus

**Load books from files**

In [7]:
book_filenames = sorted(glob.glob("data/*.txt"))

In [8]:
print("Found books:")
book_filenames

Found books:


['data/01. Dark Elf - Homeland.txt',
 'data/01. Parched Sea, The.txt',
 'data/02. Dark Elf - Exile.txt',
 'data/02. Elfshadow.txt',
 'data/03. Dark Elf - Sojourn.txt',
 'data/03. Red Magic.txt',
 'data/04. Icewind Dale - The Crystal Shard.txt',
 'data/04. Night Parade, The.txt',
 'data/05. Icewind Dale - Streams of Silver.txt',
 'data/05. Ring of Winter, The.epub.txt',
 'data/06. Crypt of the Shadowking.txt',
 "data/06. Icewind Dale - The Halfling's Gem.txt",
 'data/07. Legacy of the Drow - The Legacy.txt',
 'data/07. Soldiers of Ice.txt',
 'data/08. Elfsong.txt',
 'data/08. Legacy of the Drow - Starless Night.txt',
 'data/09. Crown of Fire.txt',
 'data/09. Legacy of the Drow - Siege of Darkness.txt',
 'data/1. A Neverwinter Novella, Part I.txt',
 'data/1. Abduction, The.txt',
 'data/1. Alabaster Staff, The.txt',
 'data/1. Archmage.txt',
 'data/1. Ascendance.txt',
 'data/1. Azure Bonds.txt',
 "data/1. Baldur's Gate.txt",
 'data/1. Blackstaff Tower.txt',
 'data/1. Blackstaff.txt',
 'dat

**Combine the books into one string**

In [None]:
corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

**Split the corpus into sentences**

In [10]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [11]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [12]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [13]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [14]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

“The Spider Queen will not accept the sacrifice until the child is named!”
“Drizzt,” breathed Matron Malice.
['The', 'Spider', 'Queen', 'will', 'not', 'accept', 'the', 'sacrifice', 'until', 'the', 'child', 'is', 'named', 'Drizzt', 'breathed', 'Matron', 'Malice']


In [15]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 34,603,549 tokens


## Train Word2Vec

In [16]:
# Dimensionality of the resulting word vectors.
num_features = 300

# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
seed = 1

In [17]:
realms2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [None]:
realms2vec.build_vocab(sentences)

In [19]:
print("Word2Vec vocabulary length:", len(realms2vec.wv.vocab))

Word2Vec vocabulary length: 71128


**Start training, this might take a minute or two...**

In [None]:
realms2vec.train(sentences, total_words=token_count, epochs=100)

**Save to file, can be useful later**

In [21]:
if not os.path.exists("trained"):
    os.makedirs("trained")

In [23]:
realms2vec.save(os.path.join("trained", "realms2vec.w2v"))

2018-02-10 13:12:25,028 : INFO : saving Word2Vec object under trained/realms2vec.w2v, separately None
2018-02-10 13:12:25,028 : INFO : storing np array 'syn0' to trained/realms2vec.w2v.wv.syn0.npy
2018-02-10 13:12:25,089 : INFO : not storing attribute syn0norm
2018-02-10 13:12:25,090 : INFO : storing np array 'syn1neg' to trained/realms2vec.w2v.syn1neg.npy
2018-02-10 13:12:25,126 : INFO : not storing attribute cum_table
2018-02-10 13:12:25,247 : INFO : saved trained/realms2vec.w2v


## Explore the trained model.

In [7]:
realms2vec = w2v.Word2Vec.load(os.path.join("trained", "realms2vec.w2v"))

2018-02-10 13:31:16,234 : INFO : loading Word2Vec object from trained/realms2vec.w2v
2018-02-10 13:31:16,374 : INFO : loading wv recursively from trained/realms2vec.w2v.wv.* with mmap=None
2018-02-10 13:31:16,375 : INFO : loading syn0 from trained/realms2vec.w2v.wv.syn0.npy with mmap=None
2018-02-10 13:31:16,529 : INFO : setting ignored attribute syn0norm to None
2018-02-10 13:31:16,532 : INFO : loading syn1neg from trained/realms2vec.w2v.syn1neg.npy with mmap=None
2018-02-10 13:31:16,679 : INFO : setting ignored attribute cum_table to None
2018-02-10 13:31:16,681 : INFO : loaded trained/realms2vec.w2v


### Compress the word vectors into 2D space and plot them

In [52]:
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

In [10]:
all_word_vectors_matrix = realms2vec.wv.syn0

**Train t-SNE, this could take a minute or two...**

NOTE: in my environment this never completes (propably too large). If so, you can skip to "Explore semantic similarities between book characters" section

In [None]:
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

**Plot the big picture**

In [None]:
points = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[realms2vec.vocab[word].index])
            for word in realms2vec.vocab
        ]
    ],
    columns=["word", "x", "y"]
)

In [None]:
points.head(10)

In [None]:
sns.set_context("poster")

In [None]:
points.plot.scatter("x", "y", s=10, figsize=(20, 12))

**Zoom in to some interesting places**

In [None]:
def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) & 
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1])
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

**People related to Kingsguard ended up together**

In [None]:
plot_region(x_bounds=(4.0, 4.2), y_bounds=(-0.5, -0.1))

**Food products are grouped nicely as well. Aerys (The Mad King) being close to "roasted" also looks sadly correct**

In [None]:
plot_region(x_bounds=(0, 1), y_bounds=(4, 4.5))

### Explore semantic similarities between book characters

**Words closest to the given word**

In [45]:
realms2vec.wv.most_similar("Elminster")

[('El', 0.7434680461883545),
 ('Shadowdale', 0.6829797029495239),
 ('Shandril', 0.6622568368911743),
 ('Mirt', 0.6510697603225708),
 ('Storm', 0.6506309509277344),
 ('Manshoon', 0.6389539837837219),
 ('Mage', 0.6381931304931641),
 ('Sharantyr', 0.6207765340805054),
 ('Narm', 0.6157053709030151),
 ('Srinshee', 0.6120926141738892)]

In [46]:
realms2vec.wv.most_similar("spell")

[('spells', 0.7400574684143066),
 ('enchantment', 0.6742175817489624),
 ('incantation', 0.6424131989479065),
 ('magic', 0.6384660601615906),
 ('casting', 0.5918986797332764),
 ('cast', 0.5778337717056274),
 ('counterspell', 0.5731417536735535),
 ('magics', 0.5412431359291077),
 ('divination', 0.5345152616500854),
 ('charm', 0.5302313566207886)]

In [47]:
realms2vec.wv.most_similar("Amn")

[('Sembia', 0.47273480892181396),
 ('Baldur', 0.462149053812027),
 ('Waterdeep', 0.4475569725036621),
 ('Cormyr', 0.4439469277858734),
 ('Esmeltaran', 0.4284575879573822),
 ('Suzail', 0.4167151153087616),
 ('Tethyr', 0.41307923197746277),
 ('Keczulla', 0.4013843834400177),
 ('Calimshan', 0.396533340215683),
 ('Gate', 0.38300296664237976)]

**Linear relationships between word pairs**

In [48]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = realms2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [49]:
nearest_similarity_cosmul("human", "Elminster", "Drizzt")
nearest_similarity_cosmul("Menzoberranzan", "Drizzt", "Bruenor")
nearest_similarity_cosmul("wizard", "human", "orc")

human is related to Elminster, as drow is related to Drizzt
Menzoberranzan is related to Drizzt, as Mithral is related to Bruenor
wizard is related to human, as shaman is related to orc


'shaman'

In [50]:
realms2vec.wv.doesnt_match("Alustriel Dove Storm Lolth".split()) # Who is not a Silverhand (Seven sister)

'Lolth'

In [51]:
realms2vec.wv.doesnt_match("Lolth Lathander Bahamut Helm Bruenor".split()) # Who is not a god

'Bruenor'