# Thrones2Vec

© Yuriy Guts, 2016

Using only the raw text of [A Song of Ice and Fire](https://en.wikipedia.org/wiki/A_Song_of_Ice_and_Fire), we'll derive and explore the semantic properties of its words.

## Imports

In [1]:
from __future__ import absolute_import, division, print_function

In [2]:
#word encoding
import codecs
#finds all pathnames matching a pattern, like regex
import glob
#log events for libraries
import logging
#concurrecy
import multiprocessing
#deal with operation system
import os
#better print
import pprint
#regex
import re

In [3]:
#natural language toolkit
import nltk
#word 2 vec
import gensim.models.word2vec as w2v
#dimensionality reduction
import sklearn.manifold
#math
import numpy as np
#plotting
import matplotlib.pyplot as plt
#parse dataset
import pandas as pd
#visualization
import seaborn as sns



In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [6]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dev\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dev\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Prepare Corpus

**Load books from files**

In [7]:
book_filenames = sorted(glob.glob("data/*.txt"))

In [8]:
print("Found books:")
book_filenames

Found books:


['data\\got1.txt',
 'data\\got2.txt',
 'data\\got3.txt',
 'data\\got4.txt',
 'data\\got5.txt']

**Combine the books into one string**

In [9]:
corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading 'data\got1.txt'...
Corpus is now 1787941 characters long

Reading 'data\got2.txt'...
Corpus is now 4110003 characters long

Reading 'data\got3.txt'...
Corpus is now 6452402 characters long

Reading 'data\got4.txt'...
Corpus is now 8185413 characters long

Reading 'data\got5.txt'...
Corpus is now 9811978 characters long



**Split the corpus into sentences**

In [10]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [11]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [12]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [13]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [14]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

Heraldic crest by Virginia Norey.
['Heraldic', 'crest', 'by', 'Virginia', 'Norey']


In [15]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 1,818,103 tokens


## Train Word2Vec

In [16]:
#ONCE we have vectors
#step 3 - build model
#3 main tasks that vectors help with
#DISTANCE, SIMILARITY, RANKING

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

In [17]:
thrones2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [18]:
thrones2vec.build_vocab(sentences)

2017-06-12 17:40:37,499 : INFO : collecting all words and their counts
2017-06-12 17:40:37,502 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-12 17:40:37,551 : INFO : PROGRESS: at sentence #10000, processed 140984 words, keeping 10280 word types
2017-06-12 17:40:37,598 : INFO : PROGRESS: at sentence #20000, processed 279730 words, keeping 13558 word types
2017-06-12 17:40:37,653 : INFO : PROGRESS: at sentence #30000, processed 420336 words, keeping 16598 word types
2017-06-12 17:40:37,686 : INFO : PROGRESS: at sentence #40000, processed 556581 words, keeping 18324 word types
2017-06-12 17:40:37,725 : INFO : PROGRESS: at sentence #50000, processed 686247 words, keeping 19714 word types
2017-06-12 17:40:37,770 : INFO : PROGRESS: at sentence #60000, processed 828497 words, keeping 21672 word types
2017-06-12 17:40:37,812 : INFO : PROGRESS: at sentence #70000, processed 973830 words, keeping 23093 word types
2017-06-12 17:40:37,851 : INFO : PROGRESS: at 

In [19]:
print("Word2Vec vocabulary length:", len(thrones2vec.vocab))

AttributeError: 'Word2Vec' object has no attribute 'vocab'

**Start training, this might take a minute or two...**

In [20]:
thrones2vec.train(sentences)

2017-06-12 17:40:45,541 : INFO : training model with 4 workers on 17277 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=7
2017-06-12 17:40:45,544 : INFO : expecting 128868 sentences, matching count from corpus used for vocabulary survey
2017-06-12 17:40:46,563 : INFO : PROGRESS: at 2.30% examples, 161287 words/s, in_qsize 7, out_qsize 0
2017-06-12 17:40:47,567 : INFO : PROGRESS: at 4.98% examples, 172754 words/s, in_qsize 8, out_qsize 0
2017-06-12 17:40:48,615 : INFO : PROGRESS: at 7.44% examples, 166633 words/s, in_qsize 7, out_qsize 0
2017-06-12 17:40:49,707 : INFO : PROGRESS: at 10.08% examples, 167450 words/s, in_qsize 7, out_qsize 0
2017-06-12 17:40:50,728 : INFO : PROGRESS: at 12.65% examples, 170326 words/s, in_qsize 8, out_qsize 0
2017-06-12 17:40:51,743 : INFO : PROGRESS: at 15.25% examples, 171019 words/s, in_qsize 8, out_qsize 0
2017-06-12 17:40:52,771 : INFO : PROGRESS: at 17.69% examples, 170074 words/s, in_qsize 8, out_qsize 0
2017-06-12 17:40:

7021533

**Save to file, can be useful later**

In [21]:
if not os.path.exists("trained"):
    os.makedirs("trained")

In [22]:
thrones2vec.save(os.path.join("trained", "thrones2vec.w2v"))

2017-06-12 17:41:26,563 : INFO : saving Word2Vec object under trained\thrones2vec.w2v, separately None
2017-06-12 17:41:26,572 : INFO : not storing attribute syn0norm
2017-06-12 17:41:26,577 : INFO : not storing attribute cum_table
2017-06-12 17:41:27,203 : INFO : saved trained\thrones2vec.w2v


## Explore the trained model.

In [23]:
thrones2vec = w2v.Word2Vec.load(os.path.join("trained", "thrones2vec.w2v"))

2017-06-12 17:41:27,213 : INFO : loading Word2Vec object from trained\thrones2vec.w2v
2017-06-12 17:41:27,605 : INFO : loading wv recursively from trained\thrones2vec.w2v.wv.* with mmap=None
2017-06-12 17:41:27,606 : INFO : setting ignored attribute syn0norm to None
2017-06-12 17:41:27,608 : INFO : setting ignored attribute cum_table to None
2017-06-12 17:41:27,611 : INFO : loaded trained\thrones2vec.w2v


### Compress the word vectors into 2D space and plot them

In [24]:
#my video - how to visualize a dataset easily
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

In [None]:
all_word_vectors_matrix = thrones2vec.syn0

**Train t-SNE, this could take a minute or two...**

In [None]:
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

**Plot the big picture**

In [None]:
points = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[thrones2vec.vocab[word].index])
            for word in thrones2vec.vocab
        ]
    ],
    columns=["word", "x", "y"]
)

In [None]:
points.head(10)

In [None]:
sns.set_context("poster")

In [None]:
points.plot.scatter("x", "y", s=10, figsize=(20, 12))

**Zoom in to some interesting places**

In [None]:
def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) & 
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1])
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

**People related to Kingsguard ended up together**

In [None]:
plot_region(x_bounds=(4.0, 4.2), y_bounds=(-0.5, -0.1))

**Food products are grouped nicely as well. Aerys (The Mad King) being close to "roasted" also looks sadly correct**

In [None]:
plot_region(x_bounds=(0, 1), y_bounds=(4, 4.5))

### Explore semantic similarities between book characters

**Words closest to the given word**

In [None]:
thrones2vec.most_similar("Stark")

In [None]:
thrones2vec.most_similar("Aerys")

In [None]:
thrones2vec.most_similar("direwolf")

**Linear relationships between word pairs**

In [None]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = thrones2vec.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [None]:
nearest_similarity_cosmul("Stark", "Winterfell", "Riverrun")
nearest_similarity_cosmul("Jaime", "sword", "wine")
nearest_similarity_cosmul("Arya", "Nymeria", "dragons")