# Thrones2Vec

© Yuriy Guts, 2016

Using only the raw text of [A Song of Ice and Fire](https://en.wikipedia.org/wiki/A_Song_of_Ice_and_Fire), we'll derive and explore the semantic properties of its words.

## Imports

In [11]:
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function

In [12]:
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re

In [13]:
import nltk
import gensim.models.word2vec as w2v
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [14]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [15]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [16]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/gabe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/gabe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Prepare Corpus

**Load books from files**

In [17]:
book_filenames = sorted(glob.glob("sports-6k/*"))

In [18]:
print("Found books:")
book_filenames

Found books:


['sports-6k/ARTICLE-755871',
 'sports-6k/ARTICLE-755901',
 'sports-6k/ARTICLE-755902',
 'sports-6k/ARTICLE-755907',
 'sports-6k/ARTICLE-755927',
 'sports-6k/ARTICLE-755938',
 'sports-6k/ARTICLE-755966',
 'sports-6k/ARTICLE-755974',
 'sports-6k/ARTICLE-755975',
 'sports-6k/ARTICLE-755989',
 'sports-6k/ARTICLE-755992',
 'sports-6k/ARTICLE-756007',
 'sports-6k/ARTICLE-756024',
 'sports-6k/ARTICLE-756025',
 'sports-6k/ARTICLE-756046',
 'sports-6k/ARTICLE-756048',
 'sports-6k/ARTICLE-756053',
 'sports-6k/ARTICLE-756068',
 'sports-6k/ARTICLE-756111',
 'sports-6k/ARTICLE-756135',
 'sports-6k/ARTICLE-756137',
 'sports-6k/ARTICLE-756146',
 'sports-6k/ARTICLE-756149',
 'sports-6k/ARTICLE-756155',
 'sports-6k/ARTICLE-756163',
 'sports-6k/ARTICLE-756184',
 'sports-6k/ARTICLE-756188',
 'sports-6k/ARTICLE-756197',
 'sports-6k/ARTICLE-756213',
 'sports-6k/ARTICLE-756216',
 'sports-6k/ARTICLE-756217',
 'sports-6k/ARTICLE-756230',
 'sports-6k/ARTICLE-756233',
 'sports-6k/ARTICLE-756234',
 'sports-6k/AR

**Combine the books into one string**

In [19]:
corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading 'sports-6k/ARTICLE-755871'...
Corpus is now 77 characters long

Reading 'sports-6k/ARTICLE-755901'...
Corpus is now 1099 characters long

Reading 'sports-6k/ARTICLE-755902'...
Corpus is now 4649 characters long

Reading 'sports-6k/ARTICLE-755907'...
Corpus is now 5446 characters long

Reading 'sports-6k/ARTICLE-755927'...
Corpus is now 6420 characters long

Reading 'sports-6k/ARTICLE-755938'...
Corpus is now 7274 characters long

Reading 'sports-6k/ARTICLE-755966'...
Corpus is now 7986 characters long

Reading 'sports-6k/ARTICLE-755974'...
Corpus is now 9094 characters long

Reading 'sports-6k/ARTICLE-755975'...
Corpus is now 9973 characters long

Reading 'sports-6k/ARTICLE-755989'...
Corpus is now 11566 characters long

Reading 'sports-6k/ARTICLE-755992'...
Corpus is now 12256 characters long

Reading 'sports-6k/ARTICLE-756007'...
Corpus is now 13091 characters long

Reading 'sports-6k/ARTICLE-756024'...
Corpus is now 14130 characters long

Reading 'sports-6k/ARTICLE-756025'..

Corpus is now 643543 characters long

Reading 'sports-6k/ARTICLE-759878'...
Corpus is now 645976 characters long

Reading 'sports-6k/ARTICLE-759881'...
Corpus is now 647211 characters long

Reading 'sports-6k/ARTICLE-759888'...
Corpus is now 648971 characters long

Reading 'sports-6k/ARTICLE-759911'...
Corpus is now 650038 characters long

Reading 'sports-6k/ARTICLE-760007'...
Corpus is now 651598 characters long

Reading 'sports-6k/ARTICLE-760011'...
Corpus is now 652680 characters long

Reading 'sports-6k/ARTICLE-760015'...
Corpus is now 654307 characters long

Reading 'sports-6k/ARTICLE-760019'...
Corpus is now 656906 characters long

Reading 'sports-6k/ARTICLE-760023'...
Corpus is now 658470 characters long

Reading 'sports-6k/ARTICLE-760033'...
Corpus is now 659231 characters long

Reading 'sports-6k/ARTICLE-760036'...
Corpus is now 660079 characters long

Reading 'sports-6k/ARTICLE-760048'...
Corpus is now 662251 characters long

Reading 'sports-6k/ARTICLE-760058'...
Corpus is no

Reading 'sports-6k/ARTICLE-763844'...
Corpus is now 1500749 characters long

Reading 'sports-6k/ARTICLE-763861'...
Corpus is now 1501382 characters long

Reading 'sports-6k/ARTICLE-763864'...
Corpus is now 1502461 characters long

Reading 'sports-6k/ARTICLE-763901'...
Corpus is now 1504636 characters long

Reading 'sports-6k/ARTICLE-763907'...
Corpus is now 1506419 characters long

Reading 'sports-6k/ARTICLE-763928'...
Corpus is now 1507512 characters long

Reading 'sports-6k/ARTICLE-763971'...
Corpus is now 1508880 characters long

Reading 'sports-6k/ARTICLE-763975'...
Corpus is now 1510538 characters long

Reading 'sports-6k/ARTICLE-763990'...
Corpus is now 1511839 characters long

Reading 'sports-6k/ARTICLE-763996'...
Corpus is now 1513281 characters long

Reading 'sports-6k/ARTICLE-764013'...
Corpus is now 1514782 characters long

Reading 'sports-6k/ARTICLE-764062'...
Corpus is now 1515903 characters long

Reading 'sports-6k/ARTICLE-764066'...
Corpus is now 1516883 characters long


Reading 'sports-6k/ARTICLE-767390'...
Corpus is now 2018615 characters long

Reading 'sports-6k/ARTICLE-767391'...
Corpus is now 2019507 characters long

Reading 'sports-6k/ARTICLE-767409'...
Corpus is now 2020763 characters long

Reading 'sports-6k/ARTICLE-767412'...
Corpus is now 2021439 characters long

Reading 'sports-6k/ARTICLE-767441'...
Corpus is now 2022459 characters long

Reading 'sports-6k/ARTICLE-767454'...
Corpus is now 2023476 characters long

Reading 'sports-6k/ARTICLE-767459'...
Corpus is now 2025355 characters long

Reading 'sports-6k/ARTICLE-767477'...
Corpus is now 2026364 characters long

Reading 'sports-6k/ARTICLE-767488'...
Corpus is now 2027177 characters long

Reading 'sports-6k/ARTICLE-767495'...
Corpus is now 2027757 characters long

Reading 'sports-6k/ARTICLE-767497'...
Corpus is now 2028803 characters long

Reading 'sports-6k/ARTICLE-767503'...
Corpus is now 2029824 characters long

Reading 'sports-6k/ARTICLE-767541'...
Corpus is now 2031247 characters long



Reading 'sports-6k/ARTICLE-770866'...
Corpus is now 2599256 characters long

Reading 'sports-6k/ARTICLE-770870'...
Corpus is now 2601200 characters long

Reading 'sports-6k/ARTICLE-770876'...
Corpus is now 2602524 characters long

Reading 'sports-6k/ARTICLE-770894'...
Corpus is now 2603784 characters long

Reading 'sports-6k/ARTICLE-770896'...
Corpus is now 2604840 characters long

Reading 'sports-6k/ARTICLE-770905'...
Corpus is now 2606909 characters long

Reading 'sports-6k/ARTICLE-770906'...
Corpus is now 2608293 characters long

Reading 'sports-6k/ARTICLE-770909'...
Corpus is now 2608391 characters long

Reading 'sports-6k/ARTICLE-770917'...
Corpus is now 2609625 characters long

Reading 'sports-6k/ARTICLE-770921'...
Corpus is now 2610527 characters long

Reading 'sports-6k/ARTICLE-770927'...
Corpus is now 2611921 characters long

Reading 'sports-6k/ARTICLE-770936'...
Corpus is now 2612578 characters long

Reading 'sports-6k/ARTICLE-770949'...
Corpus is now 2614183 characters long

Reading 'sports-6k/ARTICLE-776000'...
Corpus is now 3141086 characters long

Reading 'sports-6k/ARTICLE-776045'...
Corpus is now 3143290 characters long

Reading 'sports-6k/ARTICLE-776051'...
Corpus is now 3145919 characters long

Reading 'sports-6k/ARTICLE-776054'...
Corpus is now 3147473 characters long

Reading 'sports-6k/ARTICLE-776068'...
Corpus is now 3148741 characters long

Reading 'sports-6k/ARTICLE-776073'...
Corpus is now 3149733 characters long

Reading 'sports-6k/ARTICLE-776090'...
Corpus is now 3150829 characters long

Reading 'sports-6k/ARTICLE-776095'...
Corpus is now 3151689 characters long

Reading 'sports-6k/ARTICLE-776106'...
Corpus is now 3152565 characters long

Reading 'sports-6k/ARTICLE-776115'...
Corpus is now 3153621 characters long

Reading 'sports-6k/ARTICLE-776126'...
Corpus is now 3154765 characters long

Reading 'sports-6k/ARTICLE-776131'...
Corpus is now 3155738 characters long

Reading 'sports-6k/ARTICLE-776171'...
Corpus is now 3157051 characters long



Reading 'sports-6k/ARTICLE-780300'...
Corpus is now 3733665 characters long

Reading 'sports-6k/ARTICLE-780306'...
Corpus is now 3734808 characters long

Reading 'sports-6k/ARTICLE-780308'...
Corpus is now 3735930 characters long

Reading 'sports-6k/ARTICLE-780348'...
Corpus is now 3737375 characters long

Reading 'sports-6k/ARTICLE-780352'...
Corpus is now 3739273 characters long

Reading 'sports-6k/ARTICLE-780355'...
Corpus is now 3739991 characters long

Reading 'sports-6k/ARTICLE-780362'...
Corpus is now 3741089 characters long

Reading 'sports-6k/ARTICLE-780377'...
Corpus is now 3741719 characters long

Reading 'sports-6k/ARTICLE-780384'...
Corpus is now 3742929 characters long

Reading 'sports-6k/ARTICLE-780386'...
Corpus is now 3743340 characters long

Reading 'sports-6k/ARTICLE-780394'...
Corpus is now 3743901 characters long

Reading 'sports-6k/ARTICLE-780402'...
Corpus is now 3745101 characters long

Reading 'sports-6k/ARTICLE-780403'...
Corpus is now 3746115 characters long

Reading 'sports-6k/ARTICLE-784261'...
Corpus is now 4200436 characters long

Reading 'sports-6k/ARTICLE-784266'...
Corpus is now 4201207 characters long

Reading 'sports-6k/ARTICLE-784283'...
Corpus is now 4201963 characters long

Reading 'sports-6k/ARTICLE-784286'...
Corpus is now 4202982 characters long

Reading 'sports-6k/ARTICLE-784304'...
Corpus is now 4203887 characters long

Reading 'sports-6k/ARTICLE-784315'...
Corpus is now 4205365 characters long

Reading 'sports-6k/ARTICLE-784321'...
Corpus is now 4206452 characters long

Reading 'sports-6k/ARTICLE-784334'...
Corpus is now 4207967 characters long

Reading 'sports-6k/ARTICLE-784336'...
Corpus is now 4209321 characters long

Reading 'sports-6k/ARTICLE-784346'...
Corpus is now 4210521 characters long

Reading 'sports-6k/ARTICLE-784389'...
Corpus is now 4210997 characters long

Reading 'sports-6k/ARTICLE-784402'...
Corpus is now 4212099 characters long

Reading 'sports-6k/ARTICLE-784454'...
Corpus is now 4212876 characters long


Corpus is now 4525867 characters long

Reading 'sports-6k/ARTICLE-786972'...
Corpus is now 4526390 characters long

Reading 'sports-6k/ARTICLE-786991'...
Corpus is now 4527422 characters long

Reading 'sports-6k/ARTICLE-787017'...
Corpus is now 4528492 characters long

Reading 'sports-6k/ARTICLE-787022'...
Corpus is now 4529248 characters long

Reading 'sports-6k/ARTICLE-787036'...
Corpus is now 4530377 characters long

Reading 'sports-6k/ARTICLE-787048'...
Corpus is now 4531101 characters long

Reading 'sports-6k/ARTICLE-787067'...
Corpus is now 4532389 characters long

Reading 'sports-6k/ARTICLE-787078'...
Corpus is now 4533658 characters long

Reading 'sports-6k/ARTICLE-787083'...
Corpus is now 4534573 characters long

Reading 'sports-6k/ARTICLE-787095'...
Corpus is now 4535433 characters long

Reading 'sports-6k/ARTICLE-787102'...
Corpus is now 4536434 characters long

Reading 'sports-6k/ARTICLE-787112'...
Corpus is now 4537824 characters long

Reading 'sports-6k/ARTICLE-787118'...


Reading 'sports-6k/ARTICLE-790406'...
Corpus is now 4852307 characters long

Reading 'sports-6k/ARTICLE-790409'...
Corpus is now 4853238 characters long

Reading 'sports-6k/ARTICLE-790418'...
Corpus is now 4854204 characters long

Reading 'sports-6k/ARTICLE-790419'...
Corpus is now 4859355 characters long

Reading 'sports-6k/ARTICLE-790427'...
Corpus is now 4860620 characters long

Reading 'sports-6k/ARTICLE-790444'...
Corpus is now 4862119 characters long

Reading 'sports-6k/ARTICLE-790454'...
Corpus is now 4863662 characters long

Reading 'sports-6k/ARTICLE-790456'...
Corpus is now 4864793 characters long

Reading 'sports-6k/ARTICLE-790468'...
Corpus is now 4865669 characters long

Reading 'sports-6k/ARTICLE-790482'...
Corpus is now 4866315 characters long

Reading 'sports-6k/ARTICLE-790494'...
Corpus is now 4867634 characters long

Reading 'sports-6k/ARTICLE-790500'...
Corpus is now 4868671 characters long

Reading 'sports-6k/ARTICLE-790520'...
Corpus is now 4869882 characters long

Reading 'sports-6k/ARTICLE-793434'...
Corpus is now 5264372 characters long

Reading 'sports-6k/ARTICLE-793450'...
Corpus is now 5265810 characters long

Reading 'sports-6k/ARTICLE-793453'...
Corpus is now 5266472 characters long

Reading 'sports-6k/ARTICLE-793458'...
Corpus is now 5267547 characters long

Reading 'sports-6k/ARTICLE-793465'...
Corpus is now 5270415 characters long

Reading 'sports-6k/ARTICLE-793478'...
Corpus is now 5270955 characters long

Reading 'sports-6k/ARTICLE-793499'...
Corpus is now 5272576 characters long

Reading 'sports-6k/ARTICLE-793509'...
Corpus is now 5274258 characters long

Reading 'sports-6k/ARTICLE-793527'...
Corpus is now 5275473 characters long

Reading 'sports-6k/ARTICLE-793535'...
Corpus is now 5276730 characters long

Reading 'sports-6k/ARTICLE-793555'...
Corpus is now 5277588 characters long

Reading 'sports-6k/ARTICLE-793566'...
Corpus is now 5279345 characters long

Reading 'sports-6k/ARTICLE-793575'...
Corpus is now 5279768 characters long


Corpus is now 5734507 characters long

Reading 'sports-6k/ARTICLE-797777'...
Corpus is now 5735642 characters long

Reading 'sports-6k/ARTICLE-797785'...
Corpus is now 5738063 characters long

Reading 'sports-6k/ARTICLE-797795'...
Corpus is now 5738742 characters long

Reading 'sports-6k/ARTICLE-797800'...
Corpus is now 5739207 characters long

Reading 'sports-6k/ARTICLE-797804'...
Corpus is now 5739909 characters long

Reading 'sports-6k/ARTICLE-797809'...
Corpus is now 5741072 characters long

Reading 'sports-6k/ARTICLE-797811'...
Corpus is now 5741636 characters long

Reading 'sports-6k/ARTICLE-797819'...
Corpus is now 5742889 characters long

Reading 'sports-6k/ARTICLE-797821'...
Corpus is now 5744262 characters long

Reading 'sports-6k/ARTICLE-797825'...
Corpus is now 5746302 characters long

Reading 'sports-6k/ARTICLE-797837'...
Corpus is now 5748192 characters long

Reading 'sports-6k/ARTICLE-797859'...
Corpus is now 5749424 characters long

Reading 'sports-6k/ARTICLE-797861'...

Corpus is now 5906061 characters long

Reading 'sports-6k/ARTICLE-799121'...
Corpus is now 5907182 characters long

Reading 'sports-6k/ARTICLE-799139'...
Corpus is now 5908226 characters long

Reading 'sports-6k/ARTICLE-799143'...
Corpus is now 5908918 characters long

Reading 'sports-6k/ARTICLE-799154'...
Corpus is now 5909610 characters long

Reading 'sports-6k/ARTICLE-799155'...
Corpus is now 5910285 characters long

Reading 'sports-6k/ARTICLE-799158'...
Corpus is now 5912838 characters long

Reading 'sports-6k/ARTICLE-799165'...
Corpus is now 5914304 characters long

Reading 'sports-6k/ARTICLE-799174'...
Corpus is now 5915664 characters long

Reading 'sports-6k/ARTICLE-799180'...
Corpus is now 5916383 characters long

Reading 'sports-6k/ARTICLE-799191'...
Corpus is now 5916919 characters long

Reading 'sports-6k/ARTICLE-799194'...
Corpus is now 5918115 characters long

Reading 'sports-6k/ARTICLE-799210'...
Corpus is now 5919638 characters long

Reading 'sports-6k/ARTICLE-799220'...

Reading 'sports-6k/ARTICLE-800556'...
Corpus is now 6078842 characters long

Reading 'sports-6k/ARTICLE-800564'...
Corpus is now 6078857 characters long

Reading 'sports-6k/ARTICLE-800573'...
Corpus is now 6079992 characters long

Reading 'sports-6k/ARTICLE-800581'...
Corpus is now 6081315 characters long

Reading 'sports-6k/ARTICLE-800631'...
Corpus is now 6083962 characters long

Reading 'sports-6k/ARTICLE-800638'...
Corpus is now 6084847 characters long

Reading 'sports-6k/ARTICLE-800650'...
Corpus is now 6085929 characters long

Reading 'sports-6k/ARTICLE-800661'...
Corpus is now 6087516 characters long

Reading 'sports-6k/ARTICLE-800665'...
Corpus is now 6088326 characters long

Reading 'sports-6k/ARTICLE-800674'...
Corpus is now 6089272 characters long

Reading 'sports-6k/ARTICLE-800688'...
Corpus is now 6089329 characters long

Reading 'sports-6k/ARTICLE-800693'...
Corpus is now 6090353 characters long

Reading 'sports-6k/ARTICLE-800700'...
Corpus is now 6091054 characters long


Reading 'sports-6k/ARTICLE-803197'...
Corpus is now 6340705 characters long

Reading 'sports-6k/ARTICLE-803204'...
Corpus is now 6342512 characters long

Reading 'sports-6k/ARTICLE-803212'...
Corpus is now 6343455 characters long

Reading 'sports-6k/ARTICLE-803221'...
Corpus is now 6344365 characters long

Reading 'sports-6k/ARTICLE-803225'...
Corpus is now 6345697 characters long

Reading 'sports-6k/ARTICLE-803230'...
Corpus is now 6346953 characters long

Reading 'sports-6k/ARTICLE-803235'...
Corpus is now 6348151 characters long

Reading 'sports-6k/ARTICLE-803241'...
Corpus is now 6350072 characters long

Reading 'sports-6k/ARTICLE-803248'...
Corpus is now 6350924 characters long

Reading 'sports-6k/ARTICLE-803263'...
Corpus is now 6352359 characters long

Reading 'sports-6k/ARTICLE-803271'...
Corpus is now 6354025 characters long

Reading 'sports-6k/ARTICLE-803280'...
Corpus is now 6355114 characters long

Reading 'sports-6k/ARTICLE-803281'...
Corpus is now 6358026 characters long


Corpus is now 6506838 characters long

Reading 'sports-6k/ARTICLE-804468'...
Corpus is now 6508295 characters long

Reading 'sports-6k/ARTICLE-804483'...
Corpus is now 6509580 characters long

Reading 'sports-6k/ARTICLE-804512'...
Corpus is now 6510891 characters long

Reading 'sports-6k/ARTICLE-804550'...
Corpus is now 6511952 characters long

Reading 'sports-6k/ARTICLE-804562'...
Corpus is now 6513254 characters long

Reading 'sports-6k/ARTICLE-804566'...
Corpus is now 6513831 characters long

Reading 'sports-6k/ARTICLE-804586'...
Corpus is now 6514596 characters long

Reading 'sports-6k/ARTICLE-804594'...
Corpus is now 6515560 characters long

Reading 'sports-6k/ARTICLE-804612'...
Corpus is now 6516386 characters long

Reading 'sports-6k/ARTICLE-804621'...
Corpus is now 6517089 characters long

Reading 'sports-6k/ARTICLE-804633'...
Corpus is now 6518442 characters long

Reading 'sports-6k/ARTICLE-804639'...
Corpus is now 6519101 characters long

Reading 'sports-6k/ARTICLE-804681'...

Reading 'sports-6k/ARTICLE-806551'...
Corpus is now 6715250 characters long

Reading 'sports-6k/ARTICLE-806566'...
Corpus is now 6716676 characters long

Reading 'sports-6k/ARTICLE-806587'...
Corpus is now 6718584 characters long

Reading 'sports-6k/ARTICLE-806599'...
Corpus is now 6720030 characters long

Reading 'sports-6k/ARTICLE-806610'...
Corpus is now 6721117 characters long

Reading 'sports-6k/ARTICLE-806651'...
Corpus is now 6722334 characters long

Reading 'sports-6k/ARTICLE-806669'...
Corpus is now 6723504 characters long

Reading 'sports-6k/ARTICLE-806680'...
Corpus is now 6724863 characters long

Reading 'sports-6k/ARTICLE-806690'...
Corpus is now 6726354 characters long

Reading 'sports-6k/ARTICLE-806717'...
Corpus is now 6727049 characters long

Reading 'sports-6k/ARTICLE-806722'...
Corpus is now 6728263 characters long

Reading 'sports-6k/ARTICLE-806725'...
Corpus is now 6729694 characters long

Reading 'sports-6k/ARTICLE-806734'...
Corpus is now 6731054 characters long


Reading 'sports-6k/ARTICLE-809759'...
Corpus is now 7096327 characters long

Reading 'sports-6k/ARTICLE-809761'...
Corpus is now 7097777 characters long

Reading 'sports-6k/ARTICLE-809771'...
Corpus is now 7098654 characters long

Reading 'sports-6k/ARTICLE-809777'...
Corpus is now 7103628 characters long

Reading 'sports-6k/ARTICLE-809796'...
Corpus is now 7104725 characters long

Reading 'sports-6k/ARTICLE-809798'...
Corpus is now 7105658 characters long

Reading 'sports-6k/ARTICLE-809812'...
Corpus is now 7107859 characters long

Reading 'sports-6k/ARTICLE-809813'...
Corpus is now 7110113 characters long

Reading 'sports-6k/ARTICLE-809818'...
Corpus is now 7111377 characters long

Reading 'sports-6k/ARTICLE-809823'...
Corpus is now 7112825 characters long

Reading 'sports-6k/ARTICLE-809825'...
Corpus is now 7114612 characters long

Reading 'sports-6k/ARTICLE-809827'...
Corpus is now 7115679 characters long

Reading 'sports-6k/ARTICLE-809832'...
Corpus is now 7117414 characters long



Reading 'sports-6k/ARTICLE-812917'...
Corpus is now 7459295 characters long

Reading 'sports-6k/ARTICLE-812918'...
Corpus is now 7461022 characters long

Reading 'sports-6k/ARTICLE-812919'...
Corpus is now 7462909 characters long

Reading 'sports-6k/ARTICLE-812920'...
Corpus is now 7464796 characters long

Reading 'sports-6k/ARTICLE-812926'...
Corpus is now 7468067 characters long

Reading 'sports-6k/ARTICLE-812927'...
Corpus is now 7468565 characters long

Reading 'sports-6k/ARTICLE-812948'...
Corpus is now 7470001 characters long

Reading 'sports-6k/ARTICLE-812953'...
Corpus is now 7471810 characters long

Reading 'sports-6k/ARTICLE-812959'...
Corpus is now 7474078 characters long

Reading 'sports-6k/ARTICLE-812967'...
Corpus is now 7474746 characters long

Reading 'sports-6k/ARTICLE-812968'...
Corpus is now 7477013 characters long

Reading 'sports-6k/ARTICLE-812983'...
Corpus is now 7478652 characters long

Reading 'sports-6k/ARTICLE-812998'...
Corpus is now 7479864 characters long


Reading 'sports-6k/ARTICLE-817378'...
Corpus is now 7960110 characters long

Reading 'sports-6k/ARTICLE-817399'...
Corpus is now 7960517 characters long

Reading 'sports-6k/ARTICLE-817400'...
Corpus is now 7961542 characters long

Reading 'sports-6k/ARTICLE-817402'...
Corpus is now 7963570 characters long

Reading 'sports-6k/ARTICLE-817403'...
Corpus is now 7964904 characters long

Reading 'sports-6k/ARTICLE-817404'...
Corpus is now 7965834 characters long

Reading 'sports-6k/ARTICLE-817417'...
Corpus is now 7966770 characters long

Reading 'sports-6k/ARTICLE-817439'...
Corpus is now 7967685 characters long

Reading 'sports-6k/ARTICLE-817443'...
Corpus is now 7968322 characters long

Reading 'sports-6k/ARTICLE-817459'...
Corpus is now 7969445 characters long

Reading 'sports-6k/ARTICLE-817468'...
Corpus is now 7973173 characters long

Reading 'sports-6k/ARTICLE-817481'...
Corpus is now 7975418 characters long

Reading 'sports-6k/ARTICLE-817508'...
Corpus is now 7976603 characters long

Corpus is now 8257983 characters long

Reading 'sports-6k/ARTICLE-820083'...
Corpus is now 8259866 characters long

Reading 'sports-6k/ARTICLE-820093'...
Corpus is now 8261113 characters long

Reading 'sports-6k/ARTICLE-820100'...
Corpus is now 8262276 characters long

Reading 'sports-6k/ARTICLE-820103'...
Corpus is now 8264266 characters long

Reading 'sports-6k/ARTICLE-820111'...
Corpus is now 8265594 characters long

Reading 'sports-6k/ARTICLE-820128'...
Corpus is now 8266897 characters long

Reading 'sports-6k/ARTICLE-820145'...
Corpus is now 8267573 characters long

Reading 'sports-6k/ARTICLE-820152'...
Corpus is now 8268593 characters long

Reading 'sports-6k/ARTICLE-820168'...
Corpus is now 8269969 characters long

Reading 'sports-6k/ARTICLE-820183'...
Corpus is now 8271060 characters long

Reading 'sports-6k/ARTICLE-820184'...
Corpus is now 8273985 characters long

Reading 'sports-6k/ARTICLE-820193'...
Corpus is now 8274986 characters long

Reading 'sports-6k/ARTICLE-820231'...

Reading 'sports-6k/ARTICLE-822505'...
Corpus is now 8529177 characters long

Reading 'sports-6k/ARTICLE-822507'...
Corpus is now 8530634 characters long

Reading 'sports-6k/ARTICLE-822512'...
Corpus is now 8532091 characters long

Reading 'sports-6k/ARTICLE-822519'...
Corpus is now 8533365 characters long

Reading 'sports-6k/ARTICLE-822524'...
Corpus is now 8535329 characters long

Reading 'sports-6k/ARTICLE-822527'...
Corpus is now 8537457 characters long

Reading 'sports-6k/ARTICLE-822534'...
Corpus is now 8538532 characters long

Reading 'sports-6k/ARTICLE-822536'...
Corpus is now 8539697 characters long

Reading 'sports-6k/ARTICLE-822541'...
Corpus is now 8540394 characters long

Reading 'sports-6k/ARTICLE-822546'...
Corpus is now 8541758 characters long

Reading 'sports-6k/ARTICLE-822548'...
Corpus is now 8542986 characters long

Reading 'sports-6k/ARTICLE-822553'...
Corpus is now 8544210 characters long

Reading 'sports-6k/ARTICLE-822562'...
Corpus is now 8545271 characters long



Reading 'sports-6k/ARTICLE-826371'...
Corpus is now 8998504 characters long

Reading 'sports-6k/ARTICLE-826373'...
Corpus is now 8999411 characters long

Reading 'sports-6k/ARTICLE-826374'...
Corpus is now 9000765 characters long

Reading 'sports-6k/ARTICLE-826376'...
Corpus is now 9002040 characters long

Reading 'sports-6k/ARTICLE-826378'...
Corpus is now 9002395 characters long

Reading 'sports-6k/ARTICLE-826382'...
Corpus is now 9004250 characters long

Reading 'sports-6k/ARTICLE-826393'...
Corpus is now 9006067 characters long

Reading 'sports-6k/ARTICLE-826404'...
Corpus is now 9007665 characters long

Reading 'sports-6k/ARTICLE-826410'...
Corpus is now 9010641 characters long

Reading 'sports-6k/ARTICLE-826419'...
Corpus is now 9011264 characters long

Reading 'sports-6k/ARTICLE-826423'...
Corpus is now 9012745 characters long

Reading 'sports-6k/ARTICLE-826427'...
Corpus is now 9014237 characters long

Reading 'sports-6k/ARTICLE-826436'...
Corpus is now 9015555 characters long

Reading 'sports-6k/ARTICLE-829916'...
Corpus is now 9304476 characters long

Reading 'sports-6k/ARTICLE-829951'...
Corpus is now 9305683 characters long

Reading 'sports-6k/ARTICLE-829965'...
Corpus is now 9306729 characters long

Reading 'sports-6k/ARTICLE-829968'...
Corpus is now 9307883 characters long

Reading 'sports-6k/ARTICLE-829973'...
Corpus is now 9309383 characters long

Reading 'sports-6k/ARTICLE-829977'...
Corpus is now 9310493 characters long

Reading 'sports-6k/ARTICLE-829986'...
Corpus is now 9312059 characters long

Reading 'sports-6k/ARTICLE-829987'...
Corpus is now 9312899 characters long

Reading 'sports-6k/ARTICLE-830079'...
Corpus is now 9314554 characters long

Reading 'sports-6k/ARTICLE-830087'...
Corpus is now 9316179 characters long

Reading 'sports-6k/ARTICLE-830098'...
Corpus is now 9317271 characters long

Reading 'sports-6k/ARTICLE-830149'...
Corpus is now 9319486 characters long

Reading 'sports-6k/ARTICLE-830173'...
Corpus is now 9321197 characters long


**Split the corpus into sentences**

In [20]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [21]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [22]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    regex = u"[^a-zA-Z-Záéíóúñ]"
    clean = re.sub(regex," ", raw).encode("utf-8").lower()
    words = clean.split()
    return words

In [23]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [24]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

Pero hay que cambiar mucho para 2017.
['pero', 'hay', 'que', 'cambiar', 'mucho', 'para']


In [25]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 1,653,785 tokens


## Train Word2Vec

In [26]:
#ONCE we have vectors
#step 3 - build model
#3 main tasks that vectors help with
#DISTANCE, SIMILARITY, RANKING

# Dimensionality of the resulting word vectors.
#more dimensions, more computationally expensive to train
#but also more accurate
#more dimensions = more generalized
num_features = 300
# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
#more workers, faster we train
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

In [27]:
thrones2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [28]:
thrones2vec.build_vocab(sentences)

2017-12-04 16:53:21,316 : INFO : collecting all words and their counts
2017-12-04 16:53:21,324 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-04 16:53:21,379 : INFO : PROGRESS: at sentence #10000, processed 284390 words, keeping 16662 word types
2017-12-04 16:53:21,430 : INFO : PROGRESS: at sentence #20000, processed 571779 words, keeping 24630 word types
2017-12-04 16:53:21,482 : INFO : PROGRESS: at sentence #30000, processed 847238 words, keeping 31247 word types
2017-12-04 16:53:21,542 : INFO : PROGRESS: at sentence #40000, processed 1126713 words, keeping 36997 word types
2017-12-04 16:53:21,598 : INFO : PROGRESS: at sentence #50000, processed 1421879 words, keeping 42310 word types
2017-12-04 16:53:21,639 : INFO : collected 45655 word types from a corpus of 1653785 raw words and 57658 sentences
2017-12-04 16:53:21,640 : INFO : Loading a fresh vocabulary
2017-12-04 16:53:21,721 : INFO : min_count=3 retains 20446 unique words (44% of original 4565

In [29]:
print("Word2Vec vocabulary length:", len(thrones2vec.corpus_count))

TypeError: object of type 'int' has no len()

**Start training, this might take a minute or two...**

In [30]:
thrones2vec.train(sentences, total_examples=thrones2vec.corpus_count, epochs=thrones2vec.iter)

2017-12-04 16:54:11,500 : INFO : training model with 4 workers on 20446 vocabulary and 300 features, using sg=1 hs=0 sample=0.001 negative=5 window=7
2017-12-04 16:54:12,544 : INFO : PROGRESS: at 2.52% examples, 134934 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:13,562 : INFO : PROGRESS: at 4.68% examples, 129335 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:14,562 : INFO : PROGRESS: at 6.53% examples, 123535 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:15,648 : INFO : PROGRESS: at 8.76% examples, 121390 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:16,671 : INFO : PROGRESS: at 11.30% examples, 124174 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:17,704 : INFO : PROGRESS: at 13.77% examples, 125787 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:18,716 : INFO : PROGRESS: at 16.22% examples, 128256 words/s, in_qsize 8, out_qsize 0
2017-12-04 16:54:19,722 : INFO : PROGRESS: at 18.40% examples, 128624 words/s, in_qsize 7, out_qsize 0
2017-12-04 16:54:20,738 : INFO

5758189

**Save to file, can be useful later**

In [32]:
thrones2vec.save("tn2vec.w2v")

2017-12-04 16:55:31,000 : INFO : saving Word2Vec object under tn2vec.w2v, separately None
2017-12-04 16:55:31,004 : INFO : not storing attribute syn0norm
2017-12-04 16:55:31,008 : INFO : not storing attribute cum_table
2017-12-04 16:55:31,312 : INFO : saved tn2vec.w2v


## Explore the trained model.

In [37]:
thrones2vec = w2v.Word2Vec.load( "tn2vec.w2v")

2017-12-04 16:56:32,471 : INFO : loading Word2Vec object from tn2vec.w2v
2017-12-04 16:56:32,566 : INFO : loading wv recursively from tn2vec.w2v.wv.* with mmap=None
2017-12-04 16:56:32,567 : INFO : setting ignored attribute syn0norm to None
2017-12-04 16:56:32,567 : INFO : setting ignored attribute cum_table to None
2017-12-04 16:56:32,568 : INFO : loaded tn2vec.w2v


### Compress the word vectors into 2D space and plot them

In [40]:
#my video - how to visualize a dataset easily
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)

In [None]:
all_word_vectors_matrix = thrones2vec.wv.syn0

**Train t-SNE, this could take a minute or two...**

In [None]:
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)

**Plot the big picture**

In [None]:
points = pd.DataFrame(
    [
        (word, coords[0], coords[1])
        for word, coords in [
            (word, all_word_vectors_matrix_2d[thrones2vec.vocab[word].index])
            for word in thrones2vec.vocab
        ]
    ],
    columns=["word", "x", "y"]
)

In [None]:
points.head(10)

In [None]:
sns.set_context("poster")

In [None]:
points.plot.scatter("x", "y", s=10, figsize=(20, 12))

**Zoom in to some interesting places**

In [None]:
def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) & 
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1])
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

**People related to Kingsguard ended up together**

In [None]:
plot_region(x_bounds=(4.0, 4.2), y_bounds=(-0.5, -0.1))

**Food products are grouped nicely as well. Aerys (The Mad King) being close to "roasted" also looks sadly correct**

In [None]:
plot_region(x_bounds=(0, 1), y_bounds=(4, 4.5))

### Explore semantic similarities between book characters

**Words closest to the given word**

In [34]:
thrones2vec.most_similar("Stark")

[(u'Eddard', 0.742438018321991),
 (u'Winterfell', 0.64848792552948),
 (u'Brandon', 0.6438549757003784),
 (u'Lyanna', 0.6438394784927368),
 (u'Robb', 0.6242259740829468),
 (u'executed', 0.6220564842224121),
 (u'Arryn', 0.6189971566200256),
 (u'Benjen', 0.6188897490501404),
 (u'direwolf', 0.614366352558136),
 (u'beheaded', 0.6046538352966309)]

In [35]:
thrones2vec.most_similar("Aerys")

[(u'Jaehaerys', 0.7991689443588257),
 (u'Daeron', 0.7808291912078857),
 (u'II', 0.7649893164634705),
 (u'reign', 0.7466063499450684),
 (u'Mad', 0.7380156517028809),
 (u'Beggar', 0.7334001660346985),
 (u'Rhaegar', 0.7308052182197571),
 (u'Unworthy', 0.7120681405067444),
 (u'Cruel', 0.7089171409606934),
 (u'Dome', 0.7070454359054565)]

In [36]:
thrones2vec.most_similar("direwolf")

[(u'Rickon', 0.6617892980575562),
 (u'SHAGGYDOG', 0.643834114074707),
 (u'wolf', 0.6403605341911316),
 (u'GHOST', 0.6385751962661743),
 (u'pup', 0.6156360507011414),
 (u'Robb', 0.6147520542144775),
 (u'Stark', 0.614366352558136),
 (u'crannogman', 0.6082616448402405),
 (u'wight', 0.606614351272583),
 (u'RICKON', 0.6039268970489502)]

**Linear relationships between word pairs**

In [37]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = thrones2vec.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [38]:
nearest_similarity_cosmul("Stark", "Winterfell", "Riverrun")
nearest_similarity_cosmul("Jaime", "sword", "wine")
nearest_similarity_cosmul("Arya", "Nymeria", "dragons")

Stark is related to Winterfell, as Tully is related to Riverrun
Jaime is related to sword, as Tyrion is related to wine
Arya is related to Nymeria, as Dany is related to dragons


u'Dany'