<img src="images/alice.jpg?raw=true"/>

In [1]:
import re
import spacy

from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.summarization.textcleaner import get_sentences

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

from sklearn import cluster
from sklearn import metrics
from sklearn.manifold import TSNE
import numpy as np
import pandas as pd
print("Libraries imported")

Libraries imported


# Data Preprocessing

In [2]:
# Load raw text
file = "text/alice.txt"
with open(file, encoding='latin-1') as f:
    book_raw_text = f.readlines()
    
# View the first 20 rows 
book_raw_text[:20]

["Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\n",
 '\n',
 'This eBook is for the use of anyone anywhere at no cost and with\n',
 'almost no restrictions whatsoever.  You may copy it, give it away or\n',
 're-use it under the terms of the Project Gutenberg License included\n',
 'with this eBook or online at www.gutenberg.net\n',
 '\n',
 '\n',
 "Title: Alice's Adventures in Wonderland\n",
 '       Illustrated by Arthur Rackham. With a Proem by Austin Dobson\n',
 '\n',
 'Author: Lewis Carroll\n',
 '\n',
 'Illustrator: Arthur Rackham\n',
 '\n',
 'Release Date: May 19, 2009 [EBook #28885]\n',
 '\n',
 'Language: English\n',
 '\n',
 '\n']

The book is structured in:
* Preamble about the publication.
* A poem.
* Table of contents.
* List of plates.
* Twelve chapters with content.

We will focus on the content in the chapters and split the book in the corresponding twelve sections. 
<br>Each chapter starts with the word "CHAPTER", so we will take this as basis to make the split.

In [3]:
chapters = "".join(book_raw_text).split("CHAPTER")
len(chapters)

13

We remove the first entry on the list of chapters as it corresponds to the preamble, poem, table of contents and list of plates.

In [240]:
chapters.pop(0)

'Project Gutenberg\'s Alice\'s Adventures in Wonderland, by Lewis Carroll\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\nTitle: Alice\'s Adventures in Wonderland\n       Illustrated by Arthur Rackham. With a Proem by Austin Dobson\n\nAuthor: Lewis Carroll\n\nIllustrator: Arthur Rackham\n\nRelease Date: May 19, 2009 [EBook #28885]\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK ALICE\'S ADVENTURES IN WONDERLAND ***\n\n\n\n\nProduced by Jana Srna, Emmy and the Online Distributed\nProofreading Team at http://www.pgdp.net (This file was\nproduced from images generously made available by the\nUniversity of Florida Digital Collections.)\n\n\n\n\n\n\n\n\n\n\n\nALICE\'S ADVENTURES IN WONDERLAND\n\n[Illustration: "Alice"]\n\n[Illustration:\n\n          ALICE\'SÂ·ADV

In [241]:
num_chapters = len(chapters)
num_chapters

12

Tha last chapter contains additional license information after the end of the story which is marked with "THE END". As this additional information is not relevant to the story, we will split it to separate the story from the licensing info and subsequently we will replace the last chapter with the content that is relevant to the story.

In [242]:
chapters[num_chapters - 1]

' XII\n\n\n[Sidenote: _Alice\'s Evidence_]\n\n"HERE!" cried Alice, quite forgetting in the flurry of\nthe moment how large she had grown in the last few minutes, and she\njumped up in such a hurry that she tipped over the jury-box with the\nedge of her skirt, upsetting all the jurymen on to the heads of the\ncrowd below, and there they lay sprawling about, reminding her very much\nof a globe of gold-fish she had accidentally upset the week before.\n\n"Oh, I _beg_ your pardon!" she exclaimed in a tone of great dismay, and\nbegan picking them up again as quickly as she could, for the accident of\nthe gold-fish kept running in her head, and she had a vague sort of idea\nthat they must be collected at once and put back into the jury-box, or\nthey would die.\n\n"The trial cannot proceed," said the King in a very grave voice, "until\nall the jurymen are back in their proper places--_all_," he repeated\nwith great emphasis, looking hard at Alice as he said so.\n\nAlice looked at the jury-box,

In [243]:
end = chapters[num_chapters - 1].split("THE END")
chapters[len(chapters)-1] = end[0]

In [245]:
#chapters[num_chapters - 1]

In [246]:
# Display the first 30 characters of each chapter
for chapter in chapters:
    print(chapter[:30])

 I


[Sidenote: _Down the Rabb
 II


[Sidenote: _Pool of Tear
 III


[Sidenote: _A Caucus-ra
 IV


[Sidenote: _The Rabbit s
 V


[Sidenote: _Advice from a
 VI


[Sidenote: _Pig and Pepp
 VII


[Sidenote: _A Mad Tea-p
 VIII


[Sidenote: _The Queen'
 IX


[Sidenote: _The Mock Tur
 X


[Sidenote: _The Lobster Q
 XI


[Sidenote: _Who Stole th
 XII


[Sidenote: _Alice's Evi


## Text preprocessing
Steps:
1. Remove chapter titles, illustrations and sidenotes
2. Remove new line, * and _ characters

In [247]:
# Remove first line in each chapter (the chapter number)
chapters = [re.sub(r"^.*\n", "", sent) for sent in chapters]

# Remove chapters titles/illustraions/sidenotes - these are of the form "[text]"
chapters = [re.sub("[\[].*?[\]]", " ", sent) for sent in chapters]

# Remove one or more spaces and new lines
chapters = [re.sub("\s+", " ", sent) for sent in chapters]

# Remove the ***** text lines in the text
chapters = [re.sub("\*", " ", sent) for sent in chapters]

# Remove the _ characters in the text
chapters = [re.sub("_", " ", sent) for sent in chapters]

# Lets look at how the text now looks
for chapter in chapters:
    print("Chapter {}: {}".format(chapters.index(chapter)+1,chapter[:50]))

Chapter 1:  ALICE was beginning to get very tired of sitting 
Chapter 2:  "CURIOUSER and curiouser!" cried Alice (she was s
Chapter 3:  THEY were indeed a queer-looking party that assem
Chapter 4:  IT was the White Rabbit, trotting slowly back aga
Chapter 5:  THE Caterpillar and Alice looked at each other fo
Chapter 6:  FOR a minute or two she stood looking at the hous
Chapter 7:  THERE was a table set out under a tree in front o
Chapter 8:  A LARGE rose-tree stood near the entrance of the 
Chapter 9:  "YOU can't think how glad I am to see you again, 
Chapter 10:  THE Mock Turtle sighed deeply, and drew the back 
Chapter 11:  THE King and Queen of Hearts were seated on their
Chapter 12:  "HERE!" cried Alice, quite forgetting in the flur


In [248]:
chapter_sentences = [list(get_sentences(doc)) for doc in chapters]

In [249]:
len(chapter_sentences[:])

12

In [250]:
total = 0
for i in range(len(chapter_sentences)):
    print("Phrases in chapter {}: {}".format(i+1,len(list(chapter_sentences[i]))))
    total = total + len(list(chapter_sentences[i]))
print("Phrases in total: {}".format(total))

Phrases in chapter 1: 60
Phrases in chapter 2: 79
Phrases in chapter 3: 64
Phrases in chapter 4: 95
Phrases in chapter 5: 73
Phrases in chapter 6: 99
Phrases in chapter 7: 98
Phrases in chapter 8: 80
Phrases in chapter 9: 85
Phrases in chapter 10: 93
Phrases in chapter 11: 77
Phrases in chapter 12: 69
Phrases in total: 972


In [251]:
chapter_sentences[0][:5]

['ALICE was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?" So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid) whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.',
 'There was nothing so  very  remarkable in that; nor did Alice think it so  very  much out of the way to hear the Rabbit say to itself, "Oh dear!',
 'Oh dear!',
 'I shall be too late!" (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually  took a watch out of its waistcoat-pocket , 

We no longer want the sentences per chapter - word2vec just wants a list of sentences. So lets flatten the chapter list

In [252]:
sentences = [item for chapter in chapter_sentences for item in chapter]
len(sentences)

972

In [253]:
sentences[:5]

['ALICE was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?" So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid) whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.',
 'There was nothing so  very  remarkable in that; nor did Alice think it so  very  much out of the way to hear the Rabbit say to itself, "Oh dear!',
 'Oh dear!',
 'I shall be too late!" (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually  took a watch out of its waistcoat-pocket , 

Tokenise all sentences removing punctuation.

In [255]:
# dacc=False removes punctuation
tokenised_sentences = [simple_preprocess(str(sentence), deacc=False) for sentence in sentences]
print(len(tokenised_sentences))
# Lets view the words of the first sentence of the first chapter
print(tokenised_sentences[0])

972
['alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use', 'of', 'book', 'thought', 'alice', 'without', 'pictures', 'or', 'conversations', 'so', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', 'as', 'well', 'as', 'she', 'could', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', 'whether', 'the', 'pleasure', 'of', 'making', 'daisy', 'chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', 'up', 'and', 'picking', 'the', 'daisies', 'when', 'suddenly', 'white', 'rabbit', 'with', 'pink', 'eyes', 'ran', 'close', 'by', 'her']


Create biagrams and trigrams from the tokenised sentences using phraser objects.

In [277]:
bigram_phrases = Phrases(tokenised_sentences, min_count=2, threshold=1)
bigram = Phraser(bigram_phrases)

trigram_phrases = Phrases(bigram[tokenised_sentences], min_count=1, threshold=1, delimiter=b'_')
trigram = Phraser(trigram_phrases)

In [278]:
bigram

<gensim.models.phrases.Phraser at 0x124511ac0>

In [258]:
# Define functions for stopwords, bigrams, trigrams and lemmatization

# Gensim simple processing pipeline - lowercase and tokensises 
def simple_preprocessing(texts):
    return [[word for word in simple_preprocess(str(doc))] for doc in texts]

def make_bigrams(texts):
    return [bigram[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram[doc] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """
    use spaCy lemmatisation - allowing through Nouns, Adjectives, Verbs and adverbs only
    """
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags and token.lemma_ != '-PRON-'])
    return texts_out

In [260]:
# Simple Genism processing 

tokens_preprocessed = simple_preprocessing(tokenised_sentences)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
tokens_lemmatized = lemmatization(tokens_preprocessed, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# Form Bigrams
tokens_bigrams = make_bigrams(tokens_lemmatized)

# Form Trigrams
tokens_trigrams = make_trigrams(tokens_bigrams)

data_processed = tokens_trigrams

#  lets look at the first and fifth sentences
print(data_processed[0])
print(data_processed[4])

['begin', 'very_tired', 'sit', 'sister', 'bank', 'have', 'once', 'twice', 'peep', 'book', 'sister', 'read', 'picture', 'conversation', 'use', 'book', 'think', 'alice', 'picture', 'conversation', 'consider', 'own', 'mind', 'as_well', 'could', 'hot', 'day', 'make', 'feel_very', 'sleepy', 'stupid', 'pleasure', 'make', 'daisy', 'chain', 'would', 'worth', 'trouble', 'get', 'pick', 'daisy', 'when_suddenly', 'white_rabbit', 'pink', 'eye', 'run', 'close']
['moment', 'down', 'go', 'alice', 'never', 'once', 'consider', 'how', 'world', 'again']


In [261]:
# Initialise the model - CBOW or skipgram
w2v_model = Word2Vec(min_count=1,
                     window=5,
                     size=300,
                     sg=1) 

#  – CBOW (Continuous bag-of-words): The order of the context words does not influence prediction
#  – Skip-grams: nearby context words are weighted more heavily than distant ones.

In [262]:
# Build the vocab
w2v_model.build_vocab(data_processed, progress_per=10000)

In [263]:
# Train the model
w2v_model.train(data_processed, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

(299195, 352770)

In [264]:
# Normalise the Word2vec vectors, needed for soem similarity measures
w2v_model.init_sims(replace=True)

In [265]:
vocab = list(w2v_model.wv.vocab)
print ("Vocal length %d" % len(vocab))
print('Examples: %s' % vocab[:50])

Vocal length 2001
Examples: ['begin', 'very_tired', 'sit', 'sister', 'bank', 'have', 'once', 'twice', 'peep', 'book', 'read', 'picture', 'conversation', 'use', 'think', 'alice', 'consider', 'own', 'mind', 'as_well', 'could', 'hot', 'day', 'make', 'feel_very', 'sleepy', 'stupid', 'pleasure', 'daisy', 'chain', 'would', 'worth', 'trouble', 'get', 'pick', 'when_suddenly', 'white_rabbit', 'pink', 'eye', 'run', 'close', 'so_very', 'remarkable', 'much', 'way', 'hear', 'rabbit', 'say', 'dear', 'shall']


In [267]:
w2v_model.wv.similar_by_word("queen", topn=10)

[('shout', 0.9367688298225403),
 ('heart', 0.9245328903198242),
 ('voice', 0.9165550470352173),
 ('knave', 0.9147292971611023),
 ('off', 0.9073618650436401),
 ('player', 0.8932963609695435),
 ('turn', 0.8849328756332397),
 ('king', 0.8848758935928345),
 ('hear', 0.8825916647911072),
 ('quarrel', 0.8788511753082275)]

In [268]:
w2v_model.wv.most_similar(negative=["queen"], topn=5)

[('ootiful', 0.007597293704748154),
 ('grow', -0.0065366365015506744),
 ('able', -0.007076224312186241),
 ('eat', -0.1465734839439392),
 ('large', -0.17693555355072021)]

In [269]:
w2v_model.wv.similarity('alice', 'venture')

0.9157008

In [270]:
w2v_model.wv.similarity('alice', 'hurry')

0.5510908

In [271]:
w2v_model.wv.doesnt_match("alice queen hatter caterpillar rat".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'queen'

In [276]:
def get_word_table(table, key):
    return pd.DataFrame(table, columns=[key, 'similarity'])

keys = [ 'alice','queen', 'caterpillar', 'hatter', 'turtle', 'cheshire', 'gryphon' ]
tables = []
for key in keys:
    tables.append(get_word_table(w2v_model.wv.similar_by_word(key), key))
pd.concat(tables, axis=1)

Unnamed: 0,alice,similarity,queen,similarity.1,caterpillar,similarity.2,hatter,similarity.3,turtle,similarity.4,cheshire,similarity.5,gryphon,similarity.6
0,name,0.962324,shout,0.936769,sobs,0.974068,well,0.965805,stuff,0.993188,well_perhaps,0.998294,tis,0.990063
1,better,0.957314,heart,0.924533,hookah,0.972747,tea,0.951851,line,0.992961,soothe,0.997949,feeble,0.989691
2,shall,0.954481,voice,0.916555,unfold,0.97013,bottom,0.950032,teach,0.992585,fig,0.997766,alone,0.989557
3,let,0.954189,knave,0.914729,rub,0.969993,butter,0.933944,fellow,0.992104,visit,0.997643,great_hurry,0.988886
4,understand,0.948094,off,0.907362,yawn,0.969683,of_course,0.922191,master,0.991592,first_why,0.997618,meal,0.988879
5,else,0.947338,player,0.893296,while,0.967609,riddle,0.918792,likely,0.991529,caucus_race,0.99759,indignant,0.988598
6,little_timidly,0.947244,turn,0.884933,deep,0.966299,sleep,0.918538,grief,0.991493,dreamy,0.997506,beginning,0.988494
7,trial,0.946075,king,0.884876,slowly,0.966177,majesty,0.917205,hollow,0.991462,depend,0.997436,thunder,0.988297
8,so_many,0.945742,hear,0.882592,deeply,0.964562,live,0.917043,should_like,0.991372,memory,0.997344,sobs,0.988236
9,do,0.944342,quarrel,0.878851,silence,0.964389,treacle,0.914755,eel,0.991022,offended,0.997338,general,0.987999
