In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords

In [2]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text[0:900000]


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [4]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
austen_doc = nlp(austen_clean)

In [6]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

[u'lady', u'russell', u'steady', u'age', u'character', u'extremely', u'provide', u'thought', u'second', u'marriage', u'need', u'apology', u'public', u'apt', u'unreasonably', u'discontent', u'woman', u'marry', u'sir', u'walter', u'continue', u'singleness', u'require', u'explanation']
We have 9298 sentences and 900000 tokens.


Keep in mind that word2vec operates on the assumption that frequent proximity indicates similarity, but words can be "similar" in various ways.

In [8]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')



done!


In [9]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))

[(u'clay', 0.9496152400970459), (u'musgrove', 0.9437923431396484), (u'goddard', 0.9164848327636719), (u'harville', 0.9056604504585266), (u'benwick', 0.8973242044448853), (u'colonel', 0.8872060775756836), (u'croft', 0.8822200298309326), (u'weston', 0.8744116425514221), (u'smith', 0.8720744252204895), (u'charles', 0.8646600246429443)]
0.9151025533104298
marriage


  # This is added back by InteractiveShellApp.init_path()


## Sense2vec
The word 'break' in "Give me a break" vs "Break a leg" wouldn't be recognized as different under Word2vec.  
Sense2vec is a modification of word2vec that incorporates information on parts of speech- whether a word is a noun, verb, adjective, etc. In a sense2vec model, 

## Ngrams
Consider the word 'vain' in these two sentences:"She labored in vain, the rock would not move." "She was so vain, her bathroom mirror was covered in lip prints." In both sentences, 'vain' is an adjective. In sentence 1, it signals a lack of success. In sentence 2, the same word means vanity. Since the two usages can’t be distinguished by their part of speech, how can we tell them apart?

Ngrams incorporate context information by creating features made up of a series of consecutive words. The 'N' refers to the number of words included in the series.

Ngrams incorporate context information by creating features made up of a series of consecutive words. The ‘N’ refers to the number of words included in the series. For example, the 2-gram representation of sentence 1 would be:

She labored labored in in vain vain the the rock rock would would not not move

The 3-gram representation of sentence 2 would be:

She was so was so vain so vain her vain her bathroom her bathroom mirror bathroom mirror was mirror was covered was covered in covered in lip in lip prints