# Text Processing

# Topics
- Parsing, Stemming, Lemmatization
- Named Entity Recognition
- Frequency Analysis
- Stop Words
- Word Embeddings

## Setup

Run this command from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) conda install gensim cython nltk spacy
```

### gensim: for training word2vec

https://radimrehurek.com/gensim/


### Cython: to speed up training word2vec
http://docs.cython.org/en/latest/src/quickstart/install.html


### NLTK: NLP toolkit
Installation: https://www.nltk.org/install.html

Book: http://www.nltk.org/book

### spaCy: another NLP toolkit

Simpler to use than NLTK (but usually fewer knobs)

https://spacy.io/usage/spacy-101

# Parsing, Stemming Lemmatization

- Tokenization: splitting text into words
- Sentence boundary detection: splitting text into sentences
- Stemming: finding word stems
   - stating => state, reference => refer
- Lemmatization: finding the base form of words
   - was => be

## Tokenization

- Segmenting text into words, punctuation, etc.
- Rule-based

### Tokenization with spaCy

![tokenization in spaCy](assets/text/tokenization.svg)

(image: https://spacy.io/usage/spacy-101#annotations-token)

In [9]:
# Download the English model
# You can find other models here: https://spacy.io/models/en
!python -m spacy download en_core_web_sm

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 14.2MB/s ta 0:00:01    50% |████████████████▎               | 19.0MB 12.2MB/s eta 0:00:02    62% |███████████████████▉            | 23.2MB 12.4MB/s eta 0:00:02

[93m    Linking successful[0m
    /home/lisaong/miniconda3/envs/mldds03/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/lisaong/miniconda3/envs/mldds03/lib/python3.6/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [21]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The Bukit Panjang Light Rail Transit (BPLRT) will resume operations on Sundays starting from July, although operating hours will be shortened, announced SMRT on Thursday (Jun 21).")

for token in doc:
    # text, part-of-speech, syntactic dependencies
    print(token.text, token.pos_, token.dep_)

The DET det
Bukit PROPN compound
Panjang PROPN compound
Light PROPN compound
Rail PROPN compound
Transit PROPN nsubj
( PUNCT punct
BPLRT PROPN appos
) PUNCT punct
will VERB aux
resume VERB ROOT
operations NOUN dobj
on ADP prep
Sundays NOUN pobj
starting VERB advcl
from ADP prep
July PROPN pobj
, PUNCT punct
although ADP mark
operating VERB compound
hours NOUN nsubjpass
will VERB aux
be VERB auxpass
shortened VERB advcl
, PUNCT punct
announced VERB conj
SMRT PROPN dobj
on ADP prep
Thursday PROPN pobj
( PUNCT punct
Jun PROPN appos
21 NUM nummod
) PUNCT punct
. PUNCT punct


In [12]:
spacy.explain('DET')

'determiner'

In [22]:
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The Bukit Panjang Light Rail Transit (BPLRT) will resume operations on Sundays starting from July, although operating hours will be shortened, announced SMRT on Thursday (Jun 21).")

displacy.render(doc, style='dep', jupyter=True, options={'distance': 140})

### Tokenization with NLTK

http://www.nltk.org/api/nltk.tokenize.html

nltk.tokenize
 - sent_tokenize
 - word_tokenize
 - wordpunc_tokenize


In [32]:
# Download the Punkt sentence tokenizer
# https://www.nltk.org/_modules/nltk/tokenize/punkt.html

# List of available corpora: http://www.nltk.org/book/ch02.html#tab-corpora
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/lisaong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [33]:
from nltk.tokenize import sent_tokenize

text = 'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'

# list of words sentences
sent_tokenize(text)

['SMRT advised commuters to plan their journeys ahead while operating hours are shortened.',
 'It will deploy staff to assist commuters during the affected Sundays, it said.']

In [34]:
from nltk.tokenize import word_tokenize

text = 'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'

# flat list of words
word_tokenize(text)

['SMRT',
 'advised',
 'commuters',
 'to',
 'plan',
 'their',
 'journeys',
 'ahead',
 'while',
 'operating',
 'hours',
 'are',
 'shortened',
 '.',
 'It',
 'will',
 'deploy',
 'staff',
 'to',
 'assist',
 'commuters',
 'during',
 'the',
 'affected',
 'Sundays',
 ',',
 'it',
 'said',
 '.']

In [87]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = 'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'

sentences = sent_tokenize(text)

# list of lists
[word_tokenize(sentence) for sentence in sentences]

[['SMRT',
  'advised',
  'commuters',
  'to',
  'plan',
  'their',
  'journeys',
  'ahead',
  'while',
  'operating',
  'hours',
  'are',
  'shortened',
  '.'],
 ['It',
  'will',
  'deploy',
  'staff',
  'to',
  'assist',
  'commuters',
  'during',
  'the',
  'affected',
  'Sundays',
  ',',
  'it',
  'said',
  '.']]

In [39]:
from nltk.tokenize import wordpunct_tokenize

text = "'The time is now 5.30am,' he said."

print(word_tokenize(text))

print(wordpunct_tokenize(text))

["'The", 'time', 'is', 'now', '5.30am', ',', "'", 'he', 'said', '.']
["'", 'The', 'time', 'is', 'now', '5', '.', '30am', ",'", 'he', 'said', '.']


In [86]:
# Part of speech tagging
import nltk
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import sent_tokenize, word_tokenize

text = 'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'

sentences = sent_tokenize(text)
sentences = [word_tokenize(sentence) for sentence in sentences]

[nltk.pos_tag(word) for word in sentences]

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/lisaong/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[[('SMRT', 'NNP'),
  ('advised', 'VBD'),
  ('commuters', 'NNS'),
  ('to', 'TO'),
  ('plan', 'VB'),
  ('their', 'PRP$'),
  ('journeys', 'NNS'),
  ('ahead', 'RB'),
  ('while', 'IN'),
  ('operating', 'NN'),
  ('hours', 'NNS'),
  ('are', 'VBP'),
  ('shortened', 'VBN'),
  ('.', '.')],
 [('It', 'PRP'),
  ('will', 'MD'),
  ('deploy', 'VB'),
  ('staff', 'NN'),
  ('to', 'TO'),
  ('assist', 'VB'),
  ('commuters', 'NNS'),
  ('during', 'IN'),
  ('the', 'DT'),
  ('affected', 'JJ'),
  ('Sundays', 'NNP'),
  (',', ','),
  ('it', 'PRP'),
  ('said', 'VBD'),
  ('.', '.')]]

#### Twitter-aware tokenizer

`nltk.tokenize.TweetTokenizer`

http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

In [61]:
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer()
text = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"

tknzr.tokenize(text)

['This',
 'is',
 'a',
 'cooool',
 '#dummysmiley',
 ':',
 ':-)',
 ':-P',
 '<3',
 'and',
 'some',
 'arrows',
 '<',
 '>',
 '->',
 '<--']

In [43]:
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
text = '@remy: This is waaaaayyyy too much for you!!!!!!'

tknzr.tokenize(text)

[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']

## Stemming vs. Lemmatization

- Stemming uses rule-based heuristics
  - ponies => poni
  - Quicker, but less precision
- Lemmatization uses vocabulary and morphological analysis
  - ponies => pony
  - For English, not much improvement over stemming because context of word use is more important

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

## Porter Stemmer

- 5 sequential phases of word reductions
- Applies rules such as "sses -> ss", "ies => i"

![stemmers](assets/text/stemmers.png)

(image: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

### Stemming & Lemmatization with spaCy

`spacy.lemmatizer.Lemmatizer`

https://spacy.io/api/lemmatizer

In [65]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

nlp = spacy.load('en_core_web_sm')
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

text = u'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'
doc = nlp(text)

for token in doc:
    print(lemmatizer(token.text, token.pos_))

['smrt']
['advise']
['commuter']
['to']
['plan']
['their']
['journey']
['ahead']
['while']
['operate']
['hour']
['be']
['shorten']
['.']
['it']
['will']
['deploy']
['staff']
['to']
['assist']
['commuter']
['during']
['the']
['affect']
['sunday']
[',']
['it']
['say']
['.']


### Stemming & Lemmatization with NLTK

`nltk.stem`
- `PorterStemmer`
- `WordNetLemmatizer`

http://www.nltk.org/api/nltk.stem.html

In [70]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

text = u'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'
tokens = word_tokenize(text)

for token in tokens:
    print(stemmer.stem(token))

smrt
advis
commut
to
plan
their
journey
ahead
while
oper
hour
are
shorten
.
It
will
deploy
staff
to
assist
commut
dure
the
affect
sunday
,
it
said
.


In [72]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

text = u'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'
tokens = word_tokenize(text)

for token in tokens:
    print(lemmatizer.lemmatize(token))

[nltk_data] Downloading package wordnet to /home/lisaong/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
SMRT
advised
commuter
to
plan
their
journey
ahead
while
operating
hour
are
shortened
.
It
will
deploy
staff
to
assist
commuter
during
the
affected
Sundays
,
it
said
.


## Named Entity Recognition

- Find and classify entities within text
  - Persons
  - Organizations
  - Locations
  - Time expressions
  - Quantities
  - Phone numbers
  - etc
  
- Grammar-based models, trained classifiers

- Can be corpus-dependent, see https://spacy.io/api/annotation#named-entities

### Named Entity Recognition with spaCy

https://spacy.io/api/annotation#named-entities

In [80]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The Bukit Panjang Light Rail Transit (BPLRT) will resume operations on Sundays starting from July, although operating hours will be shortened, announced SMRT on Thursday (Jun 21).")
# doc = nlp(u"The Bukit Panjang Light Rail Transit (BPLRT) will resume operations on Sundays starting from July, although operating hours will be shortened, announced SMRT on Thursday (June 21).")

for entity in doc.ents:
    print(entity.text, entity.label_, entity.start_char, entity.end_char)

BPLRT ORG 38 43
Sundays DATE 71 78
July DATE 93 97
hours TIME 118 123
SMRT ORG 153 157
Thursday DATE 161 169
Jun 21 PERSON 171 177


In [79]:
spacy.explain('PERSON')

'People, including fictional'

In [78]:
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"The Bukit Panjang Light Rail Transit (BPLRT) will resume operations on Sundays starting from July, although operating hours will be shortened, announced SMRT on Thursday (Jun 21).")

displacy.render(doc, style='ent', jupyter=True)

### Named Entity Recognition with NLTK

```
nltk.ne_chunk()
```

https://www.nltk.org/book/ch07.html

In [91]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk.tokenize import sent_tokenize, word_tokenize

text = 'SMRT advised commuters to plan their journeys ahead while operating hours are shortened. It will deploy staff to assist commuters during the affected Sundays, it said.'

sentences = sent_tokenize(text)
sentences = [word_tokenize(sentence) for sentence in sentences]

# Input to ne_chunk needs to be a part-of-speech tagged word
sentences_pos_tagged = [nltk.pos_tag(word) for word in sentences]

[nltk.ne_chunk(word_pos) for word_pos in sentences_pos_tagged]

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/lisaong/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/lisaong/nltk_data...
[nltk_data]   Package words is already up-to-date!


[Tree('S', [Tree('ORGANIZATION', [('SMRT', 'NNP')]), ('advised', 'VBD'), ('commuters', 'NNS'), ('to', 'TO'), ('plan', 'VB'), ('their', 'PRP$'), ('journeys', 'NNS'), ('ahead', 'RB'), ('while', 'IN'), ('operating', 'NN'), ('hours', 'NNS'), ('are', 'VBP'), ('shortened', 'VBN'), ('.', '.')]),
 Tree('S', [('It', 'PRP'), ('will', 'MD'), ('deploy', 'VB'), ('staff', 'NN'), ('to', 'TO'), ('assist', 'VB'), ('commuters', 'NNS'), ('during', 'IN'), ('the', 'DT'), ('affected', 'JJ'), Tree('ORGANIZATION', [('Sundays', 'NNP')]), (',', ','), ('it', 'PRP'), ('said', 'VBD'), ('.', '.')])]

# Workshop: Creating Word2Vec Models

Credits:
- https://codesachin.wordpress.com/2015/10/09/generating-a-word2vec-model-from-a-block-of-text-using-gensim-python/
- https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

Word2Vec
- Semantic learning of text representations
- Neural network 
- Cosine similarity

## Download text

For demonstration purposes, we'll start with Wikipedia articles.

We'll use a python library that wraps the Wikipedia APIs.

https://pypi.org/project/wikipedia/

Run this from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) pip install wikipedia
```

In [None]:
import wikipedia
from wikipedia import search, page

# Get our documents: wikipedia articles
topic = 'singapore'

titles = search(topic)
titles

In [None]:
# retrieve all pages
wikipages = [page(title) for title in titles]

# inspect the first page
wikipages[0].summary

## Process text

- Split into sentences
- Remove special characters
- Convert to lowercase
- Tokenize the text into words
- Optionally remove stop words such as 'a', 'the'
- Stem each word

In [None]:
import re # python regular expressions library
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK corpora
# List of available corpora: http://www.nltk.org/book/ch02.html#tab-corpora

# 1. Download the Punkt sentence tokenizer
# https://www.nltk.org/_modules/nltk/tokenize/punkt.html
nltk.download('punkt')

# 2. Download the Stop Words corpus
nltk.download('stopwords')

# 3. Helper function to convert text
def text_to_sentence_wordlists(text, remove_stopwords=True):
    """Cleans and converts text to a list of lists of tokens
    Args:
        text: input text
        remove_stopwords: whether to remove stopwords
    Returns: a tuple
        A list of lists of tokens that looks like:
           [["cat", "say", "meow"], ["dog", "say", "woof"]]
        Total of words
    """
    # Split into sentences
    # Reference: http://www.nltk.org/api/nltk.tokenize.html
    sentences = nltk.sent_tokenize(text)

    # set of stopwords
    stops = set(stopwords.words('english'))

    stemmer = PorterStemmer()
    
    wordcount = 0
    result = []
    for sentence in sentences:
        # Remove non-letters and numbers
        sentence = re.sub('[^a-zA-Z0-9]', ' ', sentence)

        # Convert to lowercase
        sentence = sentence.lower()
        
        # Tokenize the sentence into words
        tokens = nltk.word_tokenize(sentence)
    
        if remove_stopwords:
            # Remove stop words
            tokens = [token for token in tokens if not token in stops]
    
        # Stem the words
        tokens = [stemmer.stem(t) for t in tokens]
        
        result += [tokens]
        wordcount += len(tokens)
    
    return (result, wordcount)

In [None]:
# Test our helper function to see what it does
text = wikipages[0].summary
print('===== Original text for first article =====')
print(text)

wordlist, count = text_to_sentence_wordlists(text,
                                             remove_stopwords=False)
print('\n===== Stem words [%d words] =====' % count)
print(wordlist)

wordlist, count = text_to_sentence_wordlists(text)
print('\n===== Stem words - stopwords [%d words] =====' % count)
print(wordlist)

### Convert all articles to sentence word lists

Let's now convert all articles on our topic to sentence word lists.

We were examining the summary for each article, let's see how we can get to the content.

Looking at the wikipedia library's documentation, we can use `WikipediaPage.content` to get to the plain text content for each page: https://wikipedia.readthedocs.io/en/latest/code.html

In [None]:
wikipages[0].content

In [None]:
print('Converting %d articles to training set...' % len(titles))

training_set = []
training_set_size = 0

for wikipage in wikipages:
    wordlist, count = text_to_sentence_wordlists(wikipage.content)

    training_set_size += count
    training_set += wordlist
    
print('Training set size: %d stem words, %d sentences' \
      % (training_set_size, len(training_set)))

### Question to ponder:

Should we randomize the training set?

Why or why not?

## Train a word2vec model

(Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors)

With the list of nicely parsed sentences, we're ready to train the model. There are a number of parameter choices that affect the run time and the quality of the final model that is produced.

For details on the algorithms below, see the [word2vec API documentation](https://radimrehurek.com/gensim/models/word2vec.html) as well as the [Google documentation](https://code.google.com/archive/p/word2vec/)(Performance section).

### Domain characteristics

Our training set is:
- Small (under 25k words). Typically, word2vec training sets can go in hundreds of thousands.
- Wikipedia articles about a common topic. We'll expect some words (e.g. singapore) to appear more frequently about that topic. Whether this is something we need to worry about is unclear.

### Hyperparameters

#### Architecture:
Architecture options are skip-gram (the default: slower, better for infrequent words) or continuous bag of words (fast). 

#### Training algorithm:
This controls which algorithm to use.

Hierarchical softmax (the default: better for infrequent words) or negative sampling (better for frequent words, better with low dimensional vectors). Start with the default first.

#### Downsampling of frequent words:
This controls the threshold for frequent words to be removed randomly. 

Randomly removing frequent words in large datasets can improve both accuracy and speed.

$$p = \frac{f-t}{f} - \sqrt{\frac{t}{f}}$$

Where:
- $p$: probabability that word is present
- $f$: frequency of word in corpus
- $t$: the threshold (our downsampling hyperparameter)

A smaller $t$ means more words will be randomly removed.

(Source: https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)

The [Google documentation](https://code.google.com/archive/p/word2vec/) recommends values between 1e-3 and 1e-5. Let's try 1e-3 and then iterate from there, since our training set is small.

#### Word vector dimensionality:
This controls how many features the word vector should have. Higher dimensionality (more features) usually result in better models, but also longer runtimes. Reasonable values can be in the tens to hundreds. We'll try 200.

#### Context / window size:
This defines the window-size to look for related words. For skip-gram usually around 10, for CBOW around 5. More is better, up to a point.

### Worker threads:
Number of parallel processes to run. This can significantly improve training speed.  

The number to choose depends on how many logical CPU cores your computer has (on Windows, Start Menu -> System Information, look for Processors). 

Start with a number around 2-4, and then increase up if your computer is more powerful.

### Minimum word count:
This helps limit the size of the vocabulary to meaningful words. Any word that does not occur at least this many times across all documents is ignored. 

Reasonable values could be between 10 and 100. Higher values also help limit run time.

For wikipedia articles, we'll try a minimum wordcount of 10.

In [None]:
from gensim.models import word2vec

word2vec.Word2Vec?

In [None]:
# Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

# Set values for various parameters
sg = 1                # Algorithm: 1: skip-gram, 0: CBOW
num_features = 200    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 2       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Initialize and train the model.
# This may take a while if your training set is large (e.g. 500,000 words)
print('Training Word2Vec model...')
%time model = word2vec.Word2Vec(training_set, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "wikipedia_{}features_{}minwords_{}context_{}downsampling.w2v" \
    .format(num_features, min_word_count, context, str(downsampling))
model.save(model_name)

print('Saved model as %s' % model_name)

## Loading the saved model

Here's how to load a previously saved model.

In [None]:
model_name = "wikipedia_100features_50minwords_10context.w2v"

model = word2vec.Word2Vec.load(model_name)

## Evaluating the model

The trained model contains a read-only `models.keyedvectors.Word2VecMeyedVectors` with methods for evaluating word relationships.

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors

Here are some things to try with the word2vec model:

Get the vocabulary of the model:

In [None]:
# number of words in the vocab
len(model.wv.vocab)

In [None]:
model.wv.vocab

Check if a stem word is in the model's vocabulary:

In [None]:
stemmer = PorterStemmer()
stemmer.stem('malaysia') in model.wv.vocab

In [None]:
stemmer.stem('korea') in model.wv.vocab

Find a word that doesn't match in a list of words:

In [None]:
test = 'raffles indian chinese malay'

# you can either use the helper function to convert to stem words
# or call stemmer.stem() directly on each word
wordlist, _ = text_to_sentence_wordlists(test)
print('Input: %s' % wordlist[0])

print("Word that doesn't match: %s"
      % model.wv.doesnt_match(wordlist[0]))

Get the top N most similar words:

In [None]:
word = stemmer.stem('singapore')
model.wv.most_similar(word, topn=10)

In [None]:
word = stemmer.stem('changi')
model.wv.most_similar(word, topn=10)

Measures the cosine distance and similarity between two words.

In [None]:
word1 = stemmer.stem('changi')
word2 = stemmer.stem('aircraft')

print('distance: %f' %
      model.wv.distance(word1, word2))

print('similarity: %f' %
      model.wv.similarity(word1, word2))

In [None]:
word1 = stemmer.stem('changi')
word2 = stemmer.stem('british')

print('distance: %f' %
      model.wv.distance(word1, word2))

print('similarity: %f' %
      model.wv.similarity(word1, word2))

Returns the word's representation in vector space as a 1D numpy array

In [None]:
word = stemmer.stem('malaysia')

raw_vectors = model.wv.word_vec(word, use_norm=True)

raw_vectors.shape

In [None]:
raw_vectors

# Visualizing Word2Vec

Next, we'll plot the Word Vectors to see how the clusters look like:

1. Use t-Distributed Stochastic Neighbor Embedding [TSNE](https://lvdmaaten.github.io/tsne/) to reduce the high-dimensional model into 2D
2. Plot the 2D representation of the word2vec model, with the words in its vocabulary as the labels

Credits: https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

vocab = list(model.wv.vocab)
X = model[vocab]

# Apply t-SNE
# this can take a while (like 1 minute or more)
tsne = TSNE(n_components=2)
%time X_tsne = tsne.fit_transform(X)

X_tsne

In [None]:
import pandas as pd

# Create a dataframe for the 2 dimensions,
# indexed by the words in the vocab
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
df.head()

In [None]:
# create a zoomable interactive plot
%matplotlib notebook

# Plot the 2D representation of the word2vec model,
# with the words in its vocabulary as the labels

fig, ax = plt.subplots(figsize=(10, 10))

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos)

## Exercise - Create Corpus and Train Word2Vec

In this exercise, we will create our own corpus and use it to train Word2Vec.

### Create Corpus

Create a corpus of text files, organized in a structure like this:

```
corpus/
   text001.txt
   text002.txt
   text003.txt
   ...
```

A sample corpus is included in the `corpus` folder, created with the first 3 chapters of Moby Dick:
https://www.gutenberg.org/files/2701/2701-0.txt

### Import corpus using NLTK

We will use [`nltk.corpus.reader.plaintext`](http://www.nltk.org/howto/corpus.html) to import the corpus.

Credits: https://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk

In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# directory containing the corpus
corpus_dir = 'corpus/'

# PlaintextCorpusReader uses nltk.tokenize.sent_tokenize() and
# nltk.tokenize.word_tokenize() to split texts into sentences and words
newcorpus = PlaintextCorpusReader(corpus_dir,
                                  '.*\.txt',
                                  encoding='latin1') # or 'utf-8'

In [None]:
# files found by the reader
newcorpus.fileids()

In [None]:
# print the first file in the corpus
f = newcorpus.open('text001.txt')
print(f.read().strip())

In [None]:
# sentences in the corpus:
newcorpus.sents()

In [None]:
# number of sentences
len(newcorpus.sents())

In [None]:
def clean_sentence_lists(sentence_lists, remove_stopwords=True):
    """Cleans and converts the sentence lists
    Args:
        text: sentence lists
        remove_stopwords: whether to remove stopwords
    Returns:
        A tuple:
            The cleaned sentence list
            The token count
    """
    # set of English stop words
    stops = set(stopwords.words('english'))

    stemmer = PorterStemmer()
    
    result = []
    wordcount = 0

    for sentence in sentence_lists:
        # Convert to lowercase
        tokens = [t.lower() for t in sentence]
        
        # Remove stop words
        if remove_stopwords:
            tokens = [t for t in tokens if not t in stops]
        
        # Remove non-letters and numbers
        tokens = [re.sub('[^a-zA-Z0-9]', '', t) for t in tokens]
        
        # Stem the words
        tokens = [stemmer.stem(t) for t in tokens]
        
        result += [tokens]
        wordcount += len(tokens)
    
    return (result, wordcount)

Your Tasks:

1. Convert newcorpus.sents() to sentence wordlists, using the `clean_sentence_lists` helper function
2. Train a Word2Vec model, with initial hyperparameter settings (use your best guess)
3. Try some word similarity queries
4. Tweak your model by adjusting some hyperparameter settings
5. Plot the completed Word2Vec model

In [None]:
# 1. Convert newcorpus.sents() to sentence wordlists, 
# using the clean_sentence_lists helper function
#
# Your code here

print('Converting %d sentences to training set...' % len(newcorpus.sents()))

training_set, training_set_size = clean_sentence_lists(newcorpus.sents())

print('Training set size: %d stem words, %d sentences' \
      % (training_set_size, len(training_set)))

training_set

In [None]:
# 2. Train a Word2Vec model, with initial hyperparameter settings
#
# Your code here

# Set values for various parameters
sg = 1                # Algorithm: 1: skip-gram, 0: CBOW
num_features = 200    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 2       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Initialize and train the model.
# This may take a while if your training set is large (e.g. 500,000 words)
print('Training Word2Vec model...')
%time model = word2vec.Word2Vec(training_set, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "corpus_{}features_{}minwords_{}context_{}downsampling.w2v" \
    .format(num_features, min_word_count, context, str(downsampling))
model.save(model_name)

print('Saved model as %s' % model_name)

In [None]:
# 3. Try some word similarity queries
# Your code here

print('Vocab length:', len(model.wv.vocab))
print('Vocab:', model.wv.vocab)

In [None]:
word1 = stemmer.stem('whale')
word2 = stemmer.stem('harpoon')

print('distance: %f' %
      model.wv.distance(word1, word2))

print('similarity: %f' %
      model.wv.similarity(word1, word2))

In [None]:
word1 = stemmer.stem('whale')
word2 = stemmer.stem('landlord')

print('distance: %f' %
      model.wv.distance(word1, word2))

print('similarity: %f' %
      model.wv.similarity(word1, word2))

In [None]:
word = stemmer.stem('harpoon')
model.wv.most_similar(word, topn=10)

In [None]:
vocab = list(model.wv.vocab)
X = model[vocab]

# Apply t-SNE
# this can take a while (like 1 minute or more)
tsne = TSNE(n_components=2)
%time X_tsne = tsne.fit_transform(X)

# Create a dataframe for the 2 dimensions,
# indexed by the words in the vocab
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
df.head()

In [None]:
# Plot the completed Word2Vec model

# create a zoomable interactive plot
%matplotlib notebook

# Plot the 2D representation of the word2vec model,
# with the words in its vocabulary as the labels

fig, ax = plt.subplots(figsize=(10, 10))

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos)

https://github.com/charlieg/A-Smattering-of-NLP-in-Python

http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html