# Vectorizing Words and Tokens

The vectorization of words and tokens stems from the distributional hypothesis from linguistics due to Firth - 'a word is known by the company it keeps'.  Rephrased mathematically, this can be thougth of as a version of the Yondea Lemma from category theory.  This notion is also common when defining words even in the dictionary - the examples of using a word in a sentence help to define a word.  Take to the extreme, one can say that the definition of a word *is* the document of sentences containing it.  The vectorization of documents is covered in the VectorizingDocumentParameters notenook and very similar notions will be applied here. 

Computationally, and practically, a word is often veiwed as a collection of windows instead of full sentences (to avoid long run-on sentences), i.e. the collection of words nearby that are before and after it.  From here, one can simply one hot encode the these windows for each word, creating a word-word cooccurrence matrix, where column j counts how often it occurred near in the windows around the word in row i.  More generally, one may wish to weight the words based on how far apart they are (generally assuming nearer words in the window are more informative), vary the sizes of windows based on their content, and, just as in the case of documents, filter the words under consiteration.   

These options are generally all captured in the TokenCooccurrenceVectorizer, which produces the sparse word-word cooccurrence matrix. Notice that counting pairs of words in a window is also precisely what the SkipgramVectorizer is doing at it's heart.  Consequently, these two classes have many of the same options. 

#### First let's get some data! We'll use 20newgroups and remove documents less than 100 characters long. 

In [1]:
import sklearn.datasets
import numpy as np
import vectorizers
import textmap
import textmap.tokenizers 
import textmap.transformers
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /Users/colin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
news = sklearn.datasets.fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

In [3]:
long_enough = [len(t) > 100 for t in news['data']]
data = np.array(news['data'])
data = data[long_enough]
targets = np.array(news.target)
targets = targets[long_enough]
target_names = np.array(news.target_names)

#### Next we need to tokenize the data

In [4]:
%%time
tokens = textmap.tokenizers.NLTKTweetTokenizer().fit_transform(data)

CPU times: user 10.3 s, sys: 203 ms, total: 10.5 s
Wall time: 10.6 s


In [5]:
%%time
vectorizer = vectorizers.TokenCooccurrenceVectorizer()
count_matrix = vectorizer.fit_transform(tokens)
count_matrix

The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.
[1m
File "../../../PycharmProjects/vectorizers/vectorizers/_vectorizers.py", line 422:[0m
[1m@numba.njit(nogil=True, parallel=True)
[1mdef sequence_skip_grams(
[0m[1m^[0m[0m
[0m
  self.func_ir.loc))


CPU times: user 8.81 s, sys: 536 ms, total: 9.34 s
Wall time: 8.34 s


<120664x120664 sparse matrix of type '<class 'numpy.float32'>'
	with 6709483 stored elements in Compressed Sparse Row format>

The default settings have produced a count matrix whereby the i,j entry is the number of times word_j appears within a window radius (default window_radius = 5) of word_i, either before or after (default window_orientation = 'symmetric')

#### Without filtering the tokens to count, we have ended up with a massive extremely sparse matrix. 
By Zipf's law, the vast majority of these tokens only occur a small number of times.  We can filter the feature space in several ways based on their numerics: 
* min_frequency
* max_frequency
* min_occurrences
* max_occurrences

or based on their content via

* ignored_tokens (a list of tokens to ignore)
* excluded_regex (ignore all tokens that fullmatch the regular rexpression)

In [6]:
%%time 
vectorizer = vectorizers.TokenCooccurrenceVectorizer(
                                            max_frequency = 1e-4, 
                                            min_occurrences = 50, 
                                            ignored_tokens=stopwords.words('English'),
                                            excluded_token_regex="\W+")
count_matrix = vectorizer.fit_transform(tokens)
count_matrix

CPU times: user 2.57 s, sys: 102 ms, total: 2.68 s
Wall time: 2.64 s


<3304x3304 sparse matrix of type '<class 'numpy.float32'>'
	with 1846381 stored elements in Compressed Sparse Row format>

#### The vectorizer transform returns a document by token sparse matrix.  The fit has also produced two dictionaries, the column_label_dictionary_ and the column_index_dictionary_ to access the feature space.  These two dicts relate the tokens to column numbers (which are sorted)

In [7]:
vectorizer.column_label_dictionary_['apple']

342

In [8]:
[vectorizer.column_index_dictionary_[i] for i in range(342, 350)]

['apple',
 'applications',
 'applied',
 'applies',
 'apply',
 'appreciate',
 'appreciated',
 'approach']

Just as with documents, we can consider a word (token) as a distribution over words around it and embed with UMAP and hellinger distance. 

In [9]:
import umap
import umap.plot
import pandas
from bokeh.io import output_notebook
import matplotlib.pyplot as plt
output_notebook()

In [10]:
hover = pandas.DataFrame()
hover['data'] = vectorizer.column_label_dictionary_.keys()

In [11]:
# We are just going to set a random seed for reproducibility 
mapper = umap.UMAP(metric = "hellinger", random_state=42).fit(count_matrix)

In [12]:
mapper = umap.UMAP(metric = "hellinger", random_state=42).fit(count_matrix)
pic = umap.plot.interactive(mapper, hover_data = hover, point_size=5, values = vectorizer._token_frequencies_)
umap.plot.show(pic)

## Applying NLP  techniques:

First off, it is common when embedding words and tokens to consider each sentence seperaly so as not to have windows crossing sentence boundaries.  Finding sentence boundaries is more time consuming however. 

In [None]:
%%time
tokens = textmap.tokenizers.NLTKTweetTokenizer(tokenize_by='sentence').fit_transform(data)

Secondly, it is common to place a kernel over the window that weights the counts lower for words that are further apart.  In this case we may wish to use a triangular or harmonic kernel. 

We may also wish to vary the window size depending on it's content.  For example, we may with to broaden the window if it contains several frequent words (like stop-words) to gather more information.  The 'information' window function computes the expected information for a window of size 'window_radius' via 
$$
E(window\_information) = -window\_radius * \sum_{tokens}P(token)log_2(P(token))
$$
and grows the windows until they exceed the expected information.  

In [None]:
%%time 
vectorizer = vectorizers.TokenCooccurrenceVectorizer(
                                            window_function='information',
                                            kernel_function='harmonic',
                                            max_frequency = 1e-4, 
                                            min_occurrences = 50, 
                                            ignored_tokens=stopwords.words('English'),
                                            excluded_token_regex="\W+")
count_matrix = vectorizer.fit_transform(tokens)
count_matrix

In [None]:
mapper = umap.UMAP(metric = "hellinger", random_state=42).fit(count_matrix)

In [None]:
hover = pandas.DataFrame()
hover['data'] = vectorizer.column_label_dictionary_.keys()
pic = umap.plot.interactive(mapper, hover_data = hover, point_size=5, values = vectorizer._token_frequencies_)
umap.plot.show(pic)

### Changing the window radius. 

In general, smaller windows tend to capute more syntactic similarity between words (similar parts of speech tend to be more similar) whereas wider windows tend to capture more syntatic similarity between words (words used when discussing the same topic tend to be similar.  In some sense the kernel functions are one way to balance these two notions but we can also just vary the window_radius depending on the desired effect. For example if we cared more about syntactic similarity we could set a smaller window radius. 

In [None]:
%%time 
vectorizer = vectorizers.TokenCooccurrenceVectorizer(
                                            window_radius = 2,
                                            window_function='fixed',
                                            kernel_function='triangular',
                                            max_frequency = 1e-4, 
                                            min_occurrences = 50, 
                                            ignored_tokens=stopwords.words('English'),
                                            excluded_token_regex="\W+")
count_matrix = vectorizer.fit_transform(tokens)
count_matrix

In [None]:
mapper = umap.UMAP(metric = "hellinger", random_state=42).fit(count_matrix)

In [None]:
hover = pandas.DataFrame()
hover['data'] = vectorizer.column_label_dictionary_.keys()
pic = umap.plot.interactive(mapper, hover_data = hover, point_size=5, values = vectorizer._token_frequencies_)
umap.plot.show(pic)

### Capturing word order:

The default setting (and the typical method) is to combine the counts of context words that occur before a given word and those that occur afterwards.  However, we can seperate these two cases apart.  In fact, the TokenCooccurrenceVectorizer computationally under-the-hood first produces a (kernel weighted) count matrix $M$ of for only words afterwards. The (kernel weighted) count of words only before is then just the transpose $M^T$ and the count of both is just their sum $M+M^T$.  Which that gets returned is controlled by the window_orientation parameter.  Changing this parameter it is quite easy to just combine them as $(M|M^T)$ to treat counts before and afterwards seperately. 

In [None]:
import scipy.sparse

In [None]:
%%time 
vectorizer = vectorizers.TokenCooccurrenceVectorizer(
                                            window_orientation='after',
                                            max_frequency = 1e-4, 
                                            min_occurrences = 50, 
                                            ignored_tokens=stopwords.words('English'),
                                            excluded_token_regex="\W+",
                                        )
count_matrix = vectorizer.fit_transform(tokens)

directed_count_matrix = scipy.sparse.hstack([count_matrix, count_matrix.T])

In [None]:
mapper = umap.UMAP(metric = "hellinger", random_state=42).fit(directed_count_matrix)

In [None]:
hover = pandas.DataFrame()
hover['data'] = vectorizer.column_label_dictionary_.keys()
pic = umap.plot.interactive(mapper, hover_data = hover, point_size=5, values = vectorizer._token_frequencies_)
umap.plot.show(pic)

## Combining all the options:

We can combine all of the above options in various ways by simply stacking them together, adding more features to the description of a word.  We should be somewhat careful to makes sure that each matrix is on a similar scale (which will not be true be default if we vary the window sizes and kernels), and so it is best to $L_1$ normalize the rows of each matrix, and then take a weighted combination of them as desired.  

We should also be consistent in our labelling.  For that, we can pass in the token dictionary from one vectorizer into another so that the row and column labels are consistent between the two.  

In [None]:
from sklearn.preprocessing import normalize

In [None]:
%%time 
syntax_vectorizer = vectorizers.TokenCooccurrenceVectorizer(
                                            window_radius = 2,
                                            window_function = 'fixed',
                                            kernel_function = 'triangular',
                                            window_orientation='after',
                                            max_frequency = 1e-4, 
                                            min_occurrences = 50, 
                                            ignored_tokens=stopwords.words('English'),
                                            excluded_token_regex="\W+",
                                        )
syntax_matrix = syntax_vectorizer.fit_transform(tokens)

# We should pass in the token dictionary from the above to assure we have consistent labelling
semantic_vectorizer = vectorizers.TokenCooccurrenceVectorizer(
                                            window_radius = 8,
                                            window_function = 'information',
                                            kernel_function = 'harmonic',
                                            window_orientation='symmetric',
                                            token_dictionary = vectorizer.column_label_dictionary_,
                                        )
semantic_matrix = semantic_vectorizer.fit_transform(tokens)


mixed_count_matrix = scipy.sparse.hstack([
                                            2*normalize(syntax_matrix, 'l1'),
                                            2*normalize(syntax_matrix.T, 'l1'),
                                            3*normalize(semantic_matrix, 'l1'),       
])

In [None]:
mapper = umap.UMAP(metric = "hellinger", random_state=42).fit(mixed_count_matrix)

In [None]:
hover = pandas.DataFrame()
hover['data'] = vectorizer.column_label_dictionary_.keys()
pic = umap.plot.interactive(mapper, hover_data = hover, point_size=5, values = vectorizer._token_frequencies_)
umap.plot.show(pic)