# TextMAP Tokenization

Here we are going to walk through the various transforms available within textmap used in word and document embeddings.  In general the process of tokenization is given a string of text (a document), return a sequence of tokens (words or word-like objects).  While one would like to assume this is a simple processes of breaking a string on spaces, in general language is much more complex than that - some tokens have non-alpha numeric characters, periods, spaces, etc. or the language may not use the roman alphabet at all.  In english, tokens such as  "can't", "x-ray", "Ms.", "$5.00", "www\.words\.com", "john\@doe\.edu", "3D", "Las Vegas", ":-D", "20\%", etc., have several 'natural' ways to tokenize them depending on the use case. 

TextMAP contains several standard NLP tokenizers all mad eto work within a standard sci-kit learn fit_transformer API.  There are several default options available, each of which makes slightly different choices to tokenize a document, though they all have the flexibility for a user to provide their own tokenizer instances.  TextMAP also contains a tansformer for bigram and ngram contraction to replace common occuring pairs (or n-tuples more generally) with a single token for the pair (or n-gram), to deal with multi-token terms like "Las Vegas", "ice cream", "without loss of generality", etc. in an unsupervised or semi-supervised fashion. 

To demonstrate some of the tokenizers and their usage we'll walk through some examples and explore the options available. 

First let's get some data! We'll use 20newgroups and remove documents less than 100 characters long. 

In [1]:
import sklearn.datasets
import numpy as np
import pandas
import vectorizers
import textmap
import textmap.tokenizers 
import textmap.transformers
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /Users/colin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
news = sklearn.datasets.fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

In [3]:
long_enough = [len(t) > 100 for t in news['data']]
data = np.array(news['data'])
data = data[long_enough]
targets = np.array(news.target)
targets = targets[long_enough]
target_names = np.array(news.target_names)

### Default Tokenizers:

Let's look at a couple documents and notice all of the awful text in there... numbers, email addresses, hyphens, signatures, special characters, etc... and these are some of the better ones!

In [4]:
data[0:3]

array(['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
       "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade an

The SKlearnTokenizer tokenizes the corpus using the same methodology as CountVectorizer. 

In [5]:
%%time
tokens = textmap.tokenizers.SKLearnTokenizer().fit_transform(data)

CPU times: user 4.37 s, sys: 2.03 s, total: 6.4 s
Wall time: 9.35 s


Looking at the first 3 documents we can see how it performs. Notice that is removes tokes of length 1 only produces 'word-like' tokens

In [6]:
tokens[0:3]

(('was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'saw',
  'the',
  'other',
  'day',
  'it',
  'was',
  'door',
  'sports',
  'car',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s',
  'early',
  '70s',
  'it',
  'was',
  'called',
  'bricklin',
  'the',
  'doors',
  'were',
  'really',
  'small',
  'in',
  'addition',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  'this',
  'is',
  'all',
  'know',
  'if',
  'anyone',
  'can',
  'tellme',
  'model',
  'name',
  'engine',
  'specs',
  'years',
  'of',
  'production',
  'where',
  'this',
  'car',
  'is',
  'made',
  'history',
  'or',
  'whatever',
  'info',
  'you',
  'have',
  'on',
  'this',
  'funky',
  'looking',
  'car',
  'please',
  'mail'),
 ('fair',
  'number',
  'of',
  'brave',
  'souls',
  'who',
  'upgraded',
  'their',
  'si',
  'clock',
  'oscillator',
  'have',


NLTK's default tokenizer keeps the punctuation and tries to handle special characters more naturally. 

In [7]:
%%time
tokens = textmap.tokenizers.NLTKTokenizer().fit_transform(data)

CPU times: user 26.4 s, sys: 1.96 s, total: 28.3 s
Wall time: 30 s


In [8]:
tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  'the',
  'other',
  'day',
  '.',
  'it',
  'was',
  'a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/',
  'early',
  '70s',
  '.',
  'it',
  'was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in',
  'addition',
  ',',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'this',
  'is',
  'all',
  'i',
  'know',
  '.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  'have',
  'on',
  'this',
  'funky',
  'looking',
  'car',
  ',',
  'please',

NLTK's tweet tokenizer tries to tokenize emojis, urls, and email addresses as single tokens as well. 

In [9]:
%%time
tokens = textmap.tokenizers.NLTKTweetTokenizer().fit_transform(data)

CPU times: user 9.85 s, sys: 238 ms, total: 10.1 s
Wall time: 10.1 s


In [10]:
tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  'the',
  'other',
  'day',
  '.',
  'it',
  'was',
  'a',
  '2',
  '-',
  'door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s',
  '/',
  'early',
  '70s',
  '.',
  'it',
  'was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in',
  'addition',
  ',',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'this',
  'is',
  'all',
  'i',
  'know',
  '.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  'have',
  'on',
  'this',
  'funky',
  'looking',
  'car',


SpaCy uses several language processing techniques to make even more refined choices. 

In [11]:
%%time
tokens = textmap.tokenizers.SpacyTokenizer().fit_transform(data)

CPU times: user 26.2 s, sys: 214 ms, total: 26.4 s
Wall time: 26.5 s


In [12]:
tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  '\n',
  'the',
  'other',
  'day',
  '.',
  'it',
  'was',
  'a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/',
  '\n',
  'early',
  '70s',
  '.',
  'it',
  'was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in',
  'addition',
  ',',
  '\n',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'this',
  'is',
  '\n',
  'all',
  'i',
  'know',
  '.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  '\n',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  '\n',
  'have',
  'on',
  'this',
  '

Stanza uses other custom language processing tools (which we default to English) for tokenization.  Because of all of the additional NLP processing, Stanza can be time consuming for large corpora. 

In [13]:
%%time
tokens = textmap.tokenizers.StanzaTokenizer().fit_transform(data[0:100])

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.0.0.json: 116kB [00:00, 4.07MB/s]                    
2020-04-24 17:34:05 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| tokenize  | ewt     |

2020-04-24 17:34:05 INFO: File exists: /Users/colin/stanza_resources/en/tokenize/ewt.pt.
2020-04-24 17:34:05 INFO: Finished downloading models and saved to /Users/colin/stanza_resources.
2020-04-24 17:34:05 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |

2020-04-24 17:34:05 INFO: Use device: cpu
2020-04-24 17:34:05 INFO: Loading: tokenize
2020-04-24 17:34:05 INFO: Done loading processors!


CPU times: user 1min 46s, sys: 2.77 s, total: 1min 49s
Wall time: 29.7 s


In [14]:
tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  'the',
  'other',
  'day',
  '.',
  'it',
  'was',
  'a',
  '2',
  '-',
  'door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s',
  '/',
  'early',
  '70s',
  '.',
  'it',
  'was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in',
  'addition',
  ',',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'this',
  'is',
  'all',
  'i',
  'know',
  '.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  'have',
  'on',
  'this',
  'funky',
  'looking',
  'car',


## Tokenizer options:

All of the tokenizers can return tokenization by document (default) but also by sentence (returning a sequence per sentence) or by sentence by document (a sequence of token sequences per sentence per document). This is more time consuming as finding sentence breaks presents its own challenges. For example

In [15]:
%%time
tokens = textmap.tokenizers.NLTKTokenizer(tokenize_by = "sentence").fit_transform(data)
tokens[0:3]

CPU times: user 32.7 s, sys: 770 ms, total: 33.5 s
Wall time: 34.3 s


(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  'the',
  'other',
  'day',
  '.'),
 ('it',
  'was',
  'a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/',
  'early',
  '70s',
  '.'),
 ('it', 'was', 'called', 'a', 'bricklin', '.'))

In [16]:
%%time
tokens = textmap.tokenizers.NLTKTweetTokenizer(tokenize_by = "sentence_by_document").fit_transform(data)
tokens[0:3]

CPU times: user 15.1 s, sys: 575 ms, total: 15.7 s
Wall time: 15.7 s


((('i',
   'was',
   'wondering',
   'if',
   'anyone',
   'out',
   'there',
   'could',
   'enlighten',
   'me',
   'on',
   'this',
   'car',
   'i',
   'saw',
   'the',
   'other',
   'day',
   '.'),
  ('it',
   'was',
   'a',
   '2',
   '-',
   'door',
   'sports',
   'car',
   ',',
   'looked',
   'to',
   'be',
   'from',
   'the',
   'late',
   '60s',
   '/',
   'early',
   '70s',
   '.'),
  ('it', 'was', 'called', 'a', 'bricklin', '.'),
  ('the', 'doors', 'were', 'really', 'small', '.'),
  ('in',
   'addition',
   ',',
   'the',
   'front',
   'bumper',
   'was',
   'separate',
   'from',
   'the',
   'rest',
   'of',
   'the',
   'body',
   '.'),
  ('this', 'is', 'all', 'i', 'know', '.'),
  ('if',
   'anyone',
   'can',
   'tellme',
   'a',
   'model',
   'name',
   ',',
   'engine',
   'specs',
   ',',
   'years',
   'of',
   'production',
   ',',
   'where',
   'this',
   'car',
   'is',
   'made',
   ',',
   'history',
   ',',
   'or',
   'whatever',
   'info',
   'you',
 

All tokenizers lower case by default but this option can be changed as well.  For example;

In [17]:
tokens = textmap.tokenizers.NLTKTokenizer(lower_case = False).fit_transform(data)
tokens[0:3]

(('I',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'I',
  'saw',
  'the',
  'other',
  'day',
  '.',
  'It',
  'was',
  'a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/',
  'early',
  '70s',
  '.',
  'It',
  'was',
  'called',
  'a',
  'Bricklin',
  '.',
  'The',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'In',
  'addition',
  ',',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'This',
  'is',
  'all',
  'I',
  'know',
  '.',
  'If',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  'have',
  'on',
  'this',
  'funky',
  'looking',
  'car',
  ',',
  'please',

If you have a pre-build NLP model (from Spacy, Stanza, or NLTK) you can pass those in to replace the defaults. For example

In [18]:
my_nlp = nltk.tokenize.SpaceTokenizer()
tokens = textmap.tokenizers.NLTKTokenizer(nlp = my_nlp).fit_transform(data)
tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw\nthe',
  'other',
  'day.',
  'it',
  'was',
  'a',
  '2-door',
  'sports',
  'car,',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/\nearly',
  '70s.',
  'it',
  'was',
  'called',
  'a',
  'bricklin.',
  'the',
  'doors',
  'were',
  'really',
  'small.',
  'in',
  'addition,\nthe',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body.',
  'this',
  'is',
  '\nall',
  'i',
  'know.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name,',
  'engine',
  'specs,',
  'years\nof',
  'production,',
  'where',
  'this',
  'car',
  'is',
  'made,',
  'history,',
  'or',
  'whatever',
  'info',
  'you\nhave',
  'on',
  'this',
  'funky',
  'looking',
  'car,',
  'please',
  'e-mail.'),
 ('a',
  'fair',
  'number',
  'of',
  'brave',
  'souls',
  'who',
  'upgraded',
  'th

In [19]:
from spacy.lang.fr import French
tokens = textmap.tokenizers.SpacyTokenizer(nlp = French()).fit_transform(data)
tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  '\n',
  'the',
  'other',
  'day',
  '.',
  'it',
  'was',
  'a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/',
  '\n',
  'early',
  '70s',
  '.',
  'it',
  'was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in',
  'addition',
  ',',
  '\n',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'this',
  'is',
  '\n',
  'all',
  'i',
  'know',
  '.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  '\n',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  '\n',
  'have',
  'on',
  'this',
  '

You can pass any tokenizer an instance of any class that you would like, as long as it has the same basic functionality.  For example, the NLTK tokenizer can accept any class that has a tokenize function. 

In [20]:
class SillyNLP():
    
    def tokenize(self, X):
        if len(X) % 5 == 4: 
            return ["Badger"]*(len(X) // 10 ) + ['aghh', 'Snake', 'A', 'Snake', 'Ooooh', 'Its', 'A', 'Snake']
        return ["Badger"]*(len(X) // 10 ) + ['mushroom', 'mushroom']


In [21]:
silly_tokens = textmap.tokenizers.NLTKTokenizer(nlp = SillyNLP()).fit_transform(data)
silly_tokens[0:3]

(('badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'mushroom',
  'mushroom'),
 ('badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',
  'badger',

In [22]:
silly_tokens = textmap.tokenizers.NLTKTokenizer(nlp = SillyNLP(), lower_case = False).fit_transform(data)
silly_tokens[0:3]

(('Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'mushroom',
  'mushroom'),
 ('Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',
  'Badger',

## Multi-token expressions:

Now we may wish to contract common n-grams into a single token to cature common phrases as a single token.  By default this computes the likelihood ratio of the actual number of times a pair of tokens are adjacent over the likelihood under independence (the product of thier frequencies).  If a pair of tokens is occurs more then 2^7 times more often together than expected under independence then the occurrences of adjacent pairs are contracted to a single token.  It then repeats this process again on the contracted text (default is on more time) to potentially contract larger n-grams.  

(This class relies heavily on the NLTK MultiWordExpression infrastructure which is quite excellent)

In [23]:
mte = textmap.transformers.MultiTokenExpressionTransformer()
new_tokens = mte.fit_transform(tokens)

For reproducibility, the model stores the multi-token expressions as a list of pairs of tokens to contract per iteration. 

In [24]:
mte.mtes_

[[('_', '_'),
  ("don'", 't'),
  ("it'", 's'),
  ("i'", 'm'),
  ('of', 'the'),
  ('in', 'the'),
  ('i', 'am'),
  ('if', 'you'),
  ("i'", 've'),
  ("can'", 't'),
  ('it', 'is'),
  ("didn'", 't'),
  ("doesn'", 't'),
  ('to', 'be'),
  ("that'", 's'),
  ('on', 'the'),
  ('i', 'have'),
  ('i', "don'"),
  ("you'", 're'),
  ('i', 'think'),
  ('would', 'be'),
  ('will', 'be'),
  ('at', 'least'),
  ('have', 'been'),
  ('the', 'same'),
  ('this', 'is'),
  ("isn'", 't'),
  ("i'", 'd'),
  ('there', 'are'),
  ('the', 'the'),
  ('there', 'is'),
  ('has', 'been'),
  ('want', 'to'),
  ('you', 'can'),
  ('does', 'not'),
  ("i'", 'll'),
  ('they', 'are'),
  ("there'", 's'),
  ('a', 'lot'),
  ('a', 'few'),
  ('is', 'not'),
  ('can', 'be'),
  ('should', 'be'),
  ('is', 'a'),
  ('going', 'to'),
  ("they'", 're'),
  ('able', 'to'),
  ("won'", 't'),
  ('it', 'was'),
  ('do', 'not'),
  ('of', 'course'),
  ("wouldn'", 't'),
  ('united', 'states'),
  ('t', 'know'),
  ('to', 'get'),
  ("we'", 're'),
  ('may', 'b

By default, it will not contract any token that is a non-word (under the regex r"\W+") but this can be easily changed as we will see later on. 

The tokens are contracted on first-come-first-serve basis, one round at a time, ultimately producing the new sequence of (multi-)tokens.  

In [25]:
new_tokens[0:3]

(('i_was_wondering_if',
  'anyone_out_there',
  'could',
  'enlighten',
  'me',
  'on_this',
  'car',
  'i_saw',
  '\n',
  'the_other_day',
  '.',
  'it_was_a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to_be',
  'from_the',
  'late',
  '60s/',
  '\n',
  'early',
  '70s',
  '.',
  'it_was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in_addition',
  ',',
  '\n',
  'the_front',
  'bumper',
  'was',
  'separate',
  'from_the',
  'rest_of',
  'the_body',
  '.',
  'this_is',
  '\n',
  'all',
  'i_know',
  '.',
  'if_anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  '\n',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or_whatever',
  'info',
  'you',
  '\n',
  'have',
  'on_this',
  'funky',
  'looking',
  'car',
  ',',
  'please_e-mail',
  '.'),
 ('a_fair',
  'number_of',
  'brave',
  'souls',
  'who'

Notice that this has a tendency to combine stopwords into stop phrases as stop words tend to occur next to eachother much more often then one would expect by chance.  However, it also captures other bigrams you might expect like 'electrical_engineering', which helps distinguish this token from 'electrical' and 'engineering' when they occur seperatately. 

Perhaps we want to ignore stop words for example.  In this case we can just add them to the ignored tokes. 

In [26]:
mte = textmap.transformers.MultiTokenExpressionTransformer(ignored_tokens=stopwords.words('english'))
new_tokens = mte.fit_transform(tokens)

In [27]:
new_tokens[0:3]

(('i',
  'was',
  'wondering',
  'if',
  'anyone',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  '\n',
  'the',
  'other',
  'day',
  '.',
  'it',
  'was',
  'a',
  '2-door',
  'sports',
  'car',
  ',',
  'looked',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s/',
  '\n',
  'early',
  '70s',
  '.',
  'it',
  'was',
  'called',
  'a',
  'bricklin',
  '.',
  'the',
  'doors',
  'were',
  'really',
  'small',
  '.',
  'in',
  'addition',
  ',',
  '\n',
  'the',
  'front',
  'bumper',
  'was',
  'separate',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'body',
  '.',
  'this',
  'is',
  '\n',
  'all',
  'i',
  'know',
  '.',
  'if',
  'anyone',
  'can',
  'tellme',
  'a',
  'model',
  'name',
  ',',
  'engine',
  'specs',
  ',',
  'years',
  '\n',
  'of',
  'production',
  ',',
  'where',
  'this',
  'car',
  'is',
  'made',
  ',',
  'history',
  ',',
  'or',
  'whatever',
  'info',
  'you',
  '\n',
  'have',
  'on',
  'this',
  '

You can also filter the tokens to contract in various ways, including minimum and maximum frequencies (or number of occurrences) or regex expression, or only contract ngrams if the pair occur sufficiently often.  You can also set the maximal number of iterations to be larger or small to control the maximal lenght of a contracted n-gram. 

In [28]:
mte = textmap.transformers.MultiTokenExpressionTransformer(max_token_frequency = 1e-4, 
                                                           min_token_occurrences = 50, 
                                                           min_ngram_occurrences = 30,
                                                           excluded_token_regex="\W",
                                                           max_iterations=1
                                                          )
mte.fit(tokens)
mte.mtes_

[[('los', 'angeles'),
  ('serdar', 'argic'),
  ('gordon', 'banks'),
  ('st.', 'louis'),
  ('greatly', 'appreciated'),
  ('newsletter', '                                             '),
  ('et', 'al'),
  ('apr', '93'),
  ('hicnet', 'medical'),
  ('                                             ', 'page'),
  ('holy', 'spirit'),
  ('tampa', 'bay'),
  ('bear', 'arms'),
  ('rom', 'bios'),
  ('stanley', 'cup'),
  ('medical', 'newsletter'),
  ('maple', 'leafs'),
  ('cross', 'linked'),
  ('------', '------'),
  ('tear', 'gas'),
  ('allocation', 'unit'),
  ('excellent', 'condition'),
  ('soviet', 'union'),
  ('virtual', 'reality'),
  ('cubs', 'suck'),
  ('remote', 'sensing'),
  ('ottoman', 'empire'),
  ('red', 'sox'),
  ('red', 'wings'),
  ("o'", 'clock'),
  ('mailing', 'lists'),
  ('st', "john'"),
  ('south', 'georgia'),
  ('radar', 'detector'),
  ('serial', 'port'),
  ('soviet', 'armenia'),
  ('middle', 'east'),
  ('summer', 'jobs'),
  ('western', 'digital')]]

(Yes some one of the tokens is lots of white space in this example... meh... it's just an example)

You can also change the minimal score to merge an n-gram and/or the score function itself to anything that behaves like those found in nltk.metrics.BigramAssocMeasures


In [29]:
from nltk.metrics import BigramAssocMeasures

In [30]:
%%time
mte = textmap.transformers.MultiTokenExpressionTransformer(score_function=BigramAssocMeasures.chi_sq,
                                                           min_score = 1e5,
                                                           max_token_frequency = 1e-4, 
                                                           min_token_occurrences = 50, 
                                                           min_ngram_occurrences = 30,
                                                           excluded_token_regex="\W",
                                                           max_iterations=1
                                                          )
mte.fit(tokens)
mte.mtes_

CPU times: user 11.5 s, sys: 64.5 ms, total: 11.6 s
Wall time: 11.6 s


[[('serdar', 'argic'),
  ('los', 'angeles'),
  ('newsletter', '                                             '),
  ('gordon', 'banks'),
  ('tampa', 'bay'),
  ('stanley', 'cup'),
  ('maple', 'leafs'),
  ('                                             ', 'page'),
  ('st.', 'louis'),
  ('cubs', 'suck'),
  ('hicnet', 'medical'),
  ('st', "john'"),
  ('ottoman', 'empire'),
  ('cross', 'linked'),
  ('remote', 'sensing'),
  ('------', '------'),
  ('et', 'al'),
  ('bear', 'arms'),
  ('tear', 'gas'),
  ('radar', 'detector'),
  ('holy', 'spirit'),
  ('greatly', 'appreciated'),
  ('allocation', 'unit'),
  ('medical', 'newsletter'),
  ('apr', '93'),
  ('rom', 'bios'),
  ('soviet', 'union'),
  ('red', 'sox'),
  ('south', 'georgia'),
  ('virtual', 'reality'),
  ('mailing', 'lists'),
  ("o'", 'clock'),
  ('red', 'wings')]]