# Scikit-learn compatible vectorizers built with spaCy NLP famework

In this notebook I will show you basic examples of how and when to use customized classes and vectorizers inspired by ```scikit-learn```'s ```CountVectorizer```, which add more accurate tokenization and lemmatization funcitonality with the help of <a href='https://spacy.io/'>spaCy</a> NLP framework. Simple <a href='https://keras.io/preprocessing/text/'>Keras</a>-like punctuation removal support is also added.

Let's do the imports first.  

In [1]:
import spacy
from vectorizers import SpacyTokenizer
from vectorizers import SpacyLemmatizer
from vectorizers import SpacyPipeProcessor
from vectorizers import SpacyTokenCountVectorizer
from vectorizers import SpacyLemmaCountVectorizer
from vectorizers import SpacyWord2VecVectorizer

Here we will load ```en_core_web_md``` model for spaCy and create some example single-sentence documents.

In [2]:
%%time
nlp = spacy.load('en_core_web_md')

Wall time: 21.5 s


In [3]:
# Example documents
raw_documents = ["The quick brown fox jumps over the lazy dog.",
                 "This is a test sentence.",
                 "This sentence contains exclamation mark, comma and (round brackets)!"]

We'll start with the helper classes for tokenization and lemmatization.

### SpacyTokenizer

```SpacyTokenizer``` uses spaCy <a href='https://spacy.io/usage/linguistic-features#section-tokenization'>tokenizer</a> for document tokenization. When ```join_str``` argument is set to ```None```, the result will be a ```list``` of lists of strings (tokens). Punctuation from the ```ignore_chars``` argument will be removed in every separate token, but empty tokens will be kept. You can also specify ```batch_size``` and ```n_threads``` arguments for parallel processing of large datasets. Lowercasing isn't performed.

In [4]:
tokenizer = SpacyTokenizer(nlp, join_str=None, ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', 
                           batch_size=10000, n_threads=1)

tokens = tokenizer(raw_documents) # generator object is returned
for tokenized_doc in tokens:
    print(tokenized_doc)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '']
['This', 'is', 'a', 'test', 'sentence', '']
['This', 'sentence', 'contains', 'exclamation', 'mark', '', 'comma', 'and', '', 'round', 'brackets', '', '']


Here's the difference when ```join_str``` is set to space cahracter. SpacyTokenizer will return the ```list``` of strings which are joined tokens (together with empty punctuation-only tokens).

In [5]:
tokenizer = SpacyTokenizer(nlp, join_str=' ', n_threads=1)
tokens = tokenizer(raw_documents) # generator object is returned
for tokenized_doc in tokens:
    print(tokenized_doc)

The quick brown fox jumps over the lazy dog 
This is a test sentence 
This sentence contains exclamation mark  comma and  round brackets  


Finally, this example shows a usual result from tokenization and punctuation removal. Notice that you must call the ```split()``` method to obtain a list of tokens without empty ones. 

In [6]:
tokenizer = SpacyTokenizer(nlp, join_str=' ', n_threads=1)
tokens = tokenizer(raw_documents) # generator object is returned
for tokenized_doc in tokens:
    print(tokenized_doc.split())

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
['This', 'is', 'a', 'test', 'sentence']
['This', 'sentence', 'contains', 'exclamation', 'mark', 'comma', 'and', 'round', 'brackets']


### SpacyLemmatizer

```SpacyLemmatizer``` is very similar to ```SpacyTokenizer```, but it returns lowercased lemmas instead of tokens.

In [7]:
lemmatizer = SpacyLemmatizer(nlp, join_str=None, ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', 
                             batch_size=10000, n_threads=1)
lemmas = lemmatizer(raw_documents) # generator object is returned
for lemmatized_doc in lemmas:
    print(lemmatized_doc)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '']
['this', 'be', 'a', 'test', 'sentence', '']
['this', 'sentence', 'contain', 'exclamation', 'mark', '', 'comma', 'and', '', 'round', 'bracket', '', '']


In [8]:
lemmatizer = SpacyLemmatizer(nlp, join_str=' ', n_threads=1)
lemmas = lemmatizer(raw_documents) # generator object is returned
for lemmatized_doc in lemmas:
    print(lemmatized_doc)

the quick brown fox jump over the lazy dog 
this be a test sentence 
this sentence contain exclamation mark  comma and  round bracket  


In [9]:
lemmatizer = SpacyLemmatizer(nlp, join_str=' ', n_threads=1)
lemmas = lemmatizer(raw_documents) # generator object is returned
for lemmatized_doc in lemmas:
    print(lemmatized_doc.split())

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']
['this', 'be', 'a', 'test', 'sentence']
['this', 'sentence', 'contain', 'exclamation', 'mark', 'comma', 'and', 'round', 'bracket']


### SpacyTokenCountVectorizer

```SpacyTokenCountVectorizer``` inherits ```scikit-learn```'s ```CountVectorizer``` to enable tokenization from ```spaCy``` models. Its ```fit()```, ```fit_transform()``` and ```transform()``` methods accept iterable of <a href=https://spacy.io/api/doc>Doc</a> objects as ```spacy_docs``` (```X``` in ```scikit-learn```) parameter. This iterable can be obtained from ```SpacyPipeProcessor``` class.

In [10]:
spp = SpacyPipeProcessor(nlp, n_threads=1)  # creates iterable of spaCy Doc objects
spacy_docs = spp(raw_documents)

In this example we can see that the result of ```SpacyTokenCountVectorizer```'s ```fit_transform()``` method is a CSR sparse matrix, just like a standard CountVectorizer would return.

In [11]:
stcv = SpacyTokenCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
count_vectors = stcv.fit_transform(spacy_docs); count_vectors

<3x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

In [12]:
print(stcv.vocabulary_)

{'the': 18, 'this': 19, 'mark': 12, 'quick': 14, 'brackets': 2, 'dog': 6, 'a': 0, 'round': 15, 'jumps': 10, 'brown': 3, 'over': 13, 'fox': 8, 'lazy': 11, 'comma': 4, 'sentence': 16, 'contains': 5, 'exclamation': 7, 'is': 9, 'and': 1, 'test': 17}


If you initialize a ```SpacyPipeProcessor``` object with the ```multi_iters``` parameter set to ```True```, the result of its ```__call__``` method will be a list of ```Doc``` objects, instead of a single ```generator```. This allows you to iterate multiple times thorugh returned objects if you need.

In [13]:
spp = SpacyPipeProcessor(nlp, n_threads=1, multi_iters=True)
spacy_docs = spp(raw_documents)

stcv = SpacyTokenCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
stcv.fit(spacy_docs)
count_vectors = stcv.transform(spacy_docs); count_vectors

<3x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

In [14]:
print(stcv.vocabulary_)

{'the': 18, 'this': 19, 'mark': 12, 'quick': 14, 'brackets': 2, 'dog': 6, 'a': 0, 'round': 15, 'jumps': 10, 'brown': 3, 'over': 13, 'fox': 8, 'lazy': 11, 'comma': 4, 'sentence': 16, 'contains': 5, 'exclamation': 7, 'is': 9, 'and': 1, 'test': 17}


### SpacyLemmaCountVectorizer

```SpacyLemmaCountVectorizer``` is very similar to ```SpacyTokenCountVectorizer```, but it performs lemmatization instead of tokenization.

In [24]:
spp = SpacyPipeProcessor(nlp, n_threads=1)
spacy_docs = spp(raw_documents);

slcv = SpacyLemmaCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
count_vectors = slcv.fit_transform(spacy_docs); count_vectors

<3x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

In [25]:
print(slcv.vocabulary_)

{'dog': 7, 'contain': 6, 'sentence': 16, 'the': 18, 'this': 19, 'round': 15, 'brown': 4, 'exclamation': 8, 'fox': 9, 'mark': 12, 'jump': 10, 'over': 13, 'be': 2, 'and': 1, 'a': 0, 'comma': 5, 'test': 17, 'quick': 14, 'lazy': 11, 'bracket': 3}


In [26]:
spp = SpacyPipeProcessor(nlp, n_threads=1, multi_iters=True)
spacy_docs = spp(raw_documents);

slcv = SpacyLemmaCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
slcv.fit(spacy_docs)
count_vectors = slcv.transform(spacy_docs); count_vectors

<3x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

In [27]:
print(slcv.vocabulary_)

{'dog': 7, 'contain': 6, 'sentence': 16, 'the': 18, 'this': 19, 'round': 15, 'brown': 4, 'exclamation': 8, 'fox': 9, 'mark': 12, 'jump': 10, 'over': 13, 'be': 2, 'and': 1, 'a': 0, 'comma': 5, 'test': 17, 'quick': 14, 'lazy': 11, 'bracket': 3}


## Tests

Here we'll test classes described above, in the modified Olivier Grisel's example from <a href="http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py">here</a>. Instead of ```LogisticRegression``` in the original example, we'll use ```LinearSVC```. This code samples show a grid search over several parameters in a text processing ```Pipeline``` on the 2 categories of 20 newsgroup dataset.

In [15]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

random_state = 42 

# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

Automatically created module for IPython interactive environment
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories



In [4]:
# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Automatically created module for IPython interactive environment
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 42 candidates, totalling 126 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.4s
[Parallel(n_jobs=-1)]: Done 126 out of 126 | elapsed:  1.4min finished


done in 84.489s

Best score: 0.943
Best parameters set:
	clf__C: 100
	vect__max_df: 1.0
	vect__ngram_range: (1, 2)


The best score using ```CountVectorizer``` was 94,3%. Now we will create ```spacy_docs``` list for customized vectorizers and perform grid searches using ```SpacyTokenCountVectorizer``` and ```SpacyLemmaCountVectorizer```. Running time of theirs methods is much longer when compared to ```CountVectorizer```.

In [8]:
%%time
print('Processing dataset with spaCy...')
spacy_docs = SpacyPipeProcessor(nlp, multi_iters=True, n_threads=1)(data.data)

Processing dataset with spaCy...
Wall time: 2min 19s


In [41]:
pipeline = Pipeline([
    ('vect', SpacyTokenCountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block
    
    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(spacy_docs, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Processing dataset with Spacy...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 42 candidates, totalling 126 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 11.4min
[Parallel(n_jobs=-1)]: Done 126 out of 126 | elapsed: 32.7min finished


done in 1966.337s

Best score: 0.940
Best parameters set:
	clf__C: 1
	vect__max_df: 0.75
	vect__ngram_range: (1, 1)


With ```SpacyTokenCountVectorizer``` we obtained 94% with different best hyperparameters.

In [42]:
pipeline = Pipeline([
    ('vect', SpacyLemmaCountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(spacy_docs, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Processing dataset with Spacy...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 42 candidates, totalling 126 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 11.6min
[Parallel(n_jobs=-1)]: Done 126 out of 126 | elapsed: 33.5min finished


done in 2013.056s

Best score: 0.935
Best parameters set:
	clf__C: 1
	vect__max_df: 0.5
	vect__ngram_range: (1, 1)


93,5% was the best result for ```SpacyLemmaCountVectorizer```. It seems like these custom vectorizers aren't a very good choice for concrete dataset, and a more extesive hyperparameter search and preprocessing is probably needed.

### SpacyWord2VecVectorizer

```SpacyWord2VecVectorizer``` converts a ```list``` of ```Doc``` objects to their vector representations. Vectors are stored in a ```float32``` ```numpy``` array, where the number of rows equals to the number of documents and the number of columns is a vector dimensionality, which depends on the ```nlp``` model used. Word vectors have 300 dimensions in this case. When the ```sparsify``` parameter is ```True```, the resulting matrix will be sparse (CSR).

**Important note:*** ```SpacWord2VecVectorizer``` is **not thread safe** at the moment. 

In [4]:
spp = SpacyPipeProcessor(nlp, n_threads=1)
spacy_docs = spp(raw_documents)

w2v = SpacyWord2VecVectorizer(sparsify=True)
word_vectors = w2v.fit_transform(spacy_docs); word_vectors

<3x300 sparse matrix of type '<class 'numpy.float32'>'
	with 900 stored elements in Compressed Sparse Row format>

We can also use ```fit()``` and ```transform()``` methods: 

In [5]:
spp = SpacyPipeProcessor(nlp, n_threads=1)
spacy_docs = spp(raw_documents)

w2v = SpacyWord2VecVectorizer(sparsify=True)
word_vectors = w2v.fit(spacy_docs).transform(spacy_docs); word_vectors

<3x300 sparse matrix of type '<class 'numpy.float32'>'
	with 900 stored elements in Compressed Sparse Row format>

There's a classification test with ```SpacyWord2VecVectorizer```.

In [5]:
pipeline = Pipeline([
    ('vect', SpacyWord2VecVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

parameters = {
    #'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    #'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1) #  n_jobs=1 for thread safety

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(spacy_docs, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Processing dataset with Spacy...
Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000)}
Fitting 3 folds for each of 7 candidates, totalling 21 fits


[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    7.8s finished


done in 8.457s

Best score: 0.842
Best parameters set:
	clf__C: 100


84,2% suggests that a larger hyperparameter search space is needed, together with other featrues such as bag of words. 

### Performance Considerations

In [9]:
%%time
features = CountVectorizer().fit(data.data).transform(data.data)

Wall time: 573 ms


In [10]:
%%time
features = SpacyTokenCountVectorizer().fit(spacy_docs).transform(spacy_docs)

Wall time: 2.2 s


In [11]:
%%time
features = SpacyLemmaCountVectorizer().fit(spacy_docs).transform(spacy_docs)

Wall time: 2.35 s


In [13]:
%%time
features = SpacyWord2VecVectorizer().fit(spacy_docs).transform(spacy_docs)

Wall time: 2.28 s


### Conclusion

In general, we see that custom vectorizers are about 4 times slower than original ```CountVectorizer```. This shows that their tokenizers and lemmatizers should be used as a preprocessing step before extensive hyperparameter optimization. As tihs <a href="https://stackoverflow.com/a/45212615">answer</a> suggests, ```CountVectorizer``` can be nicely used for vectorization of pre-tokenized or pre-lemmatized documents, since it's a faster and more memory friendly solution. Moreover, customized vectorizers didn't show performance imporovement on the small subset of 20 newsgroups dataset used here, but this isn't an evidence that they are not useful in general.