# Scikit-learn compatible vectorizers built with spaCy NLP famework

In this notebook I will show you basic examples of how and when to use customized classes and vectorizers inspired by ```scikit-learn```'s ```CountVectorizer```, which add more accurate tokenization and lemmatization funcitonality with the help of <a href='https://spacy.io/'>spaCy</a> NLP framework. Simple <a href='https://keras.io/preprocessing/text/'>Keras</a>-like punctuation removal support is also added.

Let's do the imports first.  

In [1]:
import spacy
from vectorizers import SpacyTokenizer
from vectorizers import SpacyLemmatizer
from vectorizers import SpacyPipeProcessor
from vectorizers import SpacyTokenCountVectorizer
from vectorizers import SpacyLemmaCountVectorizer
from vectorizers import SpacyWord2VecVectorizer

In [2]:
import pandas as pd
import os
from sklearn.feature_selection import chi2
import numpy as np
%matplotlib inline  

In [3]:
df = pd.read_json('./data/dfExoplanetsNASAdetected100rand_v2.json', orient = 'table')
df = df[['sent', 'label']]

In [4]:
df.head(3)

Unnamed: 0,sent,label
0,We detected visual companions within 1'' for 5...,discovery
1,Using these data and photometry from the Spitz...,discovery
2,"Of the over 800 exoplanets detected to date, o...",


Here we will load ```en_core_web_md``` model for spaCy and create some example single-sentence documents.

In [5]:
raw_documents = df['sent'].tolist()
raw_documents

["We detected visual companions within 1'' for 5 stars, between 1'' and 2'' for 7 stars, and between 2'' and 4'' for 15 stars.",
 'Using these data and photometry from the Spitzer Space Telescope, we have identified members with infrared excess emission from circumstellar disks and have estimated the evolutionary stages of the detected disks, which include 31 new full disks and 16 new candidate transitional, evolved, evolved transitional, and debris disks.',
 'Of the over 800 exoplanets detected to date, over half are on non-circular orbits, with eccentricities as high as 0.93.',
 'We find that for these false positive scenarios, CO at 2.35 μm, CO_2 at 2.0 and 4.3 μm, and O_4 at 1.27 μm are all stronger features in transmission than O_2/O_3 and could be detected with S/Ns ≳ 3 for an Earth-size planet orbiting a nearby M dwarf star with as few as 10 transits, assuming photon-limited noise.',
 'We present two exoplanets detected at Keck Observatory.',
 'This disfavours the possibility of

In [6]:
%%time
nlp = spacy.load('en_core_web_md')

CPU times: user 15.4 s, sys: 540 ms, total: 16 s
Wall time: 16.9 s


# Example documents
raw_documents = ["The quick brown fox jumps over the lazy dog.",
                 "This is a test sentence.",
                 "This sentence contains exclamation mark, comma and (round brackets)!"]

We'll start with the helper classes for tokenization and lemmatization.

### SpacyTokenizer

```SpacyTokenizer``` uses spaCy <a href='https://spacy.io/usage/linguistic-features#section-tokenization'>tokenizer</a> for document tokenization. When ```join_str``` argument is set to ```None```, the result will be a ```list``` of lists of strings (tokens). Punctuation from the ```ignore_chars``` argument will be removed in every separate token, but empty tokens will be kept. You can also specify ```batch_size``` and ```n_threads``` arguments for parallel processing of large datasets. Lowercasing isn't performed.

In [7]:
tokenizer = SpacyTokenizer(nlp, join_str=None, ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', 
                           batch_size=10000, n_threads=1)

tokens = tokenizer(raw_documents) # generator object is returned
for tokenized_doc in tokens:
    print(tokenized_doc)

['We', 'detected', 'visual', 'companions', 'within', '1', '', '', 'for', '5', 'stars', '', 'between', '1', '', '', 'and', '2', '', '', 'for', '7', 'stars', '', 'and', 'between', '2', '', '', 'and', '4', '', '', 'for', '15', 'stars', '']
['Using', 'these', 'data', 'and', 'photometry', 'from', 'the', 'Spitzer', 'Space', 'Telescope', '', 'we', 'have', 'identified', 'members', 'with', 'infrared', 'excess', 'emission', 'from', 'circumstellar', 'disks', 'and', 'have', 'estimated', 'the', 'evolutionary', 'stages', 'of', 'the', 'detected', 'disks', '', 'which', 'include', '31', 'new', 'full', 'disks', 'and', '16', 'new', 'candidate', 'transitional', '', 'evolved', '', 'evolved', 'transitional', '', 'and', 'debris', 'disks', '']
['Of', 'the', 'over', '800', 'exoplanets', 'detected', 'to', 'date', '', 'over', 'half', 'are', 'on', 'non', '', 'circular', 'orbits', '', 'with', 'eccentricities', 'as', 'high', 'as', '093', '']
['We', 'find', 'that', 'for', 'these', 'false', 'positive', 'scenarios', '

Here's the difference when ```join_str``` is set to space cahracter. SpacyTokenizer will return the ```list``` of strings which are joined tokens (together with empty punctuation-only tokens).

In [8]:
tokenizer = SpacyTokenizer(nlp, join_str=' ', n_threads=1)
tokens = tokenizer(raw_documents) # generator object is returned
for tokenized_doc in tokens:
    print(tokenized_doc)

We detected visual companions within 1   for 5 stars  between 1   and 2   for 7 stars  and between 2   and 4   for 15 stars 
Using these data and photometry from the Spitzer Space Telescope  we have identified members with infrared excess emission from circumstellar disks and have estimated the evolutionary stages of the detected disks  which include 31 new full disks and 16 new candidate transitional  evolved  evolved transitional  and debris disks 
Of the over 800 exoplanets detected to date  over half are on non  circular orbits  with eccentricities as high as 093 
We find that for these false positive scenarios  CO at 235 μm  CO2 at 20 and 43 μm  and O4 at 127 μm are all stronger features in transmission than O2O3 and could be detected with S  Ns ≳ 3 for an Earth  size planet orbiting a nearby M dwarf star with as few as 10 transits  assuming photon  limited noise 
We present two exoplanets detected at Keck Observatory 
This disfavours the possibility of GI  caused spiral structure

Finally, this example shows a usual result from tokenization and punctuation removal. Notice that you must call the ```split()``` method to obtain a list of tokens without empty ones. 

In [9]:
tokenizer = SpacyTokenizer(nlp, join_str=' ', n_threads=1)
tokens = tokenizer(raw_documents) # generator object is returned
for tokenized_doc in tokens:
    print(tokenized_doc.split())

['We', 'detected', 'visual', 'companions', 'within', '1', 'for', '5', 'stars', 'between', '1', 'and', '2', 'for', '7', 'stars', 'and', 'between', '2', 'and', '4', 'for', '15', 'stars']
['Using', 'these', 'data', 'and', 'photometry', 'from', 'the', 'Spitzer', 'Space', 'Telescope', 'we', 'have', 'identified', 'members', 'with', 'infrared', 'excess', 'emission', 'from', 'circumstellar', 'disks', 'and', 'have', 'estimated', 'the', 'evolutionary', 'stages', 'of', 'the', 'detected', 'disks', 'which', 'include', '31', 'new', 'full', 'disks', 'and', '16', 'new', 'candidate', 'transitional', 'evolved', 'evolved', 'transitional', 'and', 'debris', 'disks']
['Of', 'the', 'over', '800', 'exoplanets', 'detected', 'to', 'date', 'over', 'half', 'are', 'on', 'non', 'circular', 'orbits', 'with', 'eccentricities', 'as', 'high', 'as', '093']
['We', 'find', 'that', 'for', 'these', 'false', 'positive', 'scenarios', 'CO', 'at', '235', 'μm', 'CO2', 'at', '20', 'and', '43', 'μm', 'and', 'O4', 'at', '127', 'μm'

### SpacyLemmatizer

```SpacyLemmatizer``` is very similar to ```SpacyTokenizer```, but it returns lowercased lemmas instead of tokens.

In [10]:
lemmatizer = SpacyLemmatizer(nlp, join_str=None, ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~', 
                             batch_size=10000, n_threads=1)
lemmas = lemmatizer(raw_documents) # generator object is returned
for lemmatized_doc in lemmas:
    print(lemmatized_doc)

['we', 'detect', 'visual', 'companion', 'within', '1', '', '', 'for', '5', 'star', '', 'between', '1', '', '', 'and', '2', '', '', 'for', '7', 'star', '', 'and', 'between', '2', '', '', 'and', '4', '', '', 'for', '15', 'star', '']
['use', 'these', 'datum', 'and', 'photometry', 'from', 'the', 'Spitzer', 'Space', 'Telescope', '', 'we', 'have', 'identify', 'member', 'with', 'infrared', 'excess', 'emission', 'from', 'circumstellar', 'disk', 'and', 'have', 'estimate', 'the', 'evolutionary', 'stage', 'of', 'the', 'detect', 'disk', '', 'which', 'include', '31', 'new', 'full', 'disk', 'and', '16', 'new', 'candidate', 'transitional', '', 'evolve', '', 'evolve', 'transitional', '', 'and', 'debris', 'disk', '']
['of', 'the', 'over', '800', 'exoplanet', 'detect', 'to', 'date', '', 'over', 'half', 'be', 'on', 'non', '', 'circular', 'orbit', '', 'with', 'eccentricity', 'as', 'high', 'as', '093', '']
['we', 'find', 'that', 'for', 'these', 'false', 'positive', 'scenario', '', 'CO', 'at', '235', 'μm', 

In [11]:
lemmatizer = SpacyLemmatizer(nlp, join_str=' ', n_threads=1)
lemmas = lemmatizer(raw_documents) # generator object is returned
for lemmatized_doc in lemmas:
    print(lemmatized_doc)

we detect visual companion within 1   for 5 star  between 1   and 2   for 7 star  and between 2   and 4   for 15 star 
use these datum and photometry from the Spitzer Space Telescope  we have identify member with infrared excess emission from circumstellar disk and have estimate the evolutionary stage of the detect disk  which include 31 new full disk and 16 new candidate transitional  evolve  evolve transitional  and debris disk 
of the over 800 exoplanet detect to date  over half be on non - circular orbit  with eccentricity as high as 093 
we find that for these false positive scenario  CO at 235 μm  co2 at 20 and 43 μm  and o4 at 127 μm be all strong feature in transmission than o2o3 and could be detect with S  ns ≳ 3 for an earth - size planet orbit a nearby M dwarf star with as few as 10 transit  assume photon - limit noise 
we present two exoplanet detect at Keck Observatory 
this disfavour the possibility of GI - cause spiral structure in system with qlt025 be detect in relativ

In [12]:
lemmatizer = SpacyLemmatizer(nlp, join_str=' ', n_threads=1)
lemmas = lemmatizer(raw_documents) # generator object is returned
for lemmatized_doc in lemmas:
    print(lemmatized_doc.split())

['we', 'detect', 'visual', 'companion', 'within', '1', 'for', '5', 'star', 'between', '1', 'and', '2', 'for', '7', 'star', 'and', 'between', '2', 'and', '4', 'for', '15', 'star']
['use', 'these', 'datum', 'and', 'photometry', 'from', 'the', 'Spitzer', 'Space', 'Telescope', 'we', 'have', 'identify', 'member', 'with', 'infrared', 'excess', 'emission', 'from', 'circumstellar', 'disk', 'and', 'have', 'estimate', 'the', 'evolutionary', 'stage', 'of', 'the', 'detect', 'disk', 'which', 'include', '31', 'new', 'full', 'disk', 'and', '16', 'new', 'candidate', 'transitional', 'evolve', 'evolve', 'transitional', 'and', 'debris', 'disk']
['of', 'the', 'over', '800', 'exoplanet', 'detect', 'to', 'date', 'over', 'half', 'be', 'on', 'non', '-', 'circular', 'orbit', 'with', 'eccentricity', 'as', 'high', 'as', '093']
['we', 'find', 'that', 'for', 'these', 'false', 'positive', 'scenario', 'CO', 'at', '235', 'μm', 'co2', 'at', '20', 'and', '43', 'μm', 'and', 'o4', 'at', '127', 'μm', 'be', 'all', 'strong'

### SpacyTokenCountVectorizer

```SpacyTokenCountVectorizer``` inherits ```scikit-learn```'s ```CountVectorizer``` to enable tokenization from ```spaCy``` models. Its ```fit()```, ```fit_transform()``` and ```transform()``` methods accept iterable of <a href=https://spacy.io/api/doc>Doc</a> objects as ```spacy_docs``` (```X``` in ```scikit-learn```) parameter. This iterable can be obtained from ```SpacyPipeProcessor``` class.

In [13]:
spp = SpacyPipeProcessor(nlp, n_threads=1)  # creates iterable of spaCy Doc objects
spacy_docs = spp(raw_documents)

In this example we can see that the result of ```SpacyTokenCountVectorizer```'s ```fit_transform()``` method is a CSR sparse matrix, just like a standard CountVectorizer would return.

In [None]:
stcv = SpacyTokenCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
count_vectors = stcv.fit_transform(spacy_docs); count_vectors

<100x980 sparse matrix of type '<class 'numpy.int64'>'
	with 2361 stored elements in Compressed Sparse Row format>

In [None]:
print(stcv.vocabulary_)

{'we': 944, 'detected': 256, 'visual': 934, 'companions': 207, 'within': 958, '1': 7, 'for': 357, '5': 43, 'stars': 821, 'between': 144, 'and': 100, '2': 23, '7': 48, '4': 40, '15': 18, 'using': 920, 'these': 872, 'data': 240, 'photometry': 651, 'from': 364, 'the': 869, 'spitzer': 817, 'space': 808, 'telescope': 863, 'have': 400, 'identified': 426, 'members': 531, 'with': 957, 'infrared': 442, 'excess': 321, 'emission': 303, 'circumstellar': 193, 'disks': 272, 'estimated': 310, 'evolutionary': 316, 'stages': 819, 'of': 603, 'which': 950, 'include': 434, '31': 37, 'new': 576, 'full': 365, '16': 19, 'candidate': 168, 'transitional': 892, 'evolved': 317, 'debris': 245, 'over': 628, '800': 57, 'exoplanets': 325, 'to': 885, 'date': 241, 'half': 395, 'are': 108, 'on': 607, 'non': 581, 'circular': 189, 'orbits': 618, 'eccentricities': 296, 'as': 114, 'high': 405, '093': 5, 'find': 348, 'that': 868, 'false': 338, 'positive': 662, 'scenarios': 757, 'co': 198, 'at': 121, '235': 34, 'μm': 972, 'c

If you initialize a ```SpacyPipeProcessor``` object with the ```multi_iters``` parameter set to ```True```, the result of its ```__call__``` method will be a list of ```Doc``` objects, instead of a single ```generator```. This allows you to iterate multiple times thorugh returned objects if you need.

In [None]:
spp = SpacyPipeProcessor(nlp, n_threads=1, multi_iters=True)
spacy_docs = spp(raw_documents)

stcv = SpacyTokenCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
stcv.fit(spacy_docs)
count_vectors = stcv.transform(spacy_docs); count_vectors

<100x980 sparse matrix of type '<class 'numpy.int64'>'
	with 2361 stored elements in Compressed Sparse Row format>

In [None]:
print(stcv.vocabulary_)

{'we': 944, 'detected': 256, 'visual': 934, 'companions': 207, 'within': 958, '1': 7, 'for': 357, '5': 43, 'stars': 821, 'between': 144, 'and': 100, '2': 23, '7': 48, '4': 40, '15': 18, 'using': 920, 'these': 872, 'data': 240, 'photometry': 651, 'from': 364, 'the': 869, 'spitzer': 817, 'space': 808, 'telescope': 863, 'have': 400, 'identified': 426, 'members': 531, 'with': 957, 'infrared': 442, 'excess': 321, 'emission': 303, 'circumstellar': 193, 'disks': 272, 'estimated': 310, 'evolutionary': 316, 'stages': 819, 'of': 603, 'which': 950, 'include': 434, '31': 37, 'new': 576, 'full': 365, '16': 19, 'candidate': 168, 'transitional': 892, 'evolved': 317, 'debris': 245, 'over': 628, '800': 57, 'exoplanets': 325, 'to': 885, 'date': 241, 'half': 395, 'are': 108, 'on': 607, 'non': 581, 'circular': 189, 'orbits': 618, 'eccentricities': 296, 'as': 114, 'high': 405, '093': 5, 'find': 348, 'that': 868, 'false': 338, 'positive': 662, 'scenarios': 757, 'co': 198, 'at': 121, '235': 34, 'μm': 972, 'c

### SpacyLemmaCountVectorizer

```SpacyLemmaCountVectorizer``` is very similar to ```SpacyTokenCountVectorizer```, but it performs lemmatization instead of tokenization.

In [None]:
spp = SpacyPipeProcessor(nlp, n_threads=1)
spacy_docs = spp(raw_documents);

slcv = SpacyLemmaCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
count_vectors = slcv.fit_transform(spacy_docs); count_vectors

<100x881 sparse matrix of type '<class 'numpy.int64'>'
	with 2320 stored elements in Compressed Sparse Row format>

In [None]:
print(slcv.vocabulary_)

{'we': 847, 'detect': 239, 'visual': 839, 'companion': 194, 'within': 860, '1': 7, 'for': 329, '5': 43, 'star': 740, 'between': 139, 'and': 99, '2': 23, '7': 48, '4': 40, '15': 18, 'use': 825, 'these': 785, 'datum': 226, 'photometry': 587, 'from': 335, 'the': 782, 'spitzer': 737, 'space': 729, 'telescope': 777, 'have': 369, 'identify': 390, 'member': 478, 'with': 859, 'infrared': 404, 'excess': 296, 'emission': 280, 'circumstellar': 182, 'disk': 252, 'estimate': 287, 'evolutionary': 291, 'stage': 739, 'of': 544, 'which': 852, 'include': 398, '31': 37, 'new': 519, 'full': 336, '16': 19, 'candidate': 159, 'transitional': 803, 'evolve': 292, 'debris': 230, 'over': 566, '800': 57, 'exoplanet': 298, 'to': 796, 'date': 225, 'half': 364, 'be': 134, 'on': 548, 'non': 524, 'circular': 178, 'orbit': 556, 'eccentricity': 274, 'as': 112, 'high': 374, '093': 5, 'find': 321, 'that': 781, 'false': 311, 'positive': 597, 'scenario': 684, 'co': 186, 'at': 119, '235': 34, 'μm': 873, 'co2': 187, '20': 24,

In [None]:
spp = SpacyPipeProcessor(nlp, n_threads=1, multi_iters=True)
spacy_docs = spp(raw_documents);

slcv = SpacyLemmaCountVectorizer(ignore_chars='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
slcv.fit(spacy_docs)
count_vectors = slcv.transform(spacy_docs); count_vectors

<100x881 sparse matrix of type '<class 'numpy.int64'>'
	with 2320 stored elements in Compressed Sparse Row format>

In [None]:
print(slcv.vocabulary_)

{'we': 847, 'detect': 239, 'visual': 839, 'companion': 194, 'within': 860, '1': 7, 'for': 329, '5': 43, 'star': 740, 'between': 139, 'and': 99, '2': 23, '7': 48, '4': 40, '15': 18, 'use': 825, 'these': 785, 'datum': 226, 'photometry': 587, 'from': 335, 'the': 782, 'spitzer': 737, 'space': 729, 'telescope': 777, 'have': 369, 'identify': 390, 'member': 478, 'with': 859, 'infrared': 404, 'excess': 296, 'emission': 280, 'circumstellar': 182, 'disk': 252, 'estimate': 287, 'evolutionary': 291, 'stage': 739, 'of': 544, 'which': 852, 'include': 398, '31': 37, 'new': 519, 'full': 336, '16': 19, 'candidate': 159, 'transitional': 803, 'evolve': 292, 'debris': 230, 'over': 566, '800': 57, 'exoplanet': 298, 'to': 796, 'date': 225, 'half': 364, 'be': 134, 'on': 548, 'non': 524, 'circular': 178, 'orbit': 556, 'eccentricity': 274, 'as': 112, 'high': 374, '093': 5, 'find': 321, 'that': 781, 'false': 311, 'positive': 597, 'scenario': 684, 'co': 186, 'at': 119, '235': 34, 'μm': 873, 'co2': 187, '20': 24,

## Tests

Here we'll test classes described above, in the modified Olivier Grisel's example from <a href="http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py">here</a>. Instead of ```LogisticRegression``` in the original example, we'll use ```LinearSVC```. This code samples show a grid search over several parameters in a text processing ```Pipeline``` on the 2 categories of 20 newsgroup dataset.

In [None]:
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

random_state = 42 

# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

Automatically created module for IPython interactive environment
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
857 documents
2 categories



In [None]:
# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 42 candidates, totalling 126 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done 126 out of 126 | elapsed:   24.0s finished


done in 25.309s

Best score: 0.943
Best parameters set:
	clf__C: 100
	vect__max_df: 1.0
	vect__ngram_range: (1, 2)


The best score using ```CountVectorizer``` was 94,3%. Now we will create ```spacy_docs``` list for customized vectorizers and perform grid searches using ```SpacyTokenCountVectorizer``` and ```SpacyLemmaCountVectorizer```. Running time of theirs methods is much longer when compared to ```CountVectorizer```.

In [None]:
%%time
print('Processing dataset with spaCy...')
spacy_docs = SpacyPipeProcessor(nlp, multi_iters=True, n_threads=1)(data.data)

Processing dataset with spaCy...
CPU times: user 56.3 s, sys: 16.1 s, total: 1min 12s
Wall time: 49 s


In [None]:
pipeline = Pipeline([
    ('vect', SpacyTokenCountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block
    
    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(spacy_docs, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__ngram_range': ((1, 1), (1, 2))}
Fitting 3 folds for each of 42 candidates, totalling 126 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 10.0min


With ```SpacyTokenCountVectorizer``` we obtained 94% with different best hyperparameters.

In [None]:
pipeline = Pipeline([
    ('vect', SpacyLemmaCountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(spacy_docs, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

93,5% was the best result for ```SpacyLemmaCountVectorizer```. It seems like these custom vectorizers aren't a very good choice for concrete dataset, and a more extesive hyperparameter search and preprocessing is probably needed.

### SpacyWord2VecVectorizer

```SpacyWord2VecVectorizer``` converts a ```list``` of ```Doc``` objects to their vector representations. Vectors are stored in a ```float32``` ```numpy``` array, where the number of rows equals to the number of documents and the number of columns is a vector dimensionality, which depends on the ```nlp``` model used. Word vectors have 300 dimensions in this case. When the ```sparsify``` parameter is ```True```, the resulting matrix will be sparse (CSR).

**Important note:*** ```SpacWord2VecVectorizer``` is **not thread safe** at the moment. 

In [None]:
spp = SpacyPipeProcessor(nlp, n_threads=1)
spacy_docs = spp(raw_documents)

w2v = SpacyWord2VecVectorizer(sparsify=True)
word_vectors = w2v.fit_transform(spacy_docs); word_vectors

We can also use ```fit()``` and ```transform()``` methods: 

In [None]:
spp = SpacyPipeProcessor(nlp, n_threads=1)
spacy_docs = spp(raw_documents)

w2v = SpacyWord2VecVectorizer(sparsify=True)
word_vectors = w2v.fit(spacy_docs).transform(spacy_docs); word_vectors

There's a classification test with ```SpacyWord2VecVectorizer```.

In [None]:
pipeline = Pipeline([
    ('vect', SpacyWord2VecVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC(random_state=random_state))
])

parameters = {
    #'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    #'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__C': (0.001, 0.01, 0.1, 1, 10, 100, 1000),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1) #  n_jobs=1 for thread safety

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(spacy_docs, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

84,2% suggests that a larger hyperparameter search space is needed, together with other featrues such as bag of words. 

### Performance Considerations

In [None]:
%%time
features = CountVectorizer().fit(data.data).transform(data.data)

In [None]:
%%time
features = SpacyTokenCountVectorizer().fit(spacy_docs).transform(spacy_docs)

In [None]:
%%time
features = SpacyLemmaCountVectorizer().fit(spacy_docs).transform(spacy_docs)

In [None]:
%%time
features = SpacyWord2VecVectorizer().fit(spacy_docs).transform(spacy_docs)

### Conclusion

In general, we see that custom vectorizers are about 4 times slower than original ```CountVectorizer```. This shows that their tokenizers and lemmatizers should be used as a preprocessing step before extensive hyperparameter optimization. As tihs <a href="https://stackoverflow.com/a/45212615">answer</a> suggests, ```CountVectorizer``` can be nicely used for vectorization of pre-tokenized or pre-lemmatized documents, since it's a faster and more memory friendly solution. Moreover, customized vectorizers didn't show performance imporovement on the small subset of 20 newsgroups dataset used here, but this isn't an evidence that they are not useful in general.