# Introduction to NLP with Python spaCy: analyzing restaurant reviews


_This work is licensed under a [Creative Commons BY-SA 4.0 License](http://creativecommons.org/licenses/by-sa/4.0/)_

<br><br><br><br>
__Daniel Kapitan__<br>
`e. d.kapitan@jads.nl`<br>
`w. https://kapitan.net`<br>



<img style="float: left" src="https://github.com/jads-nl/public-lectures/blob/main/logos/jads-gold-250x60.png?raw=true">

## The challenge: predict the restaurant review score with the review text

Thanks to [the people at analyticslab.nl](https://www.theanalyticslab.nl/about-us/) we will use a restaurant review dataset as our introduction to NLP with spaCy. The dataset contains nearly 370.000 reviews collected over an eight-year period, which they scraped from the original review website (I think it is iens.nl). We will follow along [their blogpost series](https://www.theanalyticslab.nl/nlpblogs_0_preparing_restaurant_review_data_for_nlp_and_predictive_modeling/), but replacing their R code with a workflow in Python spaCy.

In this notebook we compare different NLP techniques to show you how we get valuable information from unstructured text. Given the restaurant reviews, the challenge is whether these could be used to:
- Extract more structured information from the review texts without them having to read all the reviews one by one
- Extract actionable feedback to the restaurant staff
- Predict the review score using the text

Note that the original blog posts aimed to predict the next Michelin star. We will leave that as an extra challenge for you to try ...

The following tools and techniques will be demonstrated:

- How to setup a reproducible text pipeline in Python spaCy for text analysis;
- How to apply [topic modeling](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) as the primary tool to extract information from the review texts, to be combined and used in predictive modeling techniques to end up with our predictions
 

## Data preparation

Before we delve into the analytical side of things, we need some prepared textual data. As all true data scientists know, proper data preparation takes most of your time and is most decisive for the quality of the analysis results you end up with. Since preparing textual data is another cup of tea compared to preparing structured numeric or categorical data, and our goal is to show you how to do text analytics, we also want to show you how we cleaned and prepared the data we gathered. Therefore, in this notebook we start with the data dump with all reviews and explore and prepare this data in a number of steps:

![](https://bhciaaablob.blob.core.windows.net/thefork/Text%20preprocessing%20pipeline_noheader.png)

In [1]:
import re

import pandas as pd
import pendulum


# # not needed for this notebook, required for uploading data to GitHub in smaller files < 25 MB
# REVIEWS = (
#     "https://bhciaaablob.blob.core.windows.net/cmotionsnlpblogs/RestoReviewRawdata.csv"
# )
# resto = pd.read_csv(REVIEWS, decimal=",")
# resto['reviewYear'] = resto.reviewDate.str[-4:].astype('float').astype('Int64')
# resto.to_parquet('data/restaurant-reviews', partition_cols=['reviewYear'])


raw_reviews = pd.read_parquet("data/restaurant-reviews")
raw_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 368529 entries, 0 to 368528
Data columns (total 25 columns):
 #   Column               Non-Null Count   Dtype   
---  ------               --------------   -----   
 0   restoId              368529 non-null  int64   
 1   restoName            368529 non-null  object  
 2   tags                 356517 non-null  object  
 3   address              368529 non-null  object  
 4   scoreTotal           348081 non-null  float64 
 5   avgPrice             336224 non-null  object  
 6   numReviews           368529 non-null  int64   
 7   scoreFood            347984 non-null  float64 
 8   scoreService         347984 non-null  float64 
 9   scoreDecor           348037 non-null  float64 
 10  review_id            368529 non-null  float64 
 11  numreviews2          368529 non-null  float64 
 12  valueForPriceScore   296774 non-null  object  
 13  noiseLevelScore      296902 non-null  object  
 14  waitingTimeScore     296902 non-null  object  
 15  

Let's look at some `reviewText`.

In [2]:
raw_reviews.reviewText.head()

0    b'We komen al meer dan 8 jaar in dit restauran...
1    b'Een werkelijk prachtige ijssalon,blinkende u...
2    b'Naast dat men hier heerlijk grieks eten heef...
3    b'Via de Sweetdeal genoten van het 3 gangenkeu...
4    b'Vakantieveiling is een leuk ding om restaura...
Name: reviewText, dtype: object

Ok, there's clearly some cleaning to be done here.

First of all, the available texts are all encapsulated in "b'...''", indicating the texts are byte literals. Also you might spot some strange sequences of tokens like in 'ingredi\xc3\xabnten', indicating that our texts include UTF-8 encoded tokens (here, the character ë that has the code \xc3\xab in UTF-8). This combination of byte literal encapsulation with the UTF-8 codes shows that in the creation of the source data we have available, the encoding got messed up a bit, making it difficult to to obtain the review texts out of the encoded data. We won't go in to too much detail here (if you want, read [this](https://diveintopython3.net/strings.html)) but you might run into similar stuff when you start working with textual data. In short, there are different encoding types and you need to know what you are working with. We need to make sure we use the right encoding and we should get rid of the "b'...''" in the strings.

We could spend some time on figuring out how to correct this messing-up due to coding as good as possible. However, in order not to lose too much time and effort on undoing this (and we don't) we can take a short cut with minimal loss of data by cleaning the texts with some regular expressions. Depending on your goal, you might want to go the extra mile and try to restore the texts in their original UTF-8 encoding though! As so often in data science projects, we're struggling with available time and resources: You need to pick you battles - and pick them wisely!

Do we have other things to cover? To get a better understanding of our data, let's check the most frequent, identical review texts:

In [3]:
raw_reviews.reviewText.value_counts(normalize=True).head()

b'- Recensie is momenteel in behandeling -'    0.003937
b'Heerlijk gegeten!'                           0.001037
b'Heerlijk gegeten'                            0.000795
b'Heerlijk gegeten.'                           0.000448
b'Top'                                         0.000293
Name: reviewText, dtype: float64

Ok, several things to solve here:

- About 3% of all reviews have no review text so they are not useful and we can delete those.
- Another 0,4% has the value "b'- Recensie is momenteel in behandeling -'" (In English: The review is currently being processed) and therefore the actual review text is not published yet. Similar to empty reviews, we can delete these reviews.
- Several reviews seem very short and are not that helpful in trying to learn from the review text. Although this is very context dependent (when performing sentiment analysis, short reviews like 'Top!' (English: Top!), 'Prima' (Engish: Fine/OK) and 'Heerlijk gegeten' (En: Had a nice meal) might still have much value!) we will set a minimum length to reviews.

We will deal with punctuation later in spaCy.

### Pattern matching with regex

In [4]:
import re


def fix_bytestring(string):
    """Decode wonky byte string into proper string"""

    pattern = re.compile(r"^b'(.*)'")
    match = re.search(pattern, string)
    if match:
        return match[1].encode("utf-8").decode("utf-8")
    else:
        return ""
    

#TO DO: when reading csv data, error in fix_bytestring function
# raw_reviews = pd.read_csv('data/RestoReviewRawdata.csv', encoding='utf-8')

In [5]:
reviews = raw_reviews.loc[:, ['restoId', 'reviewerId', 'review_id', 'reviewerFame', 'reviewerNumReviews']].copy()
reviews['reviewText'] = raw_reviews.reviewText.apply(fix_bytestring)
reviews.reviewText.head()

0    We komen al meer dan 8 jaar in dit restaurant ...
1    Een werkelijk prachtige ijssalon,blinkende uit...
2    Naast dat men hier heerlijk grieks eten heeft,...
3    Via de Sweetdeal genoten van het 3 gangenkeuze...
4    Vakantieveiling is een leuk ding om restaurant...
Name: reviewText, dtype: object

In [6]:
def validate_review(review):
    if review == '- Recensie is momenteel in behandeling -' or len(review) < 4:
        return 0
    else:
        return 1
    
reviews['is_valid'] = reviews.reviewText.apply(validate_review)
reviews[reviews.is_valid==0]['reviewText'].value_counts(normalize=True).head(10)

                                            0.956331
- Recensie is momenteel in behandeling -    0.035183
Top                                         0.002619
.                                           0.000970
Nvt                                         0.000679
-                                           0.000485
..                                          0.000291
Kip                                         0.000218
nvt                                         0.000218
Ok                                          0.000218
Name: reviewText, dtype: float64

So that looks OK, we can safely delete `is_valid == 0` reviews later. Let's do some more data prep.

### Parse localized datestrings with `pendulum`

In [7]:
import pendulum


pendulum.set_locale('nl')
pendulum.date(2021, 2, 12).format('D MMM YYYY')  # example

'12 feb. 2021'

In [8]:
def parse_date(date):
    return pendulum.from_format(date, fmt='D MMM YYYY', locale='nl')

reviews['reviewDate'] = raw_reviews.reviewDate.apply(parse_date).dt.date

In [9]:
reviews.reviewDate.head()

0    2012-09-19
1    2012-07-12
2    2012-11-29
3    2012-12-13
4    2012-10-19
Name: reviewDate, dtype: object

### Format numerical columns

In [10]:
# avgPrice has whitespace and euro character
def clean_price(string):
    if string:
        return string.split(" ")[-1]
    else:
        return None


reviews["avgPrice"] = raw_reviews["avgPrice"].apply(clean_price)

In [11]:
# turn categorical columns into ordinal values, lower is better
# note to Dutch audience: do you think the ordinal order is sensible and correct?
map_scores = {
    "waitingTimeScore": {
        "Hoog tempo": 1,
        "Kort": 2,
        "Redelijk": 3,
        "Kan beter": 4,
        "Lang": 5,
    },
    "valueForPriceScore": {
        "Erg gunstig": 1,
        "Gunstig": 2,
        "Redelijk": 3,
        "Precies goed": 4,
        "Kan beter": 5,
    },
    "noiseLevelScore": {
        "Erg rustig": 1,
        "Rustig": 2,
        "Precies goed": 3,
        "Rumoerig": 4,
    },
}

for col in map_scores.keys():
    reviews[col] = (
        raw_reviews[col].apply(lambda x: map_scores[col].get(x, None)).astype("Int64")
    )

In [12]:
# numerical columns have comma as decimal seperator --> cast to floats
numerical_cols = [
    "scoreFood",
    "scoreService",
    "scoreDecor",
    "reviewScoreOverall",
    "scoreTotal",
]
for col in numerical_cols:
    reviews[col] = pd.to_numeric(raw_reviews[col])

In [13]:
reviews.head()

Unnamed: 0,restoId,reviewerId,review_id,reviewerFame,reviewerNumReviews,reviewText,is_valid,reviewDate,avgPrice,waitingTimeScore,valueForPriceScore,noiseLevelScore,scoreFood,scoreService,scoreDecor,reviewScoreOverall,scoreTotal
0,236127,111373143.0,20.0,Fijnproever,4.0,We komen al meer dan 8 jaar in dit restaurant ...,1,2012-09-19,35.0,,,,8.6,8.4,7.2,8.5,8.4
1,246631,111355027.0,11.0,Meesterproever,21.0,"Een werkelijk prachtige ijssalon,blinkende uit...",1,2012-07-12,,,,,8.2,7.6,8.0,10.0,8.0
2,243427,112961389.0,3.0,Expertproever,9.0,"Naast dat men hier heerlijk grieks eten heeft,...",1,2012-11-29,,,,,,,,8.0,
3,234077,111347867.0,107.0,Meesterproever,97.0,Via de Sweetdeal genoten van het 3 gangenkeuze...,1,2012-12-13,45.0,,,,8.0,8.0,7.6,7.0,7.9
4,240845,112167929.0,14.0,Meesterproever,40.0,Vakantieveiling is een leuk ding om restaurant...,1,2012-10-19,43.0,,,,7.3,7.6,7.4,8.5,7.4


### Exercise: perform exploratory data analysis

Prior to diving into NLP with spaCy, perform a EDA to explore possible correlations:
- reviewer type vs. given scores
- length of reviews vs. scores
- value-for-money vs

Learning objective:
- Lest you forget to always do a short EDA, before getting lost in details ...

## Getting started with spaCy

To develop reproducible pipelines, we will follow the recommended workflow from spaCy.

![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the __processing pipeline__. The pipeline used by the [trained pipelines](https://spacy.io/models) typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed `Doc`, which is then passed on to the next component.

The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in `nlp.pipe_names`. The reason is that there can only really be one tokenizer, and while all other pipeline components take a `Doc` and return it, the tokenizer takes a __string of text__ and turns it into a `Doc`. You can still customize the tokenizer, though. `nlp.tokenizer` is writable, so you can either create your own [`Tokenizer` class from scratch](https://spacy.io/usage/linguistic-features#native-tokenizers), or even replace it with an [entirely custom function](https://spacy.io/usage/linguistic-features#custom-tokenizer).

We will use the large Dutch model which is 546 MB in size. The download command needs to be run once on your system. You may want to restart your Jupyter Notebook kernel to ensure spaCy is loaded properly with the newly downloaded model.

In [14]:
# !python -m spacy download nl_core_news_lg

In [15]:
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy


nlp = spacy.load("nl_core_news_lg")

We will show different options of tokenizing, see [this blog by Dataquest](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/) for more details.

### Documents in spaCy

In [18]:
# create spaCy doc from one of the reviews
example_doc = nlp(reviews.reviewText[100])

# Doc object has 51 attributes and methods
print(f"Methods and attributes of spaCy Doc object: {[attr for attr in dir(example_doc) if not '__' in attr]}")

Methods and attributes of spaCy Doc object: ['_', '_bulk_merge', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'to_utf8_array', 'user_data', 'user_hooks', 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']


In [23]:
spacy.displacy.render(nlp(reviews.reviewText[10]), style='dep')

In [19]:
# visualize named entities
spacy.displacy.render(example_doc, style='ent')

In [20]:
# visualize dependencies
spacy.displacy.render(example_doc, style='dep')

### Tokens in spaCy

In [25]:
# example_doc holds all tokens
print(f"Length of example doc: {len(example_doc)}")

# which you can access as a list
print(f"First token example doc: {example_doc[0]}")

# each token is a spaCy Token object with 90 methods and attributes
token_attrs = [attr for attr in dir(example_doc[0]) if not '__' in attr]
print(f"Each token is a {type(example_doc[0])} with 90 attributes")
print(token_attrs)

Length of example doc: 84
First token example doc: We
Each token is a <class 'spacy.tokens.token.Token'> with 90 attributes
['_', 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', '

In [37]:
# let's view token attributes we want to use
pd.DataFrame(
    [
        (
            token.text,
            token.lower_,
            token.lemma_,
            token.shape_,
            token.is_alpha,
            token.is_stop,
            token.is_punct,
            token.tag_,
        )
        for token in example_doc
    ],
    columns=[
        "text",
        "lower_",
        "lemma_",
        "shape_",
        "is_alpha",
        "is_stop",
        "is_punct",
        "tag_",
    ],
)

Unnamed: 0,text,lower_,lemma_,shape_,is_alpha,is_stop,is_punct,tag_
0,We,we,we,Xx,True,True,False,VNW|pers|pron|nomin|red|1|mv__Case=Nom|Person=...
1,kozen,kozen,kiezen,xxxx,True,False,False,WW|pv|verl|mv__Number=Plur|Tense=Past|VerbForm...
2,voor,voor,voor,xxxx,True,True,False,VZ|init
3,dit,dit,dit,xxx,True,True,False,VNW|aanw|det|stan|prenom|zonder|evon
4,restaurant,restaurant,restaurant,xxxx,True,False,False,N|soort|ev|basis|onz|stan__Gender=Neut|Number=...
...,...,...,...,...,...,...,...,...
79,in,in,in,xx,True,True,False,VZ|init
80,positieve,positieve,positief,xxxx,True,False,False,ADJ|prenom|basis|met-e|stan__Degree=Pos
81,zin,zin,zin,xxx,True,False,False,N|soort|ev|basis|zijd|stan__Gender=Com|Number=...
82,),),),),False,False,True,LET


### Simple tokenizer

In [31]:
def tokenize_simple(text):
    """Tokenizer returning lowercase tokens with no stop words, no punctuation and no words with encoding errors"""
    doc = nlp(text)
    return [token.lower_ for token in doc if not (token.is_stop or token.is_punct or ("\\" in token.lower_))]

tokenize_simple(reviews.reviewText[100])

['kozen',
 'restaurant',
 'vieren',
 'absoluut',
 'gelukt',
 '4-gangen',
 'verrassingsmenu',
 'voorafgegaan',
 'twee',
 'amuses',
 'dessert',
 'kregen',
 'wijnarrangement',
 'beroerd',
 'geschonken',
 'kortom',
 'weet',
 'klant',
 'watten',
 'leggen',
 'voelt',
 'echt',
 'gast',
 'eten',
 'heerlijk',
 'bood',
 'randstedelijke',
 'ogen',
 'absurde',
 'prijs-kwaliteitsverhouding',
 'positieve',
 'zin']

### Using lemmas as tokens

In [32]:
def tokenize_lemma(text):
    """Tokenizer returning lemmas with no stop words, no punctuation and no words with encoding errors"""
    doc = nlp(text)
    return [token.lemma_ for token in doc if not (token.is_stop or token.is_punct or ("\\" in token.lower_))]

tokenize_lemma(reviews.reviewText[100])

['kiezen',
 'restaurant',
 'vieren',
 'absoluut',
 'lukken',
 '4-gang',
 'verrassingsmenu',
 'voorafgaan',
 'twee',
 'amuse',
 'dessert',
 'krijgen',
 'wijnarrangement',
 'beroerd',
 'schenken',
 'kortom',
 'weten',
 'klant',
 'watten',
 'leggen',
 'voelen',
 'echt',
 'ga',
 'eten',
 'heerlijk',
 'bieden',
 'randstedelijk',
 'oog',
 'absurd',
 'prijs-kwaliteitsverhouding',
 'positief',
 'zin']

### Exercises

Since it takes quite some computing power to process all reviews, it is suggested to do the exercises just for a given year.


#### Explore association between overall review score and length of text

#### Extract ADJ-NOUN bigrams from the text

Imagine you are advising the owner of a restaurant. We wants to mine the reviews to find specific improvement points. Assuming that ADJECTIVE - NOUN bigrams can provide useful input, for example 'oude servies', 'lange wachttijd', your task is to:
- Extract ADJ-NOUN bigrams for each review
- Find the most interesting bigrams for positive and negative reviews per restaurant

## Document-Term Matrix & TF-IDF model


### Creating the Document-Term Matrix
Use `sklearn.feature_extraction.text.CountVectorizer` to create the DTM. Since this is a large and sparse matrix, this data in a numpy array instead of a pandas dataframe. We will use the spaCy's `Defaults.stop_words`. Since creating the DTM for all reviews takes a long time, we will test it first with just 1,000 records

In [33]:
%%time
count_vectorizer = CountVectorizer(tokenizer=tokenize_simple, stop_words=nlp.Defaults.stop_words, ngram_range=(1,1))
count_vectorizer.fit_transform(reviews.reviewText[:1000])



CPU times: user 16.9 s, sys: 5.31 ms, total: 16.9 s
Wall time: 16.9 s


<1000x6707 sparse matrix of type '<class 'numpy.int64'>'
	with 26657 stored elements in Compressed Sparse Row format>

That's an interesting warning. Let's look at the stopwords that we have used.

In [34]:
nlp.Defaults.stop_words

{"'t",
 'aan',
 'aangaangde',
 'aangezien',
 'achter',
 'achterna',
 'af',
 'afgelopen',
 'al',
 'aldus',
 'alhoewel',
 'alle',
 'allebei',
 'alleen',
 'allen',
 'alles',
 'als',
 'altijd',
 'ander',
 'andere',
 'anderen',
 'anders',
 'anderzijds',
 'behalve',
 'beide',
 'beiden',
 'ben',
 'beneden',
 'bent',
 'bepaald',
 'beter',
 'betere',
 'betreffende',
 'bij',
 'bijna',
 'bijvoorbeeld',
 'binnen',
 'binnenin',
 'boven',
 'bovenal',
 'bovendien',
 'bovenstaand',
 'buiten',
 'daar',
 'daarheen',
 'daarin',
 'daarna',
 'daarnet',
 'daarom',
 'daarop',
 'dan',
 'dat',
 'de',
 'den',
 'der',
 'des',
 'deze',
 'dezelfde',
 'dezen',
 'die',
 'dien',
 'dikwijls',
 'dit',
 'doch',
 'doen',
 'doet',
 'door',
 'doorgaand',
 'doorgaans',
 'dus',
 'echter',
 'een',
 'eens',
 'eerder',
 'eerst',
 'eerste',
 'eersten',
 'effe',
 'eigen',
 'elk',
 'elke',
 'en',
 'enige',
 'enkel',
 'enkele',
 'enz',
 'er',
 'erdoor',
 'etc',
 'even',
 'eveneens',
 'evenwel',
 'ff',
 'gauw',
 'ge',
 'gedurende',


So what happens here is:
- spaCy tokenizes, for example, the stop word `zo'n` into `[zo, ', n]`
- since we have chosen to remove punctuation we end up with `[zo, n]` as the tokens of the original stop word
- `zo` was already in the list of stop words, but `n` isn't which is what the warning is about

Adding those two fixes the warning.

In [35]:
%%time
stop_words = nlp.Defaults.stop_words.update(['n', 't'])
count_vectorizer = CountVectorizer(tokenizer=tokenize_simple, stop_words=stop_words, ngram_range=(1,1))
count_vectorizer.fit_transform(reviews.reviewText[:1000])

CPU times: user 9.38 s, sys: 0 ns, total: 9.38 s
Wall time: 9.38 s


<1000x6707 sparse matrix of type '<class 'numpy.int64'>'
	with 26657 stored elements in Compressed Sparse Row format>

## Topic modeling

## Classification challenge

As a final exercise, combine all the different techniques we have covered so far to build a model that predicts the review score using the text. Use `TransformerMixin` to encapsulate all the functions into one sklearn pipeline.

In [38]:
class tokenize(TransformerMixin):
    def transform(self, X, **transform_params):
        return [tokenize_simple(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

In [None]:
# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])