## Text preprocessing with `keras` versus `sklearn`
<hr>

Preprocessing text data for modeling can be done in a number of ways. The default workflow enabled by `keras.preprocessing.text.Tokenizer` appears to be somewhat different from that of `sklearn.feature_extraction.text.CountVectorizer`, which creates a count-based document-term matrix (DTM) from a set of documents ("corpus"). 

However, the differences are superficial in nature and the same DTM can be created easily with either workflow. The only meaningful difference appears to be a cultural one: in `keras` examples, DTMs are usually *binary*, not count-based. Binary DTMs can be created with `sklearn` by setting `binary=True` on instantiation of `CountVectorizer()`. And count-based DTMs can be created with `keras` by setting `mode='count'` on instantiation of `Tokenizer()`. 


This notebook demonstrates...

1. how to create a count-based DTM with the `sklearn` preprocessing workflow 
2. how to create a count-based DTM with the `keras` preprocessing workflow 
3. that the two strategies produce equivalent results 

Whether the `keras` cultural norm of using binary DTMs is important for classification accuracy will be assessed in a subsequent notebook. 


#### 0. create a toy corpus
<hr>

In [1]:
docs = ['this is me toy corp', 'a corp is just a bunch of docs', 
        'a doc is just a string', 'a string is just some chars',
        'and this doc is a doc and the last doc of the docs']

#### 1. construct count DTM with `keras` workflow
<hr>

- take the docs
- tokenize them with `Tokenizer.fit_on_texts()` 
- integer-encode them with `Tokenizer.texts_to_sequences()`  
- count-vectorize the integer tokens with `Tokenizer.sequences_to_matrix()` 

In [2]:
from keras.preprocessing.text import Tokenizer

# instantiate Tokenizer class (`num_words` to restrict vocab size)
tokenizer = Tokenizer(num_words=None)

# extract vocab and count words (makes several attrs available) 
tokenizer.fit_on_texts(docs)

# integer encode the documents 
docs_int_encoded = tokenizer.texts_to_sequences(docs)

# transform encoded docs into a DTM (default is binary)
# `mode` can be one of "binary", "count", "tfidf", "freq"
dtm_countK = tokenizer.sequences_to_matrix(docs_int_encoded, mode='count')

Using TensorFlow backend.


#### 2. construct count DTM with `sklearn` workflow
<hr>

- take the docs
- construct DTM directly with `CountVectorizer.fit_transform()`

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# don't want the default two-character restriction on words 
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b')

dtm_countSKL = vectorizer.fit_transform(docs)

#### 3. show that the two approaches yield equivalent results
<hr>

The columns are in different orders because the two tokenizers work differently internally. But the column contents are identical across the two data frames. Note also that `keras` reserves the vocabulary index `0`, which is why the placeholder character `<>` is shown as the first column header in the first df. 

In [4]:
from pandas import DataFrame

# get vocab for each approach in correct order to make column names 
vocabK = list(tokenizer.word_index.keys())
vocabSKL = vectorizer.get_feature_names()

# get both dtms as dataframes and compare side-by-side
dtm_df_K = DataFrame(dtm_countK.astype(int), columns=['<>'] + vocabK)
dtm_df_SKL = DataFrame(dtm_countSKL.toarray().astype(int), columns=vocabSKL)


print('count DTM from keras.preprocessing.text.Tokenizer():')
display(dtm_df_K)

print('count DTM from sklearn.feature_extraction.text.CountVectorizer():')
dtm_df_SKL

count DTM from keras.preprocessing.text.Tokenizer():


Unnamed: 0,<>,a,is,doc,just,this,corp,of,docs,string,and,the,me,toy,bunch,some,chars,last
0,0,0,1,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0
1,0,2,1,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0
2,0,2,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0
3,0,1,1,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0
4,0,1,1,3,0,1,0,1,1,0,2,2,0,0,0,0,0,1


count DTM from sklearn.feature_extraction.text.CountVectorizer():


Unnamed: 0,a,and,bunch,chars,corp,doc,docs,is,just,last,me,of,some,string,the,this,toy
0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,1
1,2,0,1,0,1,0,1,1,1,0,0,1,0,0,0,0,0
2,2,0,0,0,0,1,0,1,1,0,0,0,0,1,0,0,0
3,1,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0
4,1,2,0,0,0,3,1,1,0,1,0,1,0,0,2,1,0
