## Extract The Tokens Most Associated With Each Category

### Retrieve ML model that has been cached to disk

In [1]:
import pickle as pkl
import joblib

In [2]:
def tokenize(text):
    # tokenize the text using spacy's model for English
    doc = en_nlp(text)
    # while we lemmatize the now tokenized text, let's not forget to drop
    #   tokens that are stop_words or punctuation
    lemmas = [token.lemma_ for token in doc
        if token not in stopwords and not token.is_punct]
    # Had better luck with this nltk stemmer
    return [stemmer.stem(lemma) for lemma in lemmas]

In [4]:
with open('../models/ml_model.pkl', 'rb') as f:
    ml_model = joblib.load(f)
    
with open('../models/nlp_model.pkl', 'rb') as f:
    nlp_model = joblib.load(f)

In [5]:
ml_model.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('multioutputclassifier',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2),
                                                                    learning_rate=1,
                                                                    n_estimators=10)))])>

In [6]:
nlp_model.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(min_df=5,
                                 tokenizer=<function tokenize at 0x1106adf70>))])>

### Extract The Vocabulary

Is there some other way we could get the vocabulary?  Some way where we don't have to reconstruct and train the vectorizer?
- Can we store the vectorizer like we store the model?  Then we could just read it from disk.

In [7]:
nlp_model.named_steps

{'tfidfvectorizer': TfidfVectorizer(min_df=5, tokenizer=<function tokenize at 0x1106adf70>)}

In [8]:
nlp_model['tfidfvectorizer']

TfidfVectorizer(min_df=5, tokenizer=<function tokenize at 0x1106adf70>)

In [9]:
# vocab = vect.vocabulary_
vocab = list(nlp_model['tfidfvectorizer'].vocabulary_.keys())
# No good.  vocabulary_ is a dict, and dicts are only ordered in recent versions of python.  
# We can't rely on the dict to return the tokens in the correct order.
# I've gone ahead even though I'm still concerned about this issue.

In [10]:
n_vocab = len(vocab)
n_vocab

4859

In [11]:
with open('nlp_vocabulary.joblib', 'wb') as f:
    joblib.dump(vocab, f)

### Extract Canon Tokens

Here's what we're up to...   For every token in the vocabulary we're going to take its vector representation (a single 1 and a lot of 0s) and feed it to the pipeline to get a prediction.  That prediction becomes a row with 36 elements which is appended to a big table which winds up with 4859 rows.  We get a big table 4859 x 36.

From this table we then extract the columns.  For each non-zero entry in a column we look up the associated token.  We call these tokens the "canon tokens".  

What we are doing is a crude kind of matrix inversion, which would give us our features perfectly if the pipeline were a strictly linear system, but of course it isn't.  But it's the best we got.

In [12]:
import numpy as np

In [13]:
def genTokenTable(model, n_vocab):
    result = np.zeros(shape = (n_vocab, 36))
    for i in range(n_vocab):
        # construct token vector for single token
        tokenVec = np.zeros(shape=n_vocab)
        tokenVec[i] = 1
        # compute categories for that single token and append to table
        result[i] = model.predict([tokenVec])
    return result

~~Rewrite `genTokenTable()` to use csr (or lil) rather than np.array.~~  Ok. That was a terrible idea.  It's non-trivial to use sparse.lil arrays with the exceptionally useful `numpy.where()` function.  Let's go back to the original method.

In [None]:
# from scipy import sparse

In [None]:
# def genTokenTable(n_vocab):
#     result = sparse.lil_matrix((n_vocab, 36), dtype=int)
#     for i in range(n_vocab):
#         # construct token vector for single token
#         tokenVec = np.zeros(shape=n_vocab)
#         tokenVec[i] = 1
#         # compute categories for that single token and append to table
#         result[i,:] = pipe.predict([tokenVec])
#     return result

In [15]:
%%time
tokenTable = genTokenTable(ml_model, n_vocab)

CPU times: user 4min 59s, sys: 2.18 s, total: 5min 2s
Wall time: 5min 8s


In [16]:
def genCanonTable(table):
    result = []
    for i in range(36):
        categoryVector = tokenTable[:,i]
        tokenIndices = np.where(categoryVector == 1)[0]
        # A wily trick for indexing a list
        result.append(list(np.array(vocab)[tokenIndices]))
    return(result)

In [17]:
# canonTable = [canonTokens(tokenTable, i) for i in range(36)]
canonTable = genCanonTable(tokenTable)

In [18]:
canonTable

[['the',
  'medan',
  'chapter',
  'of',
  'taiwan',
  'buddhist',
  'tzu',
  'chi',
  'foundat',
  'set',
  'up',
  'a',
  'recept',
  'center',
  'at',
  'an',
  'indonesian',
  'militari',
  'base',
  'in',
  'on',
  'dec',
  '29',
  'to',
  'help',
  'victim',
  'flee',
  'tsunami',
  'ravag',
  'ach',
  'provinc',
  'i',
  'be',
  'flood',
  'port',
  'au',
  'princ',
  'now',
  'live',
  'gonaiv',
  'need',
  'make',
  'sure',
  '-pron-',
  'manhattan',
  'have',
  'emerg',
  'food',
  'frankenstorm',
  'hurricanesandi',
  'get',
  "'s",
  'mini',
  'amp',
  'water',
  'for',
  'about',
  'these',
  'hurrican',
  'with',
  'go',
  'all',
  'digicel',
  'offic',
  'find',
  '160',
  'can',
  'not',
  'one',
  'deliveri',
  'more',
  'than',
  '41',
  'ton',
  'therapeut',
  'milk',
  'and',
  '1.5',
  'treat',
  'child',
  'acut',
  'sever',
  'malnutrit',
  '31',
  'feed',
  'centr',
  'oper',
  'by',
  'unicef',
  'partner',
  'both',
  'cloth',
  'non',
  'perish',
  'donat',
 

In [None]:
# canon = dict(pd.Series(data=canonTable, index=out_columns, name='Canon Tokens'))

In [None]:
import joblib

In [None]:
with open('canon.joblib', 'wb') as f:
    joblib.dump(canonTable, f)