## NLP for ML Classification

**Hypothesis**: Part of Speech (POS) tagging and syntactic dependency parsing provides valuable information for classifying imperative phrases. The thinking is that being able to detect imperative phrases will transfer well to detecting tasks and to-dos.

#### Some Terminology
- [_Imperative mood_](https://en.wikipedia.org/wiki/Imperative_mood) is "used principally for ordering, requesting or advising the listener to do (or not to do) something... also often used for giving instructions as to how to perform a task."
- _Part of speech (POS)_ is a way of categorizing a word based on its syntactic function.
    - The POS tagger from Spacy.io that is used in this notebook differentiates between [*pos_* and *tag_*](https://spacy.io/docs/api/annotation#pos-tagging-english) - *POS (pos_)* refers to "coarse-grained part-of-speech" like `VERB`, `ADJ`, or `PUNCT`; and *POSTAG (tag_)* refers to "fine-grained part-of-speech" like `VB`, `JJ`, or `.`.
- _Syntactic dependency parsing_ is a way of connecting words based on syntactic relationships, [such as](https://spacy.io/docs/api/annotation#dependency-parsing-english) `DOBJ` (direct object), `PREP` (prepositional modifier), or `POBJ` (object of preposition).
    - Check out the dependency parse for the phrase ["Send the report by Kyle by tomorrow"](https://demos.explosion.ai/displacy/?text=Send%20the%20report%20by%20Kyle%20by%20tomorrow&model=en&cpu=1&cph=1) as an example

#### Features
The imperative mood centers around _actions_, and actions are generally represented in English using verbs. So the features are engineered to also center on the VERB:
1. *FeatureName.VERB*: Does the phrase contain VERB(s) of the tag form VB*?
2. *FeatureName.FOLLOWING_POS*: Are the words following the VERB(s) of certain parts of speech?
3. *FeatureName.FOLLOWING_POSTAG*: Are the words following the VERB(s) of certain POS tags?
4. *FeatureName.CHILD_DEP*: Are the VERB(s) parents of certain syntactic dependencies?
5. *FeatureName.PARENT_DEP*: Are the VERB(s) children of certain syntactic dependencies?
6. *FeatureName.CHILD_POS*: Are the syntactic dependencies that the VERB(s) are children of of certain parts of speech?
7. *FeatureName.CHILD_POSTAG*: Are the syntactic dependencies that the VERB(s) are children of of certain POS tags?
8. *FeatureName.PARENT_POS*: Are the syntactic dependencies that the VERB(s) parent of certain parts of speech?
9. *FeatureName.PARENT_POSTAG*: Are the syntactic dependencies that the VERB(s) parent of certain POS tags?

Note that features 2-9 all depend on feature 1 between `True`; if `False`, phrase vectorization will result in all zeroes.

## Data and Setup

### Building a recipe corpus

I wrote and ran `epicurious_recipes.py`\* to scrape Epicurious.com for recipe instructions and descriptions. Output is `epicurious-pos.txt` and `epicurious-neg.txt`.

\* _script loosely based off of https://github.com/benosment/hrecipe-parse_

Note that deriving all negative examples in the training set from Epicurious recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.

To (hopefully?) correct for this a bit, I will add the short movie reviews found at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.

This still feels weird because we're selecting negative examples only from specific categories of text (recipe descriptions, short movie reviews) - just because they're readily available.

Ultimately though, this recipe corpus is a **stopgap/proof of concept** for a corpus more relevant to tasks later on, so I won't worry further about this for now.

In [1]:
import os
import random

In [2]:
BASE_DIR = os.getcwd()
pos_data_path = BASE_DIR + '/pos.txt'
neg_data_path = BASE_DIR + '/neg.txt'

In [3]:
with open(pos_data_path, 'r', encoding='utf-8') as f:
    pos_data = f.read()
with open(neg_data_path, 'r', encoding='utf-8') as f:
    neg_data = f.read()

In [4]:
pos_data_split = pos_data.split('\n')
neg_data_split = neg_data.split('\n')

num_pos = len(pos_data_split)
num_neg = len(neg_data_split)

# 50/50 split between the number of positive and negative samples
num = num_pos if num_pos < num_neg else num_neg

# shuffle samples
random.shuffle(pos_data_split)
random.shuffle(neg_data_split)

In [5]:
lines = []
for l in pos_data_split[:num]:
    lines.append((l, 'pos'))
for l in neg_data_split[:num]:
    lines.append((l, 'neg'))

In [6]:
from enum import Enum, auto
class FeatureName(Enum):
    VERB = auto()
    FOLLOWING_POS = auto()
    FOLLOWING_POSTAG = auto()
    CHILD_DEP = auto()
    PARENT_DEP = auto()
    CHILD_POS = auto()
    CHILD_POSTAG = auto()
    PARENT_POS = auto()
    PARENT_POSTAG = auto()

## [spaCy.io](https://spacy.io/) for NLP
_Because Stanford CoreNLP is hard to install for Python_

Found Spacy through an article on ["Training a Classifier for Relation Extraction from Medical Literature"](https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/) ([GitHub](https://github.com/CatalystCode/corpus-to-graph-ml))

<img src="nltk_library_comparison.png" alt="NLTK library comparison chart https://spacy.io/docs/api/#comparison" style="width: 400px; margin: 0;"/>

In [7]:
#!conda config --add channels conda-forge
#!conda install spacy
#!python -m spacy download en

### Using the Spacy Data Model for NLP

In [8]:
import spacy
nlp = spacy.load('en')

Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\n' will start a new Spacy Doc.

In [9]:
def create_spacy_docs(ll):
    dd = [(nlp(l[0]), l[1]) for l in ll]
    # collapse noun phrases into single compounds
    for d in dd:
        for np in d[0].noun_chunks:
            np.merge(np.root.tag_, np.text, np.root.ent_type_)
    return dd

In [10]:
docs = create_spacy_docs(lines)

### NLP output

Tokenization, POS tagging, and dependency parsing happened automatically with the `nlp(line)` calls above! So let's look at the outputs.

https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward

In [11]:
for doc in docs[:10]:
    print(list(doc[0].sents))

[Whisk vigorously until the mixture comes together.]
[Add shrimp and cook (no need to return to a boil), stirring gently, until shrimp turn pink, about 3 minutes.]
[Beat the other egg in a small bowl and begin to stir it in—you may not want to add all of it in the interest of dryness.]
[Repeat with remaining steaks.]
[Flip once, cover, and grill for another 4 to 6 minutes, until cooked through.]
[Remove squid from bag, letting marinade drip off.]
[Whisk sour cream and remaining 2 Tbsp., lime juice in a small bowl.]
[Heat over medium-high until thermometer registers 350°F.]
[Rub with 1 Tbsp. oil and 1/2 tsp. salt and transfer cut side down to another rimmed baking sheet.]
[Transfer to a rimmed baking sheet and let cool.]


In [12]:
for doc in docs[:10]:
    print(list(doc[0].noun_chunks))

[Whisk, the mixture]
[shrimp, cook, a boil, shrimp, pink]
[the other egg, a small bowl, it, you, it, the interest, dryness]
[remaining steaks]
[Flip, grill, another 4 to 6 minutes]
[squid, bag, marinade drip]
[Whisk, cream, lime juice, a small bowl]
[thermometer]
[1 Tbsp, transfer, side, another rimmed baking sheet]
[Transfer, a rimmed baking sheet, let cool]


[Spacy's dependency graph visualization](https://demos.explosion.ai/displacy)

In [13]:
for doc in docs[:5]:
    for token in doc[0]:
        print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))

Whisk ROOT Whisk PROPN NNP Whisk [comes, .]
vigorously advmod vigorously ADV RB comes []
until mark until ADP IN comes []
the mixture nsubj the mixture NOUN NN comes []
comes advcl come VERB VBZ Whisk [vigorously, until, the mixture, together]
together advmod together ADV RB comes []
. punct . PUNCT . Whisk []
Add ROOT add VERB VB Add [shrimp, (, need, ,, stirring, .]
shrimp dobj shrimp NOUN NN Add [and, cook]
and cc and CCONJ CC shrimp []
cook conj cook NOUN NN shrimp []
( punct ( PUNCT -LRB- Add []
no det no DET DT need []
need npadvmod need NOUN NN Add [no, return, )]
to aux to PART TO return []
return acl return VERB VB need [to, to]
to prep to ADP IN return [a boil]
a boil pobj a boil NOUN NN to []
) punct ) PUNCT -RRB- need []
, punct , PUNCT , Add []
stirring advcl stir VERB VBG Add [gently, ,, turn]
gently advmod gently ADV RB stirring []
, punct , PUNCT , stirring []
until mark until ADP IN turn []
shrimp nsubj shrimp NOUN NN turn []
turn advcl turn VERB VBP stirring [until, s

### Featurization

In [14]:
import re
from collections import defaultdict

def featurize(d):
    s_features = defaultdict(int)
    for idx, token in enumerate(d):
        #print(token, token.pos_, token.tag_)
        if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags 
            s_features[FeatureName.VERB.name] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(d):
                s_features[f'{FeatureName.FOLLOWING_POS.name}_{d[next_idx].pos_}'] += 1
                s_features[f'{FeatureName.FOLLOWING_POSTAG.name}_{d[next_idx].tag_}'] += 1
            # VERB_HEAD_DEP
            # VERB_HEAD_POS
            '''
            "Because the syntactic relations form a tree, every word has exactly one head.
            You can therefore iterate over the arcs in the tree by iterating over the words in the sentence."
            https://spacy.io/docs/usage/dependency-parse#navigating
            '''
            if (token.head is not token):
                s_features[f'{FeatureName.PARENT_DEP.name}_{token.head.dep_.upper()}'] += 1
                s_features[f'{FeatureName.PARENT_POS.name}_{token.head.pos_}'] += 1
                s_features[f'{FeatureName.PARENT_POSTAG.name}_{token.head.tag_}'] += 1
            # VERB_CHILD_DEP
            # VERB_CHILD_POS
            for child in token.children:
                s_features[f'{FeatureName.CHILD_DEP.name}_{child.dep_.upper()}'] += 1
                s_features[f'{FeatureName.CHILD_POS.name}_{child.pos_}'] += 1
                s_features[f'{FeatureName.CHILD_POSTAG.name}_{child.tag_}'] += 1
    return dict(s_features)
        #print(dict(s_features))
    #print()

#print(featuresets, len(featuresets))

In [15]:
featuresets = [(featurize(doc[0]), doc[1]) for doc in docs]

In [16]:
from statistics import mean, median, mode, stdev
f_lengths = [len(fs[0]) for fs in featuresets]

print('Stats on feature set lengths:')
print(f'mean: {mean(f_lengths)}')
print(f'stdev: {stdev(f_lengths)}')
print(f'median: {median(f_lengths)}')
print(f'mode: {mode(f_lengths)}')
print(f'max: {max(f_lengths)}')
print(f'min: {min(f_lengths)}')

Stats on feature set lengths:
mean: 23.033783783783782
stdev: 14.534981922236032
median: 23.0
mode: 0
max: 75
min: 0


In [17]:
featuresets[:2]

[({'CHILD_DEP_ADVMOD': 2,
   'CHILD_DEP_MARK': 1,
   'CHILD_DEP_NSUBJ': 1,
   'CHILD_POSTAG_IN': 1,
   'CHILD_POSTAG_NN': 1,
   'CHILD_POSTAG_RB': 2,
   'CHILD_POS_ADP': 1,
   'CHILD_POS_ADV': 2,
   'CHILD_POS_NOUN': 1,
   'FOLLOWING_POSTAG_RB': 1,
   'FOLLOWING_POS_ADV': 1,
   'PARENT_DEP_ROOT': 1,
   'PARENT_POSTAG_NNP': 1,
   'PARENT_POS_PROPN': 1,
   'VERB': 1},
  'pos'),
 ({'CHILD_DEP_ADVCL': 2,
   'CHILD_DEP_ADVMOD': 1,
   'CHILD_DEP_AUX': 1,
   'CHILD_DEP_DOBJ': 2,
   'CHILD_DEP_MARK': 1,
   'CHILD_DEP_NPADVMOD': 2,
   'CHILD_DEP_NSUBJ': 1,
   'CHILD_DEP_PREP': 1,
   'CHILD_DEP_PUNCT': 4,
   'CHILD_POSTAG_,': 2,
   'CHILD_POSTAG_-LRB-': 1,
   'CHILD_POSTAG_.': 1,
   'CHILD_POSTAG_IN': 2,
   'CHILD_POSTAG_NN': 4,
   'CHILD_POSTAG_NNS': 1,
   'CHILD_POSTAG_RB': 1,
   'CHILD_POSTAG_TO': 1,
   'CHILD_POSTAG_VBG': 1,
   'CHILD_POSTAG_VBP': 1,
   'CHILD_POS_ADP': 2,
   'CHILD_POS_ADV': 1,
   'CHILD_POS_NOUN': 5,
   'CHILD_POS_PART': 1,
   'CHILD_POS_PUNCT': 4,
   'CHILD_POS_VERB': 2,


### Classification

In [18]:
random.shuffle(featuresets)

split_num = round(num / 5)

# train and test sets
testing_set = featuresets[:split_num]
training_set =  featuresets[split_num:]

In [19]:
# decoupling the functionality of nltk.classify.accuracy
def predict(classifier, gold):
    predictions = classifier.classify_many([fs for (fs, l) in gold])
    return list(zip([l for (fs, l) in gold], predictions))

def accuracy(predict):
    correct = [label == prediction for (label, prediction) in predict]
    if correct:
        return sum(correct) / len(correct)
    else:
        return 0

In [20]:
from nltk import NaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

nb = NaiveBayesClassifier.train(training_set)
nb_predict = predict(nb, testing_set)
nb_accuracy = accuracy(nb_predict)
print("NaiveBayes classifier accuracy percent:", nb_accuracy*100)

multinomial_nb = SklearnClassifier(MultinomialNB())
multinomial_nb.train(training_set)
mnb_predict = predict(multinomial_nb, testing_set)
mnb_accuracy = accuracy(mnb_predict)
print("MultinomialNB classifier accuracy percent:", mnb_accuracy*100)

bernoulli_nb = SklearnClassifier(BernoulliNB())
bernoulli_nb.train(training_set)
bnb_predict = predict(bernoulli_nb, testing_set)
bnb_accuracy = accuracy(bnb_predict)
print("BernoulliNB classifier accuracy percent:", bnb_accuracy*100)

logistic_regression = SklearnClassifier(LogisticRegression())
logistic_regression.train(training_set)
lr_predict = predict(logistic_regression, testing_set)
lr_accuracy = accuracy(lr_predict)
print("LogisticRegression classifier accuracy percent:", lr_accuracy*100)

sgd = SklearnClassifier(SGDClassifier())
sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print("SGDClassifier classifier accuracy percent:", sgd_accuracy*100)

svc = SklearnClassifier(SVC())
svc.train(training_set)
svc_predict = predict(svc, testing_set)
svc_accuracy = accuracy(svc_predict)
print("SVC classifier accuracy percent:", svc_accuracy*100)

linear_svc = SklearnClassifier(LinearSVC())
linear_svc.train(training_set)
linear_svc_predict = predict(linear_svc, testing_set)
linear_svc_accuracy = accuracy(linear_svc_predict)
print("LinearSVC classifier accuracy percent:", linear_svc_accuracy*100)

# slow
dt = DecisionTreeClassifier.train(training_set)
dt_predict = predict(dt, testing_set)
dt_accuracy = accuracy(dt_predict)
print("DecisionTree classifier accuracy percent:", dt_accuracy*100)

NaiveBayes classifier accuracy percent: 64.4880174291939
MultinomialNB classifier accuracy percent: 76.25272331154684
BernoulliNB classifier accuracy percent: 76.03485838779956
LogisticRegression classifier accuracy percent: 83.87799564270153
SGDClassifier classifier accuracy percent: 77.99564270152506
SVC classifier accuracy percent: 83.66013071895425
LinearSVC classifier accuracy percent: 83.87799564270153
DecisionTree classifier accuracy percent: 79.520697167756


In [21]:
phrases = ["Mow lawn", "Mow the lawn", "Buy new shoes", "Feed the dog", "Send report to Kyle", "Send the report to Kyle", "Peel the potatoes"]
features = [featurize(nlp(phrase)) for phrase in phrases]

predict_linear_svc = linear_svc.classify_many(features)
predict_logistic = logistic_regression.classify_many(features)
predict_sgd = sgd.classify_many(features)

print(f'LinearSVC: {predict_linear_svc}')
print(f'LogisticRegression: {predict_logistic}')
print(f'SGD: {predict_sgd}')

LinearSVC: ['pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos']
LogisticRegression: ['pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos']
SGD: ['pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos']


Interestingly, the highest performing classifiers (`LogisticRegression, LinearSVC: 88.89`) are producing quite different results on our sample task list:
    - "Mow lawn"
        - LinearSVC: neg
        - LogisticRegression: neg
    - "Mow the lawn"
        - LinearSVC: pos
        - LogisticRegression: neg
    - "Buy new shoes"
        - LinearSVC: pos
        - LogisticRegression: pos
    - "Feed the dog"
        - LinearSVC: pos
        - LogisticRegression: neg
    - "Send report to Kyle"
        - LinearSVC: neg
        - LogisticRegression: neg
    - "Send the report to Kyle"
        - LinearSVC: pos
        - LogisticRegression: neg

Observations:
1. LogisticRegression seems _heavily_ biased to be negative.
2. LinearSVC seems more fragile when grammar is off (e.g., missing _the_'s) - however, this feels fixable with a more varied/realistic training set.

### Multiple Epochs

In [23]:
random.shuffle(training_set)

logistic_regression.train(training_set)
lr_predict = predict(logistic_regression, testing_set)
lr_accuracy = accuracy(lr_predict)
print("LogisticRegression classifier accuracy percent:", lr_accuracy*100)

sgd.train(training_set)
sgd_predict = predict(sgd, testing_set)
sgd_accuracy = accuracy(sgd_predict)
print("SGDClassifier classifier accuracy percent:", sgd_accuracy*100)

linear_svc.train(training_set)
linear_svc_predict = predict(linear_svc, testing_set)
linear_svc_accuracy = accuracy(linear_svc_predict)
print("LinearSVC classifier accuracy percent:", linear_svc_accuracy*100)

LogisticRegression classifier accuracy percent: 83.87799564270153
SGDClassifier classifier accuracy percent: 73.20261437908496
LinearSVC classifier accuracy percent: 83.87799564270153


`LogisticRegression_classifier` and `LinearSVC_classifier` accuracies did not change with another epoch on randomly shuffled training data. `SGDClassifier_classifier` however did (as I suspected it might from my deep learning course).

So let's run more epochs with `SGDClassifier_classifier` for now (until I learn if multiple epochs can work with other types of classifiers)...

In [24]:
num_epochs = 8
for i in range(num_epochs):
    random.shuffle(training_set)

    sgd.train(training_set)
    sgd_predict = predict(sgd, testing_set)
    sgd_accuracy = accuracy(sgd_predict)
    print(f"SGDClassifier classifier accuracy percent (epoch {i+1}):", sgd_accuracy*100)

SGDClassifier classifier accuracy percent (epoch 1): 77.99564270152506
SGDClassifier classifier accuracy percent (epoch 2): 84.31372549019608
SGDClassifier classifier accuracy percent (epoch 3): 81.48148148148148
SGDClassifier classifier accuracy percent (epoch 4): 82.78867102396515
SGDClassifier classifier accuracy percent (epoch 5): 77.77777777777779
SGDClassifier classifier accuracy percent (epoch 6): 78.21350762527233
SGDClassifier classifier accuracy percent (epoch 7): 82.13507625272331
SGDClassifier classifier accuracy percent (epoch 8): 82.13507625272331


### Analysis

#### Most Informative Features

In [25]:
# https://stackoverflow.com/a/11140887
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

In [26]:
#print("Naive Bayes'")
#classifier.show_most_informative_features(15)

print("Logistic Regressions's")
print('Most Informative Features')
show_most_informative_features(logistic_regression._vectorizer, logistic_regression._clf, 15)

print('LinearSVC\'s')
print('Most Informative Features')
show_most_informative_features(linear_svc._vectorizer, linear_svc._clf, 15)

Logistic Regressions's
Most Informative Features
	-2.3360	CHILD_DEP_AGENT		2.0531	CHILD_POSTAG_-RRB-
	-1.9459	CHILD_POSTAG_HYPH		1.4447	CHILD_DEP_NPADVMOD
	-1.7239	CHILD_POSTAG_``		1.3925	FOLLOWING_POS_NUM
	-1.5496	CHILD_DEP_NSUBJ		1.3925	FOLLOWING_POSTAG_CD
	-1.4814	CHILD_DEP_INTJ 		1.2902	FOLLOWING_POSTAG_RB
	-1.4610	PARENT_DEP_AMOD		1.2464	CHILD_DEP_DOBJ||XCOMP
	-1.3615	PARENT_POSTAG_VBZ		1.2198	CHILD_POSTAG_WRB
	-1.3491	FOLLOWING_POSTAG_VB		1.2187	VERB           
	-1.3442	FOLLOWING_POSTAG_WRB		1.2039	CHILD_POSTAG_NFP
	-1.2937	FOLLOWING_POSTAG_-RRB-		1.1131	PARENT_POS_PROPN
	-1.2319	CHILD_DEP_ATTR 		1.1131	PARENT_POSTAG_NNP
	-1.2314	CHILD_DEP_NEG  		1.0512	CHILD_POSTAG_MD
	-1.2281	CHILD_POSTAG_WP		1.0500	FOLLOWING_POSTAG_JJ
	-1.1658	PARENT_DEP_DATIVE		1.0041	CHILD_POS_PROPN
	-1.1486	CHILD_DEP_AUX  		0.9206	PARENT_DEP_DEP 
LinearSVC's
Most Informative Features
	-1.5210	CHILD_POSTAG_``		1.4167	CHILD_DEP_ADVMOD||XCOMP
	-1.1372	CHILD_DEP_AGENT		1.3514	FOLLOWING_POSTAG_JJS
	-1.0790	FOLLO

#### Scikit Learn metrics: Confusion matrix, Classification report, F1 score, Log loss

http://scikit-learn.org/stable/modules/model_evaluation.html

**TODO** log loss requires `predict_proba`

In [27]:
from sklearn import metrics
    
# http://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics
def f1_macro(predict):
    labels, predictions = zip(*predict)
    return metrics.f1_score(labels, predictions, average='macro')

def classification_report(predict):
    labels, predictions = zip(*predict)
    return metrics.classification_report(labels, predictions)

def confusion_matrix(predict):
    labels, predictions = zip(*predict)
    print('layout:\n[[tn   fp]\n [fn   tp]]\n')
    return metrics.confusion_matrix(labels, predictions)

In [28]:
f1_macro(nb_predict)

0.6095900061052929

In [29]:
print(classification_report(nb_predict))

             precision    recall  f1-score   support

        neg       0.58      0.96      0.73       225
        pos       0.91      0.34      0.49       234

avg / total       0.75      0.64      0.61       459



In [30]:
print(confusion_matrix(nb_predict))

layout:
[[tn   fp]
 [fn   tp]]

[[217   8]
 [155  79]]


In [None]:
#log_loss(nb_predict)

**Next up**: digging into the results (confusion matrix, most informative features), comparing results to LUIS model

## Next Steps and Improvements

1. Training set may be too specific/not relevant enough (recipe instructions for positive dataset, recipe descriptions+short movie reviews for negative dataset)
2. Throwing features into a blender - need to understand value of each
3. Need to review different classifiers, strengths/weaknesses
4. Phrase vectorizations of all 0s
5. Varying feature vector lengths
6. Voting
7. Reducing dimensionality

# Things abandoned

## NLTK

I needed a library that supports dependency parsing, which NLTK does not... so I thought I'd add the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) toolkit and [its associated software](https://nlp.stanford.edu/software/) to NLTK. However, there are many conflicting instructions for installing the Java-based project, depending on NLTK version used. By the time I figured this out, the installation had become a time sink. So I abandoned this effort in favor of Spacy.io.

I might return this way if I want to improve results/implement a voter system between the various linguistic and classification methods later.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Tokenization

In [None]:
sentences = [s for l in lines for s in sent_tokenize(l)] # punkt
sentences

In [None]:
tagged_sentences = []
for s in sentences:
    words = word_tokenize(s)
    tagged = nltk.pos_tag(words) # averaged_perceptron_tagger
    tagged_sentences.append(tagged)
print(tagged_sentences)

#### Note: POS accuracy

`Run down to the shop, will you, Peter` is parsed unexpectedly by `nltk.pos_tag`:
> `[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]`

`Run` is tagged as a `NNP (proper noun, singular)`

I expected an output more like what the [Stanford Parser](http://nlp.stanford.edu:8080/parser/) provides:
> `Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

`Run` is tagged as a `VGB (verb, gerund/present participle)` - still not quite the `VB` I want, but at least it's a `V*`

_MEANWHILE..._

`nltk.pos_tag` did better with:
> `[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]`

Compared to [Stanford CoreNLP](http://nlp.stanford.edu:8080/corenlp/process) (note that this is different than what [Stanford Parser](http://nlp.stanford.edu:8080/parser/) outputs):
> `(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))`

Concern: _clean_ as `VB (verb, base form)` vs `JJ (adjective)` 

**IMPROVE** POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc.

Note what Spacy POS tagger did with `Run down to the shop, will you Peter`:

`Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

    where `Run` is the `VB` I expected from POS tagging (compared to `nltk.pos_tag` result of `NNP`). Also note that Spacy collapses `the shop` into a single unit, which should be helpful during featurization.

### Featurization

In [None]:
import re
from collections import defaultdict

featuresets = []
for ts in tagged_sentences:
    s_features = defaultdict(int)
    for idx, tup in enumerate(ts):
        #print(tup)
        pos = tup[1]
        # FeatureName.VERB
        is_verb = re.match(r'VB.?', pos) is not None
        print(tup, is_verb)
        if is_verb:
            s_features[FeatureName.VERB] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(ts):
                s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1
            # VERB_MODIFIER
            # VERB_MODIFYING
        else:
            s_features[FeatureName.VERB] = 0
    featuresets.append(dict(s_features))

print()
print(featuresets)

### [Stanford NLP](https://nlp.stanford.edu/software/)
Setup guide used: https://stackoverflow.com/a/34112695

In [None]:
# Get dependency parser, NER, POS tagger
!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
!unzip stanford-parser-full-2017-06-09.zip
!unzip stanford-ner-2017-06-09.zip
!unzip stanford-postagger-full-2017-06-09.zip

In [None]:
from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer