## Data and Setup

In [1]:
import os

In [2]:
BASE_DIR = os.getcwd()
pos_data_path = BASE_DIR + '/pos.txt'
neg_data_path = BASE_DIR + '/neg.txt'

In [3]:
with open(pos_data_path, 'r', encoding='utf-8') as f:
    pos_data = f.read()
with open(neg_data_path, 'r', encoding='utf-8') as f:
    neg_data = f.read()

In [4]:
pos_data_split = pos_data.split('\n')
neg_data_split = neg_data.split('\n')

num_pos = len(pos_data_split)
num_neg = len(neg_data_split)
# 50/50 split between the number of positive and negative samples
num = num_pos if num_pos > num_neg else num_neg

In [5]:
lines = []
for l in pos_data_split[:num]:
    lines.append((l, 'pos'))
for l in neg_data_split[:num]:
    lines.append((l, 'neg'))

In [6]:
from enum import Enum, auto
class FeatureName(Enum):
    VERB = auto() # does this sentence contain a VB*?
    FOLLOWING = auto() # is the following word a <POS>? postfixed with _<POS>
    VERB_CHILD_DEP = auto() # what are the child (outgoing edges) dependencies (arc labels)? postfixed with _<DEP>
    VERB_HEAD_DEP = auto() # what are the head (incoming edge) dependencies (arc labels)? postfixed with _<DEP>
    VERB_CHILD_POS = auto() # is the child dependency a <POS>? postfixed with _<POS>
    VERB_HEAD_POS = auto() # is the head dependency a <POS>? postfixed with _<POS>

## [spaCy.io](https://spacy.io/)
_Because Stanford CoreNLP is hard to install for Python_

Found Spacy through an article on ["Training a Classifier for Relation Extraction from Medical Literature"](https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/) ([GitHub](https://github.com/CatalystCode/corpus-to-graph-ml))

<img src="nltk_library_comparison.png" alt="NLTK library comparison chart https://spacy.io/docs/api/#comparison" style="width: 400px; margin: 0;"/>

In [None]:
!conda config --add channels conda-forge
!conda install spacy
!python -m spacy download en

### Using the Spacy Data Model for NLP

In [7]:
import spacy
nlp = spacy.load('en')

Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\n' will start a new Spacy Doc.

In [8]:
def create_spacy_docs(ll):
    dd = [(nlp(l[0]), l[1]) for l in ll]
    # collapse noun phrases into single compounds
    for d in dd:
        for np in d[0].noun_chunks:
            np.merge(np.root.tag_, np.text, np.root.ent_type_)
    return dd

In [9]:
docs = create_spacy_docs(lines)

### NLP output

Tokenization, POS tagging, and syntactic parsing happened automatically with the `nlp(line)` calls above! So let's look at these outputs.

https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward

In [10]:
for doc in docs[:10]:
    print(list(doc[0].sents))

[Be kind]
[Get out of here]
[Look this over]
[Paul, do your homework now]
[Do not clean soot off the window]
[Turn your phones off, please]
[Run down to the shop, will you, Peter]
[Look at this]
[Stir until smooth]
[Pick up milk]


In [11]:
for doc in docs[:10]:
    print(list(doc[0].noun_chunks))

[]
[]
[]
[Paul, your homework]
[soot, the window]
[your phones]
[the shop, you]
[]
[]
[milk]


[Spacy's dependency graph visualization](https://demos.explosion.ai/displacy)

In [12]:
for doc in docs[:10]:
    for token in doc[0]:
        print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))

Be ROOT be VERB VB Be [kind]
kind acomp kind ADJ JJ Be []
Get ROOT get VERB VB Get [out]
out prep out ADP IN Get [of]
of prep of ADP IN out [here]
here pcomp here ADV RB of []
Look ROOT look VERB VB Look [this, over]
this dobj this DET DT Look []
over prep over ADP IN Look []
Paul nsubj Paul PROPN NNP do [,]
, punct , PUNCT , Paul []
do ROOT do VERB VB do [Paul, your homework, now]
your homework dobj your homework NOUN NN do []
now advmod now ADV RB do []
Do ROOT do VERB VBP Do [clean]
not neg not ADV RB clean []
clean acomp clean ADJ JJ Do [not, soot]
soot dobj soot NOUN NN clean [off]
off prep off ADP IN soot [the window]
the window pobj the window NOUN NN off []
Turn ROOT turn VERB VB Turn [your phones, off, ,, please]
your phones dobj your phones NOUN NNS Turn []
off prt off PART RP Turn []
, punct , PUNCT , Turn []
please intj please INTJ UH Turn []
Run ROOT run VERB VB Run [down, to, ,, will]
down prt down PART RP Run []
to prep to ADP IN Run [the shop]
the shop pobj the shop NOU

Note what Spacy POS tagger did with `Run down to the shop, will you Peter`:

`Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

where `Run` is the `VB` I expected earlier from POS tagging. Also note that `the shop` has been collapsed to a single compound, which will be helpful during featurization.

### Featurization

In [13]:
import re
from collections import defaultdict

def featurize(d):
    fs = []
    s_features = defaultdict(int)
    for idx, token in enumerate(d):
        #print(token, token.pos_, token.tag_)
        if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags 
            s_features[FeatureName.VERB.name] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(d):
                s_features[f'{FeatureName.FOLLOWING.name}_{d[next_idx].tag_}'] += 1
            # VERB_HEAD_DEP
            # VERB_HEAD_POS
            '''
            "Because the syntactic relations form a tree, every word has exactly one head.
            You can therefore iterate over the arcs in the tree by iterating over the words in the sentence."
            https://spacy.io/docs/usage/dependency-parse#navigating
            '''
            if (token.head is not token):
                s_features[f'{FeatureName.VERB_HEAD_DEP.name}_{token.head.dep_.upper()}'] += 1
                s_features[f'{FeatureName.VERB_HEAD_POS.name}_{token.head.tag_}'] += 1
            # VERB_CHILD_DEP
            # VERB_CHILD_POS
            for child in token.children:
                s_features[f'{FeatureName.VERB_CHILD_DEP.name}_{child.dep_.upper()}'] += 1
                s_features[f'{FeatureName.VERB_CHILD_POS.name}_{child.tag_}'] += 1            
    return dict(s_features)
        #print(dict(s_features))
    #print()

#print(featuresets, len(featuresets))

In [14]:
featuresets = [(featurize(doc[0]), doc[1]) for doc in docs]

### Building a recipe corpus

I wrote and ran `epicurious_recipes.py`\* to scrape Epicurious.com for recipe instructions and descriptions. Output is `epicurious-pos.txt` and `epicurious-neg.txt`.

\* _script loosely based off of https://github.com/benosment/hrecipe-parse_

Note that building a training set entirely from recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.

To (hopefully?) correct for this a bit, I will combine the short movie reviews at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.

Ultimately though, this recipe corpus is a stopgap for more relevant corpus later on, so I won't worry further about this.

### Classification

In [15]:
import random

random.shuffle(featuresets)

split_num = round(num / 5)

# train and test sets
training_set = featuresets[:split_num]
testing_set =  featuresets[split_num:]

In [16]:
from nltk import classify, NaiveBayesClassifier
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

classifier = NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo accuracy percent:", (classify.accuracy(classifier, testing_set))*100)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (classify.accuracy(SGDClassifier_classifier, testing_set))*100)

##SVC_classifier = SklearnClassifier(SVC())
##SVC_classifier.train(training_set)
##print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (classify.accuracy(LinearSVC_classifier, testing_set))*100)

Naive Bayes Algo accuracy percent: 84.00358262427228
MNB_classifier accuracy percent: 87.0577698163905
BernoulliNB_classifier accuracy percent: 77.25929243170623
LogisticRegression_classifier accuracy percent: 88.14151365875503
SGDClassifier_classifier accuracy percent: 87.8280340349306
LinearSVC_classifier accuracy percent: 88.40125391849529


In [17]:
phrase = "Pick up milk"
feature = featurize(nlp(phrase))

predict_linearSVC = LinearSVC_classifier.classify_many(feature)[0]
predict_naivebayes = classifier.classify_many([feature])[0]

print(f'LinearSVC: {predict_linearSVC}')
print(f'NaiveBayes: {predict_naivebayes}')

LinearSVC: pos
NaiveBayes: neg


**Next up**: digging into the results (confusion matrix), improving results, comparing results to LUIS model, reducing dimensionality, VoteClassifier?

### Analysis

In [18]:
print("Naive Bayes'")
classifier.show_most_informative_features(15)

Naive Bayes'
Most Informative Features
      VERB_CHILD_POS_WDT = 1                 neg : pos    =     36.8 : 1.0
       VERB_HEAD_POS_NNP = 1                 pos : neg    =     34.2 : 1.0
       VERB_CHILD_POS_WP = 1                 neg : pos    =     18.4 : 1.0
       VERB_HEAD_POS_VBZ = 1                 neg : pos    =     11.8 : 1.0
      VERB_CHILD_POS_PRP = 3                 neg : pos    =      8.7 : 1.0
       VERB_HEAD_POS_VBZ = 2                 neg : pos    =      8.3 : 1.0
      VERB_HEAD_DEP_NMOD = 1                 pos : neg    =      8.1 : 1.0
 VERB_CHILD_DEP_NPADVMOD = 2                 pos : neg    =      8.1 : 1.0
    VERB_CHILD_DEP_APPOS = 2                 pos : neg    =      8.1 : 1.0
     VERB_HEAD_DEP_APPOS = 3                 pos : neg    =      8.1 : 1.0
     VERB_CHILD_DEP_PREP = 5                 pos : neg    =      8.1 : 1.0
     VERB_CHILD_DEP_DOBJ = 6                 pos : neg    =      8.1 : 1.0
       VERB_CHILD_POS_IN = 6                 pos : neg    =  

In [19]:
# https://stackoverflow.com/a/11140887
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
       
print('LinearSVC\'s')
print('Most Informative Features')
show_most_informative_features(LinearSVC_classifier._vectorizer, LinearSVC_classifier._clf, 15)

LinearSVC's
Most Informative Features
	-1.3291	VERB_HEAD_DEP_NPADVMOD		1.3190	FOLLOWING_MD   
	-1.2536	VERB_HEAD_POS_IN		1.2541	VERB_HEAD_POS_NNP
	-1.1268	VERB_HEAD_POS_CD		1.0593	FOLLOWING_PDT  
	-0.9553	VERB_CHILD_POS_WDT		1.0263	VERB_HEAD_DEP_OPRD
	-0.9482	FOLLOWING_HYPH 		0.9821	VERB_HEAD_DEP_ADVMOD||CONJ
	-0.9174	FOLLOWING_JJR  		0.9790	VERB_CHILD_POS_-RRB-
	-0.8975	VERB_CHILD_DEP_AGENT		0.7315	FOLLOWING_CD   
	-0.8735	VERB_HEAD_POS_RB		0.7277	VERB_HEAD_DEP_CSUBJ
	-0.8319	VERB_HEAD_DEP_DET		0.6689	VERB_CHILD_POS_NNP
	-0.8150	VERB_CHILD_POS_POS		0.6601	VERB_CHILD_DEP_AMOD
	-0.8150	FOLLOWING_POS  		0.6315	VERB           
	-0.8147	VERB_CHILD_POS_HYPH		0.5917	VERB_CHILD_POS_PDT
	-0.7742	VERB_CHILD_DEP_NMOD		0.5542	VERB_HEAD_DEP_ADVMOD
	-0.7741	VERB_CHILD_POS_``		0.5487	VERB_CHILD_DEP_NPADVMOD
	-0.7512	VERB_HEAD_POS_UH		0.5190	VERB_CHILD_DEP_MARK


# Things abandoned

## NLTK

I needed a library that supports dependency parsing, which NLTK does not... so I thought I'd add the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) toolkit and [its associated software](https://nlp.stanford.edu/software/) to NLTK. However, there are many conflicting instructions for installing the Java-based project, depending on NLTK version used. By the time I figured this out, the installation had become a time sink. So I abandoned this effort in favor of Spacy.io.

I might return this way if I want to improve results/implement a voter system between the various linguistic and classification methods later.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Tokenization

In [None]:
sentences = [s for l in lines for s in sent_tokenize(l)] # punkt
sentences

In [None]:
tagged_sentences = []
for s in sentences:
    words = word_tokenize(s)
    tagged = nltk.pos_tag(words) # averaged_perceptron_tagger
    tagged_sentences.append(tagged)
print(tagged_sentences)

#### Note: POS accuracy

`Run down to the shop, will you, Peter` is parsed unexpectedly by `nltk.pos_tag`:
> `[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]`

`Run` is tagged as a `NNP (proper noun, singular)`

I expected an output more like what the [Stanford Parser](http://nlp.stanford.edu:8080/parser/) provides:
> `Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

`Run` is tagged as a `VGB (verb, gerund/present participle)` - still not quite the `VB` I want, but at least it's a `V*`

_MEANWHILE..._

`nltk.pos_tag` did better with:
> `[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]`

Compared to [Stanford CoreNLP](http://nlp.stanford.edu:8080/corenlp/process) (note that this is different than what [Stanford Parser](http://nlp.stanford.edu:8080/parser/) outputs):
> `(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))`

Concern: _clean_ as `VB (verb, base form)` vs `JJ (adjective)` 

**IMPROVE** POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc.

### Featurization

In [None]:
import re
from collections import defaultdict

featuresets = []
for ts in tagged_sentences:
    s_features = defaultdict(int)
    for idx, tup in enumerate(ts):
        #print(tup)
        pos = tup[1]
        # FeatureName.VERB
        is_verb = re.match(r'VB.?', pos) is not None
        print(tup, is_verb)
        if is_verb:
            s_features[FeatureName.VERB] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(ts):
                s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1
            # VERB_MODIFIER
            # VERB_MODIFYING
        else:
            s_features[FeatureName.VERB] = 0
    featuresets.append(dict(s_features))

print()
print(featuresets)

### [Stanford NLP](https://nlp.stanford.edu/software/)
Setup guide used: https://stackoverflow.com/a/34112695

In [None]:
# Get dependency parser, NER, POS tagger
!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
!unzip stanford-parser-full-2017-06-09.zip
!unzip stanford-ner-2017-06-09.zip
!unzip stanford-postagger-full-2017-06-09.zip

In [None]:
from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer