## Data and Setup

In [1]:
import os

In [2]:
BASE_DIR = os.getcwd()
pos_data_path = BASE_DIR + '/pos.txt'
neg_data_path = BASE_DIR + '/neg.txt'

In [3]:
with open(pos_data_path, 'r') as f:
    pos_data = f.read()
with open(neg_data_path, 'r') as f:
    neg_data = f.read()

In [4]:
lines = []
for l in pos_data.split('\n'):
    lines.append(l)
for l in neg_data.split('\n'):
    lines.append(l)

In [5]:
from enum import Enum, auto
class FeatureName(Enum):
    VERB = auto() # does this sentence contain a VB*?
    FOLLOWING = auto() # is the following word a <POS>? postfixed with _<POS>
    VERB_CHILD_DEP = auto() # what are the child (outgoing edges) dependencies (arc labels)? postfixed with _<DEP>
    VERB_HEAD_DEP = auto() # what are the head (incoming edge) dependencies (arc labels)? postfixed with _<DEP>
    VERB_CHILD_POS = auto() # is the child dependency a <POS>? postfixed with _<POS>
    VERB_HEAD_POS = auto() # is the head dependency a <POS>? postfixed with _<POS>

## NLTK

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Tokenization

In [None]:
sentences = [s for l in lines for s in sent_tokenize(l)] # punkt
sentences

In [None]:
tagged_sentences = []
for s in sentences:
    words = word_tokenize(s)
    tagged = nltk.pos_tag(words) # averaged_perceptron_tagger
    tagged_sentences.append(tagged)
print(tagged_sentences)

#### Note: POS accuracy

`Run down to the shop, will you, Peter` is parsed unexpectedly by `nltk.pos_tag`:
> `[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]`

`Run` is tagged as a `NNP (proper noun, singular)`

I expected an output more like what the [Stanford Parser](http://nlp.stanford.edu:8080/parser/) provides:
> `Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

`Run` is tagged as a `VGB (verb, gerund/present participle)` - still not quite the `VB` I want, but at least it's a `V*`

_MEANWHILE..._

`nltk.pos_tag` did better with:
> `[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]`

Compared to [Stanford CoreNLP](http://nlp.stanford.edu:8080/corenlp/process) (note that this is different than what [Stanford Parser](http://nlp.stanford.edu:8080/parser/) outputs):
> `(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))`

Concern: _clean_ as `VB (verb, base form)` vs `JJ (adjective)` 

**IMPROVE** POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc.

### Featurization

In [None]:
import re
from collections import defaultdict

featuresets = []
for ts in tagged_sentences:
    s_features = defaultdict(int)
    for idx, tup in enumerate(ts):
        #print(tup)
        pos = tup[1]
        # FeatureName.VERB
        is_verb = re.match(r'VB.?', pos) is not None
        print(tup, is_verb)
        if is_verb:
            s_features[FeatureName.VERB] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(ts):
                s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1
            # VERB_MODIFIER
            # VERB_MODIFYING
        else:
            s_features[FeatureName.VERB] = 0
    featuresets.append(dict(s_features))

print()
print(featuresets)

Next, I need a library that supports dependency parsing, which NLTK does not...

## [spaCy.io](https://spacy.io/)
_Because Stanford NLP is hard to install_

<img src="nltk_library_comparison.png" alt="NLTK library comparison chart" style="width: 400px; margin: 0;"/>

In [None]:
!conda config --add channels conda-forge
!conda install spacy
!python -m spacy download en

### Using the Spacy Data Model for NLP

In [6]:
import spacy
from spacy.tokens.doc import Doc
nlp = spacy.load('en')

Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\n' will start a new Spacy Doc.

In [7]:
docs = [nlp(line) for line in lines]
docs

[Be kind,
 Get out of here,
 Look this over,
 Paul, do your homework now,
 Do not clean soot off the window,
 Turn your phones off, please,
 Run down to the shop, will you, Peter,
 Look at this,
 Help is on the way,
 I can't feel my face when I'm with you,
 Will you marry me?]

In [8]:
# collapse noun phrases into single compounds
for doc in docs:
    for np in doc.noun_chunks:
        np.merge(np.root.tag_, np.text, np.root.ent_type_)

### NLP output

Tokenization, POS tagging, and syntactic parsing happened automatically with the `nlp(line)` calls above! So let's look at these outputs.

https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward

In [9]:
for doc in docs:
    print(list(doc.sents))

[Be kind]
[Get out of here]
[Look this over]
[Paul, do your homework now]
[Do not clean soot off the window]
[Turn your phones off, please]
[Run down to the shop, will you, Peter]
[Look at this]
[Help is on the way]
[I can't feel my face when I'm with you]
[Will you marry me?]


In [10]:
for doc in docs:
    print(list(doc.noun_chunks))

[]
[]
[]
[Paul, your homework]
[soot, the window]
[your phones]
[the shop, you]
[]
[Help, the way]
[I, my face, I, you]
[you, me]


[Spacy's dependency graph visualization](https://demos.explosion.ai/displacy)

In [11]:
for doc in docs:
    for token in doc:
        print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))

Be ROOT be VERB VB Be [kind]
kind acomp kind ADJ JJ Be []
Get ROOT get VERB VB Get [out]
out prep out ADP IN Get [of]
of prep of ADP IN out [here]
here pcomp here ADV RB of []
Look ROOT look VERB VB Look [this, over]
this dobj this DET DT Look []
over prep over ADP IN Look []
Paul nsubj Paul PROPN NNP do [,]
, punct , PUNCT , Paul []
do ROOT do VERB VB do [Paul, your homework, now]
your homework dobj your homework NOUN NN do []
now advmod now ADV RB do []
Do ROOT do VERB VBP Do [clean]
not neg not ADV RB clean []
clean acomp clean ADJ JJ Do [not, soot]
soot dobj soot NOUN NN clean [off]
off prep off ADP IN soot [the window]
the window pobj the window NOUN NN off []
Turn ROOT turn VERB VB Turn [your phones, off, ,, please]
your phones dobj your phones NOUN NNS Turn []
off prt off PART RP Turn []
, punct , PUNCT , Turn []
please intj please INTJ UH Turn []
Run ROOT run VERB VB Run [down, to, ,, will]
down prt down PART RP Run []
to prep to ADP IN Run [the shop]
the shop pobj the shop NOU

Note what Spacy POS tagger did with `Run down to the shop, will you Peter`:

`Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`

where `Run` is the `VB` I expected earlier from POS tagging. Also note that `the shop` has been collapsed to a single compound, which will be helpful during featurization.

### Featurization

In [12]:
from spacy.symbols import VERB
import re
from collections import defaultdict

featuresets = []
for doc in docs:
    s_features = defaultdict(int)
    for idx, token in enumerate(doc):
        print(token, token.pos_, token.tag_)
        if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags 
            s_features[FeatureName.VERB] += 1
            # FOLLOWING_POS
            next_idx = idx + 1;
            if next_idx < len(doc):
                s_features[f'{FeatureName.FOLLOWING}_{doc[next_idx].tag_}'] += 1
            # VERB_HEAD_DEP
            # VERB_HEAD_POS
            '''
            "Because the syntactic relations form a tree, every word has exactly one head.
            You can therefore iterate over the arcs in the tree by iterating over the words in the sentence."
            https://spacy.io/docs/usage/dependency-parse#navigating
            '''
            if (token.head is not token):
                s_features[f'{FeatureName.VERB_HEAD_DEP}_{token.head.dep_.upper()}'] += 1
                s_features[f'{FeatureName.VERB_HEAD_POS}_{token.head.tag_}'] += 1
            # VERB_CHILD_DEP
            # VERB_CHILD_POS
            for child in token.children:
                s_features[f'{FeatureName.VERB_CHILD_DEP}_{child.dep_.upper()}'] += 1
                s_features[f'{FeatureName.VERB_CHILD_POS}_{child.tag_}'] += 1            
    if len(s_features) > 0:
        featuresets.append(dict(s_features))
        print(dict(s_features))
    print()

#print(featuresets, len(featuresets))

Be VERB VB
kind ADJ JJ
{<FeatureName.VERB: 1>: 1, 'FeatureName.FOLLOWING_JJ': 1, 'FeatureName.VERB_CHILD_DEP_ACOMP': 1, 'FeatureName.VERB_CHILD_POS_JJ': 1}

Get VERB VB
out ADP IN
of ADP IN
here ADV RB
{<FeatureName.VERB: 1>: 1, 'FeatureName.FOLLOWING_IN': 1, 'FeatureName.VERB_CHILD_DEP_PREP': 1, 'FeatureName.VERB_CHILD_POS_IN': 1}

Look VERB VB
this DET DT
over ADP IN
{<FeatureName.VERB: 1>: 1, 'FeatureName.FOLLOWING_DT': 1, 'FeatureName.VERB_CHILD_DEP_DOBJ': 1, 'FeatureName.VERB_CHILD_POS_DT': 1, 'FeatureName.VERB_CHILD_DEP_PREP': 1, 'FeatureName.VERB_CHILD_POS_IN': 1}

Paul PROPN NNP
, PUNCT ,
do VERB VB
your homework NOUN NN
now ADV RB
{<FeatureName.VERB: 1>: 1, 'FeatureName.FOLLOWING_NN': 1, 'FeatureName.VERB_CHILD_DEP_NSUBJ': 1, 'FeatureName.VERB_CHILD_POS_NNP': 1, 'FeatureName.VERB_CHILD_DEP_DOBJ': 1, 'FeatureName.VERB_CHILD_POS_NN': 1, 'FeatureName.VERB_CHILD_DEP_ADVMOD': 1, 'FeatureName.VERB_CHILD_POS_RB': 1}

Do VERB VBP
not ADV RB
clean ADJ JJ
soot NOUN NN
off ADP IN
the win

### Classification

# Things that didn't work

### [Stanford NLP](https://nlp.stanford.edu/software/)
Setup guide used: https://stackoverflow.com/a/34112695

In [None]:
# Get dependency parser, NER, POS tagger
!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
!unzip stanford-parser-full-2017-06-09.zip
!unzip stanford-ner-2017-06-09.zip
!unzip stanford-postagger-full-2017-06-09.zip

In [None]:
from nltk.parse.stanford import StanfordParser
from nltk.parse.stanford import StanfordDependencyParser
from nltk.parse.stanford import StanfordNeuralDependencyParser
from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger
from nltk.tokenize.stanford import StanfordTokenizer