<h3>Basic Recipe for Training a POS Tagger with SpaCy</h3>
<ol>
<li id="loaddatatitle"><a href="#-Load-Data-">Load Data </a>
<ol><li>We'll be using a sample from Web Treebank corpus, in ConllX format</ol>
<li><a href="#Prepare-Environment-for-New-Model">Prepare environment for a new model</a>
<ol><li>New model directory, with tagger and parser subdirectories. (Ensure you have permission)</ol>
<li><a href="#Build-a-Vocabulary">Build a vocabulary</a>

<ol>
<li>We are just going to load the default English Vocabulary
<li>Defines how we get attributes (like suffix) from a token string
<li>Includes brown cluster data on lexemes, we'll use as a feature for the parser
</ol>
<li> <a href="#Build-a-Tagger">Build a Tagger</a>
<ol><li>Ensure tagmap is provided if needed</ol>
<ol><li>Which features should be used to train tagger?</ol>
<li><a href="#Train-Tagger"> Train Tagger</a>
<ol><li>Averaged Perceptron algorithm
<li>For each epoch: 
<ol><li>For each document in training data:
<ol><li>For each sentence in document:
<ol>
    <li>Create document with sentence words (tagger not yet applied)
    <li>Create GoldParse object with annotated labels
    <li>Apply the tagger to the document to get predictions
    <li>Update the tagger with GoldParse, Document (actual v predicted)
</ol>
</ol>
<li> Score predictions on validation set
</ol>
</ol>
<li><a href="#Save-Tagger">Save Tagger</a>

<h3> Load Data </h3>

In [2]:
import sys
sys.path.append('/home/jupyter/site-packages/')

In [38]:
import requests
from spacy.syntax.arc_eager import PseudoProjectivity

            
def read_conllx(text):
    bad_lines = 0
    #t = text.strip()
    #print(type(t), type('\n\n'))
    # u = t.split(b'\n\n')
    n_sent = 0
    n_line = 0
    print('text=%d' % len(text))
    # text = str(text)
    # print('text=%d' % len(text))
    for sent in text.strip().split('\n\n'):
        n_sent += 1
        lines = sent.strip().split('\n')
        if lines:
            while lines[0].startswith('#'):
                lines.pop(0)
            tokens = []
            for line in lines:
                n_line += 1
                try:
                    id_, word, lemma, tag, pos, morph, head, dep, _1, _2 = line.split()
                    if '-' in id_:
                        continue
                    id_ = float(id_) - 1
                    try:
                        head = (int(head) - 1) if head != '0' else id_
                    except:
                        head = id_
                    dep = 'ROOT' if dep == 'root' else dep
                    tokens.append((id_, word, pos, int(head), dep, 'O'))
                except:
                    bad_lines += 1
                    print('***', line)
                    raise
            if not tokens:
                continue
            tuples = [list(t) for t in zip(*tokens)]
           
            yield (None, [[tuples, []]])
    print("Skipped %d malformed lines" % bad_lines)
    print('n_sent=%d' % n_sent)
    print('n_line=%d' % n_line)

                        
def LoadData(url, path, make_projective=False):
    if url:
        conll_string = str(requests.get(url).content)
    elif path:
        conll_string = open(path).read()
    print('conll_string=%d' % len(conll_string))
    sents = list(read_conllx(conll_string))
    if make_projective:
        sents = PseudoProjectivity.preprocess_training_data(sents)
    return sents
    
    
train_url = 'https://raw.githubusercontent.com/UniversalDependencies/UD_English/master/en-ud-train.conllu'
test_url  = 'https://raw.githubusercontent.com/UniversalDependencies/UD_English/master/en-ud-test.conllu'
train_path = '/Users/pcadmin/code/spacy-examples/en-ud-train.conllu.txt'
train_sents = LoadData(None, train_path)
# test_sents = LoadData(test_url, None)
print('train=%d' % len(train_sents))
#print('test =%d' % len(test_sents))


conll_string=11843424
text=11843424
Skipped 0 malformed lines
n_sent=12543
n_line=204605
train=12543


In [41]:
def sent_iter(conll_corpus):
    for _, doc_sents in conll_corpus:
       # print(len(doc_sents))
      #  print(doc_sents[0])
        for (ids, words, tags, heads, deps, ner), _ in doc_sents:
            yield ids, words, tags, heads, deps, ner
            
print('train=%d' % len(train_sents))
sent_counter = 0
unique_tags = set()
for ids, words, tags, heads, deps, ner in sent_iter(train_sents):
    unique_tags.update(tags)
    sent_counter += 1
doc_counter = len(train_sents)
print("Training corpus metadata")
print()
print("Number of Sentences: %d" % sent_counter)
print("Number of Unique Tags: %d" % len(unique_tags))
print("Unique Tags: %s" % sorted(unique_tags))

train=12543
Training corpus metadata

Number of Sentences: 12543
Number of Unique Tags: 50
Unique Tags: ['$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'GW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', 'XX', '``']


<a href="#loaddatatitle">back</a>
<br>
### Prepare Environment for New Model

In [24]:
from pathlib import Path
import spacy

def prepare_environment_for_new_tagger(model_path, tagger_path):
    if not model_dir.exists():
        model_dir.mkdir()
    if not tagger_path.exists():
        tagger_path.mkdir()
        
data_dir = spacy.en.get_data_path()
model_dir = data_dir / 'en-1.1.0'
tagger_dir = model_dir / 'custom-pos-tagger'
prepare_environment_for_new_tagger(model_dir, tagger_dir)

AttributeError: module 'spacy.en' has no attribute 'get_data_path'

<a href="#loaddatatitle">back</a>
<br>
### Build a Vocabulary

In [15]:
from spacy.vocab import Vocab
def build_vocab(model_dir, vec_path = None, lexeme_path = None):
    vocab = Vocab.load(model_dir)
    if lexeme_path:
        vocab.load_lexemes(lexeme_path)
    if vec_path:
        vocab.load_vectors_from_bin_loc(vec_path)
        
    return vocab
    
lexeme_path = model_dir / 'vocab' / 'lexemes.bin'
vocab = build_vocab(model_dir, lexeme_path=lexeme_path)

In [6]:
#test clusters are available
from spacy.tokens import Doc

doc = Doc(vocab, words=[u'He',u'ate',u'pizza',u'.'])
print "Cluster Value for '{}': {}".format(*[doc[0], doc[0].cluster])

Cluster Value for 'He': 126


<a href="#loaddatatitle">back</a>
<br>
### Build a Tagger

In [7]:
from spacy.tagger import Tagger
from spacy.tagger import *

features = [
    (W_orth,),(W_shape,),(W_cluster,),(W_flags,),(W_suffix,),(W_prefix,),    #current word attributes   
    (P1_pos,),(P1_cluster,),(P1_flags,),(P1_suffix,),                        #-1 word attributes 
    (P2_pos,),(P2_cluster,),(P2_flags,),                                     #-2 word attributes  
    (N1_orth,),(N1_suffix,),(N1_cluster,),(N1_flags,),                       #+1 word attributes       
    (N2_orth,),(N2_cluster,),(N2_flags,),                                    #+2 word attributes 
    (P1_lemma, P1_pos),(P2_lemma, P2_pos), (P1_pos, P2_pos),(P1_pos, W_orth) #combination attributes 
]

features = spacy.en.English.Defaults.tagger_features
tag_map = spacy.en.tag_map
statistical_model = spacy.tagger.TaggerModel(features)
tagger = Tagger(vocab, tag_map=tag_map, statistical_model = statistical_model)

<a href="#loaddatatitle">back</a>
<br>
### Train Tagger

In [9]:
from spacy.scorer import Scorer
from spacy.gold import GoldParse
import random


def score_model(vocab, tagger, gold_docs, verbose=False):
    scorer = Scorer()
    for _, gold_doc in gold_docs:
        for (ids, words, tags, heads, deps, entities), _ in gold_doc:
            doc = Doc(vocab, words=map(unicode,words))
            tagger(doc)
            gold = GoldParse(doc, tags=tags)
            scorer.score(doc, gold, verbose=verbose)
    return scorer  


def train(tagger, vocab, train_sents, test_sents, model_dir, n_iter=20, seed = 0, feat_set = u'basic'):
    scorer = score_model(vocab, tagger, test_sents)
    print('%s:\t\t%s' % ("Iteration", "POS Tag Accuracy"))            
    print('%s:\t\t%.3f' % ("Pretraining", scorer.tags_acc))        
    
    #TRAINING STARTS HERE
    for itn in range(n_iter):
        for ids, words, tags, heads, deps, ner in sent_iter(train_sents):
            doc = Doc(vocab, words=map(unicode,words))
            gold = GoldParse(doc, tags=tags, heads=heads, deps=deps)
            tagger(doc)
            tagger.update(doc, gold)
        random.shuffle(train_sents)
        scorer = score_model(vocab, tagger, test_sents)
        print('%d:\t\t\t%.3f' % (itn, scorer.tags_acc))
    return tagger
trained_tagger = train(tagger, vocab, train_sents, test_sents, model_dir, n_iter = 10)

Iteration:		POS Tag Accuracy
Pretraining:		0.000
0:			87.655
1:			89.122
2:			91.250
3:			91.110
4:			91.453
5:			91.851
6:			92.545
7:			92.302
8:			92.246
9:			91.843


<a href="#loaddatatitle">back</a>
<br>
### Save Tagger

In [None]:
def ensure_dir(path):
    if not path.exists():
        path.mkdir()
        
ensure_dir(tagger_dir)
trained_tagger.model.dump(str(tagger_dir / 'model'))

### Notes
<br>
1. Spacy will be rolling out a neural network model soon!
<br>
<br>
2. Checkout Speech and Language Processing by Daniel Jurafsky and James H. Martin
<br>
<br>
3. Next section: Vector space models for natural language.