In [1]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.matcher import Matcher

# Gameplan

What I want to do is;

1. take instances where my matcher agress with my labelling (taking care that i only grab sentences with 1 language in the document, just in case)
2. use the matcher to assign the correct labels such that I have a NER datastructure that I can feed to the training loop
3. replace the NER in the pipeline with my own one
4. train an instance of NER and have a peek at the results

The main goal of this part is to explain NER and to get to a working training loop. We will need to iterate on this training loop but this is probably better for another video.

## 1st things 1st 

I figured it would be good to first focus on what `NER` is. The easiest way to do this is to show that spacy comes with a few detectors from the get-go. 

In [2]:
nlp = spacy.load("en_core_web_md")

In [3]:
doc = nlp("My name is Steve and I was born on 24th June 1973. \
           I work at Apple in Amsterdam. The contract states that we will \
           pay $1000 upfront for the service and 20 euro for every room sold \
           via your website.")
displacy.render(doc, style="ent")

# spaCy pipelines

![](from-docs.png)

You might remember the `NER` from a previous video. We turned it off because our matchers only used the part of speech and dependency features. 

What we will now do is replace this part of the pipeline.

In [4]:
nlp.pipe_names

['tagger', 'parser', 'ner']

## Back to the Task at Hand 

Let's first grab out matcher from before. I've moved the matcher code into a small package to make sure our code-base isn't loosely defined across notebooks (good idea).

```python
from spacy.matcher import Matcher

def _create_versioned(name):
    return [
        [{'LOWER': name}], 
        [{'LOWER': {'REGEX': f'({name}\d+\.?\d*.?\d*)'}}], 
        [{'LOWER': name}, {'TEXT': {'REGEX': '(\d+\.?\d*.?\d*)'}}],
    ]

def create_patterns():
    # let us first make the patterns with versions 
    versioned_languages = ['ruby', 'php', 'python', 'perl', 'java', 'haskell', 
                       'scala', 'c', 'cpp', 'matlab', 'bash', 'delphi']
    flatten = lambda l: [item for sublist in l for item in sublist]
    versioned_patterns = flatten([_create_versioned(lang) for lang in versioned_languages])
    
    # next we'll keep a list of non-standard patterns 
    lang_patterns = [
        [{'LOWER': 'objective'},{'IS_PUNCT': True, 'OP': '?'},{'LOWER': 'c'}],
        [{'LOWER': 'objectivec'}],
        [{'LOWER': 'c'}, {'LOWER': '#'}],
        [{'LOWER': 'c'}, {'LOWER': 'sharp'}],
        [{'LOWER': 'c#'}],
        [{'LOWER': 'f'}, {'LOWER': '#'}],
        [{'LOWER': 'f'}, {'LOWER': 'sharp'}],
        [{'LOWER': 'f#'}],
        [{'LOWER': 'lisp'}],
        [{'LOWER': 'common'}, {'LOWER': 'lisp'}],
        [{'LOWER': 'go', 'POS': {'NOT_IN': ['VERB']}}],
        [{'LOWER': 'golang'}],
        [{'LOWER': 'html'}],
        [{'LOWER': 'css'}],
        [{'LOWER': 'sql'}],
        [{'LOWER': {'IN': ['js', 'javascript']}}],
        [{'LOWER': 'c++'}]]
    
    return versioned_patterns + lang_patterns
```

With our matcher defined, we can load it in.

In [5]:
from proglang.matcher import create_patterns

matcher = Matcher(nlp.vocab, validate=True)
matcher.add("PROG_LANG", None, *create_patterns())

We can confirm that your matcher does indeed match things we're interested in.

In [29]:
doc = nlp("I am Vincent and I do code with datastuff and golang seems like a cool language but I mostly work in python.")

for idx, start, end in matcher(doc):
    print(doc[start:end],)

golang
python


We can also confirm that what we match on is a `Span` object. Good to note that a span cannot overlap and neither can a named entity.

In [30]:
type(doc[start:end])

spacy.tokens.span.Span

## Parsing a Doc 

First I'll need to prepare the data such that it fits the API. Docs found [here](https://spacy.io/usage/training#training-simple-style).

In [31]:
def parse_train_data(doc):
    return (doc.text, {'entities': [(doc[start:end].start_char, doc[start:end].end_char, 'PROGLANG') for idx, start, end in matcher(doc)]})

parse_train_data(doc)

('I am Vincent and I do code with datastuff and golang seems like a cool language but I mostly work in python.',
 {'entities': [(46, 52, 'PROGLANG'), (101, 107, 'PROGLANG')]})

This is nice. Let's grab some data and train on it. The idea is that I will look at examples where I my excel label overlaps with a matcher that finds a single example.

In [40]:
df = (pd.read_csv("have_label.txt", nrows=5_000, sep='\t', usecols=['Label', 'Title']))
titles = df.loc[lambda d: d['Label'] == 1]['Title']
titles = df['Title']
titles.shape, sum(1 for d in nlp.pipe(titles) if len(matcher(d)) == 1)

((2000,), 416)

In [41]:
g = (parse_train_data(d) for d in nlp.pipe(titles) if len(matcher(d)) == 1)

In [44]:
TRAIN_DATA = [parse_train_data(d) for d in nlp.pipe(titles) if len(matcher(d)) == 1]
TRAIN_DATA[:5]

[('Deploying SQL Server Databases from Test to Live',
  {'entities': [(10, 13, 'PROGLANG')]}),
 ('Is Windows Server 2008 "Server Core" appropriate for a SQL Server instance?',
  {'entities': [(55, 58, 'PROGLANG')]}),
 ('Good STL-like library for C', {'entities': [(26, 27, 'PROGLANG')]}),
 ('Paging SQL Server 2005 Results', {'entities': [(7, 10, 'PROGLANG')]}),
 ('MySQL/Apache Error in PHP MySQL query',
  {'entities': [(22, 25, 'PROGLANG')]})]

Now that we have training data we'll start a new spacy model. 

![](from-docs.png)


In [15]:
def create_blank_nlp():
    nlp = spacy.blank("en")
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    for _, annotations in TRAIN_DATA:
            for ent in annotations.get("entities"):
                ner.add_label(ent[2])
    return nlp

In [16]:
from spacy.util import minibatch, compounding

In [45]:
nlp = create_blank_nlp()
optimizer = nlp.begin_training()  
for i in range(20):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer, losses=losses)
    print(f"Losses at iteration {i}", losses)

Losses {'ner': 171.68839159619745}
Losses {'ner': 12.77679905959765}
Losses {'ner': 8.576888525372755}
Losses {'ner': 7.921659459052945}
Losses {'ner': 14.079479234089929}
Losses {'ner': 22.2691320396331}
Losses {'ner': 1.814979443285463}
Losses {'ner': 3.614291375231467}
Losses {'ner': 13.192082488298539}
Losses {'ner': 13.265022523456754}
Losses {'ner': 7.289025678655073}
Losses {'ner': 27.61016798911226}
Losses {'ner': 7.929718542739575}
Losses {'ner': 12.558411253964168}
Losses {'ner': 6.295474202151618}
Losses {'ner': 0.0010332701030309791}
Losses {'ner': 2.9600180028579797e-05}
Losses {'ner': 6.542552724515043e-07}
Losses {'ner': 3.88690716435697e-10}
Losses {'ner': 3.078534097405294e-10}


In [47]:
doc = nlp("I like coding in python, java script, go and rust")
displacy.render(doc, style="ent")

In [49]:
doc = nlp("I like coding in go")
displacy.render(doc, style="ent")

Mhmm ... not perfect.

## Different Train Loop

Here's another run that has more bells and whistles.

In [50]:
import tqdm

In [51]:
nlp = create_blank_nlp()
optimizer = nlp.begin_training()
pbar = tqdm.tqdm(range(25))
for i in pbar:
    losses = {}
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
            texts,  # batch of texts
            annotations,  # batch of annotations
            drop=0.1,  # dropout - make it harder to memorise data
            losses=losses,
        )
    pbar.set_description(f"loss={losses['ner']}")

loss=9.656993702075896: 100%|██████████| 25/25 [00:56<00:00,  2.52s/it]     


In [52]:
doc = nlp("I like coding in python, java script, go and rust")
displacy.render(doc, style="ent")

In [53]:
doc = nlp("I like coding in go")
displacy.render(doc, style="ent")

The model is also not perfect. So we'll need to do some iterations of improvement. But there's a lot to cover there and that is for the next video.

## Final Thoughts 

It feels like explaining entities and setting up a training loop might be enough for a single video. Talking about all the things one could tweak during a training loop might be worth a seperate video. It also feels like there will be a phase of 'label some more and set up a proper evaluation pipeline' which might also be good to have in a seperate video. I'm also getting a feeling that it's gonna get mighty convenient if I could use prodigy here.

Thoughts? 

In [174]:
from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)

In [176]:
ruler.add_patterns([{'label':'PROGLANG', 'pattern': p} for p in create_patterns()])
nlp.add_pipe(ruler)

In [179]:
nlp.to_disk('entity-ruler')

In [177]:
displacy.render(nlp("I am Vincent and I like to code in go and python"), 'ent')

In [126]:
g = (d for d in nlp.pipe(df['Title']) if len(d.ents) > 0)

In [143]:
foo = next(g)
foo, foo.ents

(How can I Java webstart multiple, dependent, native libraries?, (Java,))

Let's move this into a text file.

In [156]:
import json
import pathlib

In [169]:
df = (pd.read_csv("Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [170]:
p = pathlib.Path("./proglang-text.jsonl").resolve()
txt = ""
for i in range(5000):
    txt += json.dumps({"text": df['Title'][i]}) + '\n'
    
p.write_text(txt)

322057

In [173]:
import spacy

?spacy.cli.train


[0;31mSignature:[0m
[0mspacy[0m[0;34m.[0m[0mcli[0m[0;34m.[0m[0mtrain[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mlang[0m[0;34m:[0m[0;34m([0m[0;34m'Model language'[0m[0;34m,[0m [0;34m'positional'[0m[0;34m,[0m [0;32mNone[0m[0;34m,[0m [0;34m<[0m[0;32mclass[0m [0;34m'str'[0m[0;34m>[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_path[0m[0;34m:[0m[0;34m([0m[0;34m'Output directory to store model in'[0m[0;34m,[0m [0;34m'positional'[0m[0;34m,[0m [0;32mNone[0m[0;34m,[0m [0;34m<[0m[0;32mclass[0m [0;34m'pathlib.Path'[0m[0;34m>[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_path[0m[0;34m:[0m[0;34m([0m[0;34m'Location of JSON-formatted training data'[0m[0;34m,[0m [0;34m'positional'[0m[0;34m,[0m [0;32mNone[0m[0;34m,[0m [0;34m<[0m[0;32mclass[0m [0;34m'pathlib.Path'[0m[0;34m>[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdev_path[0m[0;34m:[0m[0;34m([0m[0;34m'Location 