<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/intro/methods_intro_nlp_spacy_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

To be a crash-course introduction to the NLP functionality using the [spacy](https://spacy.io/) package.

It is aimed at people who might want to understand, use, and change spacy code used in this project,
or use spacy as the main plaform for their own analysis.

The below focuses largely on understanding the lowish-level mechanics well enough to be precise in your own research.

If you wish to explore other, comparable libraries that sometimes do specific jobs better, take a look at things like [nltk](https://www.nltk.org/), [pattern](https://github.com/clips/pattern), [CoreNLP](https://stanfordnlp.github.io/CoreNLP/).

If you want more automatism, you might care for one of many (paid-for) online tools, such as https://studio.oneai.com

## Installing, Importing, loading

Loading spacy can take a minute. This is one reason notebooks (and colab) are useful for exploration, as it will stay loaded in the background for the whole session.

In [None]:
## install spacy, and wetsuite

# note that you can can run most of the below with a standard spacy install (pip3 install -U spacy) and do not need wetsuite install yet

# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
#!pip3 install -U --no-cache-dir --quiet https://github.com/knobs-dials/wetsuite-dev/archive/refs/heads/main.zip

# also, download language models we use here
#!python3 -m spacy download en_core_web_lg    # reasonable on CPU
#!python3 -m spacy download en_core_web_trf   # works better, but can be rather slow without GPU properly set up

In [1]:
## load spacy (the library)

#  importing spacy is likely to spit out a bunch of warnings.  There are sometimes useful errors, but usually it's mostly about optional libraries, 
#  so you could decide to have it be less spammy in your eveyday work by doing the following before the import:
#import os
#os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'    

import spacy
import spacy.displacy

import wetsuite.helpers.spacy    # some helpers of our own, barely used here.

2024-03-12 22:53:04.514676: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-12 22:53:06.783161: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-12 22:53:06.787817: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-03-12 22:53:06.787943: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


In [2]:
## Load a language model that you previously downloaded
english_lg  = spacy.load('en_core_web_lg')   
english_trf = spacy.load('en_core_web_trf')
# Tutorials often call the loaded model object 'nlp'.  This uses more descriptive names because we use more than one.

In [3]:
# Ask spacy what this particular model (roughly) will calculate and annotate.   
#   or, in spacy's terms, what components(/pipes) are in this model's pipeline?
for pipe_name in english_lg.pipe_names:
    print( '==== component: %s ====\n%s\n'%(pipe_name, english_lg.get_pipe(pipe_name).__doc__) )

==== component: tok2vec ====
Apply a "token-to-vector" model and set its outputs in the doc.tensor
    attribute. This is mostly useful to share a single subnetwork between multiple
    components, e.g. to have one embedding and CNN network shared between a
    parser, tagger and NER.

    In order to use the `Tok2Vec` predictions, subsequent components should use
    the `Tok2VecListener` layer as the tok2vec subnetwork of their model. This
    layer will read data from the `doc.tensor` attribute during prediction.
    During training, the `Tok2Vec` component will save its prediction and backprop
    callback for each batch, so that the subsequent components can backpropagate
    to the shared weights. This implementation is used because it allows us to
    avoid relying on object identity within the models to achieve the parameter
    sharing.
    

==== component: tagger ====
Pipeline component for part-of-speech tagging.

    DOCS: https://spacy.io/api/tagger
    

==== component: 

## Inspection of a parse

Calling a model object with some text produces a [Doc](https://spacy.io/api/doc) object, the analysis of that text. 

Iterating that document object gives you a series of [Token](https://spacy.io/api/token) objects.


Pipeline components usually put their results in attributes on the Doc and/or the Token objects. 
You could get a summary of what specific attributes are set, with `.analyze_pipes()`, but this is not the most readable. This document will go through many of them in more natural examples.

One example of annotation onto the Doc is sentence splitting that various models do  (often in the parser components) - the [.sents](https://spacy.io/api/doc#sents) attribute gives you a sequence of [Span](https://spacy.io/api/span) objects, each of which is a sequences of tokens from the document these Spans belong to

In [5]:
statements_txt = "Machine learning can be easy. Deep Learning isn't straightforward. Ducks are nice. Cats and dogs are cute. Neural nets are useful. Long Cat looks like a name."
statements_doc = english_trf( statements_txt )

print("TOKENS")
for tok in statements_doc: 
    print( '  %s/%s'%(tok.text,  tok.pos_),  end='' )  # print token text, and its part of speech tag


print("\n\nSENTENCES")
for sent in statements_doc.sents:  # https://spacy.io/api/doc#sents
    print( '  %s'%sent.text )
    #for tok in sent: 
    #    print( '  %s/%s'%(tok.text,  tok.pos_), end='' )

TOKENS
  Machine/NOUN  learning/NOUN  can/AUX  be/AUX  easy/ADJ  ./PUNCT  Deep/ADJ  Learning/NOUN  is/AUX  n't/PART  straightforward/ADJ  ./PUNCT  Ducks/NOUN  are/AUX  nice/ADJ  ./PUNCT  Cats/NOUN  and/CCONJ  dogs/NOUN  are/AUX  cute/ADJ  ./PUNCT  Neural/ADJ  nets/NOUN  are/AUX  useful/ADJ  ./PUNCT  Long/ADJ  Cat/PROPN  looks/VERB  like/ADP  a/DET  name/NOUN  ./PUNCT

SENTENCES
  Machine learning can be easy.
  Deep Learning isn't straightforward.
  Ducks are nice.
  Cats and dogs are cute.
  Neural nets are useful.
  Long Cat looks like a name.


It can be useful to be aware of the other things Token can do for you, which including but not limited to the following. Chances are you will never use half of these, but it's good to know it's there.
- (there's `.orth_` yet it seems to be an implementation detail, and practically equivalent to .text)
- `.lemma`: lemmatized form of a token
- `.norm`:  normalized form of a token  (seems to mostly resolve contractions, and otherwise seems largely the same as lemmatizer output)
- `.pos_`:  coarse tagging  (often following wider conventions)
- `.tag`:   finer taggging  (more easily model/language specific)
- specifics like `.is_sent_start`, `.is_bracket`, `.is_quote`, `.is_left_punct`, `.is_punct`, `.is_upper`;  `.like_url`, `.like_email`,  and a few more
- the ability to fetch the original whitespace  (this example uses `.text` which ignores whitespace,  while `.text_with_ws` is basically `.text` + `.whitespace_` )

In [6]:
fields = '%15s %15s %15s %10s %10s %10s %10s  %15s  %s' # (for aligned printing)
head = ('TEXT', 'LEMMA', 'NORMALIZED', 'POS', 'TAG', 'STOP?', 'UNKNOWN?', 'EXPL(POS)', 'EXPL(TAG)')   # unknown as in 'out of vocabulary'
print( fields % head )
for tok in english_lg( "I don't think fhqwhgads was weird."):
    print( fields%(tok.text, tok.lemma_, tok.norm_, tok.pos_, tok.tag_, tok.is_stop, tok.is_oov, spacy.explain(tok.pos_), spacy.explain(tok.tag_) ) )

# Notes:
# - for certain models, is_oov will always be True.  Don't rely on it until you understand why.
# - if you're not sure about tagset details - the values that pos_ and tag_ can take, and what they mean - you can ask spacy via explain(), like we do here
# - if you're wondering: those underscores are the human-readable variants, 
#   and there will also be an attribute without the underscore that stores a number, which is a more compact, internal reference

           TEXT           LEMMA      NORMALIZED        POS        TAG      STOP?   UNKNOWN?        EXPL(POS)  EXPL(TAG)
              I               I               i       PRON        PRP       True      False          pronoun  pronoun, personal
             do              do              do        AUX        VBP       True      False        auxiliary  verb, non-3rd person singular present
            n't             not             not       PART         RB       True      False         particle  adverb
          think           think           think       VERB         VB      False      False             verb  verb, base form
      fhqwhgads        fhqwhgad       fhqwhgads       NOUN        NNS      False       True             noun  noun, plural
            was              be             was        AUX        VBD       True      False        auxiliary  verb, past tense
          weird           weird           weird        ADJ         JJ      False      False        adjective  a

In [None]:
# When looking to extract the more contentful words and ignore filler/function words, 
# there are fancier ways to go about that (say, the word embeddings),
# but one quick and dirty way is "ignore tokens by annotation, mostly parts of speech", like:
interesting, boring, todo = [], [], []
for tok in statements_doc:
    if tok.is_stop:
        boring.append( tok )
    elif tok.pos_ in ('PUNCT','SPACE', 'X'):
        boring.append( tok )
    elif tok.pos_ in ('AUX','DET','CCONJ'): # auxiliary verbs, determiners, conjunctions are mostly just function words
        boring.append( tok )
    elif tok.pos_ in ('NOUN','PROPN','NUM'):  # nouns tend to be useful.   Numbers could be ignored; if they're dates, there is some chanced that they're picked up as entities
        interesting.append( tok )
    elif tok.pos_ in ('ADJ','VERB','ADP', 'ADV'): # adjectives, verbs, adverbs, adpositions - are often halfway useful
        interesting.append( tok )
    else: #  (debug: labels not yet in the above)
        todo.append( '%s/%s'%(tok.text, tok.pos_) )

print("CONTENTFUL")
for tok in interesting:
    print( '  %s'%tok.text )

print("\nBORING")
for tok in boring:
    print( '  %s'%tok.text)

#print("\nTODO")
#for tok in todo:
#    print( '  ',tok )

CONTENTFUL
  Machine
  learning
  easy
  Deep
  Learning
  straightforward
  Ducks
  nice
  Cats
  dogs
  cute
  Neural
  nets
  useful
  Long
  Cat
  looks
  like

BORING
  can
  be
  .
  is
  n't
  .
  are
  .
  and
  are
  .
  are
  .
  a
  name
  .


## Noun chunks and entities

Various models try to find adjacent nouns that seem to belong together (somewhat like noun phrases, but that is not exactly the goal), placed onto Doc's [.noun_chunks](https://spacy.io/api/doc#noun_chunks) attribute.

There is also named entity extraction, plased onto Doc's [.ents](https://spacy.io/api/doc#ents) attribute. While more directed, this may also be incomplete and/or messy - this is one of the things you may want to train for your specific use case.

In [None]:
print("\nNOUN CHUNKS")    # https://spacy.io/api/doc#noun_chunks
for nc in statements_doc.noun_chunks:
    print( '    %-20s   head=%-15s   head.dep_=%s'%(nc.text, nc.root.text, nc.root.dep_))  # the root and dep things will be explained below

print("\nNAMED ENTITIES") # https://spacy.io/api/doc#ents
for ent in english_lg( statements_txt ).ents:  # using the other model because _trf doesn't find any.
    print( '    %-20s   label=%-15s   head=%-15s   head.dep_=%s'%(ent.text, ent.label_, ent.root.text, ent.root.dep_))


NOUN CHUNKS
    Machine learning       head=learning          head.dep_=nsubj
    Deep Learning          head=Learning          head.dep_=nsubj
    Ducks                  head=Ducks             head.dep_=nsubj
    Cats                   head=Cats              head.dep_=nsubj
    dogs                   head=dogs              head.dep_=conj
    Neural nets            head=nets              head.dep_=nsubj
    Long Cat               head=Cat               head.dep_=nsubj
    a name                 head=name              head.dep_=pobj

NAMED ENTITIES
    Deep Learning          label=WORK_OF_ART       head=Learning          head.dep_=nsubj
    Long Cat               label=ORG               head=Cat               head.dep_=nsubj


Note that these two things are done separately, so their results may well overlap.

In [None]:
bd = "Bill Diamond is an American puppeteer, puppet fabricator, and producer."
print("\nNOUN CHUNKS")
for nc in english_trf(bd).noun_chunks:
    print( '    %s'%(nc.text) )

print("\nNAMED ENTITIES")
for ent in english_trf(bd).ents: 
    print( '    %-25s   label=%-15s'%(ent.text, ent.label_))


NOUN CHUNKS
    Bill Diamond
    an American puppeteer
    puppet fabricator
    producer

NAMED ENTITIES
    Bill Diamond                label=PERSON         
    American                    label=NORP           


### Dependency relations
Models also detail certain dependencies/relations between tokens. 

On each token,  .head is the source of that relation, .dep_ is its type. 

It may become a little clearer once you see this visualized - scroll down until you see images.  Even then, this is a somewhat unusual way of representing and storing these.

In [7]:
for tok in english_trf( "John, I gave the bean to Alice's sister" ):
   print( f'{tok.text:>10s}   <----{spacy.explain(tok.dep_):->35s}--   {tok.head} ' )

      John   <------noun phrase as adverbial modifier--   gave 
         ,   <----------------------------punctuation--   gave 
         I   <------------------------nominal subject--   gave 
      gave   <-----------------------------------root--   gave 
       the   <-----------------------------determiner--   bean 
      bean   <--------------------------direct object--   gave 
        to   <-----------------prepositional modifier--   gave 
     Alice   <--------------------possession modifier--   sister 
        's   <---------------------------case marking--   Alice 
    sister   <------------------object of preposition--   to 


In [None]:
# Dependencies help certain tests, such as extracting a sentence's subjects and objects. 
# If you do this, give it some weird and complex sentences too, to realize what it does when subjects and objects are implicit, missing, or when there is more than one.
for sent_i, sent in enumerate( statements_doc.sents ):
    for tok in sent:
        if tok.dep_ == 'nsubj':
            print( "SUBJECT = %-15s    for sentence %r"%(tok.text, sent) )
        if tok.dep_ == 'pobj':
            print( "OBJECT  = %-15s    for sentence %r"%(tok.text, sent) )       

# In various cases you may care to figure out whether that token marked as subjects and objects (or other) is part of a larger thing, like a named entity or noun chunk.
# This may be easier to check the other way around: for each noun chunk or entity, see if its head/root has the relation you're interested in:
print()
for nc in statements_doc.noun_chunks:
    if nc.root.dep_ in ('nsubj','pobj'):
        print( 'NOUN CHUNK %-20r   head=%-15s   has relation %s'%(nc.text, nc.root.text, nc.root.dep_))

SUBJECT = learning           for sentence Machine learning can be easy.
SUBJECT = Learning           for sentence Deep Learning isn't straightforward.
SUBJECT = Ducks              for sentence Ducks are nice.
SUBJECT = Cats               for sentence Cats and dogs are cute.
SUBJECT = nets               for sentence Neural nets are useful.
SUBJECT = Cat                for sentence Long Cat looks like a name.
OBJECT  = name               for sentence Long Cat looks like a name.

NOUN CHUNK 'Machine learning'     head=learning          has relation nsubj
NOUN CHUNK 'Deep Learning'        head=Learning          has relation nsubj
NOUN CHUNK 'Ducks'                head=Ducks             has relation nsubj
NOUN CHUNK 'Cats'                 head=Cats              has relation nsubj
NOUN CHUNK 'Neural nets'          head=nets              has relation nsubj
NOUN CHUNK 'Long Cat'             head=Cat               has relation nsubj
NOUN CHUNK 'a name'               head=name              has r

## Visualisation
spacy's own displacy is geared towards python notebooks.   You can also fish out representations for display elsewhere.
- `style='ent'` is for entities, shown with HTML blocks, and fine for longer texts
- `style='dep'` is for dependencies, shown as an SVG diagram, and generally unwieldy if showing more than one sentence at a time
- `style='span'` is for any spans (noun chunks, entities, others that you may add) which may overlap.

In [None]:
paris_text = """During the Restoration, the bridges and squares of Paris were returned to their pre-Revolution names; the July Revolution in 1830 
(commemorated by the July Column on the Place de la Bastille) brought a constitutional monarch, Louis Philippe I, to power.
The first railway line to Paris opened in 1837, beginning a new period of massive migration from the provinces to the city.
Louis-Philippe was overthrown by a popular uprising in the streets of Paris in 1848. His successor, Napoleon III, 
alongside the newly appointed prefect of the Seine, Georges-Eugène Haussmann, launched a gigantic public works project to 
build wide new boulevards, a new opera house, a central market, new aqueducts, sewers and parks, including the Bois de Boulogne 
and Bois de Vincennes.""".replace('\n',' ')
paris_doc = english_trf( paris_text )

print("NAMED ENTITIES")
for ent in paris_doc.ents:
    print( '  %-28s   label=%-8s (%-53s)    head=%-15s   head.dep_=%s'%(ent.text, ent.label_, spacy.explain(ent.label_), ent.root.text, ent.root.dep_))

NAMED ENTITIES
  Restoration                    label=EVENT    (Named hurricanes, battles, wars, sports events, etc. )    head=Restoration       head.dep_=pobj
  Paris                          label=GPE      (Countries, cities, states                            )    head=Paris             head.dep_=pobj
  the July Revolution            label=EVENT    (Named hurricanes, battles, wars, sports events, etc. )    head=Revolution        head.dep_=nsubj
  1830                           label=DATE     (Absolute or relative dates or periods                )    head=1830              head.dep_=pobj
  the July Column                label=FAC      (Buildings, airports, highways, bridges, etc.         )    head=Column            head.dep_=pobj
  the Place de la Bastille       label=FAC      (Buildings, airports, highways, bridges, etc.         )    head=Bastille          head.dep_=pobj
  Louis Philippe I               label=PERSON   (People, including fictional                          )    head=I 

In [None]:
# Visualized in context of a sentence:
spacy.displacy.render(paris_doc, style='ent', jupyter=True)    # Note: colab require explicit jupyter=True,  local notebooks do not.

In [None]:
# Dependencies between the tokens in a sentence:
louis_doc = english_trf( "Louis XVI and the royal family were brought to Paris and made prisoners within the Tuileries Palace." )
spacy.displacy.render(louis_doc, style='dep', jupyter=True)   # jupyter=True is required in some environments, optional in others

In [None]:
# You can get it a litte more compact, and do other styling, like
spacy.displacy.render(louis_doc, style='dep', options={'compact':True, 'bg':'#00000000', 'color':'#666666', 'distance':85, 'word_spacing':42, 'arrow_stroke':1}, jupyter=True)

# Slightly less basic

## Extracting patterns with rule-based matching

In [None]:
from spacy.matcher import Matcher

# Look for   one or more adjectives  before  a noun or proper noun
an_pattern = [
    [ {"POS": "ADJ", "OP": "+"},   {"POS": {"IN":["NOUN","PROPN"]}} ],
    # you can have more rules in a matcher
]
matcher = Matcher(english_trf.vocab)
matcher.add("adjective-noun", an_pattern)
matches = matcher( paris_doc )
for match_id, start_i, end_i in matches:
    print( paris_doc[ start_i : end_i ] ) # from the indices, print the described span 

# Notes: 
# - you can express more complex types of patterns, see https://spacy.io/api/matcher#patterns
# - You could extend this to more complex tasks, like maybe rule-based phrase and named entity extraction.
#   - ...though you might base that on more specific existing code like PhraseMatcher and EntityRuler,
#     which may work faster and/or annotate automatically.
#   - and in the case of NER would probaly still be less effective than existing trained NER model components


Revolution names
-Revolution names
pre-Revolution names
constitutional monarch
first railway
new period
massive migration
popular uprising
public works
gigantic public works
new boulevards
wide new boulevards
new opera
central market
new aqueducts


##  Inventing your own functions to apply to spacy

In [None]:
def complexity( span ):
    ''' Takes a parsed spacy sentence, and estimates its complexity.

        Currently uses _only_ the average distance of the dependencies, which is decent for how basic it is. 
        Consider e.g.
        - long sentences are often more complex, but not proportional to their length
          They might, say, separate independent clauses.
          They become harder to parse e.g. if there are references to something further away.
        - parenthetical sentences will lengthen references across them
        - lists and flat compounds will drag the complexity down
        which all seem to make basic sense.

        We don't even really need normalization, in that the units is 'token distance' (though we should take more care across PUNCT and such)

        Downsides include that spacy seems to assign some dependencies just because it needs to, not necessarily because they are linguistically sensible.
        Also, we should probably count most named entities as a single thing, not the amount of tokens in them
    '''
    # this idea happens to be easy to implement, because every token has a relation .head, and every token has a position
    dists = []
    for tok in span:
        dist = tok.head.i - tok.i
        dists.append( dist ) 
    abs_dists = list( abs(d)  for d in dists )
    avg_dist = float(sum(abs_dists)) / len(abs_dists)
    return avg_dist


for sent in english_trf( paris_doc ).sents:
    print('[%s]\nComplexity: %.1f\n'%(sent.text.replace('\n',' ').strip(), complexity(sent) ) )

[During the Restoration, the bridges and squares of Paris were returned to their pre-Revolution names; the July Revolution in 1830  (commemorated by the July Column on the Place de la Bastille) brought a constitutional monarch, Louis Philippe I, to power.]
Complexity: 4.3

[The first railway line to Paris opened in 1837, beginning a new period of massive migration from the provinces to the city.]
Complexity: 2.4

[Louis-Philippe was overthrown by a popular uprising in the streets of Paris in 1848.]
Complexity: 2.5

[His successor, Napoleon III,  alongside the newly appointed prefect of the Seine, Georges-Eugène Haussmann, launched a gigantic public works project to  build wide new boulevards, a new opera house, a central market, new aqueducts, sewers and parks, including the Bois de Boulogne  and Bois de Vincennes.]
Complexity: 4.4



## Some specialized models

Take a look around existing models.
There are sometimes readymade models that help do specific tasks, and then possibly _only_ those tasks.

Say you have a text that you know contains multiple languages, and you want to feed each to a parser for that language:

### Language detection
useful to take multiple languages in one piece of text, and be able to feed it to the right nlp object (which you would still need to have manually loaded first, so this probably makes the most sense if you want to work in a small pre-set number of languges).

There is a [spacy_fastlang](https://pypi.org/project/spacy-fastlang/) library (which depends on the [fasttext](https://fasttext.cc/) library) that can do that for you.

The below uses our own helper function called `detect_language()`, which is a handful of lines mostly jut doing the boilerplate of "create model, stick fastlang on it, run it, and report the best-scoring language and (our certainty it's correct)"

In [None]:
%pip install --quiet spacy_fastlang

In [None]:
print("LANGUAGE DETECTION")
for example in ("I am cheese", "Ik ben kaas", "Je suis fromage", "Ich bin Käse", "Ek is kaas", "Saya keju", "Olen juusto", "ben peynirim"):
    lang, score = wetsuite.helpers.spacy.detect_language(example)
    print( "    %s (certainty: %.2f) for  %r"%( lang, score, example ) )

LANGUAGE DETECTION
    en (certainty: 0.73) for  'I am cheese'
    nl (certainty: 0.53) for  'Ik ben kaas'
    fr (certainty: 1.00) for  'Je suis fromage'
    de (certainty: 1.00) for  'Ich bin Käse'
    af (certainty: 0.66) for  'Ek is kaas'
    id (certainty: 0.58) for  'Saya keju'
    fi (certainty: 0.91) for  'Olen juusto'
    tr (certainty: 0.29) for  'ben peynirim'


### Sentence splitting

Some models use rule based sentence splitting that happens to be... unpredictable and not always great.
For one example:

In [8]:
dutch = spacy.load('nl_core_news_lg')

from IPython.display import HTML  # this uses HTML output to mark the split unambiguously with color. A little overkill, but clearer.
splitter = ' <span style="color:red">//</span> '

for txt in (
        'Wario Land 4 ( Japans : ワリオランドアドバンス; Wario Land Advance ) is een platformspel, uitgebracht voor de Game Boy Advance in Europa op 16 november 2001.',
        'Wario Land 4 is een platformspel, uitgebracht voor de Game Boy Advance in Europa op 16 november 2001.',
        #'Wario 4 is een platformspel, uitgebracht voor de Game Boy Advance in Europa op 16 november 2001.',
        'Land 4 is een platformspel, uitgebracht voor de Game Boy Advance in Europa op 16 november 2001.',
    ):
    display( HTML( splitter.join( s.text for s in dutch(txt+' '+txt).sents) ) )


I have no idea why it does that and why it seems to vary so much.  I just know we can get better.

For example, the [xx_sent_ud_sm](https://spacy.io/models/xx#xx_sent_ud_sm) model is trained on multiple languages. 

It seems to fare a little better, and even when it's not, it seems more of a known quantity than 'whatever the model you use today happens to do'.

In [None]:
# !python3 -m spacy download xx_sent_ud_sm

In [14]:
print("SENTENCE SPLITTING")
split_doc = wetsuite.helpers.spacy.sentence_split("""C'est n'est pas une pipe. A capital after Mr. Abbreviation might throw things off. (Also) weird sentence starts could. 
     As can ellipses... as you can imagine. As could "Things we quote?" ...as they are embedded sentences and what comes after is unknown. (en die laatste zou anders werken met een komma)""") 

# add language detection, as per our initial plan
for sent in split_doc.sents: 
    text = sent.text.strip()
    print('    [%s] %s'%( wetsuite.helpers.spacy.detect_language(text)[0], text ) )   

SENTENCE SPLITTING
    [fr] C'est n'est pas une pipe.
    [en] A capital after Mr. Abbreviation might throw things off.
    [en] (Also) weird sentence starts could.
    [en] As can ellipses... as you can imagine.
    [en] As could "Things we quote?"
    [en] ...as they are embedded sentences and what comes after is unknown.
    [nl] (en die laatste zou anders werken met een komma)
