## Intro to spacy
This is a crash-course introduction to the basic NLP functionality in spacy, for people who have not used it before.

## Installing, Importing, Loading

In [None]:
# ON YOUR OWN COMPUTER(S) you need to install only once 
#   You also probably care about details like GPU, exactly once, for the speed. 
# ON GOOGLE COLAB you get a fresh environment each time, and need to install each time
#   Also, it's easier to not bother with GPU, particularly for a few quick tests where efficiency does not matter

!pip3 install -U spacy                       
# also, download models you may use
#!python3 -m spacy download en_core_web_sm   # loads faster, handy for quick tests
!python3 -m spacy download en_core_web_lg    # works better, reasonable on CPU
!python3 -m spacy download en_core_web_trf   # works better, but can be rather slow on CPU

In [None]:
# compute libraries are heavy, somewhat fragile, so like to give you plentiful warnings  (mostly coming from tensorflow)
#os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'   #  assume errors are non-fatal and make it quiet.

import spacy           # initial load can take a while (it's one reason notebooks are useful - you don't need to rerun this as long as the backend is still running)
import spacy.displacy

import sys
sys.path.insert(0, '/var/www/coding')
import helpers_spacy # some helpers of our own.   Barely used here.  chances are you will eventually make your own as well.

In [None]:
# load the model that you chose to download

# see notes above on model choice
english_lg = spacy.load('en_core_web_lg')   
english_trf = spacy.load('en_core_web_trf')
# Notes:
# - tutorials often call the loaded model object 'nlp'.  We mix models below, so use more descriptive names.
# - Older versions allowed things like load('en') as a 'you decide for me'.  It's less confusing to not do that.


# what components are in this model's pipeline - what does it calculate and annotate?
for pipe_name in english_trf.pipe_names:
    print( '==== %s ====\n%s'%(pipe_name, english_trf.get_pipe(pipe_name).__doc__) )

==== transformer ====
spaCy pipeline component that provides access to a transformer model from
    the Huggingface transformers library. Usually you will connect subsequent
    components to the shared transformer using the TransformerListener layer.
    This works similarly to spaCy's Tok2Vec component and Tok2VecListener
    sublayer.

    The activations from the transformer are saved in the doc._.trf_data extension
    attribute. You can also provide a callback to set additional annotations.

    vocab (Vocab): The Vocab object for the pipeline.
    model (Model[List[Doc], FullTransformerBatch]): A thinc Model object wrapping
        the transformer. Usually you will want to use the TransformerModel
        layer for this.
    set_extra_annotations (Callable[[List[Doc], FullTransformerBatch], None]): A
        callback to set additional information onto the batch of `Doc` objects.
        The doc._.trf_data attribute is set prior to calling the callback.
        By default, no add

## Inspection of a parse

In [None]:
# Calling a model with some text produces a Doc object, the analysis of that text.
# Iterating that object gives you a series of Token objects.
#
# Doc as well as Token get annotated by the pipeline components.
# Much of the interesting annotation sits on the token objects, some of it on the document (more on that below).
# 
# One example of data annotated on the doc doc.sents (many models add this), 
#   a sequence of Span objects, where each is the series of tokens for a single sentence.
#   You may find iterating a sentence at a time more convenient than iterating the entire doc in one go.
statements_txt = "Machine learning can be easy. Deep Learning isn't straightforward. Ducks are nice. Cats and dogs are cute. Neural nets are useful. Long Cat looks like a name."
statements_doc = english_trf( statements_txt )

print("TOKENS")
for tok in statements_doc: 
    print( '  %s/%s'%(tok.text,  tok.pos_),  end='' )  # print token text, and its part of speech tag

# the 'parser' component also annotates sentences
# The sents attribute gives you a sequence of Span objects, each of which is a sequences of tokens from the document these Spans belong to
print("\n\nSENTENCES")
for sent in statements_doc.sents:  # https://spacy.io/api/doc#sents
    print( '  %s'%sent.text )
    #for tok in sent: 
    #    print( '  %s/%s'%(tok.text,  tok.pos_), end='' )

TOKENS
  Machine/NOUN  learning/NOUN  can/AUX  be/AUX  easy/ADJ  ./PUNCT  Deep/PROPN  Learning/PROPN  is/AUX  n't/PART  straightforward/ADJ  ./PUNCT  Ducks/NOUN  are/AUX  nice/ADJ  ./PUNCT  Cats/NOUN  and/CCONJ  dogs/NOUN  are/AUX  cute/ADJ  ./PUNCT  Neural/ADJ  nets/NOUN  are/AUX  useful/ADJ  ./PUNCT

SENTENCES
  Machine learning can be easy.
  Deep Learning isn't straightforward.
  Ducks are nice.
  Cats and dogs are cute.
  Neural nets are useful.


In [19]:
# Various models also try to find sequences that you might think of as noun phrases (though that is not exactly the goal).
print("\nNOUN CHUNKS")    # https://spacy.io/api/doc#noun_chunks
for nc in statements_doc.noun_chunks:
    print( '    %-20s   head=%-15s   head.dep_=%s'%(nc.text, nc.root.text, nc.root.dep_))

print("\nNAMED ENTITIES") # https://spacy.io/api/doc#ents
for ent in statements_doc.ents:           
    print( '    %-20s   label=%-10s   head=%-15s   head.dep_=%s'%(ent.text, ent.label_, ent.root.text, ent.root.dep_))
# Keep in mind that particularly named entities are only the clearer cases, and never quite complete.
# In this example, there are 0 or 1 depending on the model you used.


NOUN CHUNKS
    Machine learning       head=learning          head.dep_=nsubj
    Deep Learning          head=Learning          head.dep_=nsubj
    Ducks                  head=Ducks             head.dep_=nsubj
    Cats                   head=Cats              head.dep_=nsubj
    dogs                   head=dogs              head.dep_=conj
    Neural nets            head=nets              head.dep_=nsubj

NAMED ENTITIES


In [17]:
# pipelines add further attributes to tokens, which often includes
# - there's also .orth_ but it seems to be an implementation detail, and practically equivalent to .text
# - lemmatized form of a token
# - normalized form of a token  (seems to mostly resolve contractions, though is often largely the same as lemmatizer output)
# - coarse tagging, in pos_   (often following wider conventions)
# - finer taggging, in tag_   (more easily model/language specific)

fields = '%15s %15s %15s %10s %10s %10s %10s  %15s  %s'
head = ('TEXT', 'LEMMA', 'NORMALIZED', 'POS', 'TAG', 'STOP?', 'UNKNOWN?', 'EXPL(POS)', 'EXPL(TAG)')   # unknown as in 'out of vocabulary'
print( fields % head )
complicated = english_trf( "I don't think this was complex.")
for tok in complicated:
    print( fields%(tok.text, tok.lemma_, tok.norm_, tok.pos_, tok.tag_, tok.is_stop, tok.is_oov, spacy.explain(tok.pos_), spacy.explain(tok.tag_) ) )

# Notes:
# - If your'e not sure about the tagset (which can vary per language), you can ask spacy via explain(), like above
# - for certain models, is_oov will always be True.  Don't rely on it until you understand why.
# - things you may want to know exists, even if you do not often use them
#   - .is_sent_start, .is_bracket, .is_quote, is_left_punct, is_punct, is_upper;  like_url, like_email,  and a few more
#   - the ability to fetch the original whitespace  (this example uses .text which ignores that,  while .text_with_ws is basically .text+.whitespace_ )
#   - because you're probably wondering: those underscores are the human-readable variants, 
#     and there will also be an attribute without the underscore that stores a number, which refers to something in the model it came from (this allows more compact storage of parses)

           TEXT           LEMMA      NORMALIZED        POS        TAG      STOP?   UNKNOWN?        EXPL(POS)  EXPL(TAG)
              I               I               i       PRON        PRP       True       True          pronoun  pronoun, personal
             do              do              do        AUX        VBP       True       True        auxiliary  verb, non-3rd person singular present
            n't             not             not       PART         RB       True       True         particle  adverb
          think           think           think       VERB         VB      False       True             verb  verb, base form
           this            this            this       PRON         DT       True       True          pronoun  determiner
            was              be             was        AUX        VBD       True       True        auxiliary  verb, past tense
        complex         complex         complex        ADJ         JJ      False       True        adjective  adj

In [16]:
# Models also detail certain dependencies/relations between tokens. 
# On each token,  .head is the source of that relation, .dep_ is its type 
# Yes, this is a somewhat unusual way of representing and storing these.
bean = english( "John, I gave the bean to Alice's sister" )
for tok in bean:
   print( '%10s   <--%s--   %s '%(tok.text,  spacy.explain(tok.dep_).rjust(35, '-'), tok.head) )
# This may become clearer once you see this visualized of these - scroll down

      John   <----noun phrase as adverbial modifier--   gave 
         ,   <--------------------------punctuation--   gave 
         I   <----------------------nominal subject--   gave 
      gave   <---------------------------------root--   gave 
       the   <---------------------------determiner--   bean 
      bean   <------------------------direct object--   gave 
        to   <---------------prepositional modifier--   gave 
     Alice   <------------------possession modifier--   sister 
        's   <-------------------------case marking--   Alice 
    sister   <----------------object of preposition--   to 


In [20]:
# Dependencies are a decent way to figure out a sentence's subjects and objects. 
# Do test this at more complex sentences too, to realize what it does when subjects and objects are not explicit, or present, and when there is more than one in a sentence.
for i, sent in enumerate( statements_doc.sents ):
    for tok in sent:
        if tok.dep_ == 'nsubj':
            print( "Sentence %s SUBJECT: %s"%(i, tok.text) )
        if tok.dep_ == 'pobj':
            print( "Sentence %s OBJECT:  %s"%(i, tok.text) )       

# In various cases you may care to figure out whether the token marked as e.g. subjects and objects is part of a larger thing, like a named entity or noun chunk.
# This may be easier to check the other way around: for each noun chunk or entity, see if its head/root has the relation you're interested in:
print("\nNOUN CHUNKS")
for nc in doc.noun_chunks:     
    print( '    %-20s   head=%-15s   head.dep_=%s'%(nc.text, nc.root.text, nc.root.dep_))

Sentence 0 SUBJECT: learning
Sentence 1 SUBJECT: Learning
Sentence 2 SUBJECT: Ducks
Sentence 3 SUBJECT: Cats
Sentence 4 SUBJECT: nets

NOUN CHUNKS
    Machine learning       head=learning          head.dep_=nsubj
    Deep Learning          head=Learning          head.dep_=nsubj
    Ducks                  head=Ducks             head.dep_=nsubj
    Cats                   head=Cats              head.dep_=nsubj
    dogs                   head=dogs              head.dep_=conj
    Neural nets            head=nets              head.dep_=nsubj


In [21]:
# When looking to split the more contentful from the filler/function words, there are a few ways to go about it.
# One quick and dirty way is to 'give me all potentially useful tokens, ignoring by less interesting parts of speech', like:
interesting, boring, todo = [], [], []
for tok in statements_doc:
    if tok.is_stop:
        boring.append( tok )
    elif tok.pos_ in ('PUNCT','SPACE', 'X'):
        boring.append( tok )
    elif tok.pos_ in ('AUX','DET','CCONJ'): # auxiliary verbs, determiners, conjunctions are mostly just function words
        boring.append( tok )
    elif tok.pos_ in ('NOUN','PROPN','NUM'):  # nouns tend to be useful.   Numbers could be ignored; if they're dates, there is some chanced that they're picked up as entities
        interesting.append( tok )
    elif tok.pos_ in ('ADJ','VERB','ADP', 'ADV'): # adjectives, verbs, adverbs, adpositions - are often halfway useful
        interesting.append( tok )
    else: #  (debug: labels not yet in the above)
        todo.append( '%s/%s'%(tok.text, tok.pos_) )

print("CONTENTFUL")
for tok in interesting:
    print( ' ', tok, tok.prob )

print("\nBORING")
for tok in boring:
    print( ' %r'%tok.text, end=', ' )

#print("n\TODO")
#for tok in todo:
#    print( '  ',tok )


CONTENTFUL
  Machine -20.0
  learning -20.0
  easy -20.0
  Deep -20.0
  Learning -20.0
  straightforward -20.0
  Ducks -20.0
  nice -20.0
  Cats -20.0
  dogs -20.0
  cute -20.0
  Neural -20.0
  nets -20.0
  useful -20.0

BORING
 'can',  'be',  '.',  'is',  "n't",  '.',  'are',  '.',  'and',  'are',  '.',  'are',  '.', 

## Visualisation
spacy's own displacy is geared towards python notebooks.   You can also fish out representations for display elsewhere.
-  style='ent' is for entities, shown with HTML blocks, and fine for longer texts
-  style='dep' is for dependencies, shown as an SVG diagram, and generally unwieldy if showing more than one sentence at a time
-  style='span' is for any spans (noun chunks, entities, others that you may add) which may overlap.

In [22]:
paris_text = """During the Restoration, the bridges and squares of Paris were returned to their pre-Revolution names; the July Revolution in 1830 
(commemorated by the July Column on the Place de la Bastille) brought a constitutional monarch, Louis Philippe I, to power.
The first railway line to Paris opened in 1837, beginning a new period of massive migration from the provinces to the city.
Louis-Philippe was overthrown by a popular uprising in the streets of Paris in 1848. His successor, Napoleon III, 
alongside the newly appointed prefect of the Seine, Georges-Eugène Haussmann, launched a gigantic public works project to 
build wide new boulevards, a new opera house, a central market, new aqueducts, sewers and parks, including the Bois de Boulogne 
and Bois de Vincennes."""
paris_doc = english_trf( paris_text )

print("NAMED ENTITIES")
for ent in paris_doc.ents:
    print( '  %-28s   label=%-8s (%-53s)    head=%-15s   head.dep_=%s'%(ent.text, ent.label_, spacy.explain(ent.label_), ent.root.text, ent.root.dep_))

spacy.displacy.render(paris_doc, style='ent')   


NAMED ENTITIES
  Restoration                    label=EVENT    (Named hurricanes, battles, wars, sports events, etc. )    head=Restoration       head.dep_=pobj
  Paris                          label=GPE      (Countries, cities, states                            )    head=Paris             head.dep_=pobj
  the July Revolution            label=EVENT    (Named hurricanes, battles, wars, sports events, etc. )    head=Revolution        head.dep_=nsubj
  1830                           label=DATE     (Absolute or relative dates or periods                )    head=1830              head.dep_=pobj
  the July Column                label=FAC      (Buildings, airports, highways, bridges, etc.         )    head=Column            head.dep_=pobj
  the Place de la Bastille       label=FAC      (Buildings, airports, highways, bridges, etc.         )    head=Bastille          head.dep_=pobj
  Louis Philippe I               label=PERSON   (People, including fictional                          )    head=I 

In [25]:
louis_doc = english_trf( "Louis XVI and the royal family were brought to Paris and made prisoners within the Tuileries Palace." )
spacy.displacy.render(louis_doc, style='dep')

In [11]:
# You can get it a litte more compact like
spacy.displacy.render(louis_doc, style='dep', options={'compact':True, 'bg':'#00000000', 'color':'#ffffff', 'distance':85, 'word_spacing':42, 'arrow_stroke':1})

## Extracting patterns with rule-based matching

In [34]:
from spacy.matcher import Matcher

# One or more adjectives  before  a noun or proper noun
an_pattern = [
    [ {"POS": "ADJ", "OP": "+"},   {"POS": {"IN":["NOUN","PROPN"]}} ],
    # you can have more rules in a matcher
]
matcher = Matcher(english_trf.vocab)
matcher.add("adjective-noun", an_pattern)
matches = matcher( paris_doc )
for match_id, start_i, end_i in matches:
    print( paris_doc[start_i:end_i] ) # from the indices, print the described span 

# Notes: 
# - you can express more complex types of patterns, see https://spacy.io/api/matcher#patterns
# - You could extend this to more complex tasks, like maybe rule-based phrase and named entity extraction.
#   - ...though you might base that on more specific existing code like PhraseMatcher and EntityRuler,
#     which may work faster and/or annotate automatically.
#   - and in the case of NER would probaly still be less effective than existing trained NER model components


Revolution names
-Revolution names
pre-Revolution names
constitutional monarch
first railway
new period
massive migration
popular uprising
public works
gigantic public works
new boulevards
wide new boulevards
new opera
central market
new aqueducts


##  Inventing your own functions to apply to spacy

In [35]:
def complexity( span ):
    ''' Takes a parsed spacy sentence, and estimates its complexity.

        Currently uses only the average distance of the dependencies, which is fairly decent for how simple it is. 
        Consider e.g.
        - long sentences aren't necessarily complex at all (they can just be separate things joined by a comma),
            they mainly become harder to parse if they introduce long-distance references.
        - parenthetical sentences will lengthen references across them
        - lists and flat compounds will drag the complexity down
        - this doesn't really need normalization

        Downsides include that spacy seems to assign some dependencies just because it needs to, not necessarily sensibly.
        Also, we should probably count most named entities as a single thing, not the amount of tokens in them
    '''
    dists = []
    for tok in span:
        dist = tok.head.i - tok.i
        dists.append( dist ) 
    abs_dists = list( abs(d)  for d in dists )
    avg_dist = float(sum(abs_dists)) / len(abs_dists)
    return avg_dist

for sent in english_trf( paris_doc ).sents:
    print('[%s]\nComplexity: %.1f\n'%(sent.text.replace('\n',' ').strip(), complexity(sent) ) )


[During the Restoration, the bridges and squares of Paris were returned to their pre-Revolution names; the July Revolution in 1830  (commemorated by the July Column on the Place de la Bastille) brought a constitutional monarch, Louis Philippe I, to power.]
Complexity: 4.3

[The first railway line to Paris opened in 1837, beginning a new period of massive migration from the provinces to the city. Louis-Philippe was overthrown by a popular uprising in the streets of Paris in 1848.]
Complexity: 4.0

[His successor, Napoleon III,  alongside the newly appointed prefect of the Seine, Georges-Eugène Haussmann, launched a gigantic public works project to  build wide new boulevards, a new opera house, a central market, new aqueducts, sewers and parks, including the Bois de Boulogne  and Bois de Vincennes.]
Complexity: 4.7



## Some specialized models

There are sometimes readymade models that help do specific tasks - possibly _only_ those tasks.

Say you have a text that you know contains multiple languages, and you want to feed each to a parser for that language.

### Language detection
useful to take multiple languages in one piece of text, and be able to feed it to the right nlp object.

There is a spacy_fastlang library (which depends on the fasttext library) that can do that for you.

The below uses detect_language() - which is a handful of lines mostly jut doing the boilerplate of "create model, stick fastlang on it, run it, and report"

In [None]:
!pip3 install --quiet spacy_fastlang

In [10]:
print("LANGUAGE DETECTION")
for example in ("I am cheese", "Ik ben kaas", "Je suis fromage", "Ich bin Käse", "Ek is kaas", "Saya keju", "Olen juusto", "ben peynirim"):
    lang, score = helpers_spacy.detect_language(example)
    print( "    %s (certainty: %.2f) for  %r"%( lang, score, example ) )

LANGUAGE DETECTION
    en (certainty: 0.73) for  'I am cheese'
    nl (certainty: 0.53) for  'Ik ben kaas'
    fr (certainty: 1.00) for  'Je suis fromage'
    de (certainty: 1.00) for  'Ich bin Käse'
    af (certainty: 0.66) for  'Ek is kaas'
    id (certainty: 0.58) for  'Saya keju'
    fi (certainty: 0.91) for  'Olen juusto'
    tr (certainty: 0.29) for  'ben peynirim'


### Sentence splitting
The [xx_sent_ud_sm](https://spacy.io/models/xx#xx_sent_ud_sm) model is trained on multiple languages, and may even do better than model's sentence splitter (when those are rule-based, because that tends to be dumb).

In [None]:
!python3 -m spacy download xx_sent_ud_sm

In [37]:
print("SENTENCE SPLITTING")
split_doc = helpers_spacy.sentence_split("""C'est n'est pas une pipe. A capital after Mr. Abbreviation might throw things off. (Also) weird sentence starts could. 
     As can ellipses... as you can imagine. As could "Things we quote?" ...as they are embedded sentences and what comes after is unknown. (en die laatste zou anders werken met een komma)""") 
for sent in split_doc.sents:
    text = sent.text.strip()
    print('    [%s] %s'%( helpers_spacy.detect_language(text)[0], text ) )   # add language detection, as per our initial plan


SENTENCE SPLITTING
    [fr] C'est n'est pas une pipe.
    [en] A capital after Mr. Abbreviation might throw things off.
    [en] (Also) weird sentence starts could.
    [en] As can ellipses... as you can imagine.
    [en] As could "Things we quote?"
    [en] ...as they are embedded sentences and what comes after is unknown.
    [nl] (en die laatste zou anders werken met een komma)
