### Fair Warning:

* I will show you cool things, in random order
* I am very very biased toward the English language

### Word Tokenization and Sentence Segmentation: What is a lexical unit?

What is a word? [Dictionary entries + alternate word forms]

### Basic, Deterministic Approach:
* words: split on white space
* sentences: split on punctuation

#### Regex Examples:

Regex | Matches | Example
--- | --- 
Jupyter | Instances of "Jupyter" | Welcome to **Jupyter**Con 2017!
[JCe] |  Single instances of the characters of "J", "C", "e" | W**e**lcom**e** to **J**upyt**e**r**C**on 2017! 
[A-Z] | Any single capital letter. | **W**elcome to **J**upyter**C**on 2017!


### Build a Regex Preview
* Regex Box
* Pattern Box
* Results

In [3]:
#%matplotlib notebook
from ipywidgets import Button, Textarea, Layout, Box, Label
from IPython.display import display, Markdown, clear_output
import re

class RegexFinder(object):
    def __init__(self, init_pattern=""):
        
        
        self.text_field = Textarea(init_pattern, layout=Layout(height='100px'))
        self.pattern_field = Textarea(layout=Layout(height='20px'))
        self.text_box = Box([Label(value='Text Box'), self.text_field])
        self.pattern_box = Box([Label(value='Pattern Box'), self.pattern_field])
        
        self.match_button = Button(description='Match Pattern', )
        self.match_button.on_click(self.match_pattern)
        
        display(self.text_box)
        display(self.pattern_box)
        display(self.match_button)
        self.match_pattern(None)
        
    @property
    def pattern(self):
        return re.compile(self.pattern_field.value)
    
    @property
    def text(self):
        return self.text_field.value    
    
    def match_pattern(self, b):
        clear_output()
        display(Markdown(self.format_match_markdown(self.text, self.pattern)))
        
    def format_match_markdown(self, text, pattern):

        new_string = ""
        last = 0

        for i in pattern.finditer(text):
            start, stop = i.span()
            new_string += text[last:start] + "<b style='color:blue;'><u>" + text[start:stop] +  "</u></b>"
            last = stop

        new_string += text[last:]
        return new_string
        
    
    
r = RegexFinder("Welcome to JupyterCon 2017!")

<b style='color:blue;'><u>Welco</u></b>me to JupyterCon 2017!

### Exercise: Match all proper nouns

In [4]:
!pip3 install wikipedia >> pip-wiki.txt
import wikipedia
wiki = wikipedia.WikipediaPage('Noam Chomsky')
intro_paragraph = wiki.content.split('\n')[0]
finder = RegexFinder(intro_paragraph)

<b style='color:blue;'><u>Avram</u></b> <b style='color:blue;'><u>Noam</u></b> <b style='color:blue;'><u>Chomsky</u></b> (US:  a-VRAHM NOHM CHOM-<b style='color:blue;'><u>skee</u></b>; <b style='color:blue;'><u>born</u></b> <b style='color:blue;'><u>December</u></b> 7, 1928) <b style='color:blue;'><u>is</u></b> <b style='color:blue;'><u>an</u></b> <b style='color:blue;'><u>American</u></b> <b style='color:blue;'><u>linguist</u></b>, <b style='color:blue;'><u>philosopher</u></b>, <b style='color:blue;'><u>cognitive</u></b> <b style='color:blue;'><u>scientist</u></b>, <b style='color:blue;'><u>historian</u></b>, <b style='color:blue;'><u>social</u></b> <b style='color:blue;'><u>critic</u></b>, <b style='color:blue;'><u>and</u></b> <b style='color:blue;'><u>political</u></b> <b style='color:blue;'><u>activist</u></b>. <b style='color:blue;'><u>Sometimes</u></b> <b style='color:blue;'><u>described</u></b> <b style='color:blue;'><u>as</u></b> "<b style='color:blue;'><u>the</u></b> <b style='color:blue;'><u>father</u></b> <b style='color:blue;'><u>of</u></b> <b style='color:blue;'><u>modern</u></b> <b style='color:blue;'><u>linguistics</u></b>", <b style='color:blue;'><u>Chomsky</u></b> <b style='color:blue;'><u>is</u></b> <b style='color:blue;'><u>also</u></b> a <b style='color:blue;'><u>major</u></b> <b style='color:blue;'><u>figure</u></b> <b style='color:blue;'><u>in</u></b> <b style='color:blue;'><u>analytic</u></b> <b style='color:blue;'><u>philosophy</u></b> <b style='color:blue;'><u>and</u></b> <b style='color:blue;'><u>one</u></b> <b style='color:blue;'><u>of</u></b> <b style='color:blue;'><u>the</u></b> <b style='color:blue;'><u>founders</u></b> <b style='color:blue;'><u>of</u></b> <b style='color:blue;'><u>the</u></b> <b style='color:blue;'><u>field</u></b> <b style='color:blue;'><u>of</u></b> <b style='color:blue;'><u>cognitive</u></b> <b style='color:blue;'><u>science</u></b>. <b style='color:blue;'><u>He</u></b> <b style='color:blue;'><u>is</u></b> <b style='color:blue;'><u>Institute</u></b> <b style='color:blue;'><u>Professor</u></b> <b style='color:blue;'><u>Emeritus</u></b> <b style='color:blue;'><u>at</u></b> <b style='color:blue;'><u>the</u></b> <b style='color:blue;'><u>Massachusetts</u></b> <b style='color:blue;'><u>Institute</u></b> <b style='color:blue;'><u>of</u></b> <b style='color:blue;'><u>Technology</u></b> (MIT), <b style='color:blue;'><u>where</u></b> <b style='color:blue;'><u>he</u></b> <b style='color:blue;'><u>has</u></b> <b style='color:blue;'><u>worked</u></b> <b style='color:blue;'><u>since</u></b> 1955, <b style='color:blue;'><u>and</u></b> <b style='color:blue;'><u>is</u></b> <b style='color:blue;'><u>the</u></b> <b style='color:blue;'><u>author</u></b> <b style='color:blue;'><u>of</u></b> <b style='color:blue;'><u>over</u></b> 100 <b style='color:blue;'><u>books</u></b> <b style='color:blue;'><u>on</u></b> <b style='color:blue;'><u>topics</u></b> <b style='color:blue;'><u>such</u></b> <b style='color:blue;'><u>as</u></b> <b style='color:blue;'><u>linguistics</u></b>, <b style='color:blue;'><u>war</u></b>, <b style='color:blue;'><u>politics</u></b>, <b style='color:blue;'><u>and</u></b> <b style='color:blue;'><u>mass</u></b> <b style='color:blue;'><u>media</u></b>. <b style='color:blue;'><u>Ideologically</u></b>, <b style='color:blue;'><u>he</u></b> <b style='color:blue;'><u>aligns</u></b> <b style='color:blue;'><u>with</u></b> <b style='color:blue;'><u>anarcho</u></b>-<b style='color:blue;'><u>syndicalism</u></b> <b style='color:blue;'><u>and</u></b> <b style='color:blue;'><u>libertarian</u></b> <b style='color:blue;'><u>socialism</u></b>.

In [5]:
import re

def sent_segmenter(doc):
    sent_pattern = re.compile('[\.\?\!]')
    for sent in sent_pattern.split(doc):
        yield sent
        
        
def word_segmenter(sent):
    word_pattern = re.compile('[\s]')
    for word in word_pattern.split(sent):
        yield word
        
        
sentences = sent_segmenter(intro_paragraph)
for word in word_segmenter(next(sentences)):
    print(word)

Avram
Noam
Chomsky
(US:

a-VRAHM
NOHM
CHOM-skee;
born
December
7,
1928)
is
an
American
linguist,
philosopher,
cognitive
scientist,
historian,
social
critic,
and
political
activist


Failures of Deterministic Approach:
* "Dr. White saw a patient." is one sentence, not two.
* (US: ...) is probably best tokenized as "(", "US", ":", ... I.e we missed other forms of punctuation.

### How Spacy's Tokenizer Works:
* Iterate over space-separated substrings
* Check whether we have an explicitly defined rule for this substring. If we do, use it.
* Otherwise, try to consume a prefix.
* If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
* If we didn't consume a prefix, try to consume a suffix.
* If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
* Once we can't consume any more of the string, handle it as a single token.

### example
She didnt say "don't forget to take out the trash!"

In [181]:
### Tokenizer execptions:
from spacy.en import TOKENIZER_EXCEPTIONS
from spacy.attrs import ORTH

print(TOKENIZER_EXCEPTIONS['dont'])
print([i[ORTH] for i in TOKENIZER_EXCEPTIONS['dont']])

[{65: 'do', 73: 'do'}, {65: 'nt', 75: 'RB', 73: 'not'}]
['do', 'nt']


In [6]:
#!pip install spacy
#!python -m spacy download en
import spacy
nlp = spacy.load('en')


In [6]:

from spacy.attrs import ORTH
import re

whitespace = re.compile("[\s]+")

def tokenize(doc):
    
    token_list = []
    
    # begin by iterating over candidate tokens, 
    # which are separated by white space
    for token in whitespace.split(doc):
        #expand the candidate tokens to the full list
        new_tokens = iter_tokenize(token, [])
        token_list.extend(new_tokens)
    
    return token_list

def iter_tokenize(token, tokens):
    if token in TOKENIZER_EXCEPTIONS:
        for addition in TOKENIZER_EXCEPTIONS[token]:
            print("appending exception ", addition[ORTH])
            tokens.append(addition[ORTH])
        return tokens
    else:
        pre_match = nlp.tokenizer.prefix_search(token)
        suff_match = nlp.tokenizer.suffix_search(token)     
        if pre_match:
            start, end = pre_match.start(), pre_match.end()
            prefix, token = token[start:end], token[end:]
            print("appending prefix ", prefix)
            tokens.append(prefix)
            tokens = iter_tokenize(token, tokens)
        elif suff_match:
            start, end = suff_match.start(), suff_match.end()
            token, suffix = token[:start], token[start:end]
            tokens = iter_tokenize(token, tokens)
            print("appending suffix ", suffix)
            tokens.append(suffix)
        else:
            print("appending token ", token)            
            tokens.append(token)
        return tokens

ImportError: No module named 'spacy'

In [178]:
tokens = tokenize(example)
print(tokens)

appending token  She
appending token  said
appending prefix  "
appending exception  do
appending exception  n't
appending token  forget
appending token  to
appending token  take
appending token  out
appending token  the
appending prefix  (
appending token  old
appending suffix  )
appending token  trash
appending suffix  !
appending suffix  "
appending token  
['She', 'said', '"', 'do', "n't", 'forget', 'to', 'take', 'out', 'the', '(', 'old', ')', 'trash', '!', '"', '']


# Extending the spacy tokenizer, when and how
### When 

### How
* add a special case
* modify the tokenizer's prefix/suffix/infix search patterns
* create an entirely new tokenizer

#### Adding a special case

In [189]:
list(TOKENIZER_EXCEPTIONS.keys()).index('dont')

508

In [184]:
example2 = u'Gimme that sandwich!'
for token in nlp(example2):
    print(token)

Gimme
that
sandwich
!


In [190]:
from spacy.attrs import ORTH, LEMMA, TAG
nlp.tokenizer.add_special_case('gimme', [{ORTH:u'gim', LEMMA:u'give', TAG:u"VB"}, {ORTH:'me'}])
nlp.tokenizer.add_special_case('Gimme', [{ORTH:u'Gim', LEMMA:u'give', TAG:u"VB"}, {ORTH:'me'}])

In [191]:
example2 = u'Gimme that sandwich!'
for token in nlp(example2):
    print(token)


Gim
me
that
sandwich
!


More Examples:
* willya -> will, you
* tbt -> throw back thursdays

### Modify the tokenizer

In [201]:
exceptions = {'Gimme': [{ORTH:u'Gim', LEMMA:u'give', TAG:u"VB"}, {ORTH:'me'}]}
prefixes = re.compile('''[\[\]\(\)\'\"\#\@]''')
suffixes = re.compile('''[\[\]\(\)\'\"]''')

custom_tokenizer = Tokenizer(nlp.vocab
                             , exceptions
                             , prefixes.search
                             , suffixes.search
                             , nlp.tokenizer.infix_finditer)

example3 = "Gimme that sandwich @John"
print("Original Tokenizer")
print(list(nlp.tokenizer(example3)))    
print
print("New Tokenizer")
print(list(custom_tokenizer(example3)))    

    
    

Original Tokenizer
[Gim, me, that, sandwich, @John]
New Tokenizer
[Gim, me, that, sandwich, @, John]


### Adding your Tokenizer to the language pipeline

In [228]:
#method 1
nlp = spacy.load('en')
nlp.make_doc = custom_tokenizer

#method 2
nlp = spacy.load('en', make_doc=custom_tokenizer)

### create an entirely new tokenizer

In [240]:
from spacy.tokens import Doc
class CustomTokenizer(object):
    
    def __init__(self, nlp):
        self.vocab = nlp.vocab

    def __call__(self, text):
        words = tokenize(text)
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words)
    

nlp.make_doc = CustomTokenizer(nlp)
nlp(example3)

appending token  Gimme
appending token  that
appending token  sandwich
appending token  @John


Gimme that sandwich @John 