### Fair Warning:

* I will show you cool things, in random order
* I am very very biased toward the English language

### Word Tokenization and Sentence Segmentation: What is a lexical unit?

What is a word? [Dictionary entries + alternate word forms]

### Basic, Deterministic Approach:
* words: split on white space
* sentences: split on punctuation

#### Regex Examples:

Regex | Matches | Example
--- | --- 
Jupyter | Instances of "Jupyter" | Welcome to **Jupyter**Con 2017!
[JCe] |  Single instances of the characters of "J", "C", "e" | W**e**lcom**e** to **J**upyt**e**r**C**on 2017! 
[A-Z] | Any single capital letter. | **W**elcome to **J**upyter**C**on 2017!


### Build a Regex Preview
* Regex Box
* Pattern Box
* Results

In [12]:
#%matplotlib notebook
from ipywidgets import Button, Textarea, Layout, Box, Label
from IPython.display import display, Markdown, clear_output
import re

class RegexFinder(object):
    def __init__(self, init_pattern=""):
        
        
        self.text_field = Textarea(init_pattern, layout=Layout(height='100px'))
        self.pattern_field = Textarea(layout=Layout(height='20px'))
        self.text_box = Box([Label(value='Text Box'), self.text_field])
        self.pattern_box = Box([Label(value='Pattern Box'), self.pattern_field])
        
        self.match_button = Button(description='Match Pattern', )
        self.match_button.on_click(self.match_pattern)
        
        display(self.text_box)
        display(self.pattern_box)
        display(self.match_button)
        self.match_pattern(None)
        
    @property
    def pattern(self):
        return re.compile(self.pattern_field.value)
    
    @property
    def text(self):
        return self.text_field.value    
    
    def match_pattern(self, b):
        clear_output()
        display(Markdown(self.format_match_markdown(self.text, self.pattern)))
        
    def format_match_markdown(self, text, pattern):

        new_string = ""
        last = 0

        for i in pattern.finditer(text):
            start, stop = i.span()
            new_string += text[last:start] + "<b style='color:blue;'><u>" + text[start:stop] +  "</u></b>"
            last = stop

        new_string += text[last:]
        return new_string
        
    
    
r = RegexFinder("Welcome to JupyterCon 2017!")

<b style='color:blue;'><u>W</u></b>elcome to <b style='color:blue;'><u>J</u></b>upyter<b style='color:blue;'><u>C</u></b>on 2017!

### Exercise: Match all proper nouns

In [13]:
!pip3 install wikipedia >> pip-wiki.txt
import wikipedia
wiki = wikipedia.WikipediaPage('Noam Chomsky')
intro_paragraph = wiki.content.split('\n')[0]
finder = RegexFinder(intro_paragraph)

<b style='color:blue;'><u>Avram</u></b> <b style='color:blue;'><u>Noam</u></b> <b style='color:blue;'><u>Chomsky</u></b> (US: /æˈvrɑːm ˈnoʊm ˈtʃɒmski/ a-VRAHM nohm CHOM-skee; born <b style='color:blue;'><u>December</u></b> 7, 1928) is an <b style='color:blue;'><u>American</u></b> linguist, philosopher, cognitive scientist, historian, social critic, and political activist. <b style='color:blue;'><u>Sometimes</u></b> described as "the father of modern linguistics", <b style='color:blue;'><u>Chomsky</u></b> is also a major figure in analytic philosophy and one of the founders of the field of cognitive science. <b style='color:blue;'><u>He</u></b> is <b style='color:blue;'><u>Institute</u></b> <b style='color:blue;'><u>Professor</u></b> <b style='color:blue;'><u>Emeritus</u></b> at the <b style='color:blue;'><u>Massachusetts</u></b> <b style='color:blue;'><u>Institute</u></b> of <b style='color:blue;'><u>Technology</u></b> (MIT), where he has worked since 1955, and is the author of over 100 books on topics such as linguistics, war, politics, and mass media. <b style='color:blue;'><u>Ideologically</u></b>, he aligns with anarcho-syndicalism and libertarian socialism.

In [47]:
import re

def sent_segmenter(doc):
    sent_pattern = re.compile('[\.\?\!]')
    for sent in sent_pattern.split(doc):
        yield sent
        
        
def word_segmenter(sent):
    word_pattern = re.compile('[\s]')
    for word in word_pattern.split(sent):
        yield word
        
        
sentences = sent_segmenter(intro_paragraph)
for word in word_segmenter(next(sentences)):
    print(word)

Avram
Noam
Chomsky
(US:
/æˈvrɑːm
ˈnoʊm
ˈtʃɒmski/
a-VRAHM
nohm
CHOM-skee;
born
December
7,
1928)
is
an
American
linguist,
philosopher,
cognitive
scientist,
historian,
social
critic,
and
political
activist


Failures of Deterministic Approach:
* "Dr. White saw a patient." is one sentence, not two.
* (US: ...) is probably best tokenized as "(", "US", ":", ... I.e we missed other forms of punctuation.

In [38]:
nlp = spacy.load('en')

for word in nlp(intro_paragraph):
    print(word)



    Only loading the 'en' tokenizer.

Avram
Noam
Chomsky
(
US
:
/æˈvrɑːm
ˈnoʊm
ˈtʃɒmski/
a
-
VRAHM
nohm
CHOM
-
skee
;
born
December
7
,
1928
)
is
an
American
linguist
,
philosopher
,
cognitive
scientist
,
historian
,
social
critic
,
and
political
activist
.
Sometimes
described
as
"
the
father
of
modern
linguistics
"
,
Chomsky
is
also
a
major
figure
in
analytic
philosophy
and
one
of
the
founders
of
the
field
of
cognitive
science
.
He
is
Institute
Professor
Emeritus
at
the
Massachusetts
Institute
of
Technology
(
MIT
)
,
where
he
has
worked
since
1955
,
and
is
the
author
of
over
100
books
on
topics
such
as
linguistics
,
war
,
politics
,
and
mass
media
.
Ideologically
,
he
aligns
with
anarcho
-
syndicalism
and
libertarian
socialism
.


Avram
Noam
Chomsky
(US:
/æˈvrɑːm
ˈnoʊm
ˈtʃɒmski/
a-VRAHM
nohm
CHOM-skee;
born
December
7,
1928)
is
an
American
linguist,
philosopher,
cognitive
scientist,
historian,
social
critic,
and
political
activist

Sometimes
described
as
"the
father
of
modern
linguistics",
Chomsky
is
also
a
major
figure
in
analytic
philosophy
and
one
of
the
founders
of
the
field
of
cognitive
science

He
is
Institute
Professor
Emeritus
at
the
Massachusetts
Institute
of
Technology
(MIT),
where
he
has
worked
since
1955,
and
is
the
author
of
over
100
books
on
topics
such
as
linguistics,
war,
politics,
and
mass
media

Ideologically,
he
aligns
with
anarcho-syndicalism
and
libertarian
socialism



In [20]:
tweet = """
We shouldn't be looking for heroes, we should be looking for good ideas.” 
― Noam Chomsky
@allenmichael89 #NoamChomsky #Literature
"""

In [28]:
import spacy