# Intro to NLP
## Katy Machine Learning Meetup
## 11 Aug 2018

This is a quick demo of spaCy functionality to demonstrate the basic tools found in an NLP library. 

  * Author: K. Scott Ferguson
  * Environment: Python 3.6 + spaCy (+ fast.ai/pyTorch)

General Setup Notes:

If you haven't set up a working ML machine, here's my suggestions:

0. I now prefer to execute ML work in Ubuntu (dual booting with Win 10). Win 10 now has a pretty decent Canonical Ubuntu ternminal in the MS store, but I'm not sure if it supports GPU usage on your machine.

1. It's easier to setup and use a GPU machine in the cloud than to configure say, your laptop GPU. I currently use Paperspace, but it's not free. There are free levels of AWS and GCP, grab a tutorial and get started. My Paperspace setup is documented here: https://gist.github.com/ksferguson/0b384e892689617d1539d35c1254eb01. 

1. Alternatively, GCP and Paperspace and others offer Jupyter Notebooks as a service if you want to abstract away from machines.

2. As I mentioned, I dual boot with Win 10 on a Dell laptop that has it's own nVidia GPU (in addition to the built-in Intel video). Setting up a fully functional Ubuntu on my own laptop was _complicated_ and required multiple iterations of driver installs to get it right. My laptop setup is documented here: https://gist.github.com/ksferguson/a6eba79df658826cacb629dcc14992ea
3. Use Anaconda to manage your Python environments.
4. Get a free GitHub account for source control unless you use something else.
5. I tend to use Jupyter notebooks for code like this, but Atom (or Vim) for larger code files. Visual Studio Code is free and nice as well. Lepton is handy for Gists.  


# spaCy

Install spacy https://spacy.io/ and download the default (small) English model:

Use Anaconda to setup a new environment:

In [1]:
# conda create --name myspacy python=3.6
# . activate myspacy
# conda install ipykernel
# python -m ipykernel install --user--name=myspacy 
# conda install nb_conda

Install spaCy

In [2]:
# conda install spacy
# python -m spacy download en

In [13]:
import os
import spacy
from spacy import displacy

#load english model
nlp = spacy.load('en')

In [14]:
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")
doc = nlp(text)

#doc = nlp(unicode(text, 'utf-8'))

#matches = matcher(nlp(unicode(text, 'utf-8')))

    

Token Properties:

  * Text: The original word text.
  * Lemma: The base form of the word.
  * POS: The simple part-of-speech tag.
  * Tag: The detailed part-of-speech tag.
  * Dep: Syntactic dependency, i.e. the relation between tokens.
  * Shape: The word shape – capitalisation, punctuation, digits.
  * is alpha: Is the token an alpha character?
  * is stop: Is the token part of a stop list, i.e. the most common words of the language?

In [15]:
print('Tokens:')
print('text lemma POS tag dep shape is_alpha is_stop')
print('---- ----- --- --- --- ----- -------- -------')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_
          , token.is_alpha, token.is_stop)

Tokens:
text lemma POS tag dep shape is_alpha is_stop
---- ----- --- --- --- ----- -------- -------
When when ADV WRB advmod Xxxx True False
Sebastian sebastian PROPN NNP compound Xxxxx True False
Thrun thrun PROPN NNP nsubj Xxxxx True False
started start VERB VBD advcl xxxx True False
working work VERB VBG xcomp xxxx True False
on on ADP IN prep xx True True
self self NOUN NN npadvmod xxxx True False
- - PUNCT HYPH punct - False False
driving drive VERB VBG amod xxxx True False
cars car NOUN NNS pobj xxxx True False
at at ADP IN prep xx True True
Google google PROPN NNP pobj Xxxxx True False
in in ADP IN prep xx True True
2007 2007 NUM CD pobj dddd False False
, , PUNCT , punct , False False
few few ADJ JJ amod xxx True True
people people NOUN NNS nsubj xxxx True False
outside outside ADV RB prep xxxx True False
of of ADP IN prep xx True True
the the DET DT det xxx True True
company company NOUN NN pobj xxxx True False
took take VERB VBD ROOT xxxx True False
him -PRON- PRON PRP dobj x

In [16]:
print('Entities:')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Entities:
Sebastian Thrun 5 20 PERSON
Google 61 67 ORG
2007 71 75 DATE
American 173 181 NORP
Thrun 271 276 PERSON
Recode 370 376 ORG
earlier this week 377 394 DATE


In [17]:
displacy.render(doc, style='ent', jupyter=True)

In [18]:
print('Sentences:')
for sent in doc.sents:
    print(sent.text, sent.start_char, sent.end_char, sent.label_)

Sentences:
When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. 0 130 
“ 131 132 
I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to,” said Thrun, now the co-founder and CEO of online higher education startup Udacity, in an interview with Recode earlier this week. 132 395 


In [19]:
displacy.render(doc, style='dep', jupyter=True, options={'compact': True})

In [None]:
displacy.serve(doc, style='dep', options={'compact': True})


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer



127.0.0.1 - - [05/Aug/2018 19:50:56] "GET / HTTP/1.1" 200 52196
127.0.0.1 - - [05/Aug/2018 19:50:57] "GET /favicon.ico HTTP/1.1" 200 52196
127.0.0.1 - - [05/Aug/2018 19:52:19] "GET / HTTP/1.1" 200 52196
127.0.0.1 - - [05/Aug/2018 19:52:19] "GET /favicon.ico HTTP/1.1" 200 52196


http://localhost:5000