# Intro to NLP
## Katy Machine Learning Meetup
## 11 Aug 2018

This is a quick demo of spaCy functionality to demonstrate the basic tools found in an NLP library. 

  * Author: K. Scott Ferguson
  * Environment: Python 3.6 + spaCy (+ fast.ai/pyTorch)

General Setup Notes:

If you haven't set up a working ML machine, here's my suggestions:

0. I now prefer to execute ML work in Ubuntu (dual booting with Win 10). Win 10 now has a pretty decent Canonical Ubuntu ternminal in the MS store, but I'm not sure if it supports GPU usage on your machine.

1. It's easier to setup and use a GPU machine in the cloud than to configure say, your laptop GPU. I currently use Paperspace, but it's not free. There are free levels of AWS and GCP, grab a tutorial and get started. My Paperspace setup is documented here: https://gist.github.com/ksferguson/0b384e892689617d1539d35c1254eb01. 

1. Alternatively, GCP and Paperspace and others offer Jupyter Notebooks as a service if you want to abstract away from machines.

2. As I mentioned, I dual boot with Win 10 on a Dell laptop that has it's own nVidia GPU (in addition to the built-in Intel video). Setting up a fully functional Ubuntu on my own laptop was _complicated_ and required multiple iterations of driver installs to get it right. My laptop setup is documented here: https://gist.github.com/ksferguson/a6eba79df658826cacb629dcc14992ea
3. Use Anaconda to manage your Python environments.
4. Get a free GitHub account for source control unless you use something else.
5. I tend to use Jupyter notebooks for code like this, but Atom (or Vim) for larger code files. Visual Studio Code is free and nice as well. Lepton is handy for Gists.  


# spaCy

Install spacy https://spacy.io/ and download the default (small) English model:

Use Anaconda to setup a new environment:

In [5]:
# conda create --name myspacy python=3.6
# . activate myspacy
# conda install ipykernel
# python -m ipykernel install --user--name=myspacy 
# conda install nb_conda

Install spaCy

In [6]:
# conda install spacy
# python -m spacy download en

In [2]:
import os
import spacy
from spacy import displacy

#load english model
nlp = spacy.load('en')

In [3]:
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")

In [None]:
text = (u"This is one of the dumbest films, I've ever seen. It rips off nearly "
        u"ever type of thriller and manages to make a mess of them all.<br /><br />There's "
        u"not a single good line or character in the whole mess. If there was a plot, it was "
        u"an afterthought and as far as acting goes, there's nothing good to say so Ill say nothing. "\
        u"I honestly cant understand how this type of nonsense gets produced and actually released, "
        u"does somebody somewhere not at some stage think, 'Oh my god this really is a load of shite' "
        u"and call it a day. Its crap like this that has people downloading illegally, the trailer "
        u"looks like a completely different film, at least if you have download it, you haven't wasted "
        u"your time or money Don't waste your time, this is painful.")

In [9]:
text = (u"This is one of the dumbest films, I've ever seen.")

In [10]:
doc = nlp(text)

Token Properties:

  * Text: The original word text.
  * Lemma: The base form of the word.
  * POS: The simple part-of-speech tag.
  * Tag: The detailed part-of-speech tag.
  * Dep: Syntactic dependency, i.e. the relation between tokens.
  * Shape: The word shape – capitalisation, punctuation, digits.
  * is alpha: Is the token an alpha character?
  * is stop: Is the token part of a stop list, i.e. the most common words of the language?
  
https://spacy.io/api/annotation#section-text-processing

In [11]:
print('Tokens:')
print('text lemma POS tag dep shape is_alpha is_stop')
print('---- ----- --- --- --- ----- -------- -------')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_
          , token.is_alpha, token.is_stop)

Tokens:
text lemma POS tag dep shape is_alpha is_stop
---- ----- --- --- --- ----- -------- -------
This this DET DT nsubj Xxxx True False
is be VERB VBZ ccomp xx True True
one one NUM CD attr xxx True True
of of ADP IN prep xx True True
the the DET DT det xxx True True
dumbest dumb ADJ JJS amod xxxx True False
films film NOUN NNS pobj xxxx True False
, , PUNCT , punct , False False
I -PRON- PRON PRP nsubj X True False
've have VERB VB aux 'xx False False
ever ever ADV RB advmod xxxx True True
seen see VERB VBN ROOT xxxx True False
. . PUNCT . punct . False False


In [12]:
print('Entities:')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Entities:
one 8 11 CARDINAL


In [13]:
displacy.render(doc, style='ent', jupyter=True)

In [14]:
print('Sentences:')
for sent in doc.sents:
    print(sent.text, sent.start_char, sent.end_char, sent.label_)

Sentences:
This is one of the dumbest films, I've ever seen. 0 49 


In [15]:
displacy.render(doc, style='dep', jupyter=True)

In [None]:
displacy.serve(doc, style='dep', options={'compact': True})


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer



127.0.0.1 - - [11/Aug/2018 14:08:23] "GET / HTTP/1.1" 200 8316
127.0.0.1 - - [11/Aug/2018 14:08:23] "GET /favicon.ico HTTP/1.1" 200 8316


http://localhost:5000