In [1]:
import pandas as pd
import numpy as np

# The `spaCy` Quickstart

[back to index](index.ipynb)

This follows the guide provided by the [spaCy project](https://spacy.io/usage/#section-quickstart). And it looks like the good stuff starts at the [spaCy 101](https://spacy.io/usage/spacy-101) section.

## First steps

To get to this point I've done a `conda install` and a `python -m spacy download en`. No hicups so far on Lubuntu 16.04.

In [2]:
import spacy
%time nlp = spacy.load('en')

In [3]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

Wow. First try, no hicups.

In [4]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


## Visualizers?

In [5]:
from spacy import displacy

In [8]:
displacy.serve(doc, style='dep')


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer


    Shutting down server on port 5000.



Wow. Nice. No really... It just serves up the sentence so you can just look at it in your browser. It's on port 5000 so I'm thinking it's a simple Flask app?

In [10]:
displacy.render(doc, style='ent', jupyter=True)

AND IT HAS NATIVE SUPPORT FOR JUPYTER.

## Entities

In [11]:
for ent in doc.ents:
    print(ent.text)

Apple
U.K.
$1 billion


## Word Vectors

In [29]:
tokens = nlp(u'dog cat banana')

In [30]:
for i in tokens:
    for j in tokens:
        print(i, i.similarity(j), j)
    print()

dog 1.0 dog
dog 0.53907 cat
dog 0.28761 banana
cat 0.53907 dog
cat 1.0 cat
cat 0.487522 banana
banana 0.28761 dog
banana 0.487522 cat
banana 1.0 banana


Not sure what that means. And my numbers are different than the ones in the tutorial. And it turns out that the models in

    $ python -m spacy download en
    
come in small by default. Thus they don't offer as good a performance. You can get bigger and better models from

    $ python -m spacy download en_core_web_lg

. And... I'm downloading them now. 30% done and it's already at 240MB. It weighs in at 852MB. That's pretty big.

In [31]:
%time nlp = spacy.load('en_core_web_lg')

In [32]:
tokens = nlp(u'dog cat banana')

In [34]:
for i in tokens:
    for j in tokens:
        print(i, i.similarity(j), j)
    print()

dog 1.0 dog
dog 0.801686 cat
dog 0.243276 banana

cat 0.801686 dog
cat 1.0 cat
cat 0.281544 banana

banana 0.243276 dog
banana 0.281544 cat
banana 1.0 banana



And those numbers now do match up with the ones on the tutorial page.

In [35]:
banana = nlp(u'banana')
banana.vector

array([  2.02280000e-01,  -7.66180009e-02,   3.70319992e-01,
         3.28450017e-02,  -4.19569999e-01,   7.20689967e-02,
        -3.74760002e-01,   5.74599989e-02,  -1.24009997e-02,
         5.29489994e-01,  -5.23800015e-01,  -1.97710007e-01,
        -3.41470003e-01,   5.33169985e-01,  -2.53309999e-02,
         1.73800007e-01,   1.67720005e-01,   8.39839995e-01,
         5.51070012e-02,   1.05470002e-01,   3.78719985e-01,
         2.42750004e-01,   1.47449998e-02,   5.59509993e-01,
         1.25210002e-01,  -6.75960004e-01,   3.58420014e-01,
        -4.00279984e-02,   9.59490016e-02,  -5.06900012e-01,
        -8.53179991e-02,   1.79800004e-01,   3.38669986e-01,
         1.32300004e-01,   3.10209990e-01,   2.18779996e-01,
         1.68530002e-01,   1.98740005e-01,  -5.73849976e-01,
        -1.06490001e-01,   2.66689986e-01,   1.28380001e-01,
        -1.28030002e-01,  -1.32839993e-01,   1.26570001e-01,
         8.67229998e-01,   9.67210010e-02,   4.83060002e-01,
         2.12709993e-01,