In [1]:
import pandas as pd
import numpy as np

# The `spaCy` Quickstart

[back to index](index.ipynb)

This follows the guide provided by the [spaCy project](https://spacy.io/usage/#section-quickstart). And it looks like the good stuff starts at the [spaCy 101](https://spacy.io/usage/spacy-101) section.

## First steps

To get to this point I've done a `conda install` and a `python -m spacy download en`. No hicups so far on Lubuntu 16.04.

In [2]:
import spacy
%time nlp = spacy.load('en')

Wall time: 2.25 s


In [3]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

### Lubuntu 16.04

Wow. First try, no hicups.

### Windows...

Surprise, surprise, windows had a couple hicups.

1. You need Visual Studio Tools C++ 14
1. You need to run Git Bash as Administrator

The error on the C++ 14 issue spits out the URL you need for downloading the tools. Luckily they are available for free and installation is via a wizard that takes care of everything. The second issue is more interesting. You have to run it as Administrator and the current directory must be your local drive. You get this error otherwise

    PermissionError: [WinError 5] Access is denied
    
But finally I was able to get it working.

In [4]:
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


## Visualizers?

In [5]:
from spacy import displacy

In [6]:
#displacy.serve(doc, style='dep')


    Serving on port 5000...
    Using the 'dep' visualizer


    Shutting down server on port 5000.



Wow. Nice. No really... It just serves up the sentence so you can just look at it in your browser. It's on port 5000 so I'm thinking it's a simple Flask app?

Windows... it gives some permission pop up that you may or may not want to approve.

In [20]:
doc = nlp(u'John Smith had a heart attack, he was treated by doctor joe shmoe.')
displacy.render(doc, style='ent', jupyter=True)

AND IT HAS NATIVE SUPPORT FOR JUPYTER.

## Entities

In [8]:
for ent in doc.ents:
    print(ent.text)

Apple
U.K.
$1 billion


## Word Vectors

In [9]:
tokens = nlp(u'dog cat banana')

In [10]:
for i in tokens:
    for j in tokens:
        print(i, i.similarity(j), j)
    print()

dog 1.0 dog
dog 0.53907 cat
dog 0.28761 banana

cat 0.53907 dog
cat 1.0 cat
cat 0.487522 banana

banana 0.28761 dog
banana 0.487522 cat
banana 1.0 banana



Not sure what that means. And my numbers are different than the ones in the tutorial. And it turns out that the models in

    $ python -m spacy download en
    
come in small by default. Thus they don't offer as good a performance. You can get bigger and better models from

    $ python -m spacy download en_core_web_lg

. And... I'm downloading them now. 30% done and it's already at 240MB. It weighs in at 852MB. That's pretty big.

In [11]:
%time nlp = spacy.load('en_core_web_lg')

Wall time: 9.94 s


In [12]:
tokens = nlp(u'dog cat banana')

In [13]:
for i in tokens:
    for j in tokens:
        print(i, i.similarity(j), j)
    print()

dog 1.0 dog
dog 0.801685 cat
dog 0.243276 banana

cat 0.801685 dog
cat 1.0 cat
cat 0.281544 banana

banana 0.243276 dog
banana 0.281544 cat
banana 1.0 banana



And those numbers now do match up with the ones on the tutorial page.

In [14]:
banana = nlp(u'banana')
banana.vector[:10]

array([ 0.20228   , -0.076618  ,  0.37031999,  0.032845  , -0.41957   ,
        0.072069  , -0.37476   ,  0.05746   , -0.012401  ,  0.52948999], dtype=float32)