# Getting Started

### How to install
Spacy can be installed using either `pip` or `conda`:
* `pip install -U spacy`
* `conda install -c conda-forge spacy`

In [13]:
import spacy
spacy.__version__

'2.0.11'

# Loading a Vocabulary Model

After importing the library, load one of the many available models. Models may be installed via spaCy's download command

```python
python -m spacy download <MODEL NAME> 

```
List of available models and available features can be found in the [Available models section of the documentation](https://spacy.io/usage/models#available)

In [14]:
# Load the en_core_web_sm model: 
# https://spacy.io/models/en#en_core_web_md
nlp = spacy.load('en_core_web_sm')
type(nlp).__name__

'English'

We now have all the components we need to process text. The next step is to pass in the text data into `nlp` and invoke its various methods appropriate for the analysis we want to undertake.

# Exploring Features

## Basic things that spaCy can do
* Tokenization (word and sentence)
* Lemmatization
* Part-of-speech tagger
* Depdenency parsing
* Named entity recognition

For full list, see [this page](https://spacy.io/usage/spacy-101#features)

In [15]:
with open('facebook_md_transcript.txt', 'r') as f:
    text = f.readlines()[0]
text[:500]

"Good morning. Welcome, thanks for joining us for Facebook’s 2018 Annual Meeting of Stockholders. I’m Dave Kling, Deputy General Counsel and Corporate Secretary of Facebook and Chairman of this Annual Meeting, which I now call to order. Let me run through today's agenda. First, we’ll focus on the formal business set forth in the proxy statement. Then, Mark will spend a few minutes talking about how Facebook has been doing and finally, we’ll conclude with the Q&A session.Before we begin with the f"

## Accessing features

Once a vocabulary model has been loaded, text processing is a matter of passing the text into the Language object in the `nlp` variable.

In [16]:
# create object of class Doc
# see: https://spacy.io/api/doc
doc = nlp(text)
type(doc).__name__

'Doc'

In [17]:
len(doc)

7047

In [18]:
for token in doc[0:100]:
    print('Token:', token,
          '|Lemma:', token.lemma_,
          '|P-O-S:', token.pos_,
          '|Dep. Parse:', token.dep_,
          '|Shape:', token.shape_,
          '|Stop Word:', token.is_stop,
          '|Alphanumeric:', token.is_alpha,
          '\n---')

Token: Good |Lemma: good |P-O-S: ADJ |Dep. Parse: amod |Shape: Xxxx |Stop Word: False |Alphanumeric: True 
---
Token: morning |Lemma: morning |P-O-S: NOUN |Dep. Parse: ROOT |Shape: xxxx |Stop Word: False |Alphanumeric: True 
---
Token: . |Lemma: . |P-O-S: PUNCT |Dep. Parse: punct |Shape: . |Stop Word: False |Alphanumeric: False 
---
Token: Welcome |Lemma: welcome |P-O-S: INTJ |Dep. Parse: ROOT |Shape: Xxxxx |Stop Word: False |Alphanumeric: True 
---
Token: , |Lemma: , |P-O-S: PUNCT |Dep. Parse: punct |Shape: , |Stop Word: False |Alphanumeric: False 
---
Token: thanks |Lemma: thank |P-O-S: NOUN |Dep. Parse: npadvmod |Shape: xxxx |Stop Word: False |Alphanumeric: True 
---
Token: for |Lemma: for |P-O-S: ADP |Dep. Parse: prep |Shape: xxx |Stop Word: True |Alphanumeric: True 
---
Token: joining |Lemma: join |P-O-S: VERB |Dep. Parse: pcomp |Shape: xxxx |Stop Word: False |Alphanumeric: True 
---
Token: us |Lemma: -PRON- |P-O-S: PRON |Dep. Parse: dobj |Shape: xx |Stop Word: True |Alphanumeric:

Some notes on features
* Tokenization and lemmatization: splits by whitespace, but also understands contractions and punctuations
* Part-of-speech tagging: use language model to detect P-O-S
* Dependency parsing: also uses language model. Useful for resolving ambiguity in text (e.g. "scientist study whales from space")
* Shape: characterizes shape of token (use case?)

### Named Entity Recognition
To get named entities, invoke `ents` attribute on `Doc` object

In [19]:
len(doc.ents)

379

In [20]:
for ent in doc.ents:
    print("Entity:", ent.text, 
          "|Start index", ent.start_char, 
          "|End index:", ent.end_char, 
          "|Label:", ent.label_)

Entity: morning |Start index 5 |End index: 12 |Label: TIME
Entity: Facebook |Start index 49 |End index: 57 |Label: ORG
Entity: 2018 |Start index 60 |End index: 64 |Label: DATE
Entity: Annual Meeting of Stockholders |Start index 65 |End index: 95 |Label: WORK_OF_ART
Entity: Dave Kling |Start index 101 |End index: 111 |Label: PERSON
Entity: Counsel |Start index 128 |End index: 135 |Label: PERSON
Entity: Corporate |Start index 140 |End index: 149 |Label: GPE
Entity: Facebook |Start index 163 |End index: 171 |Label: ORG
Entity: this Annual Meeting |Start index 188 |End index: 207 |Label: ORG
Entity: today |Start index 255 |End index: 260 |Label: DATE
Entity: First |Start index 271 |End index: 276 |Label: ORDINAL
Entity: Mark |Start index 353 |End index: 357 |Label: PERSON
Entity: a few minutes |Start index 369 |End index: 382 |Label: TIME
Entity: Facebook |Start index 401 |End index: 409 |Label: ORG
Entity: Q&A |Start index 462 |End index: 465 |Label: LAW
Entity: Board of Directors |Start 

Entity: Gary Coleman |Start index 19611 |End index: 19623 |Label: PERSON
Entity: the AFL-CIO Reserve Fund |Start index 19655 |End index: 19679 |Label: ORG
Entity: CActually |Start index 19685 |End index: 19694 |Label: PERSON
Entity: Richard |Start index 19701 |End index: 19708 |Label: PERSON
Entity: eight |Start index 19754 |End index: 19759 |Label: CARDINAL
Entity: the AFL-CIO |Start index 19773 |End index: 19784 |Label: ORG
Entity: Board |Start index 19814 |End index: 19819 |Label: ORG
Entity: Facebook |Start index 19996 |End index: 20004 |Label: ORG
Entity: Facebook |Start index 20098 |End index: 20106 |Label: ORG
Entity: Company |Start index 20135 |End index: 20142 |Label: ORG
Entity: Company |Start index 20289 |End index: 20296 |Label: ORG
Entity: the Company |Start index 20458 |End index: 20469 |Label: ORG
Entity: Company |Start index 20475 |End index: 20482 |Label: ORG
Entity: UK |Start index 20547 |End index: 20549 |Label: GPE
Entity: U.S. |Start index 20558 |End index: 20562 |

# Other Features

The `Doc` object offers other features in addition to the ones demonstrated above. For a full list of features, see `dir(doc)`.

Some examples include sentence boundary detection, noun chunks, word vectors, word similarity.

In [21]:
# Sentence boundary detection
for sent in doc.sents:
    print(sent)

Good morning.
Welcome, thanks for joining us for Facebook’s 2018 Annual Meeting of Stockholders.
I’m Dave Kling, Deputy General Counsel and Corporate Secretary of Facebook and Chairman of this Annual Meeting, which I now call to order.
Let me run through today's agenda.
First, we’ll focus on the formal business set forth in the proxy statement.
Then, Mark will spend a few minutes talking about how Facebook has been doing and finally, we’ll conclude with the Q&A session.
Before we begin with the formal business, I’d like to remind you to please turn off all mobile devices, and of the rules of conduct that were provided to you when you entered the meeting, including the rule prohibiting photos and recording of any kind.
Thanks for your cooperation.
Let me start by introducing the members of our Board of Directors who are here today; Mark Zuckerberg, Sheryl Sandberg, Erskine Bowles, Ken Chenault and Sue Desmond-Hellmann.
I would also like to introduce two other people who are in attendanc

In [22]:
# Noun chunks
for nc in doc.noun_chunks:
    print(nc)

Good morning
us
Facebook’s 2018 Annual Meeting
Stockholders
I
Dave Kling
Deputy General Counsel
Corporate Secretary
Facebook
Chairman
this Annual Meeting
I
order
me
today's agenda
we
the formal business
the proxy statement
Mark
a few minutes
Facebook
we
the Q&A session
we
the formal business
I
you
all mobile devices
the rules
conduct
you
you
the meeting
the rule
photos
recording
any kind
Thanks
your cooperation
me
the members
our Board
Directors
who
I
two other people
who
attendance
Alex Bender
Ernst
Young LLP
our independent registered public accounting firm
Kris Veaco
who
the Inspector
Election
this meeting
the results
the voting
Ms. Veaco
the oath
Inspector
Election
the formal business
I
a copy
the notice
the declaration
the mailing
the proxy statement
all Facebook stockholders
record
April
I
a list
stockholders
this meeting
inspection
any stockholder
any proxyholder
who
a stockholder
which list
records
this meeting
I
the Inspector
Election
the holders
shares
at least the majority
t

In [23]:
# doc.vector returns the average vector in the text
print(doc.vector.shape)

(384,)


In [24]:
# Get vector of each token
for token in doc[0:2]:
    print(token.vector)

[-1.01393139 -1.40702009 -2.64994955  2.09281063 -1.97974396  1.62159455
  1.4604938   1.26499486 -1.71690214 -0.38691491 -0.16623339 -0.57863778
  0.16432497 -3.3152163  -1.84567595 -0.0419752  -1.46858501 -2.71334958
  0.5873363   2.03711629  1.20276451 -0.93468851  0.33614558  1.98867559
 -1.73306501 -1.80320525 -3.26318407  0.65803003  1.05705953  2.32139993
  2.80969214 -1.53222191  0.9180609   0.50728929 -0.67875248  4.46587181
  2.92354059  0.30113083 -0.34886426 -1.92665195 -3.7593081   1.07089591
  0.16429344 -0.53713918 -0.17663306 -0.52894181 -2.07969904  5.14629459
 -2.19922924 -3.10876894  0.69134474 -1.29334843 -1.55373967  2.84668589
 -2.28205085  0.94399607  4.30578947 -0.58578634 -2.3346312  -2.00597453
  0.47395083  4.9656105  -0.78003526  0.40823543  2.06077528 -1.84092426
 -0.62683618  1.5516696  -2.17767739  1.7413547  -1.09436584 -3.12682962
 -0.03999323 -1.67738426  0.05101991 -0.90884304  2.32613373 -2.54049635
  0.95585465 -1.76753855 -0.12845905 -0.21103936 -2

In [25]:
# Can use word vectors to calculate L2 norm and 
# to calculate cosine similarity between words
for t1 in doc[0:20]:
    for t2 in doc[0:20]:
        if (len(t1) > 1 and len(t2) > 1):
            print(t1, t2, t1.similarity(t2))

Good Good 1.0
Good morning 0.14695
Good Welcome 0.309286
Good thanks 0.0573976
Good for 0.112577
Good joining 0.155585
Good us 0.146946
Good for 0.0986884
Good Facebook 0.216687
Good ’s -0.110767
Good 2018 0.161359
Good Annual 0.470256
Good Meeting 0.0324741
Good of -0.0176348
Good Stockholders 0.105944
Good ’m -0.0342152
morning Good 0.14695
morning morning 1.0
morning Welcome 0.377752
morning thanks 0.131367
morning for 0.0800371
morning joining 0.251703
morning us 0.113255
morning for 0.00541531
morning Facebook 0.3568
morning ’s 0.171865
morning 2018 0.0325751
morning Annual 0.0621184
morning Meeting 0.427617
morning of 0.080633
morning Stockholders 0.108235
morning ’m 0.0974803
Welcome Good 0.309286
Welcome morning 0.377752
Welcome Welcome 1.0
Welcome thanks 0.332843
Welcome for 0.0823989
Welcome joining -0.0283853
Welcome us 0.0520562
Welcome for -0.0498189
Welcome Facebook 0.229774
Welcome ’s -0.0155238
Welcome 2018 -0.0467492
Welcome Annual 0.135207
Welcome Meeting 0.0431619
We

# Tips and Bugs I Encountered
1. `is_stop` depends on capitalization
    * Example --> The: False, the: true
    * Work around: lemmatize words (using `lemma_` method) before using `is_stop`
    * Link to issue: https://github.com/explosion/spaCy/issues/1889
2. Multi-threading doesn't work (i.e. n_thread > 0 does not make a difference) when using `nlp.pipe`
    * Link to issue: https://github.com/explosion/spaCy/issues/2075
    * Note on multi-threading in spaCy: https://explosion.ai/blog/multithreading-with-cython
3. `similarity` method raises TypeError when single character strings is encountered
    * Example in previous cell, above
    * Link to issue: https://github.com/explosion/spaCy/issues/2219
4. Don't need p-o-s tagging, dependency parsing, or named entity recognition? You can turn them off.
    * Disabling and modifying pipeline components: https://spacy.io/usage/processing-pipelines#disabling
    * Link to issue: https://github.com/explosion/spaCy/issues/1837

# Summary

In summary, the only code you need (after installation) to get started with spaCy are as follows:

```python
nlp = spacy.load('en_core_web_sm')
doc = nlp("Text to process goes here")
```

`nlp("Text to process goes here")` creates the `Doc` object, which contains the tokens of the text. You then access the attributes of your text using the various method calls on each individual `tokens`. Additional features are also available within the created `Doc` object. These can be explored by running `dir(doc)`.

Also, the model need not be `en_core_web_sm`. You can choose from [among this list.](https://spacy.io/usage/models#available)

See the documentation for even more [detailed and in-depth examples.](https://spacy.io/usage/examples).