# Getting Started

### How to install
Spacy can be installed using either `pip` or `conda`:
* `pip install -U spacy`
* `conda install -c conda-forge spacy`

In [None]:
import spacy
spacy.__version__

# Loading a Vocabulary Model

After importing the library, load one of the many available models. Models may be installed via spaCy's download command

```python
python -m spacy download <MODEL NAME> 

```
List of available models and available features can be found in the [Available models section of the documentation](https://spacy.io/usage/models#available)

In [None]:
# Load vocabulary model
# 'en_core_web_sm' model: 
# https://spacy.io/models/en#en_core_web_md
nlp = spacy.load('en_core_web_sm')
type(nlp).__name__

We now have all the components we need to process text. The next step is to pass in the text data into `nlp` and invoke its various methods appropriate for the analysis we want to undertake.

# Exploring Features

## Basic things that spaCy can do
* Tokenization (word and sentence)
* Lemmatization
* Part-of-speech tagger
* Depdenency parsing
* Named entity recognition

For full list, see [this page](https://spacy.io/usage/spacy-101#features)

In [None]:
with open('facebook_md_transcript.txt', 'r') as f:
    text = f.readlines()[0]
text[:500]

## Accessing features

Once a vocabulary model has been, text processing is a matter of passing the text into the Language object bounded to the `nlp` variable

In [None]:
# create object of class Doc
# see: https://spacy.io/api/doc
doc = nlp(text)
type(doc).__name__

In [None]:
len(doc)

In [None]:
for token in doc:
    print('Token:', token,
          '|Lemma:', token.lemma_,
          '|P-O-S:', token.pos_,
          '|Dep. Parse:', token.dep_,
          '|Shape:', token.shape_,
          '|Stop Word:', token.is_stop,
          '\n----')

Some notes on features
* Tokenization and lemmatization: splits by whitespace, but also understands contractions and punctuations
* Part-of-speech tagging: use language model to detect POS
* Dependency parsing: also uses language model. Useful for resolving ambiguity in text (e.g. "scientist study whales from space")
* Shape: characterizes shape of token (use case?)

### Named Entity Recognition
To get named entities, invoke `ents` attribute on `Doc` object

In [None]:
len(doc.ents)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Other Features

The `Doc` object offers other features in addition to the ones demonstrated above. For a full list of features, see `dir(doc)`.

Some examples include sentence boundary detection, noun chunks, word vectors, word similarity.

In [None]:
# Sentence boundary detection
for sent in doc.sents:
    print(sent)

In [None]:
# Noun chunks
for nc in doc.noun_chunks:
    print(nc)

In [None]:
# doc.vector returns the average vector in the text
print(doc.vector.shape)

In [None]:
# Get vector of each token
for token in doc[0:2]:
    print(token.vector)

In [None]:
# Can use word vectors to calculate L2 norm and 
# to calculate cosine similarity between words
for t1 in doc[0:20]:
    for t2 in doc[0:20]:
        if (len(t1) > 1 and len(t2) > 1):
            print(t1, t2, t1.similarity(t2))

# Bugs I Encountered
1. `is_stop` depends on capitalization
    * Example --> The: False, the: true
    * Work around: lemmatize words first (using `lemma_` method) before using `is_stop`
    * Link to issue: https://github.com/explosion/spaCy/issues/1889
2. multi-threading doesn't work (i.e. n_thread > 0 does not make a difference) when using `nlp.pipe`
    * Link to issue: https://github.com/explosion/spaCy/issues/2075
    * Note on multi-threading in spaCy: https://explosion.ai/blog/multithreading-with-cython
3. `similarity` method raises TypeError when single character strings is encountered
    * Example in previous cell, above
    * Link to issue: https://github.com/explosion/spaCy/issues/2219

# Summary

In summary, the only code you need (after installation) to get started with spaCy are as follows:

```python
nlp = spacy.load('en_core_web_sm')
doc = nlp("Text to process goes here")
```

`nlp("Text to process goes here")` creates the `Doc` object, which contains the tokens of the text. You then access the attributes of your text using the various method calls on each individual `tokens`. Additional features are also available within the created `Doc` object. These can be explored by running `dir(doc)`.

See the documentation for even more [detailed and in-depth examples.](https://spacy.io/usage/examples).

# Reference
1. [spaCy 101](https://spacy.io/usage/spacy-101)