# Spacy Basics

The two main NLP libraries we are going to use are **Spacy** and **NLTK**.

Main differences of the two libraries:
- NLTK was released in 2001 and it has several algorithms and models implemented.
- Spacy was released in 2015 and it has the best and fastest methods only; it can be more than 100x faster than NLTK.

Spacy can have a tricky installation: look at [Spacy Installation](https://spacy.io/usage). Take into account that we need to download the dictionaries, too. I installed everything as follows:

```bash
conda install keras nltk
conda install -c conda-forge spacy
# Download dictionaries/models
python -m spacy download en # spacy.load('en_core_news_sm')
python -m spacy download es # spacy.load('es_core_news_sm')
python -m spacy download de # spacy.load('de_core_news_sm')
```

Both libraries are used to perform **Natural Language Processing**, which consists in parsing and structuring the raw text so that it can be handled by the computer.

Overview of contents:

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Model, Doc, Pipeline

In [6]:
import spacy

In [7]:
# We load our English _model_
nlp = spacy.load('en_core_web_sm')

In [17]:
# Create a _Doc_ object:
# the nlp model processes the text 
# and saves it structured in the Doc object
# u: Unicode string (any symbol, from any language)
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [18]:
# Print each token separately
# Tokens are word representations, unique elements
# Note that spacy does a lot of identification work already
# $ is a symbol, U.S. is handled as a word, etc.
for token in doc:
    # token.text: raw text
    # token.pos_: part of speech: proper noun, verb, ...
    # token.dep_: syntactic dependency
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN dobj
startup VERB dep
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


In [19]:
# The Doc object contains the processed text
# To see how it is processed, we can show the pipeline used
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fbd3c548ec0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fbd3c548bb0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fbd18bc3050>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fbd18b55dc0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fbd18b6dd20>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fbd18b31a50>)]

In [20]:
# We can get the basic names of the steps in the pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

For a more detailed explaination of pipelines and their steps, see: [Spacy Pipelines](https://spacy.io/usage/spacy-101#pipelines)

![Spacy Pipeline](../pics/spacy_pipeline.png)

## 2. Tokenization