# Getting Started with spaCy

spaCy is my go-to library for Natural Language Processing (NLP) tasks. I’d venture to say that’s the case for the majority of NLP experts out there!

## spaCy’s Statistical Models

These models are the power engines of spaCy. These models enable spaCy to perform several NLP related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

Below are the different statistical models in spaCy along with their specifications:

- `en_core_web_sm`: English multi-task CNN trained on OntoNotes. Size – 11 MB
- `en_core_web_md`: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 91 MB
- `en_core_web_lg`: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 789 MB

Importing these models is super easy. We can import a model by just executing `spacy.load(‘model_name’)`


## spaCy’s Processing Pipeline
The first step for a text string, when working with spaCy, is to pass it to an NLP object. This object is essentially a pipeline of several text pre-processing operations through which the input text string has to go through.

![image.png](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/spacy_pipeline.png)

As you can see in the figure above, the NLP pipeline has multiple components, such as tokenizer, tagger, parser, ner, etc. So, the input text string has to go through all these components before we can work on it.






# spaCy in Action - Let's import

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

# Linguistic annotations

spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.



In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, "-->", token.pos_, "-->", token.dep_)

Apple --> PROPN --> nsubj
is --> AUX --> aux
looking --> VERB --> ROOT
at --> ADP --> prep
buying --> VERB --> pcomp
U.K. --> PROPN --> dobj
startup --> NOUN --> dep
for --> ADP --> prep
$ --> SYM --> quantmod
1 --> NUM --> compound
billion --> NUM --> pobj


# Tokenization

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


# Part-of-speech tags and dependencies
POS tagging is the task of automatically assigning POS tags to all the words of a sentence. It is helpful in various downstream tasks in NLP, such as feature engineering, language understanding, and information extraction.

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_, "(", spacy.explain(token.pos_), ")")

Apple --> PROPN ( proper noun )
is --> AUX ( auxiliary )
looking --> VERB ( verb )
at --> ADP ( adposition )
buying --> VERB ( verb )
U.K. --> PROPN ( proper noun )
startup --> NOUN ( noun )
for --> ADP ( adposition )
$ --> SYM ( symbol )
1 --> NUM ( numeral )
billion --> NUM ( numeral )


In [None]:
for token in doc:
    print(token.text, "-->", token.dep_)

Apple --> nsubj
is --> aux
looking --> ROOT
at --> prep
buying --> pcomp
U.K. --> dobj
startup --> dep
for --> prep
$ --> quantmod
1 --> compound
billion --> pobj


In [None]:
displacy.serve(doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


# Named Entities
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, "-->", ent.label_)

Apple --> ORG
U.K. --> GPE
$1 billion --> MONEY


In [None]:

text = "Apple is looking at buying U.K. startup for $1 billion, this is happening today"
doc = nlp(text)
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
