# Chapter 1: Introduction to spaCy
In this lesson, we'll take a look at the most important concepts of spaCy and how to get started.

### The nlp object

In [2]:
from spacy.lang.en import English

# Create the nlp object
# - contains the processing pipeline
# - includes language-specific rules for tokenization etc.
nlp = English()

In [3]:
nlp

<spacy.lang.en.English at 0x7f8218b5d6d8>

At the center of spaCy is the object containing the processing pipeline. We usually call this variable `nlp`.

For example, to create an English nlp object, you can import the English language class from `spacy.lang.en` and instantiate it. You can use the `nlp` object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy.lang.

### The Doc object

In [5]:
doc = nlp("Hello world!")

#Iterate over tokens in a doc
for token in doc:
    print(token.text)

Hello
world
!


<br><br>
When you process a text with the nlp object, spaCy creates a `Doc` object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The `Doc` behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

![](https://course.spacy.io/doc.png)

### The token object

In [7]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


<br><br>
Token objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

### The Span object

![](https://course.spacy.io/doc_span.png)

In [10]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

world!


In [12]:
type(span)

spacy.tokens.span.Span

A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a span, you can use Python's slice notation. For example, `1:3` will create a slice starting from the token at position `1`, up to – but not including! – the token at position `3`.

### Lexical Attributes

In [13]:
doc = nlp("It costs $5.")

In [16]:
print(f"Index: {[token.i for token in doc]}")
print(f"Text: {[token.text for token in doc]}")

Index: [0, 1, 2, 3, 4]
Text: ['It', 'costs', '$', '5', '.']


In [17]:
print(f"is_alpha: {[token.is_alpha for token in doc]}")
print(f"is_punct: {[token.is_punct for token in doc]}")
print(f"like_num: {[token.like_num for token in doc]}")

is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


<br><br>

Here you can see some of the available token attributes:

`i` is the index of the token within the parent document.

`text` returns the token text.

`is_alpha`, `is_punct` and `like_num` return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

## Documents, spans and tokens

When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

**Step 1**

1. Import the English language class and create the nlp object.
1.Process the text and instantiate a Doc object in the variable doc.
1. Select the first token of the Doc and print its text.

In [18]:
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


**Step 2**
1. Import the English language class and create the nlp object.
1. Process the text and instantiate a Doc object in the variable doc.
1. Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.

In [19]:
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:-1]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## Lexical attributes

In this example, you’ll use spaCy’s `Doc` and `Token` objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

1. Use the like_num token attribute to check whether a token in the doc resembles a number.
1. Get the token following the current token in the document. The index of the next token in the doc is `token.i + 1`.
1. Check whether the next token’s text attribute is a percent sign ”%“.

In [21]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## Statistical models

Let's add some more power to the nlp object!

In this lesson, you'll learn about spaCy's statistical models.

#### What are statistical models?
- Enable spaCy to predict linguistic attributes in context
    - Part-of-speech tags
    - Syntactic dependencies
    -Named entities

- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

### Model packages

- Binary weights
- Vocabulary
- Meta information (language, pipeline)

In [23]:
$ python -m spacy download en_core_web_sm

In [24]:
import spacy

nlp = spacy.load("en_core_web_sm")

spaCy provides a number of pre-trained model packages you can download using the spacy download command. For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

The `spacy.load` method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

### Predicting Part-of-speech Tags

In [30]:
#  Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_, token.pos)


She PRON 95
ate VERB 100
the DET 90
pizza NOUN 92


<br><br>
Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English model and receive an `nlp` object.

Next, we're processing the text "She ate the pizza".

For each token in the doc, we can print the text and the `.pos_` attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun

### Predicting Syntactic Dependencies

In [33]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The `.dep_` attribute returns the predicted dependency label.

The `.head` attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

### Dependency label scheme

![](https://course.spacy.io/dep_example.png)

Label | Description | Example
------|-------------|--------
nsubj | nominal subject | She
dobj | direct object | pizza
det | determiner (article) | the


To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

### Predicting Named Entities

![](https://course.spacy.io/ner_example.png)

In [36]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


<br/><br/>
Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc.ents property lets you access the named entities predicted by the model.

It **returns an iterator of Span objects**, so we can print the entity text and the entity label using the `.label_` attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

### Tip: the spacy.explain method

Get quick definitions of the most common tags and labels.

In [41]:
spacy.explain("GPE")

'Countries, cities, states'

In [42]:
spacy.explain("LOC")

'Non-GPE locations, mountain ranges, bodies of water'

In [39]:
spacy.explain("NNP")

'noun, proper singular'

In [40]:
spacy.explain("dobj")

'direct object'

A quick tip: To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

## Rule based matching

In this lesson, we'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.

### Why not just regular expressions?

- Match on `Doc` objects, not just strings
- Match on tokens and token attributes
- Use the model's predictions
- Example: "duck" (verb) vs. "duck" (noun)

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.

### Match patterns

Lists of dictionaries, one per token

Match exact token texts

`[{"TEXT": "iPhone"}, {"TEXT": "X"}]`

Match lexical attributes

`[{"LOWER": "iphone"}, {"LOWER": "x"}]`

Match any token attributes

`[{"LEMMA": "buy"}, {"POS": "NOUN"}]`

<br><br>
Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

In [49]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

doc = nlp("Upcoming iPhone X release date leaked")

matches = matcher(doc)

matches

[(9528407286733565721, 1, 3)]

<br><br>
To use a pattern, we first import the matcher from spacy.matcher.

We also load a model and create the nlp object.

The matcher is initialized with the shared vocabulary, nlp.vocab. You'll learn more about this later – for now, just remember to always pass it in.

The matcher.add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to `None`. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [53]:
# Call the matcher on the doc
doc = nlp("Upcoming iPhone X release date leaked through an iPhone X")
matches = matcher(doc)

# Iterate over the matches
for id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(id, start, end, matched_span.text)

9528407286733565721 1 3 iPhone X
9528407286733565721 8 10 iPhone X


- `match_id`: hash value of the **pattern name**
- `start`: start index of matched span
- `end`: end index of matched span

In [54]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

doc = nlp("2018 FIFA World Cup: France won!")

matcher = Matcher(nlp.vocab)
matcher.add("FIFA", None, pattern)

matches = matcher(doc)
matches

[(851579294197118795, 0, 5)]

In [55]:
doc[0:5]

2018 FIFA World Cup:

<br><br>
Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

A token consisting of only digits.

Three case-insensitive tokens for "fifa", "world" and "cup".

And a token that consists of punctuation.

The pattern matches the tokens "2018 FIFA World Cup:".

### Matching other token attributes

In [59]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]

doc = nlp("I loved dogs but now I love cats more.")

matcher = Matcher(nlp.vocab)
matcher.add("LOVE", None, pattern)

matches = matcher(doc)
matches

[(18437031736592595799, 1, 3), (18437031736592595799, 6, 8)]

<br><br>
In this example, we're looking for two tokens:

A verb with the lemma "love", followed by a noun.

This pattern will match "loved dogs" and "love cats".

### Using operators and qualifiers (1)

In [60]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"}, # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

matcher = Matcher(nlp.vocab)
matcher.add("LOVE", None, pattern)

matches = matcher(doc)
matches

[(18437031736592595799, 1, 4), (18437031736592595799, 8, 10)]

<br><br>
Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

### Using operators and qualifiers (2)

|Example|Description|
|-------|-----------|
`{"OP": "!"}` | Negation: match 0 times
`{"OP": "?"}` | Optional: match 0 or 1 times
`{"OP": "+"}` | Match 1 or more times
`{"OP": "*"}` | Match 0 or more times


"OP" can have one of four values:

- An "!" negates the token, so it's matched 0 times.

- A "?" makes the token optional, and matches it 0 or 1 times.

- A "+" matches a token 1 or more times.

- And finally, an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.