# SpaCy Statistical Models

### What are statistical models?

Some of the most interesting thinks you can analyize are context-specific. For exammple, whether a word is a verb or whether is a personal name. 

- Statistical models enable spaCy to make predictions (predict linguistic attributes) *in context*. This usually includes:
    - part-of-speech tags
    - syntactic dependencies
    - named entites
- Models are trainied on large data sets of labeled example texts.
- They can be updated with more examples to fine-tune predictions, for example, to perform better on your specific data.

### Model Packages

SpaCy provides a number of pre-trained model packages you can download. For example the `en_core_web_sm` package is the small English model that support all co-capabilites and is trained on web text. 

The spaCy `load` method loads a model package by name and returns an `nlp` object. 

```python
import spacy
nlp = spacy.load('en_core_web_sm')
```

The package provides:

- The binary weights that enable to spaCy to make predictions.
- It also includes the vocabulary and meta information to tell spaCy which language class to use and how to configure the proecesses pypline.

Let's take a look to the models predictions.

### Predict Part-of-speech Tags

In this example, we use spaCy to predict part-of-speech tags. The words types in context. 

First, we load the small English model and receive and `nlp` object. 

In [4]:
import spacy
# Load the small English model
nlp = spacy.load('en_core_web_sm')

Next, we processing the text:

In [5]:
# Process a text
doc = nlp("Shee ate the pizza")

For each token in the `doc` we print the text and the `pos_` attribute, the predicted part-of-speech tag:

In [6]:
# Iterate over the tokens
for token in doc:
    
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

Shee PROPN
ate VERB
the DET
pizza NOUN


> In spaCy attributes that return an string usually end with an undersocre. Attributes without the underscore return an id. 

Here, the model correctly predict `ate` as a verb and `pizza` as a noun. 

### Predicting Syntactic Dependencies

In addition to the part-of-speech tags we can also can predict how the words are related. 

For example, wheather a word is the subject of the sentence or an object. The `dep_` attribute returns the predictive dependency lable. 

The `head` attribute returns the syntactic head token. You can also think on it as the parent token. This word is attached too. 

In [7]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Shee PROPN nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### Dependency label scheme

To discribe syntactic dependencies spaCy uses a standardized lable scheme. Here it is an example of some common labels. 

In [8]:
from spacy import displacy

In [10]:
displacy.render(doc,style='dep', jupyter=True)

### Predicting Named Entities

Named entities are real world objects that assign a name, for example, a person, an organization, or a country. The `doc.ents` property let you access the name entities predicted by the model. 


In [20]:
import spacy
# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

print("")
displacy.render(doc,style='ent', jupyter=True)

Apple ORG
U.K. GPE
$1 billion MONEY



It returns an iterator of span objects so, we can print, the entity text and the entity label using the `label_` attribute. In this case the models is correctly predicting Apple as an organization, U.K. as a Geo Political Entity and $ 1 billion dollasr as money

### Tip: the explain method

A quick tip, to get definitions from the most common tags and labels you can use the `spacy.explain` help function. 

For example, `GPE` for geo-political entity isn't exactly intuitive but `spacy.explain` can tell you that it refres to counties, cities and states.

The same works for part-of-speech tags and dependency labels.

In [24]:
spacy.explain('NNP')

'noun, proper singular'

In [25]:
spacy.explain('dobj')

'direct object'

## Model packages
What's **not** included in a model package that you can load into spaCy?

1. A meta file including the language, pipeline and license.
2. Binary weights to make statistical predictions.
3. The labelled data that the model was trained on.
4. Strings of the model's vocabulary and their hashes.

Answer: 3

That's correct! Statistical models allow you to generalize based on a set of training examples. Once they're trained, they use binary weights to make predictions. That's why it's not necessary to ship them with their training data.

## Loading models
Let's start by loading a model. `spacy` is already imported.

- Use `spacy.load` to load the small English model 'en_core_web_sm'.
- Process the text and print the document text.

In [29]:
# Load the 'en_core_web_sm' model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


- Use `spacy.load` to load the small German model 'de_core_news_sm'.
- Process the text and print the document text.

In [33]:
# Load the 'de_core_news_sm' model – spaCy is already imported
#nlp = spacy.load('de_core_news_sm')

text = "Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht


## Predicting linguistic annotations
You'll now get to try one of spaCy's pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! The small English model is already available as the variable `nlp`.

To find out what a tag or label means, you can call `spacy.explain` in the IPython shell. For example: `spacy.explain('PROPN')` or `spacy.explain('GPE')`.\

- Process the text with the `nlp` object and create a `doc`.
- For each token, print the token text, the token's `.pos_` (part-of-speech tag) and the token's `.dep_` (dependency label).


In [34]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      ccomp     
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


To find out what a tag or label means, you can call `spacy.explain` in the IPython shell. For example: `spacy.explain('PROPN')` or `spacy.explain('GPE')`.

- Process the text and create a `doc` object.
- Iterate over the `doc.ents` and print the entity `text` and `label_` attribute.

In [35]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


## Predicting named entities in context
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you're processing. Let's take a look at an example. The small English model is available as the variable `nlp`.

- Process the text with the `nlp` object.
- Iterate over the entities and print the entity text and label.

In [36]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

Apple ORG


- Looks like the model didn't predict "iPhone X". Create a span for those tokens manually.

In [37]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

Apple ORG
Missing entity: iPhone X


Perfect! Of course, you don't always have to do this manually. In the next lessson, you'll learn about spaCy's rule-based matcher, which can help you find certain words and phrases in text.