# Performance Tips

You're going to feel inclined to use a more list comprehension-like form to process text, but fight the urge! Use `nlp.pipe` instead.

**BAD**

```
docs = [nlp(text) for text in LOTS_OF_TEXTS]
```

**GOOD**

```
docs = list(nlp.pipe(LOTS_OF_TEXTS))
```

> tl;dr - use `nlp.pipe` to process lists of texts. Wrap it in `list` to create a list of these docs.

## Get tuple of text and metadata using Pipeline

If getting "context" is important, include the `as_tuples=True` argument for `nlp.pipe`. This is useful for passing in additional metadata, like an ID associated with the text, or a page number.

Check out the example to clear this up. 

In [1]:
from spacy.lang.en import English

nlp = English()

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16


## Pipeline to add metadata/context as custom attributes

The below example creates two extensions, "id" and "page number", which default to None.

In [2]:
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

## Use *only* the tokenizer

If you only need to tokenize the doc, then doing the whole nlp pipeline, i.e. POS, dependency, NER, eats up unnecessary time/resources.

**Solution**: If you only need a tokenized `Doc` object, you can use the `nlp.make_doc` method instead, which takes a text and returns a Doc.

**BAD**

```
doc = nlp("Hello world")
```

**GOOD**

```
doc = nlp.make_doc("Hello world!")
```

In [12]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


## Disable components of a pipeline

Similar to above, if you want to turn off the "tagger" for instance, you can use `nlp.disable_pipes` to do just that.

Example:
```
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)
```

In [13]:
# Example - Disable the tagger and parser
# i.e. should be left w/ tokenize -> NER

import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Disable the tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(American, College Park, Georgia)


# More Examples

## Example - Process bunch of tweets

Create a pipeline to process the tweets and extract the **adjectives**.

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

TEXTS = ['McDonalds is my favorite restaurant.', 
         'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..', 
         'People really still eat McDonalds :(', 'The McDonalds in Spain has chicken wings. My heart is so happy ', 
         '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P', 
         'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D', 
         'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible', 'gettin', 'payin']


In [None]:
# THIS IS THE **WRONG** APPROACH

# for text in TEXTS:
#     doc = nlp(text)
#     print([token.text for token in doc if token.pos_ == "ADJ"])

**Print out the entities from the tweets.**

HINT: use `list(nlp.pipe(TEXT))` for optimal performance

In [6]:
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) (@McDonalds,) (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) (WANT, McRib) (This morning,)


## Example - Add author, book metadata to quotes

Here, the text data is available as a list of tuples `[text, context]`, where `context` is often in the form of a dictionary, e.g. `{'author': 'Frank Harrell', 'book': 'Regression Modeling Strategies'}`.

In [9]:
import requests
import json

req = requests.get("https://raw.githubusercontent.com/ines/spacy-course/master/exercises/bookquotes.json")
DATA = [item for item in req.json()]

In [10]:
DATA

[['One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.',
  {'author': 'Franz Kafka', 'book': 'Metamorphosis'}],
 ["I know not all that may be coming, but be it what it will, I'll go to it laughing.",
  {'author': 'Herman Melville', 'book': 'Moby-Dick or, The Whale'}],
 ['It was the best of times, it was the worst of times.',
  {'author': 'Charles Dickens', 'book': 'A Tale of Two Cities'}],
 ['The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.',
  {'author': 'Jack Kerouac', 'book': 'On the Road'}],
 ['It was a bright cold day in April, and the clocks were striking thirteen.',
  {'author': 'George Orwell', 'book': '1984'}],
 ['Nowadays people know the price of everything and the valu

In [11]:
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None)

# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']

    # Print the text and custom attribute data
    print(doc.text, "\n", "— '{}' by {}".format(doc._.book, doc._.author), "\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 

