In [1]:
import spacy

nlp = spacy.load('en_core_web_md')

In order to process a large volume of text use:
```python
docs = list(nlp.pipe(LARGE_TEXT))
            
```
It essentially creates a generator that yields tuples for optimized performance.
We can also pass tuples also by selecting the `as_tuple=True`
```python
nlp.pipe(data, as_tuples=True)
```

In [3]:
data = [
    ("this is text1", {"id":1, "page_no":10}),
    ("this is another one",{"id":2, "page_no":12})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context["page_no"])


this is text1 10
this is another one 12


### Avoid Running the entire pipeline

Sometimes we do not need the entire processing performed for us by the pipeline. 
If we just need to tokenize and create a doc, we use:
```python

nlp.make_doc("something goes here")

#temporally disable components
with nlp.disable_pipes("tagger", "parser"):
    doc=nlp(text)
    print(doc.ents)
```
The default pipeline functioning gets blocked after a single use after the block is exit.

In [4]:
people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))

## Updating And Training Models
The steps of a training loop
1. Loop for a number of times.
2. Shuffle the training data.
3. Divide the data into batches.
4. Update the model for each batch.
5. Save the updated model.


<img src="https://course.spacy.io/training.png">

* Training data: Examples and their annotations.
* Text: The input text the model should predict a label for.
* Label: The label the model should predict.
* Gradient: How to change the weights.

#### template for updating previous models

```python
TRAINING_DATA = [
    ("How to preorder the iPhone X", {"entities": [(20, 28, "GADGET")]})
    # And many more examples...
]

# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# Save the model
nlp.to_disk(path_to_model)

```


#### template for setting up a new pipeline
```python
nlp = spacy.blank("en")
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
# Add a new label
ner.add_label("GADGET")

# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)
```


### Problems with model updating and training
1. Catastrophic Forgetting: If we only train with new data model can "forget" older examples. The best thing is to mix the older examples with the newer examples with each iteration of training.
2. Models are only good for `local context` predictions. It can struggle if the decision is difficult to make based on the context. To avoid this labels should be fairly generic like:
"clothing" is better than "adult clothing" , "kid clothing" etc.

*brat, prodigy* for rapid data labelling