# Training Best Practices

## Avoiding the "catastrophic forgetting" problem

If you're updating an existing model with new data, especially new labels, it can overfit and adjust too much to the new examples and "unlearn" the old stuff. For example, if you only update it with `WEBSITE`, it can "unlearn" what a `PERSON` is.

spaCy can help you with this. You can create those additional examples by running the existing model over data and extracting the entity spans you care about. You can then mix those examples in with your existing data and update the model with annotations of all labels.

**BAD**
```
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]
```

**GOOD**
```
TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]
```

In [None]:
# Example of annotation

TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "GPE")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "GPE")]},
    ),
    (
        "There's also a Paris in Arkansas, lol",
        {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]},
    ),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "GPE")]},
    ),
]

# There's More!

- Docs on [training](https://spacy.io/usage/training)
- Customize the tokenizer. See [here](https://spacy.io/usage/linguistic-features#tokenization)