# **Tokenization**

Tokenization is the process of splitting up textual input data, e.g., a sentence or a paragraph into meaningful units [[1]](#scrollTo=op-j6UywUt5i).

This notebook shows examples of word and sentence tokenization with ``spaCy``.

## **Word tokenization**

Word tokenization frequently uses word pieces such as word stems, prefixes,
and suffixes. 

For word tokenization, we will apply the following steps:
1. Import the ``spaCy`` library
2. Load the language model (English)
3. Create a ``Doc`` object and perform word tokenization
4. Print the word tokens



### Import ``spaCy`` library
``spaCy`` is a free, open-source library for advanced natural language processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[2]](https://spacy.io/usage/spacy-101). For example, it supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=op-j6UywUt5i). For more information about ``spaCy``, please refer to  [[3]](https://spacy.io/).

In [None]:
# Import the spaCy library
import spacy

### Load language model
We will load the ``en_core_web_sm`` English language model by using the ``spaCy`` library.
For more details about ``en_core_web_sm``, please refer to [[4]](https://spacy.io/models).

In [None]:
# Load "en_core_web_sm" English language model
sp = spacy.load('en_core_web_sm')

### Create a ``Doc`` object and perform word tokenization

When we create a ``Doc`` object by using the ``spaCy`` library, it automatically produces tokens for the input text. The following figure demonstrates the processing pipeline of a given text to create a ``Doc`` object [[5]](https://spacy.io/usage/processing-pipelines):

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
# Create a Doc object "doc"
doc = sp(u'I am non-vegetarian, please send me the menu to abs-xyz@gmail.com. "They are going to the U.K. and then to the U.S.A"')

### Print word tokens

In [None]:
# Print each word token in the object "doc":
for token in doc:
    print(token.text)

I
am
non
-
vegetarian
,
please
send
me
the
menu
to
abs-xyz@gmail.com
.
"
They
are
going
to
the
U.K.
and
then
to
the
U.S.A
"


## Sentence tokenization
Sentence tokenization is used to split up a given text into individual sentences [[1]](#scrollTo=op-j6UywUt5i). 

In the section ["Word tokenization"](#scrollTo=3CPahqp-pF9i), we have already imported the ``spaCy`` library, loaded the language model and created a ``Doc`` object. We will use the same ``Doc`` object in this section.

### Print sentence tokens

As explained [before](#scrollTo=hXK00FUnQb97), ``spaCy`` automatically performs tokenization during creation of the ``Doc`` object . 

To print the sentence tokens, we will simply use the ``doc.sents`` attribute of the "Sentencizer" class in ``spaCy``. For more details, please refer to [[6]](https://spacy.io/api/sentencizer).

In [None]:
# Use the "doc.sents" attribute for sentence tokenization.
## It iterates over sentences in the document.
## Then it defines the first word token of each sentence.
## To decide whether a token starts a sentence, spaCy assigns a boolean value to each token.
## This will be either True or False for all tokens.
for sentence in doc.sents:
    print(sentence)

I am non-vegetarian, please send me the menu to abs-xyz@gmail.com.
"They are going to the U.K. and then to the U.S.A"


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://spacy.io/usage/spacy-101
- [3] https://spacy.io
- [4] https://spacy.io/models
- [5] https://spacy.io/usage/processing-pipelines
- [6] https://spacy.io/api/sentencizer

Copyright © 2022 IU International University of Applied Sciences