# Tokenizers is used to divide strings into lists of substrings. For example, Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

## There are many nlp tools include the sentence tokenize function, such as OpenNLP，NLTK, TextBlob, MBSP etc. 

In [4]:
from nltk.tokenize import sent_tokenize

text = 'this is a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn.'

In [5]:
sent_tokenize_list = sent_tokenize(text)

In [6]:
len(sent_tokenize_list)

5

In [7]:
sent_tokenize_list

['this is a sent tokenize test.',
 'this is sent two.',
 'is this sent three?',
 'sent 4 is cool!',
 'Now it’s your turn.']

### sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk. tokenize.punkt module. This instance has already been trained on and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

#### There are total 17 european languages that NLTK support for sentence tokenize, and you can use them as the following steps:

In [12]:
import nltk.data

## English

tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
tokenizer.tokenize(text)

['this is a sent tokenize test.',
 'this is sent two.',
 'is this sent three?',
 'sent 4 is cool!',
 'Now it’s your turn.']

In [14]:
## Spanish

spanish_tokenizer = nltk.data.load("tokenizers/punkt/spanish.pickle")
spanish_tokenizer.tokenize("Hola amigo. Estoy bien.")

['Hola amigo.', 'Estoy bien.']

## Tokenizing text into words

Tokenizing text into words in NLTK is very simple, just call word_tokenize from nltk.tokenize module

In [15]:
from nltk.tokenize import word_tokenize

In [16]:
word_tokenize("Hello World.")

['Hello', 'World', '.']

In [17]:
word_tokenize('this is a test.')

['this', 'is', 'a', 'test', '.']

### Another equivalent method

In [18]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('this’s a test')

['this', '’', 's', 'a', 'test']

### Except the TreebankWordTokenizer, there are other alternative word tokenizers, such as PunktWordTokenizer and WordPunktTokenizer : 

PunktTokenizer splits on punctuation, but keeps it with the word(Throwing error here due to version conflicts)

WordPunctTokenizer splits all punctuations into separate tokens:

In [21]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()

In [22]:
word_punct_tokenizer.tokenize('This’s a test')

['This', '’', 's', 'a', 'test']