#### Tokenization in NLP

##### Tokenization Using Python's Inbuilt Method

In [2]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
# Split text by whitespace
tokens = text.split()

In [3]:
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']


Observe in above list, words like 'language,' and 'modeling.' are containing punctuation at the end of them. Python split method do not consider punctuation as separate token.

##### Sentence Tokenization

In [4]:
# Lets split the given text by full stop (.)
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

In [5]:
text.split(". ") # Note the space after the full stop makes sure that we dont get empty element at the end of list.

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method.']

As you can see, split() since we can't use multiple separator split() method failed to split the last sentence from separator (!). We can overcome this drawback by applying split method multiple times with different separator but there are better ways to do it.


##### Tokenization Using Regular Expressions(RegEx)

Word Tokenization

In [6]:
import re

In [7]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = re.findall("[\w]+", text)

In [8]:
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


[] :    A set of characters.
\w :    Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character).
+  :    One or more occurrences.

Sentence Tokenization

In [9]:
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
tokens_sent = re.compile('[.!?] ').split(text) # Using compile method to combine RegEx patterns

In [10]:
tokens_sent

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time',
 'So sentence tonenization wont be foolproof with split() method.']

As you can see from above result, we are able to split sentence using multiple separators.

##### Tokenization Using NLTK

In [1]:
from nltk.tokenize import word_tokenize

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/h6x/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = word_tokenize(text)

In [5]:
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


In [6]:
from nltk.tokenize import sent_tokenize

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
sent_tokenize(text)

['Characters like periods, exclamation point and newline char are used to separate the sentences.',
 'But one drawback with split() method, that we can only use one separator at a time!',
 'So sentence tonenization wont be foolproof with split() method.']

##### Tokenization Using spaCy

In [7]:
# Load English model from spacy
from spacy.lang.en import English

In [8]:
# Load English tokenizer. 
# nlp object will be used to create 'doc' object which uses preprecoessing pipeline's components such as tagger, parser, NER and word vectors
nlp = English()

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

In [9]:
# Now we will process above text using 'nlp' object. Which is use to create documents with linguistic annotations and various nlp properties
my_doc = nlp(text)

In [10]:
# Above step has already tokenized our text but its in doc format, so lets write fo loop to create list of it
token_list = []
for token in my_doc:
    token_list.append(token.text)

In [11]:
print(token_list)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


Sentence Tokenization

In [14]:
# Load English tokenizer, tager, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
nlp.add_pipe('sentencizer')


text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

# nlp object is used to create documents with linguistic annotations
doc = nlp(text)

# Create list of sentence tokens

sentence_list =[]
for sentence in doc.sents:
    sentence_list.append(sentence.text)
print(sentence_list)

['Characters like periods, exclamation point and newline char are used to separate the sentences.', 'But one drawback with split() method, that we can only use one separator at a time!', 'So sentence tonenization wont be foolproof with split() method.']


##### Tokenization using Keras

###### To perform word tokenization we use the text_to_word_sequence() method from the keras.preprocessing.text class
###### By default, this function automatically does 3 things:
###### Splits words by space (split=” “).
###### Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n’).
###### Converts text to lowercase (lower=True).

In [7]:
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer,text_to_word_sequence

In [8]:
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""

tokens = text_to_word_sequence(text)
print(tokens)

['there', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'we', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


As you can notice, all words are also converted to lowercase. This is default behavior we can change it by changing the arguments e.g. text_to_word_sequence(text,lower=False)


##### Sentence Tokenization

In [9]:
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

text_to_word_sequence(text, split= ".", filters="!.\n")

['characters like periods, exclamation point and newline char are used to separate the sentences',
 ' but one drawback with split() method, that we can only use one separator at a time',
 ' so sentence tonenization wont be foolproof with split() method']