In [1]:
# Download necessary libraries and packages for tokenization
#!pip install nltk -U -q
#!pip install spacy -U -q

In [2]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

import nltk
import re
import spacy
import zipfile
import requests
from io import BytesIO
from nltk.tokenize import word_tokenize

# download and install the spacy language model
sp = spacy.load('en_core_web_sm')

# download the 'punkt' tokenizer models
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Manikant IIT
[nltk_data]     Delhi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True


## Read and load the data
For this project, we will be using the Asian religious text data set from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/dataset/512/a+study+of+asian+religious+and+biblical+texts). We will focus on tokenizing one text file in this tutorial in order to highlight additional steps in creating high quality tokenized text data. We will also explore different types of tokenization. Note that many of these steps can be combined together under one Python function to iterate through a text corpus.

- Download the Asian religious text data from the UCI Machine Learning Repository.
- Unzip the file and open the resulting folder, and read `Complete_data` text file.
- Remove successive whitespace and line breaks

In [3]:
# URL of the ZIP file
url = 'https://archive.ics.uci.edu/static/public/512/a+study+of+asian+religious+and+biblical+texts.zip'  # Replace with your URL

# Download and extract the ZIP file
response = requests.get(url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # Extract 'complete_data.txt' from the ZIP file
    with z.open('Complete_data .txt') as file:
        # Read and decode the file
        raw_bytes = file.read()
        working_txt = raw_bytes.decode("utf-8", errors="ignore")

        # Clean text by removing successive whitespace and line breaks
        clean_txt = re.sub(r"\s+", " ", working_txt).strip()

# Print first 500 character the cleaned text
print(clean_txt[:500])

0.1 1.The Buddha: "What do you think, Rahula: What is a mirror for?"The Buddha:Rahula: "For reflection, sir."Rahula:The Buddha: "In the same way, Rahula, bodily acts, verbal acts, & mental acts are to be done with repeated reflection.The Buddha:"Whenever you want to perform a bodily act, you should reflect on it: 'This bodily act I want to perform would it lead to self-affliction, to the affliction of others, or to both? Is it an unskillful bodily act, with painful consequences, painful results?


## Tokenize the text
The NLTK library comes with functions to tokenize text at various degrees of granularity. For this first task, we will tokenize at the word level. We can pass our cleaned text string through the `word_tokenize()` function.

In [4]:
tokens = word_tokenize(clean_txt)
print(tokens[:200])

['0.1', '1.The', 'Buddha', ':', '``', 'What', 'do', 'you', 'think', ',', 'Rahula', ':', 'What', 'is', 'a', 'mirror', 'for', '?', '``', 'The', 'Buddha', ':', 'Rahula', ':', '``', 'For', 'reflection', ',', 'sir', '.', '``', 'Rahula', ':', 'The', 'Buddha', ':', '``', 'In', 'the', 'same', 'way', ',', 'Rahula', ',', 'bodily', 'acts', ',', 'verbal', 'acts', ',', '&', 'mental', 'acts', 'are', 'to', 'be', 'done', 'with', 'repeated', 'reflection.The', 'Buddha', ':', "''", 'Whenever', 'you', 'want', 'to', 'perform', 'a', 'bodily', 'act', ',', 'you', 'should', 'reflect', 'on', 'it', ':', "'This", 'bodily', 'act', 'I', 'want', 'to', 'perform', 'would', 'it', 'lead', 'to', 'self-affliction', ',', 'to', 'the', 'affliction', 'of', 'others', ',', 'or', 'to', 'both', '?', 'Is', 'it', 'an', 'unskillful', 'bodily', 'act', ',', 'with', 'painful', 'consequences', ',', 'painful', 'results', '?', "'", 'If', ',', 'on', 'reflection', ',', 'you', 'know', 'that', 'it', 'would', 'lead', 'to', 'self-affliction', '

## Remove noisy data

The first four characters of the tokenization output reveals much about NLTK’s tokenizer:

> <p style="color: blue; font-family: Courier; font-size: 20px;"> “0.1” “1.The” “Buddha” “:” <p></p>

In tokenization, a delimiter is the character or sequence by which the tokenizer divides tokens. The NLTK `word_tokenize()` function’s delimiter is primarily whitespace. The function may also individuate words from adjacent punctuation, as evidenced by the separate output tokens for "Buddha" and its adjacent colon. Despite this caveat, the tokenizer is clearly not infallible, as it does not recognize "1." and "The" as separate semantic units. This may be due to the tokenizer’s internal rules that account for decimals following numerical characters with no subsequent whitespace.

**Overall, our tokenized output contains a lot of noise. There are tokens comprised of nothing except ellipses or colons and some that combine numerical and alphabetic digits. This will clearly create problems if we want to use the tokenized data for training a classifier or for word embedding. We can remove non-alphabetic tokens using this command:**

In [7]:
# remove non-alphabetic tokens
filtered_tokens_alpha = [word for word in tokens if word.isalpha()]
print(filtered_tokens_alpha[:200])

['Buddha', 'What', 'do', 'you', 'think', 'Rahula', 'What', 'is', 'a', 'mirror', 'for', 'The', 'Buddha', 'Rahula', 'For', 'reflection', 'sir', 'Rahula', 'The', 'Buddha', 'In', 'the', 'same', 'way', 'Rahula', 'bodily', 'acts', 'verbal', 'acts', 'mental', 'acts', 'are', 'to', 'be', 'done', 'with', 'repeated', 'Buddha', 'Whenever', 'you', 'want', 'to', 'perform', 'a', 'bodily', 'act', 'you', 'should', 'reflect', 'on', 'it', 'bodily', 'act', 'I', 'want', 'to', 'perform', 'would', 'it', 'lead', 'to', 'to', 'the', 'affliction', 'of', 'others', 'or', 'to', 'both', 'Is', 'it', 'an', 'unskillful', 'bodily', 'act', 'with', 'painful', 'consequences', 'painful', 'results', 'If', 'on', 'reflection', 'you', 'know', 'that', 'it', 'would', 'lead', 'to', 'to', 'the', 'affliction', 'of', 'others', 'or', 'to', 'both', 'it', 'would', 'be', 'an', 'unskillful', 'bodily', 'act', 'with', 'painful', 'consequences', 'painful', 'results', 'then', 'any', 'bodily', 'act', 'of', 'that', 'sort', 'is', 'absolutely', '

Unfortunately, since this command removes all tokens that contain non-alphabetic characters, we lose tokens that contain actual words, such as the “1.The” token. Of course, a token comprised only of The may be removed anyway if later steps in our text preprocessing pipeline incorporate a stopword list during tasks like stemming.

For now, let’s assume we want every word from the initial text in our tokenized output. To account for cases such as “1.The,” we need to remove non-alphabetic characters prior to tokenization. We can modify the same RegEx commands we previously used to remove whitespace and linebreaks from the raw text. Because some words are separated only by punctuation marks sans white space, we will replace all non-alphabetic characters with a single space, then remove successive, leading, and trailing spaces:

In [8]:
# replace non-alphabetic characters with single whitespace
reg_txt = re.sub(r'[^a-zA-Z\s]', ' ', clean_txt)
# remove any whitespace that appears in sequence
reg_txt = re.sub(r"\s+", " ", reg_txt)
# remove any new leading and trailing whitespace
reg_txt = reg_txt.strip()

print(reg_txt[:500])

The Buddha What do you think Rahula What is a mirror for The Buddha Rahula For reflection sir Rahula The Buddha In the same way Rahula bodily acts verbal acts mental acts are to be done with repeated reflection The Buddha Whenever you want to perform a bodily act you should reflect on it This bodily act I want to perform would it lead to self affliction to the affliction of others or to both Is it an unskillful bodily act with painful consequences painful results If on reflection you know that i


Now, we can tokenize the regularized text:

In [9]:
# tokenize regularized text
reg_tokens = word_tokenize(reg_txt)
print(reg_tokens[:200])

['The', 'Buddha', 'What', 'do', 'you', 'think', 'Rahula', 'What', 'is', 'a', 'mirror', 'for', 'The', 'Buddha', 'Rahula', 'For', 'reflection', 'sir', 'Rahula', 'The', 'Buddha', 'In', 'the', 'same', 'way', 'Rahula', 'bodily', 'acts', 'verbal', 'acts', 'mental', 'acts', 'are', 'to', 'be', 'done', 'with', 'repeated', 'reflection', 'The', 'Buddha', 'Whenever', 'you', 'want', 'to', 'perform', 'a', 'bodily', 'act', 'you', 'should', 'reflect', 'on', 'it', 'This', 'bodily', 'act', 'I', 'want', 'to', 'perform', 'would', 'it', 'lead', 'to', 'self', 'affliction', 'to', 'the', 'affliction', 'of', 'others', 'or', 'to', 'both', 'Is', 'it', 'an', 'unskillful', 'bodily', 'act', 'with', 'painful', 'consequences', 'painful', 'results', 'If', 'on', 'reflection', 'you', 'know', 'that', 'it', 'would', 'lead', 'to', 'self', 'affliction', 'to', 'the', 'affliction', 'of', 'others', 'or', 'to', 'both', 'it', 'would', 'be', 'an', 'unskillful', 'bodily', 'act', 'with', 'painful', 'consequences', 'painful', 'resul

In [10]:
# tokenize regularized text
reg_tokens = word_tokenize(reg_txt)
print(reg_tokens[:200])

['The', 'Buddha', 'What', 'do', 'you', 'think', 'Rahula', 'What', 'is', 'a', 'mirror', 'for', 'The', 'Buddha', 'Rahula', 'For', 'reflection', 'sir', 'Rahula', 'The', 'Buddha', 'In', 'the', 'same', 'way', 'Rahula', 'bodily', 'acts', 'verbal', 'acts', 'mental', 'acts', 'are', 'to', 'be', 'done', 'with', 'repeated', 'reflection', 'The', 'Buddha', 'Whenever', 'you', 'want', 'to', 'perform', 'a', 'bodily', 'act', 'you', 'should', 'reflect', 'on', 'it', 'This', 'bodily', 'act', 'I', 'want', 'to', 'perform', 'would', 'it', 'lead', 'to', 'self', 'affliction', 'to', 'the', 'affliction', 'of', 'others', 'or', 'to', 'both', 'Is', 'it', 'an', 'unskillful', 'bodily', 'act', 'with', 'painful', 'consequences', 'painful', 'results', 'If', 'on', 'reflection', 'you', 'know', 'that', 'it', 'would', 'lead', 'to', 'self', 'affliction', 'to', 'the', 'affliction', 'of', 'others', 'or', 'to', 'both', 'it', 'would', 'be', 'an', 'unskillful', 'bodily', 'act', 'with', 'painful', 'consequences', 'painful', 'resul

We now have a tokenized output with far less non-alphabetic noise. While it is not necessary to remove non-alphabetic characters from text for tokenization, doing so helps conduct additional preprocessing (for example, stemming and lemmatization) and produce more meaningfully accurate results for NLP tasks, such as sentiment analysis.

## Other tokenization methods
Word-level tokenization isWord-level tokenization is one of the most common types of tokenization in preparing for NLP tasks, but it is not the only granular level for tokenizing text.

### Character tokenization
One issue that can arise when using a word-level tokenizer is unknown word tokens. Out-of-vocabulary (OOV) terms (that is, words not recognized by a tokenizer with a pre-trained vocabulary) might be returned as unknown tokens ([UNK]). OOV terms can arise if you use a tokenizer with a pre-trained vocabulary. Character tokenization is one method of solving for this. Because character tokenization tokenizes at the character level, the chances of meeting OOV terms is miniscule. But character tokens in and of themselves might not provide meaningful or helpful data for NLP tasks that focus on word-level units, such as word embedding models.

The NLTK tokenizer requires a specified pattern to differentiate characters. The pattern defined in the following code separates alphabetic characters, digits, punctuation marks, and spaces as individual characters:

In [11]:
# import NLTK regular expression tokenizer
from nltk.tokenize import regexp_tokenize

# Excercise: use pattern = r"\S|\s for regexp_tokenize on clean_text
# tokenize text at character level
pattern = r"\S|\s"
character_tokens = regexp_tokenize(clean_txt, pattern)

# print first 100 character tokens
print(character_tokens[:200])

['0', '.', '1', ' ', '1', '.', 'T', 'h', 'e', ' ', 'B', 'u', 'd', 'd', 'h', 'a', ':', ' ', '"', 'W', 'h', 'a', 't', ' ', 'd', 'o', ' ', 'y', 'o', 'u', ' ', 't', 'h', 'i', 'n', 'k', ',', ' ', 'R', 'a', 'h', 'u', 'l', 'a', ':', ' ', 'W', 'h', 'a', 't', ' ', 'i', 's', ' ', 'a', ' ', 'm', 'i', 'r', 'r', 'o', 'r', ' ', 'f', 'o', 'r', '?', '"', 'T', 'h', 'e', ' ', 'B', 'u', 'd', 'd', 'h', 'a', ':', 'R', 'a', 'h', 'u', 'l', 'a', ':', ' ', '"', 'F', 'o', 'r', ' ', 'r', 'e', 'f', 'l', 'e', 'c', 't', 'i', 'o', 'n', ',', ' ', 's', 'i', 'r', '.', '"', 'R', 'a', 'h', 'u', 'l', 'a', ':', 'T', 'h', 'e', ' ', 'B', 'u', 'd', 'd', 'h', 'a', ':', ' ', '"', 'I', 'n', ' ', 't', 'h', 'e', ' ', 's', 'a', 'm', 'e', ' ', 'w', 'a', 'y', ',', ' ', 'R', 'a', 'h', 'u', 'l', 'a', ',', ' ', 'b', 'o', 'd', 'i', 'l', 'y', ' ', 'a', 'c', 't', 's', ',', ' ', 'v', 'e', 'r', 'b', 'a', 'l', ' ', 'a', 'c', 't', 's', ',', ' ', '&', ' ', 'm', 'e', 'n', 't', 'a', 'l', ' ', 'a', 'c', 't', 's', ' ', 'a', 'r', 'e', ' ', 't', 'o']

You can also implement RegEx commands (similar to those used with the word tokenizer) before character tokenization to remove digits, punctuation, and whitespace if required. Doing this cleanup will help remove whitespace tokens if you only care about the actual alphabetic characters used.

## Sentence tokenization
Sentence tokenization has several use cases, such as sentiment analysis tasks or machine translation. For example, with regard to machine translation, a word’s significance or meaning in another language cannot always be determined in isolation from its context. In this case, you might prefer a sentence tokenization algorithm as opposed to word-level tokenization.


In [13]:
#See the solution
# import sentence NLTK sentence tokenizer
from nltk.tokenize import sent_tokenize

# Exercise: tokenize text at sentence level using sent_tokenize for clean_txt
sentence_tokens = sent_tokenize(clean_txt)

# print first 10 sentence tokens
print(sentence_tokens[:10])

['0.1 1.The Buddha: "What do you think, Rahula: What is a mirror for?', '"The Buddha:Rahula: "For reflection, sir.', '"Rahula:The Buddha: "In the same way, Rahula, bodily acts, verbal acts, & mental acts are to be done with repeated reflection.The Buddha:"Whenever you want to perform a bodily act, you should reflect on it: \'This bodily act I want to perform would it lead to self-affliction, to the affliction of others, or to both?', "Is it an unskillful bodily act, with painful consequences, painful results?'", 'If, on reflection, you know that it would lead to self-affliction, to the affliction of others, or to both; it would be an unskillful bodily act with painful consequences, painful results, then any bodily act of that sort is absolutely unfit for you to do.', 'But if on reflection you know that it would not cause affliction... it would be a skillful bodily act with happy consequences, happy results, then any bodily act of that sort is fit for you to do.', '(Similarly with verba

**This token contains several syntactic units, which should be divided into several tokens. Nevertheless, the tokenizer clumps them together most likely because of the original text file's inconsistent formatting, such as (missing) white space before and after punctuation marks. Of course, there are means of correcting for these inconsistencies. But such methods require a more involved regularization process beyond the scope of this tutorial.**

## This token contains several syntactic units, which should be divided into several tokens. Nevertheless, the tokenizer clumps them together most likely because of the original text file's inconsistent formatting, such as (missing) white space before and after punctuation marks. Of course, there are means of correcting for these inconsistencies. But such methods require a more involved regularization process beyond the scope of this tutorial.

Sina Nazeri, Ph.D.

Kang Wang is a Data Scientist at IBM. He is also a PhD Candidate at the University of Waterloo.

Original IBM Developer tutorial written by Jacob Murel, Ph.D