This notebook explores the different tokenization techniques available in NLP. Here we explore sentence tokenizer, word tokenizer, word punctuation tokenizer and tree bank word tokenizer

In [1]:
pip install nltk



In [2]:
corpus = """"The Night Librarian" by Christopher Lincoln brings a unique twist to the adventure genre, combining the nostalgic charm of "Night at the Museum" with the fantastical elements of "The Land of Stories." The story follows twins Page and Turner, who have always found solace in the New York Public Library amidst their parents' frequent travels. However, their routine visits take an unexpected turn when they embark on a secret mission that revolves around their father's rare edition of Bram Stoker's "Dracula."

As the twins delve deeper, they encounter a world hidden within the library's walls, brought to life by the enigmatic Night Librarian. This character serves as their guide through a realm where famous literary heroes and villains have broken free from their pages, creating a dynamic and often chaotic environment. The stakes are high as Page and Turner, alongside their newfound allies, must prevent the library's imminent destruction.

"The Night Librarian" is a commendable effort that will likely appeal to young readers and fans of literary adventures. Its blend of mystery, magic, and familiar faces from classic literature offers a captivating experience, even if it doesn't fully realize its potential. With more focused storytelling and deeper character exploration, this series has the potential to become a beloved staple in the graphic novel genre. For now, it stands as an enjoyable, if somewhat uneven, introduction to the magical world hidden within the library's walls."""

## Sentence Tokenization

In [3]:
from nltk.tokenize import sent_tokenize

In [4]:
import nltk

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
docs = sent_tokenize(corpus)

In [7]:
docs

['"The Night Librarian" by Christopher Lincoln brings a unique twist to the adventure genre, combining the nostalgic charm of "Night at the Museum" with the fantastical elements of "The Land of Stories."',
 "The story follows twins Page and Turner, who have always found solace in the New York Public Library amidst their parents' frequent travels.",
 'However, their routine visits take an unexpected turn when they embark on a secret mission that revolves around their father\'s rare edition of Bram Stoker\'s "Dracula."',
 "As the twins delve deeper, they encounter a world hidden within the library's walls, brought to life by the enigmatic Night Librarian.",
 'This character serves as their guide through a realm where famous literary heroes and villains have broken free from their pages, creating a dynamic and often chaotic environment.',
 "The stakes are high as Page and Turner, alongside their newfound allies, must prevent the library's imminent destruction.",
 '"The Night Librarian" is

In [8]:
for sent in docs:
  print(sent)

"The Night Librarian" by Christopher Lincoln brings a unique twist to the adventure genre, combining the nostalgic charm of "Night at the Museum" with the fantastical elements of "The Land of Stories."
The story follows twins Page and Turner, who have always found solace in the New York Public Library amidst their parents' frequent travels.
However, their routine visits take an unexpected turn when they embark on a secret mission that revolves around their father's rare edition of Bram Stoker's "Dracula."
As the twins delve deeper, they encounter a world hidden within the library's walls, brought to life by the enigmatic Night Librarian.
This character serves as their guide through a realm where famous literary heroes and villains have broken free from their pages, creating a dynamic and often chaotic environment.
The stakes are high as Page and Turner, alongside their newfound allies, must prevent the library's imminent destruction.
"The Night Librarian" is a commendable effort that w

## Word Tokenization

In [9]:
from nltk.tokenize import word_tokenize

In [10]:
print(word_tokenize(corpus))

['``', 'The', 'Night', 'Librarian', "''", 'by', 'Christopher', 'Lincoln', 'brings', 'a', 'unique', 'twist', 'to', 'the', 'adventure', 'genre', ',', 'combining', 'the', 'nostalgic', 'charm', 'of', '``', 'Night', 'at', 'the', 'Museum', "''", 'with', 'the', 'fantastical', 'elements', 'of', '``', 'The', 'Land', 'of', 'Stories', '.', "''", 'The', 'story', 'follows', 'twins', 'Page', 'and', 'Turner', ',', 'who', 'have', 'always', 'found', 'solace', 'in', 'the', 'New', 'York', 'Public', 'Library', 'amidst', 'their', 'parents', "'", 'frequent', 'travels', '.', 'However', ',', 'their', 'routine', 'visits', 'take', 'an', 'unexpected', 'turn', 'when', 'they', 'embark', 'on', 'a', 'secret', 'mission', 'that', 'revolves', 'around', 'their', 'father', "'s", 'rare', 'edition', 'of', 'Bram', 'Stoker', "'s", '``', 'Dracula', '.', "''", 'As', 'the', 'twins', 'delve', 'deeper', ',', 'they', 'encounter', 'a', 'world', 'hidden', 'within', 'the', 'library', "'s", 'walls', ',', 'brought', 'to', 'life', 'by',

In [11]:
from nltk.tokenize import wordpunct_tokenize

In [12]:
wp_tokenizer = wordpunct_tokenize(corpus)

In [13]:
from nltk.tokenize import TreebankWordTokenizer

In [14]:
tokenizer = TreebankWordTokenizer()

In [15]:
tb_tokenizer = tokenizer.tokenize(corpus)

In [16]:
set1 = set(wp_tokenizer)

In [17]:
set2 = set(tb_tokenizer)

#### How is word punct tokenizer different from tree bank word tokenizer

In [18]:
set1-set2 #tems that are in list1 but not in list2

{'"',
 '."',
 'Dracula',
 'Stories',
 'adventures',
 'destruction',
 'doesn',
 'environment',
 's',
 't',
 'travels'}

In [19]:
set2-set1 #tems that are in list2 but not in list1.

{"''",
 "'s",
 'Dracula.',
 'Librarian.',
 'Stories.',
 '``',
 'adventures.',
 'destruction.',
 'does',
 'environment.',
 'genre.',
 "n't",
 'potential.',
 'travels.'}