# Chapter 2: Tokenization using NLTK

This notebook demonstrates various text tokenization techniques using NLTK and spaCy libraries. We'll explore different approaches to breaking down text into meaningful tokens and analyze their features.

## Setup

First, let's import the required libraries and download necessary NLTK data.

In [1]:
import re
from pathlib import Path

import nltk
import spacy
from nltk import word_tokenize
from spacy import displacy

# Download required NLTK data
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\giloz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Example Texts

We'll use a literary excerpt and several test sentences with various linguistic features to demonstrate different tokenization approaches.

In [2]:
# Original example
text = (
    "Trust me, though, the words were on their way, and when "
    "they arrived, Liesel would hold them in her hands like "
    "the clouds, and she would wring them out, like the rain."
)

# New interesting sentences with various linguistic features
new_sentences = [
    "The AI researcher's model achieved 99.9% accuracy - a groundbreaking result!",
    "Mr. Smith bought a Ph.D. degree from example.com for $9,999...",
    "She exclaimed, 'OMG! This can't be real!' while reading the email.",
    "The code runs fast (about 2.5x faster) than our previous implementation.",
]

## Basic String Tokenization

Let's start with Python's built-in string splitting method.

In [3]:
print("Basic split() tokenization:")
tokens = text.split()
print(tokens[:8])

Basic split() tokenization:
['Trust', 'me,', 'though,', 'the', 'words', 'were', 'on', 'their']


## Regular Expression Tokenization

Using regex for more sophisticated tokenization that can handle contractions and punctuation.

In [4]:
print("Regex tokenization:")
pattern = r"\w+(?:'\w+)?|[^\w\s]"
texts = [text] + new_sentences
tokens = list(re.findall(pattern, texts[-1]))
print(tokens)

Regex tokenization:
['The', 'code', 'runs', 'fast', '(', 'about', '2', '.', '5x', 'faster', ')', 'than', 'our', 'previous', 'implementation', '.']


## NLTK Tokenization

NLTK provides more sophisticated tokenization that handles various edge cases.

In [5]:
print("NLTK tokenization:")
for sentence in new_sentences:
    print("\nOriginal:", sentence)
    print("Tokens:", word_tokenize(sentence))

NLTK tokenization:

Original: The AI researcher's model achieved 99.9% accuracy - a groundbreaking result!
Tokens: ['The', 'AI', 'researcher', "'s", 'model', 'achieved', '99.9', '%', 'accuracy', '-', 'a', 'groundbreaking', 'result', '!']

Original: Mr. Smith bought a Ph.D. degree from example.com for $9,999...
Tokens: ['Mr.', 'Smith', 'bought', 'a', 'Ph.D.', 'degree', 'from', 'example.com', 'for', '$', '9,999', '...']

Original: She exclaimed, 'OMG! This can't be real!' while reading the email.
Tokens: ['She', 'exclaimed', ',', "'OMG", '!', 'This', 'ca', "n't", 'be', 'real', '!', "'", 'while', 'reading', 'the', 'email', '.']

Original: The code runs fast (about 2.5x faster) than our previous implementation.
Tokens: ['The', 'code', 'runs', 'fast', '(', 'about', '2.5x', 'faster', ')', 'than', 'our', 'previous', 'implementation', '.']


## SpaCy Tokenization and Analysis

SpaCy provides comprehensive NLP capabilities including tokenization and linguistic feature analysis.

In [6]:
print("SpaCy tokenization:")
nlp = spacy.load("en_core_web_sm")
doc = nlp(texts[-1])
print(type(doc))
tokens = [token.text for token in doc]
print(tokens)

SpaCy tokenization:
<class 'spacy.tokens.doc.Doc'>
['The', 'code', 'runs', 'fast', '(', 'about', '2.5x', 'faster', ')', 'than', 'our', 'previous', 'implementation', '.']


## Advanced SpaCy Features

Let's explore SpaCy's additional linguistic analysis capabilities.

In [7]:
print("SpaCy's advanced features:")
for token in doc:
    print(f"Token: {token.text:15} | Lemma: {token.lemma_:15} | POS: {token.pos_:10} | Tag: {token.tag_:10}")

SpaCy's advanced features:
Token: The             | Lemma: the             | POS: DET        | Tag: DT        
Token: code            | Lemma: code            | POS: NOUN       | Tag: NN        
Token: runs            | Lemma: run             | POS: VERB       | Tag: VBZ       
Token: fast            | Lemma: fast            | POS: ADV        | Tag: RB        
Token: (               | Lemma: (               | POS: PUNCT      | Tag: -LRB-     
Token: about           | Lemma: about           | POS: ADV        | Tag: RB        
Token: 2.5x            | Lemma: 2.5x            | POS: NOUN       | Tag: NN        
Token: faster          | Lemma: fast            | POS: ADV        | Tag: RBR       
Token: )               | Lemma: )               | POS: PUNCT      | Tag: -RRB-     
Token: than            | Lemma: than            | POS: ADP        | Tag: IN        
Token: our             | Lemma: our             | POS: PRON       | Tag: PRP$      
Token: previous        | Lemma: previous        |

## Dependency Parsing Visualization

Finally, let's visualize the sentence structure using SpaCy's displacy.

In [8]:
sentence_span = list(doc.sents)[0]
svg = displacy.render(sentence_span, style="dep", jupyter=True)
with Path("sentence_diagram.svg").open("w") as f:
    f.write(svg)
displacy.render(sentence_span, style="dep")

ImportError: cannot import name 'display' from 'IPython.core.display' (c:\dev\ai_experiments\code_cademy_nlp\venv\Lib\site-packages\IPython\core\display.py)