# Tokenization Types with LangChain

In this notebook, we'll use LangChain to demonstrate three common types of tokenization: word, character, and subword.

## 1️⃣ Word Tokenization

Word tokenization splits text into words. Let's see how to do this with LangChain's `RecursiveCharacterTextSplitter` (with default settings, it splits on spaces).

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "LangChain makes tokenization easy and flexible."

# Word-level tokenization (split on spaces)
word_splitter = RecursiveCharacterTextSplitter(chunk_size=1, chunk_overlap=0, separators=[" "])
word_tokens = word_splitter.split_text(text)
print("Word tokens:", word_tokens)

## 2️⃣ Character Tokenization

Character tokenization splits text into individual characters.

In [None]:
# Character-level tokenization (split on every character)
char_splitter = RecursiveCharacterTextSplitter(chunk_size=1, chunk_overlap=0, separators=[""])
char_tokens = char_splitter.split_text(text)
print("Character tokens:", char_tokens)

## 3️⃣ Subword Tokenization (Simulated)

LangChain does not provide built-in subword tokenization, but we can simulate it by splitting on common subword patterns (like 'ing', 'ion', etc.).

In [None]:
import re

def simple_subword_tokenizer(text):
    # Example: split on 'ing', 'ion', 'iz', 'ed', 'ly', 'er', 'es', 's'
    pattern = r"(ing|ion|iz|ed|ly|er|es|s)"
    tokens = re.split(pattern, text)
    # Remove empty strings
    return [t for t in tokens if t and not t.isspace()]

subword_tokens = simple_subword_tokenizer("Tokenization is amazing and powerful.")
print("Subword tokens:", subword_tokens)

---

**Summary:**
- Word tokenization splits by spaces.
- Character tokenization splits by each character.
- Subword tokenization splits by common subword patterns (simulated here).

You can use LangChain's text splitters for flexible tokenization in your NLP workflows!