Text Processing

Text Processing Tools

6 specialized tools for text similarity, fuzzy matching, and full-text search in PostgreSQL.

📊 Overview

Tool	Purpose	Algorithm
`text_similarity`	Find similar text strings	Trigram similarity
`text_search_advanced`	Full-text search with ranking	ts_vector/ts_query
`regex_extract_all`	Extract patterns from text	PostgreSQL regex
`fuzzy_match`	Fuzzy string matching	Levenshtein distance

Requirements: pg_trgm and fuzzystrmatch extensions

🔧 Tool Details

text_similarity

Find text strings similar to a search term using trigram similarity.

Parameters:

table_name (string, required): Table to search
text_column (string, required): Column containing text
search_text (string, required): Text to find similar matches for
similarity_threshold (number, optional): Minimum similarity (0.0-1.0, default: 0.3)
limit (integer, optional): Maximum results (default: 10)

Returns:

Matching rows with similarity scores
Sorted by similarity (highest first)

Example:

result = text_similarity(
    table_name="products",
    text_column="name",
    search_text="laptop",
    similarity_threshold=0.3,
    limit=10
)
# Returns: [
#   {"name": "Laptop Computer", "similarity_score": 0.75},
#   {"name": "Gaming Laptop", "similarity_score": 0.68},
#   {"name": "Laptop Bag", "similarity_score": 0.45}
# ]

Use Cases:

Product search with typo tolerance
Autocomplete suggestions
Duplicate detection
"Did you mean?" features

Similarity Thresholds:

0.1-0.2 - Very lenient (catches distant matches)
0.3-0.5 - Moderate (good default)
0.6-0.8 - Strict (close matches only)
0.9-1.0 - Exact matches only

text_search_advanced

Full-text search with PostgreSQL's ts_vector and ts_query for ranked results.

Parameters:

table_name (string, required): Table to search
text_columns (list, required): Columns to search (can be multiple)
search_query (string, required): Search query (supports AND, OR, NOT)
language (string, optional): Text search language (default: 'english')
limit (integer, optional): Maximum results (default: 10)

Returns:

Matching rows with relevance ranking
Headline snippets showing matches

Example:

# Simple search
result = text_search_advanced(
    table_name="articles",
    text_columns=["title", "content"],
    search_query="postgresql database",
    limit=10
)

# Boolean search
result = text_search_advanced(
    table_name="articles",
    text_columns=["title", "content"],
    search_query="postgresql & (performance | optimization)",
    language="english",
    limit=20
)
# Returns: [
#   {
#     "title": "PostgreSQL Performance Tuning",
#     "rank": 0.95,
#     "headline": "...PostgreSQL <b>performance</b> tips..."
#   }
# ]

Search Operators:

word1 & word2 - AND (both words)
word1 | word2 - OR (either word)
!word - NOT (exclude word)
word1 <-> word2 - FOLLOWED BY (adjacent words)
word:* - Prefix matching (word*)

Supported Languages:

english, spanish, french, german, russian, chinese, etc.
See PostgreSQL documentation for full list

Use Cases:

Content search
Documentation search
Article/blog search
Knowledge base queries

regex_extract_all

Extract all matches of a regex pattern from text columns.

Parameters:

table_name (string, required): Table to search
text_column (string, required): Column containing text
pattern (string, required): Regex pattern
flags (string, optional): Regex flags ('i' for case-insensitive, 'g' for global)
limit (integer, optional): Maximum rows to search

Returns:

All matching patterns extracted from text
Original row data with extracted matches

Example:

# Extract email addresses
result = regex_extract_all(
    table_name="contacts",
    text_column="notes",
    pattern="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
    flags="gi"
)
# Returns: [
#   {"notes": "Contact: john@example.com", "matches": ["john@example.com"]},
#   {"notes": "Emails: alice@test.org, bob@test.org", "matches": ["alice@test.org", "bob@test.org"]}
# ]

# Extract phone numbers
result = regex_extract_all(
    table_name="customers",
    text_column="contact_info",
    pattern="\\d{3}-\\d{3}-\\d{4}",
    limit=100
)

# Extract URLs
result = regex_extract_all(
    table_name="posts",
    text_column="content",
    pattern="https?://[^\\s]+",
    flags="gi"
)

Common Patterns:

# Email
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

# Phone (US)
"\\d{3}-\\d{3}-\\d{4}"

# URL
"https?://[^\\s]+"

# IP Address
"\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}"

# Date (YYYY-MM-DD)
"\\d{4}-\\d{2}-\\d{2}"

Use Cases:

Data extraction from unstructured text
Contact information parsing
Log analysis
Data migration/transformation

fuzzy_match

Find approximate string matches using Levenshtein distance.

Parameters:

table_name (string, required): Table to search
text_column (string, required): Column containing text
search_text (string, required): Text to match
max_distance (integer, required): Maximum edit distance allowed
limit (integer, optional): Maximum results

Returns:

Rows within edit distance threshold
Actual Levenshtein distance for each match

Example:

# Find names similar to "John Smith"
result = fuzzy_match(
    table_name="users",
    text_column="full_name",
    search_text="John Smith",
    max_distance=2,
    limit=10
)
# Returns: [
#   {"full_name": "John Smith", "distance": 0},
#   {"full_name": "Jon Smith", "distance": 1},
#   {"full_name": "John Smyth", "distance": 1},
#   {"full_name": "John Smit", "distance": 1}
# ]

# Find product names with typos
result = fuzzy_match(
    table_name="products",
    text_column="name",
    search_text="iphone",
    max_distance=1
)
# Finds: "iPhone", "ipbone", "iphne"

Edit Distance Guidelines:

1 - One character different (typo tolerance)
2 - Two characters different (moderate fuzzy)
3-5 - Multiple differences (loose matching)
>5 - Very loose matching (use with caution)

Use Cases:

Spell checking
Duplicate detection
Name matching
Data cleansing

🎯 Common Workflows

Product Search with Typo Tolerance

# 1. Try exact match first
exact = execute_sql(
    "SELECT * FROM products WHERE name ILIKE %s",
    params=[f"%{search_term}%"]
)

# 2. If no results, try similarity search
if not exact:
    similar = text_similarity(
        table_name="products",
        text_column="name",
        search_text=search_term,
        similarity_threshold=0.3
    )

# 3. Or try fuzzy matching
if not similar:
    fuzzy = fuzzy_match(
        table_name="products",
        text_column="name",
        search_text=search_term,
        max_distance=2
    )

Content Search with Ranking

# Full-text search with relevance ranking
results = text_search_advanced(
    table_name="articles",
    text_columns=["title", "content", "tags"],
    search_query="database & (postgresql | postgres)",
    language="english",
    limit=20
)

# Results are automatically ranked by relevance
for article in results["results"]:
    print(f"Rank: {article['rank']}, Title: {article['title']}")

Data Extraction Pipeline

# 1. Extract emails
emails = regex_extract_all(
    table_name="messages",
    text_column="body",
    pattern="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
)

# 2. Extract phone numbers
phones = regex_extract_all(
    table_name="messages",
    text_column="body",
    pattern="\\d{3}-\\d{3}-\\d{4}"
)

# 3. Extract URLs
urls = regex_extract_all(
    table_name="messages",
    text_column="body",
    pattern="https?://[^\\s]+"
)

Duplicate Detection

# Find potential duplicate names
duplicates = text_similarity(
    table_name="contacts",
    text_column="name",
    search_text="John Doe",
    similarity_threshold=0.7,
    limit=20
)

# Or use fuzzy matching for stricter detection
fuzzy_dups = fuzzy_match(
    table_name="contacts",
    text_column="name",
    search_text="John Doe",
    max_distance=2
)

📊 Performance Optimization

Create Text Search Indexes

-- Trigram index for similarity search
CREATE INDEX idx_products_name_trgm ON products USING gin(name gin_trgm_ops);

-- Full-text search index
CREATE INDEX idx_articles_fts ON articles USING gin(to_tsvector('english', title || ' ' || content));

Composite Text Search Index

-- For multi-column text search
ALTER TABLE articles ADD COLUMN search_vector tsvector;

UPDATE articles SET search_vector =
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''));

CREATE INDEX idx_articles_search ON articles USING gin(search_vector);

🚀 Best Practices

1. Use Appropriate Tool

Exact match: Use SQL LIKE or ILIKE
Typo tolerance: Use text_similarity or fuzzy_match
Content search: Use text_search_advanced
Pattern extraction: Use regex_extract_all

2. Choose Right Thresholds

# Strict (reduce false positives)
text_similarity(..., similarity_threshold=0.7)
fuzzy_match(..., max_distance=1)

# Moderate (balanced)
text_similarity(..., similarity_threshold=0.3)
fuzzy_match(..., max_distance=2)

# Lenient (more results)
text_similarity(..., similarity_threshold=0.1)
fuzzy_match(..., max_distance=3)

3. Create Indexes

Always create appropriate indexes for text search columns to improve performance.

📚 Related Documentation

Core Database Tools - Schema and SQL
JSON Operations - JSON text search
Extension Setup - pg_trgm, fuzzystrmatch

🔗 External Resources

See Home for more tool categories.

Text Processing

Text Processing Tools

📊 Overview

🔧 Tool Details

text_similarity

text_search_advanced

regex_extract_all

fuzzy_match

🎯 Common Workflows

Product Search with Typo Tolerance

Content Search with Ranking

Data Extraction Pipeline

Duplicate Detection

📊 Performance Optimization

Create Text Search Indexes

Composite Text Search Index

🚀 Best Practices

1. Use Appropriate Tool

2. Choose Right Thresholds

3. Create Indexes

📚 Related Documentation

🔗 External Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally