Skip to content

Text Processing

Temp edited this page Oct 3, 2025 · 1 revision

Text Processing Tools

6 specialized tools for text similarity, fuzzy matching, and full-text search in PostgreSQL.


📊 Overview

Tool Purpose Algorithm
text_similarity Find similar text strings Trigram similarity
text_search_advanced Full-text search with ranking ts_vector/ts_query
regex_extract_all Extract patterns from text PostgreSQL regex
fuzzy_match Fuzzy string matching Levenshtein distance

Requirements: pg_trgm and fuzzystrmatch extensions


🔧 Tool Details

text_similarity

Find text strings similar to a search term using trigram similarity.

Parameters:

  • table_name (string, required): Table to search
  • text_column (string, required): Column containing text
  • search_text (string, required): Text to find similar matches for
  • similarity_threshold (number, optional): Minimum similarity (0.0-1.0, default: 0.3)
  • limit (integer, optional): Maximum results (default: 10)

Returns:

  • Matching rows with similarity scores
  • Sorted by similarity (highest first)

Example:

result = text_similarity(
    table_name="products",
    text_column="name",
    search_text="laptop",
    similarity_threshold=0.3,
    limit=10
)
# Returns: [
#   {"name": "Laptop Computer", "similarity_score": 0.75},
#   {"name": "Gaming Laptop", "similarity_score": 0.68},
#   {"name": "Laptop Bag", "similarity_score": 0.45}
# ]

Use Cases:

  • Product search with typo tolerance
  • Autocomplete suggestions
  • Duplicate detection
  • "Did you mean?" features

Similarity Thresholds:

  • 0.1-0.2 - Very lenient (catches distant matches)
  • 0.3-0.5 - Moderate (good default)
  • 0.6-0.8 - Strict (close matches only)
  • 0.9-1.0 - Exact matches only

text_search_advanced

Full-text search with PostgreSQL's ts_vector and ts_query for ranked results.

Parameters:

  • table_name (string, required): Table to search
  • text_columns (list, required): Columns to search (can be multiple)
  • search_query (string, required): Search query (supports AND, OR, NOT)
  • language (string, optional): Text search language (default: 'english')
  • limit (integer, optional): Maximum results (default: 10)

Returns:

  • Matching rows with relevance ranking
  • Headline snippets showing matches

Example:

# Simple search
result = text_search_advanced(
    table_name="articles",
    text_columns=["title", "content"],
    search_query="postgresql database",
    limit=10
)

# Boolean search
result = text_search_advanced(
    table_name="articles",
    text_columns=["title", "content"],
    search_query="postgresql & (performance | optimization)",
    language="english",
    limit=20
)
# Returns: [
#   {
#     "title": "PostgreSQL Performance Tuning",
#     "rank": 0.95,
#     "headline": "...PostgreSQL <b>performance</b> tips..."
#   }
# ]

Search Operators:

  • word1 & word2 - AND (both words)
  • word1 | word2 - OR (either word)
  • !word - NOT (exclude word)
  • word1 <-> word2 - FOLLOWED BY (adjacent words)
  • word:* - Prefix matching (word*)

Supported Languages:

  • english, spanish, french, german, russian, chinese, etc.
  • See PostgreSQL documentation for full list

Use Cases:

  • Content search
  • Documentation search
  • Article/blog search
  • Knowledge base queries

regex_extract_all

Extract all matches of a regex pattern from text columns.

Parameters:

  • table_name (string, required): Table to search
  • text_column (string, required): Column containing text
  • pattern (string, required): Regex pattern
  • flags (string, optional): Regex flags ('i' for case-insensitive, 'g' for global)
  • limit (integer, optional): Maximum rows to search

Returns:

  • All matching patterns extracted from text
  • Original row data with extracted matches

Example:

# Extract email addresses
result = regex_extract_all(
    table_name="contacts",
    text_column="notes",
    pattern="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
    flags="gi"
)
# Returns: [
#   {"notes": "Contact: john@example.com", "matches": ["john@example.com"]},
#   {"notes": "Emails: alice@test.org, bob@test.org", "matches": ["alice@test.org", "bob@test.org"]}
# ]

# Extract phone numbers
result = regex_extract_all(
    table_name="customers",
    text_column="contact_info",
    pattern="\\d{3}-\\d{3}-\\d{4}",
    limit=100
)

# Extract URLs
result = regex_extract_all(
    table_name="posts",
    text_column="content",
    pattern="https?://[^\\s]+",
    flags="gi"
)

Common Patterns:

# Email
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

# Phone (US)
"\\d{3}-\\d{3}-\\d{4}"

# URL
"https?://[^\\s]+"

# IP Address
"\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}"

# Date (YYYY-MM-DD)
"\\d{4}-\\d{2}-\\d{2}"

Use Cases:

  • Data extraction from unstructured text
  • Contact information parsing
  • Log analysis
  • Data migration/transformation

fuzzy_match

Find approximate string matches using Levenshtein distance.

Parameters:

  • table_name (string, required): Table to search
  • text_column (string, required): Column containing text
  • search_text (string, required): Text to match
  • max_distance (integer, required): Maximum edit distance allowed
  • limit (integer, optional): Maximum results

Returns:

  • Rows within edit distance threshold
  • Actual Levenshtein distance for each match

Example:

# Find names similar to "John Smith"
result = fuzzy_match(
    table_name="users",
    text_column="full_name",
    search_text="John Smith",
    max_distance=2,
    limit=10
)
# Returns: [
#   {"full_name": "John Smith", "distance": 0},
#   {"full_name": "Jon Smith", "distance": 1},
#   {"full_name": "John Smyth", "distance": 1},
#   {"full_name": "John Smit", "distance": 1}
# ]

# Find product names with typos
result = fuzzy_match(
    table_name="products",
    text_column="name",
    search_text="iphone",
    max_distance=1
)
# Finds: "iPhone", "ipbone", "iphne"

Edit Distance Guidelines:

  • 1 - One character different (typo tolerance)
  • 2 - Two characters different (moderate fuzzy)
  • 3-5 - Multiple differences (loose matching)
  • >5 - Very loose matching (use with caution)

Use Cases:

  • Spell checking
  • Duplicate detection
  • Name matching
  • Data cleansing

🎯 Common Workflows

Product Search with Typo Tolerance

# 1. Try exact match first
exact = execute_sql(
    "SELECT * FROM products WHERE name ILIKE %s",
    params=[f"%{search_term}%"]
)

# 2. If no results, try similarity search
if not exact:
    similar = text_similarity(
        table_name="products",
        text_column="name",
        search_text=search_term,
        similarity_threshold=0.3
    )

# 3. Or try fuzzy matching
if not similar:
    fuzzy = fuzzy_match(
        table_name="products",
        text_column="name",
        search_text=search_term,
        max_distance=2
    )

Content Search with Ranking

# Full-text search with relevance ranking
results = text_search_advanced(
    table_name="articles",
    text_columns=["title", "content", "tags"],
    search_query="database & (postgresql | postgres)",
    language="english",
    limit=20
)

# Results are automatically ranked by relevance
for article in results["results"]:
    print(f"Rank: {article['rank']}, Title: {article['title']}")

Data Extraction Pipeline

# 1. Extract emails
emails = regex_extract_all(
    table_name="messages",
    text_column="body",
    pattern="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
)

# 2. Extract phone numbers
phones = regex_extract_all(
    table_name="messages",
    text_column="body",
    pattern="\\d{3}-\\d{3}-\\d{4}"
)

# 3. Extract URLs
urls = regex_extract_all(
    table_name="messages",
    text_column="body",
    pattern="https?://[^\\s]+"
)

Duplicate Detection

# Find potential duplicate names
duplicates = text_similarity(
    table_name="contacts",
    text_column="name",
    search_text="John Doe",
    similarity_threshold=0.7,
    limit=20
)

# Or use fuzzy matching for stricter detection
fuzzy_dups = fuzzy_match(
    table_name="contacts",
    text_column="name",
    search_text="John Doe",
    max_distance=2
)

📊 Performance Optimization

Create Text Search Indexes

-- Trigram index for similarity search
CREATE INDEX idx_products_name_trgm ON products USING gin(name gin_trgm_ops);

-- Full-text search index
CREATE INDEX idx_articles_fts ON articles USING gin(to_tsvector('english', title || ' ' || content));

Composite Text Search Index

-- For multi-column text search
ALTER TABLE articles ADD COLUMN search_vector tsvector;

UPDATE articles SET search_vector =
    to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''));

CREATE INDEX idx_articles_search ON articles USING gin(search_vector);

🚀 Best Practices

1. Use Appropriate Tool

  • Exact match: Use SQL LIKE or ILIKE
  • Typo tolerance: Use text_similarity or fuzzy_match
  • Content search: Use text_search_advanced
  • Pattern extraction: Use regex_extract_all

2. Choose Right Thresholds

# Strict (reduce false positives)
text_similarity(..., similarity_threshold=0.7)
fuzzy_match(..., max_distance=1)

# Moderate (balanced)
text_similarity(..., similarity_threshold=0.3)
fuzzy_match(..., max_distance=2)

# Lenient (more results)
text_similarity(..., similarity_threshold=0.1)
fuzzy_match(..., max_distance=3)

3. Create Indexes

Always create appropriate indexes for text search columns to improve performance.


📚 Related Documentation


🔗 External Resources


See Home for more tool categories.

Clone this wiki locally