-
Notifications
You must be signed in to change notification settings - Fork 0
Text Processing
6 specialized tools for text similarity, fuzzy matching, and full-text search in PostgreSQL.
Tool | Purpose | Algorithm |
---|---|---|
text_similarity |
Find similar text strings | Trigram similarity |
text_search_advanced |
Full-text search with ranking | ts_vector/ts_query |
regex_extract_all |
Extract patterns from text | PostgreSQL regex |
fuzzy_match |
Fuzzy string matching | Levenshtein distance |
Requirements: pg_trgm
and fuzzystrmatch
extensions
Find text strings similar to a search term using trigram similarity.
Parameters:
-
table_name
(string, required): Table to search -
text_column
(string, required): Column containing text -
search_text
(string, required): Text to find similar matches for -
similarity_threshold
(number, optional): Minimum similarity (0.0-1.0, default: 0.3) -
limit
(integer, optional): Maximum results (default: 10)
Returns:
- Matching rows with similarity scores
- Sorted by similarity (highest first)
Example:
result = text_similarity(
table_name="products",
text_column="name",
search_text="laptop",
similarity_threshold=0.3,
limit=10
)
# Returns: [
# {"name": "Laptop Computer", "similarity_score": 0.75},
# {"name": "Gaming Laptop", "similarity_score": 0.68},
# {"name": "Laptop Bag", "similarity_score": 0.45}
# ]
Use Cases:
- Product search with typo tolerance
- Autocomplete suggestions
- Duplicate detection
- "Did you mean?" features
Similarity Thresholds:
-
0.1-0.2
- Very lenient (catches distant matches) -
0.3-0.5
- Moderate (good default) -
0.6-0.8
- Strict (close matches only) -
0.9-1.0
- Exact matches only
Full-text search with PostgreSQL's ts_vector and ts_query for ranked results.
Parameters:
-
table_name
(string, required): Table to search -
text_columns
(list, required): Columns to search (can be multiple) -
search_query
(string, required): Search query (supports AND, OR, NOT) -
language
(string, optional): Text search language (default: 'english') -
limit
(integer, optional): Maximum results (default: 10)
Returns:
- Matching rows with relevance ranking
- Headline snippets showing matches
Example:
# Simple search
result = text_search_advanced(
table_name="articles",
text_columns=["title", "content"],
search_query="postgresql database",
limit=10
)
# Boolean search
result = text_search_advanced(
table_name="articles",
text_columns=["title", "content"],
search_query="postgresql & (performance | optimization)",
language="english",
limit=20
)
# Returns: [
# {
# "title": "PostgreSQL Performance Tuning",
# "rank": 0.95,
# "headline": "...PostgreSQL <b>performance</b> tips..."
# }
# ]
Search Operators:
-
word1 & word2
- AND (both words) -
word1 | word2
- OR (either word) -
!word
- NOT (exclude word) -
word1 <-> word2
- FOLLOWED BY (adjacent words) -
word:*
- Prefix matching (word*)
Supported Languages:
-
english
,spanish
,french
,german
,russian
,chinese
, etc. - See PostgreSQL documentation for full list
Use Cases:
- Content search
- Documentation search
- Article/blog search
- Knowledge base queries
Extract all matches of a regex pattern from text columns.
Parameters:
-
table_name
(string, required): Table to search -
text_column
(string, required): Column containing text -
pattern
(string, required): Regex pattern -
flags
(string, optional): Regex flags ('i' for case-insensitive, 'g' for global) -
limit
(integer, optional): Maximum rows to search
Returns:
- All matching patterns extracted from text
- Original row data with extracted matches
Example:
# Extract email addresses
result = regex_extract_all(
table_name="contacts",
text_column="notes",
pattern="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
flags="gi"
)
# Returns: [
# {"notes": "Contact: john@example.com", "matches": ["john@example.com"]},
# {"notes": "Emails: alice@test.org, bob@test.org", "matches": ["alice@test.org", "bob@test.org"]}
# ]
# Extract phone numbers
result = regex_extract_all(
table_name="customers",
text_column="contact_info",
pattern="\\d{3}-\\d{3}-\\d{4}",
limit=100
)
# Extract URLs
result = regex_extract_all(
table_name="posts",
text_column="content",
pattern="https?://[^\\s]+",
flags="gi"
)
Common Patterns:
# Email
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
# Phone (US)
"\\d{3}-\\d{3}-\\d{4}"
# URL
"https?://[^\\s]+"
# IP Address
"\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}"
# Date (YYYY-MM-DD)
"\\d{4}-\\d{2}-\\d{2}"
Use Cases:
- Data extraction from unstructured text
- Contact information parsing
- Log analysis
- Data migration/transformation
Find approximate string matches using Levenshtein distance.
Parameters:
-
table_name
(string, required): Table to search -
text_column
(string, required): Column containing text -
search_text
(string, required): Text to match -
max_distance
(integer, required): Maximum edit distance allowed -
limit
(integer, optional): Maximum results
Returns:
- Rows within edit distance threshold
- Actual Levenshtein distance for each match
Example:
# Find names similar to "John Smith"
result = fuzzy_match(
table_name="users",
text_column="full_name",
search_text="John Smith",
max_distance=2,
limit=10
)
# Returns: [
# {"full_name": "John Smith", "distance": 0},
# {"full_name": "Jon Smith", "distance": 1},
# {"full_name": "John Smyth", "distance": 1},
# {"full_name": "John Smit", "distance": 1}
# ]
# Find product names with typos
result = fuzzy_match(
table_name="products",
text_column="name",
search_text="iphone",
max_distance=1
)
# Finds: "iPhone", "ipbone", "iphne"
Edit Distance Guidelines:
-
1
- One character different (typo tolerance) -
2
- Two characters different (moderate fuzzy) -
3-5
- Multiple differences (loose matching) -
>5
- Very loose matching (use with caution)
Use Cases:
- Spell checking
- Duplicate detection
- Name matching
- Data cleansing
# 1. Try exact match first
exact = execute_sql(
"SELECT * FROM products WHERE name ILIKE %s",
params=[f"%{search_term}%"]
)
# 2. If no results, try similarity search
if not exact:
similar = text_similarity(
table_name="products",
text_column="name",
search_text=search_term,
similarity_threshold=0.3
)
# 3. Or try fuzzy matching
if not similar:
fuzzy = fuzzy_match(
table_name="products",
text_column="name",
search_text=search_term,
max_distance=2
)
# Full-text search with relevance ranking
results = text_search_advanced(
table_name="articles",
text_columns=["title", "content", "tags"],
search_query="database & (postgresql | postgres)",
language="english",
limit=20
)
# Results are automatically ranked by relevance
for article in results["results"]:
print(f"Rank: {article['rank']}, Title: {article['title']}")
# 1. Extract emails
emails = regex_extract_all(
table_name="messages",
text_column="body",
pattern="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
)
# 2. Extract phone numbers
phones = regex_extract_all(
table_name="messages",
text_column="body",
pattern="\\d{3}-\\d{3}-\\d{4}"
)
# 3. Extract URLs
urls = regex_extract_all(
table_name="messages",
text_column="body",
pattern="https?://[^\\s]+"
)
# Find potential duplicate names
duplicates = text_similarity(
table_name="contacts",
text_column="name",
search_text="John Doe",
similarity_threshold=0.7,
limit=20
)
# Or use fuzzy matching for stricter detection
fuzzy_dups = fuzzy_match(
table_name="contacts",
text_column="name",
search_text="John Doe",
max_distance=2
)
-- Trigram index for similarity search
CREATE INDEX idx_products_name_trgm ON products USING gin(name gin_trgm_ops);
-- Full-text search index
CREATE INDEX idx_articles_fts ON articles USING gin(to_tsvector('english', title || ' ' || content));
-- For multi-column text search
ALTER TABLE articles ADD COLUMN search_vector tsvector;
UPDATE articles SET search_vector =
to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''));
CREATE INDEX idx_articles_search ON articles USING gin(search_vector);
-
Exact match: Use SQL
LIKE
orILIKE
-
Typo tolerance: Use
text_similarity
orfuzzy_match
-
Content search: Use
text_search_advanced
-
Pattern extraction: Use
regex_extract_all
# Strict (reduce false positives)
text_similarity(..., similarity_threshold=0.7)
fuzzy_match(..., max_distance=1)
# Moderate (balanced)
text_similarity(..., similarity_threshold=0.3)
fuzzy_match(..., max_distance=2)
# Lenient (more results)
text_similarity(..., similarity_threshold=0.1)
fuzzy_match(..., max_distance=3)
Always create appropriate indexes for text search columns to improve performance.
- Core Database Tools - Schema and SQL
- JSON Operations - JSON text search
- Extension Setup - pg_trgm, fuzzystrmatch
See Home for more tool categories.