# Advanced GFQL Validation Patterns

Deep dive into complex GFQL validation scenarios, performance considerations, and advanced patterns.

## Prerequisites
- Complete the [GFQL Validation Fundamentals](./gfql_validation_fundamentals.ipynb) notebook
- Experience writing GFQL queries
- Understanding of graph traversal concepts

In [None]:
# Imports
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import time
import graphistry

from graphistry.compute.validate import (
    validate_syntax,
    validate_schema,
    validate_query,
    extract_schema_from_dataframes,
    extract_schema_from_plottable
)
from graphistry.compute.ast import n, e_forward, e_reverse, e

print(f"PyGraphistry version: {graphistry.__version__}")

## Complex Multi-Hop Queries

Validate queries with multiple hops and complex traversal patterns.

In [None]:
# Create a more complex dataset
nodes_df = pd.DataFrame({
    'id': range(1, 21),
    'name': [f'Entity_{i}' for i in range(1, 21)],
    'type': ['user', 'product', 'order', 'payment'] * 5,
    'risk_score': np.random.uniform(0, 100, 20),
    'created_at': pd.date_range('2024-01-01', periods=20, freq='D'),
    'tags': [['premium'], ['sale'], [], ['urgent']] * 5
})

edges_df = pd.DataFrame({
    'src': np.random.choice(range(1, 21), 50),
    'dst': np.random.choice(range(1, 21), 50),
    'rel_type': np.random.choice(['purchased', 'viewed', 'paid_for', 'related_to'], 50),
    'weight': np.random.uniform(0, 1, 50),
    'timestamp': pd.date_range('2024-01-01', periods=50, freq='H')
})

schema = extract_schema_from_dataframes(nodes_df, edges_df)
print(f"Dataset: {len(nodes_df)} nodes, {len(edges_df)} edges")

In [None]:
# Multi-hop validation with bounded hops
multi_hop_query = [
    {"type": "n", "filter": {"type": {"eq": "user"}}},
    {"type": "e_forward", "hops": 2},  # 2-hop traversal
    {"type": "n", "filter": {"risk_score": {"gt": 50}}}
]

issues = validate_query(multi_hop_query, nodes_df=nodes_df, edges_df=edges_df)
print("2-hop traversal query:")
print(f"Validation issues: {len(issues)}")
for issue in issues:
    print(f"- {issue.level}: {issue.message}")

In [None]:
# Named operations for complex patterns
named_ops_query = [
    {"type": "n", "name": "start_users", "filter": {"type": {"eq": "user"}}},
    {"type": "e_forward", "filter": {"rel_type": {"eq": "purchased"}}},
    {"type": "n", "name": "products", "filter": {"type": {"eq": "product"}}},
    {"type": "e_reverse", "filter": {"rel_type": {"eq": "viewed"}}},
    {"type": "n", "name": "viewers"}
]

issues = validate_query(named_ops_query, nodes_df=nodes_df, edges_df=edges_df)
print("Named operations query (find who viewed products that users purchased):")
print(f"Validation issues: {len(issues)}")
if not issues:
    print("✅ Complex pattern validated successfully!")

In [None]:
# Path validation with alternating patterns
path_query = [
    {"type": "n"},
    {"type": "e_forward"},
    {"type": "n"},
    {"type": "e_reverse"},
    {"type": "n"},
    {"type": "e"}, # Bidirectional
    {"type": "n"}
]

issues = validate_syntax(path_query)
print("Alternating path pattern:")
print(f"Pattern length: {len(path_query)} operations")
print(f"Validation issues: {len(issues)}")
print("✅ Valid path structure" if not issues else "❌ Invalid path")

## Advanced Predicates

Complex filtering with temporal, nested, and custom predicates.

In [None]:
# Temporal predicates validation
temporal_query = [
    {"type": "n", "filter": {
        "created_at": {
            "gt": {"type": "datetime", "value": "2024-01-10T00:00:00Z"}
        }
    }},
    {"type": "e_forward", "filter": {
        "timestamp": {
            "between": [
                {"type": "datetime", "value": "2024-01-01T00:00:00Z"},
                {"type": "datetime", "value": "2024-01-15T00:00:00Z"}
            ]
        }
    }}
]

issues = validate_query(temporal_query, nodes_df=nodes_df, edges_df=edges_df)
print("Temporal predicates query:")
print(f"Validation issues: {len(issues)}")
if not issues:
    print("✅ Temporal predicates validated!")

In [None]:
# Nested predicates with AND/OR logic
nested_query = [
    {"type": "n", "filter": {
        "_and": [
            {"type": {"in": ["user", "payment"]}},
            {"_or": [
                {"risk_score": {"gte": 75}},
                {"tags": {"contains": "urgent"}}
            ]}
        ]
    }}
]

issues = validate_query(nested_query, nodes_df=nodes_df, edges_df=edges_df)
print("Nested predicates query:")
print("Finding: (user OR payment) AND (high risk OR urgent)")
print(f"Validation issues: {len(issues)}")
for issue in issues:
    print(f"- {issue.level}: {issue.message}")

In [None]:
# Type-specific validation
type_specific_queries = [
    # Numeric predicates
    [{"type": "n", "filter": {"risk_score": {"between": [25, 75]}}}],
    
    # String predicates  
    [{"type": "n", "filter": {"name": {"regex": "Entity_[1-5]$"}}}],
    
    # Array predicates
    [{"type": "n", "filter": {"tags": {"is_empty": False}}}],
    
    # Null checks
    [{"type": "n", "filter": {"name": {"is_null": False}}}]
]

for i, query in enumerate(type_specific_queries):
    issues = validate_query(query, nodes_df=nodes_df, edges_df=edges_df)
    print(f"Query {i+1}: {query[0]['filter']}")
    print(f"  Valid: {'✅' if not issues else '❌'}")

## Edge Queries with Complex Filters

Advanced edge filtering with source/destination constraints.

In [None]:
# Edge query with source/destination filters
edge_filter_query = [
    {"type": "e", 
     "filter": {"rel_type": {"eq": "purchased"}},
     "source_filter": {"type": {"eq": "user"}},
     "destination_filter": {"type": {"eq": "product"}}
    }
]

issues = validate_query(edge_filter_query, nodes_df=nodes_df, edges_df=edges_df)
print("Edge query with endpoint filters:")
print("Finding: purchased edges from users to products")
print(f"Validation issues: {len(issues)}")
if not issues:
    print("✅ Complex edge filters validated!")

In [None]:
# Complex edge pattern with multiple constraints
complex_edge_pattern = [
    {"type": "n", "filter": {"risk_score": {"gt": 80}}},
    {"type": "e_forward", 
     "filter": {
         "_and": [
             {"weight": {"gte": 0.5}},
             {"timestamp": {"gte": {"type": "datetime", "value": "2024-01-05T00:00:00Z"}}}
         ]
     },
     "destination_filter": {"type": {"ne": "payment"}}
    },
    {"type": "n"}
]

issues = validate_query(complex_edge_pattern, nodes_df=nodes_df, edges_df=edges_df)
print("Complex edge pattern:")
print("Finding: High-risk entities with strong recent connections (not to payments)")
print(f"Validation issues: {len(issues)}")

## Performance Considerations

Validate queries while considering performance implications.

In [None]:
# Performance comparison: bounded vs unbounded hops
import time

# Bounded hops (good performance)
bounded_query = [
    {"type": "n", "filter": {"id": {"eq": 1}}},
    {"type": "e_forward", "hops": 3},
    {"type": "n"}
]

# Unbounded hops (potential performance issue)
unbounded_query = [
    {"type": "n", "filter": {"id": {"eq": 1}}},
    {"type": "e_forward"},  # No hops limit!
    {"type": "n"}
]

# Validate bounded
start = time.time()
bounded_issues = validate_syntax(bounded_query)
bounded_time = time.time() - start

# Validate unbounded
start = time.time()
unbounded_issues = validate_syntax(unbounded_query)
unbounded_time = time.time() - start

print("Performance Analysis:")
print(f"\nBounded query (3 hops):")
print(f"  Validation time: {bounded_time*1000:.2f}ms")
print(f"  Issues: {len(bounded_issues)}")

print(f"\nUnbounded query:")
print(f"  Validation time: {unbounded_time*1000:.2f}ms")
print(f"  Issues: {len(unbounded_issues)}")
for issue in unbounded_issues:
    if issue.level == "warning":
        print(f"  ⚠️  {issue.message}")

In [None]:
# Validate query complexity
def estimate_query_complexity(query):
    """Estimate relative complexity of a query."""
    complexity = 0
    
    for op in query:
        if op["type"] in ["e_forward", "e_reverse", "e"]:
            hops = op.get("hops", float('inf'))
            complexity += min(hops, 10) * 2  # Cap at 10 for estimation
        
        if "filter" in op:
            # Complex filters add to complexity
            if "_and" in op["filter"] or "_or" in op["filter"]:
                complexity += 3
            else:
                complexity += 1
    
    return complexity

# Test different query complexities
queries = [
    [{"type": "n"}, {"type": "e_forward", "hops": 1}, {"type": "n"}],
    [{"type": "n"}, {"type": "e_forward", "hops": 5}, {"type": "n"}],
    [{"type": "n", "filter": {"_and": [{"type": {"eq": "user"}}, {"risk_score": {"gt": 50}}]}},
     {"type": "e_forward"}, {"type": "n"}],
]

for i, q in enumerate(queries):
    complexity = estimate_query_complexity(q)
    issues = validate_syntax(q)
    print(f"\nQuery {i+1} complexity: {complexity}")
    print(f"  Operations: {len(q)}")
    print(f"  Has warnings: {'Yes' if any(i.level == 'warning' for i in issues) else 'No'}")

## Schema Evolution

Handle schema changes and maintain backwards compatibility.

In [None]:
# Original schema
original_nodes = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'],
    'value': [10, 20, 30]
})

# Evolved schema (renamed column, added column)
evolved_nodes = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'],
    'score': [10, 20, 30],  # 'value' renamed to 'score'
    'category': ['X', 'Y', 'Z']  # New column
})

# Query using old schema
legacy_query = [
    {"type": "n", "filter": {"value": {"gte": 15}}}
]

# Validate against both schemas
print("Legacy query validation:")
print(f"Query: {legacy_query}")

original_schema = extract_schema_from_dataframes(original_nodes, pd.DataFrame())
evolved_schema = extract_schema_from_dataframes(evolved_nodes, pd.DataFrame())

original_issues = validate_schema(legacy_query, original_schema)
evolved_issues = validate_schema(legacy_query, evolved_schema)

print(f"\nOriginal schema: {'✅ Valid' if not original_issues else '❌ Invalid'}")
print(f"Evolved schema: {'✅ Valid' if not evolved_issues else '❌ Invalid'}")

if evolved_issues:
    print("\nMigration needed:")
    for issue in evolved_issues:
        print(f"  - {issue.message}")
        if issue.suggestion:
            print(f"    Suggestion: {issue.suggestion}")

In [None]:
# Backwards compatibility helper
def create_compatible_query(query, column_mapping):
    """Update query to use new column names."""
    import copy
    new_query = copy.deepcopy(query)
    
    for op in new_query:
        if "filter" in op:
            for old_col, new_col in column_mapping.items():
                if old_col in op["filter"]:
                    op["filter"][new_col] = op["filter"].pop(old_col)
    
    return new_query

# Update query for new schema
column_mapping = {"value": "score"}
updated_query = create_compatible_query(legacy_query, column_mapping)

print("Updated query for new schema:")
print(f"Before: {legacy_query}")
print(f"After: {updated_query}")

# Validate updated query
updated_issues = validate_schema(updated_query, evolved_schema)
print(f"\nValidation: {'✅ Valid' if not updated_issues else '❌ Invalid'}")

## Custom Validation Logic

Extend validation for domain-specific requirements.

In [None]:
# Custom validation for business rules
def validate_business_rules(query, schema):
    """Add custom business rule validation."""
    custom_issues = []
    
    # Rule 1: Don't allow queries on sensitive columns without filters
    sensitive_columns = ['risk_score', 'payment_id']
    
    for i, op in enumerate(query):
        if op.get("type") == "n" and "filter" not in op:
            # Check if this could expose sensitive data
            from graphistry.compute.validate import ValidationIssue
            custom_issues.append(ValidationIssue(
                level="warning",
                message="Unfiltered node query may expose sensitive data",
                operation_index=i,
                suggestion="Add filters to limit data exposure"
            ))
    
    # Rule 2: Warn about expensive patterns
    consecutive_edges = 0
    for i, op in enumerate(query):
        if op["type"] in ["e_forward", "e_reverse", "e"]:
            consecutive_edges += 1
            if consecutive_edges > 2:
                custom_issues.append(ValidationIssue(
                    level="warning",
                    message=f"Query has {consecutive_edges} consecutive edge operations",
                    operation_index=i,
                    suggestion="Consider adding node filters between edge operations"
                ))
        else:
            consecutive_edges = 0
    
    return custom_issues

# Test custom validation
risky_query = [
    {"type": "n"},  # No filter!
    {"type": "e_forward"},
    {"type": "e_forward"},
    {"type": "e_forward"},  # Three consecutive edges
    {"type": "n"}
]

# Standard validation
standard_issues = validate_syntax(risky_query)
print(f"Standard validation: {len(standard_issues)} issues")

# Custom validation
custom_issues = validate_business_rules(risky_query, schema)
print(f"\nCustom validation: {len(custom_issues)} issues")
for issue in custom_issues:
    print(f"  - {issue.level}: {issue.message}")
    print(f"    {issue.suggestion}")

In [None]:
# Domain-specific validation example
def validate_security_query(query, schema):
    """Validate queries for security/compliance use cases."""
    issues = []
    
    # Check for required audit fields
    has_timestamp_filter = False
    
    for op in query:
        if "filter" in op:
            filters = op["filter"]
            if "timestamp" in filters or "created_at" in filters:
                has_timestamp_filter = True
                break
    
    if not has_timestamp_filter:
        from graphistry.compute.validate import ValidationIssue
        issues.append(ValidationIssue(
            level="warning",
            message="Security queries should include time-based filters",
            suggestion="Add timestamp or created_at filter for audit compliance"
        ))
    
    return issues

# Test security validation
security_query = [
    {"type": "n", "filter": {"type": {"eq": "payment"}}},
    {"type": "e_forward"},
    {"type": "n", "filter": {"risk_score": {"gt": 90}}}
]

security_issues = validate_security_query(security_query, schema)
print("Security validation for payment query:")
if security_issues:
    for issue in security_issues:
        print(f"⚠️  {issue.message}")
        print(f"   {issue.suggestion}")
else:
    print("✅ Passes security validation")

## Integration with Plottable

Advanced validation using Plottable objects.

In [None]:
# Create a Plottable and extract schema
g = graphistry.nodes(nodes_df, 'id').edges(edges_df, 'src', 'dst')

# Extract schema from Plottable
plottable_schema = extract_schema_from_plottable(g)

# Advanced query using Plottable schema
advanced_query = [
    {"type": "n", "filter": {
        "_and": [
            {"type": {"in": ["user", "payment"]}},
            {"risk_score": {"between": [70, 100]}}
        ]
    }},
    {"type": "e_forward", "filter": {
        "rel_type": {"in": ["purchased", "paid_for"]}
    }},
    {"type": "n", "name": "targets"}
]

# Validate using Plottable
issues = validate_query(advanced_query, g._nodes, g._edges)
print("Advanced query validation with Plottable:")
print(f"Issues: {len(issues)}")
if not issues:
    print("✅ Query validated successfully against Plottable schema!")

## Summary & Best Practices

### Key Takeaways
1. **Multi-hop queries**: Always specify hop limits for performance
2. **Complex predicates**: Use nested AND/OR for sophisticated filtering
3. **Schema evolution**: Plan for column changes with validation
4. **Custom validation**: Extend for domain-specific requirements
5. **Performance**: Consider query complexity during validation

### Best Practices
- ✅ Validate early and often during development
- ✅ Use named operations for complex patterns
- ✅ Add custom validation for business rules
- ✅ Cache schemas for better performance
- ✅ Monitor validation warnings in production

### Next Steps
- [GFQL Validation for LLMs](./gfql_validation_llm.ipynb) - AI integration
- [Production Validation Patterns](./gfql_validation_production.ipynb) - Scale validation
- [GFQL Documentation](https://docs.graphistry.com/gfql/) - Complete reference