# GFQL Call Operations

This notebook demonstrates the Call operation in GFQL, which enables:
- Invoking user-defined functions on graph data
- Custom data transformations and enrichments
- Integration with external services
- Advanced analytics within graph queries

**Security Note**: Call operations are restricted to a safelist of allowed functions for security.

## Setup

In [None]:
import pandas as pd
import numpy as np
import graphistry
from graphistry import n, e_forward, e_reverse
from graphistry.compute.ast import ASTCall, ASTLet, ASTChainRef

# For convenience, use the alias
from graphistry import call, let

graphistry.__version__

## Sample Data: Network Traffic Analysis

In [None]:
# Create sample network traffic data
edges_df = pd.DataFrame({
    'src_ip': ['192.168.1.1', '192.168.1.1', '192.168.1.2', '10.0.0.1', '10.0.0.1', 
               '192.168.1.3', '192.168.1.3', '10.0.0.2', '172.16.0.1'],
    'dst_ip': ['192.168.1.2', '10.0.0.1', '192.168.1.3', '192.168.1.3', '10.0.0.2',
               '10.0.0.2', '172.16.0.1', '172.16.0.1', 'external.com'],
    'protocol': ['HTTP', 'HTTPS', 'SSH', 'HTTP', 'DNS', 'HTTPS', 'SSH', 'HTTP', 'HTTPS'],
    'bytes': [1024, 2048, 512, 4096, 128, 8192, 256, 16384, 32768],
    'packets': [10, 20, 5, 40, 2, 80, 3, 160, 320],
    'timestamp': pd.to_datetime([
        '2024-01-01 10:00:00', '2024-01-01 10:05:00', '2024-01-01 10:10:00',
        '2024-01-01 10:15:00', '2024-01-01 10:20:00', '2024-01-01 10:25:00',
        '2024-01-01 10:30:00', '2024-01-01 10:35:00', '2024-01-01 10:40:00'
    ])
})

nodes_df = pd.DataFrame({
    'ip': ['192.168.1.1', '192.168.1.2', '192.168.1.3', '10.0.0.1', 
           '10.0.0.2', '172.16.0.1', 'external.com'],
    'type': ['workstation', 'workstation', 'server', 'gateway', 
             'dns_server', 'web_server', 'external'],
    'risk_level': [0.2, 0.3, 0.7, 0.5, 0.4, 0.8, 0.9]
})

g = graphistry.edges(edges_df, 'src_ip', 'dst_ip').nodes(nodes_df, 'ip')
print(f"Graph has {len(g._nodes)} nodes and {len(g._edges)} edges")

## Basic Call Operations

Call operations allow you to invoke functions to transform or analyze data.

In [None]:
# Define some analysis functions (these would be in the safelist)
def calculate_traffic_score(df):
    """Calculate a traffic anomaly score based on bytes and packets."""
    if 'bytes' in df.columns and 'packets' in df.columns:
        df['traffic_score'] = (
            (df['bytes'] / df['bytes'].mean()) + 
            (df['packets'] / df['packets'].mean())
        ) / 2
    return df

def classify_protocol_risk(df):
    """Classify protocols by risk level."""
    risk_map = {
        'HTTP': 'medium',
        'HTTPS': 'low',
        'SSH': 'low',
        'DNS': 'low',
        'FTP': 'high',
        'TELNET': 'critical'
    }
    if 'protocol' in df.columns:
        df['protocol_risk'] = df['protocol'].map(risk_map).fillna('unknown')
    return df

# Use Call to apply transformations
# Note: In production, function names must be in the safelist
enriched = g.gfql([
    # Apply traffic scoring to all edges
    call('calculate_traffic_score', target='edges'),
    
    # Apply protocol risk classification
    call('classify_protocol_risk', target='edges')
])

print("Enriched edges:")
print(enriched._edges[['src_ip', 'dst_ip', 'protocol', 'traffic_score', 'protocol_risk']].head())

## Call with Arguments

Call operations can accept arguments to customize their behavior.

In [None]:
def filter_by_threshold(df, column, threshold, operation='gt'):
    """Filter dataframe by threshold."""
    if column in df.columns:
        if operation == 'gt':
            return df[df[column] > threshold]
        elif operation == 'lt':
            return df[df[column] < threshold]
        elif operation == 'gte':
            return df[df[column] >= threshold]
        elif operation == 'lte':
            return df[df[column] <= threshold]
    return df

# Filter high-traffic connections
high_traffic = g.gfql([
    # First calculate traffic scores
    call('calculate_traffic_score', target='edges'),
    
    # Then filter for high scores
    call('filter_by_threshold', 
         target='edges',
         args={'column': 'traffic_score', 'threshold': 1.5, 'operation': 'gt'})
])

print(f"High traffic connections: {len(high_traffic._edges)} edges")
print(high_traffic._edges[['src_ip', 'dst_ip', 'bytes', 'packets', 'traffic_score']])

## Combining Call with Let Bindings

Call operations work seamlessly with Let bindings for complex analyses.

In [None]:
def calculate_node_centrality(g):
    """Calculate degree centrality for nodes."""
    # Count incoming and outgoing connections
    in_degree = g._edges.groupby(g._destination).size().to_frame('in_degree')
    out_degree = g._edges.groupby(g._source).size().to_frame('out_degree')
    
    # Merge with nodes
    nodes = g._nodes.copy()
    nodes = nodes.merge(in_degree, left_on=g._node, right_index=True, how='left')
    nodes = nodes.merge(out_degree, left_on=g._node, right_index=True, how='left')
    nodes['centrality'] = (nodes['in_degree'].fillna(0) + nodes['out_degree'].fillna(0)) / 2
    
    return g.nodes(nodes)

# Complex analysis combining Let and Call
analysis = let({
    # Find high-risk nodes
    'risky_nodes': n({'risk_level': lambda x: x > 0.6}),
    
    # Find their network
    'risky_network': [
        ASTChainRef('risky_nodes'),
        e_forward(hops=2)
    ],
    
    # Calculate centrality for the risky network
    'analyzed_network': [
        ASTChainRef('risky_network'),
        call('calculate_node_centrality', target='graph')
    ]
})

result = g.gfql([analysis])
print("Nodes with centrality scores:")
print(result._nodes[['ip', 'type', 'risk_level', 'centrality']].sort_values('centrality', ascending=False))

## Call for Data Enrichment

Call operations can enrich data with external information or complex calculations.

In [None]:
def enrich_with_geolocation(df):
    """Simulate IP geolocation enrichment."""
    # In production, this might call an actual geolocation service
    geo_data = {
        '192.168.1.1': {'country': 'US', 'city': 'New York', 'lat': 40.7128, 'lon': -74.0060},
        '192.168.1.2': {'country': 'US', 'city': 'New York', 'lat': 40.7128, 'lon': -74.0060},
        '192.168.1.3': {'country': 'US', 'city': 'Boston', 'lat': 42.3601, 'lon': -71.0589},
        '10.0.0.1': {'country': 'UK', 'city': 'London', 'lat': 51.5074, 'lon': -0.1278},
        '10.0.0.2': {'country': 'UK', 'city': 'London', 'lat': 51.5074, 'lon': -0.1278},
        '172.16.0.1': {'country': 'JP', 'city': 'Tokyo', 'lat': 35.6762, 'lon': 139.6503},
        'external.com': {'country': 'CN', 'city': 'Beijing', 'lat': 39.9042, 'lon': 116.4074}
    }
    
    if 'ip' in df.columns:
        for col in ['country', 'city', 'lat', 'lon']:
            df[col] = df['ip'].map(lambda x: geo_data.get(x, {}).get(col))
    
    return df

def calculate_geo_distance(df):
    """Calculate geographical distance for edges."""
    if all(col in df.columns for col in ['src_lat', 'src_lon', 'dst_lat', 'dst_lon']):
        # Simplified distance calculation
        df['geo_distance'] = np.sqrt(
            (df['dst_lat'] - df['src_lat'])**2 + 
            (df['dst_lon'] - df['src_lon'])**2
        ) * 111  # Rough conversion to km
    return df

# Enrich with geolocation data
geo_enriched = g.gfql([
    # Add geolocation to nodes
    call('enrich_with_geolocation', target='nodes'),
    
    # Calculate geographical distances for edges
    call('calculate_geo_distance', target='edges')
])

print("Nodes with geolocation:")
print(geo_enriched._nodes[['ip', 'type', 'country', 'city']].head())

# Note: Edge distance calculation would require joining node geo data to edges

## Advanced: Multi-Stage Analysis Pipeline

Combine multiple Call operations in a complex analysis pipeline.

In [None]:
def detect_anomalies(df, columns, method='zscore', threshold=2):
    """Detect anomalies in specified columns."""
    for col in columns:
        if col in df.columns:
            if method == 'zscore':
                mean = df[col].mean()
                std = df[col].std()
                df[f'{col}_anomaly'] = np.abs((df[col] - mean) / std) > threshold
            elif method == 'iqr':
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                df[f'{col}_anomaly'] = (
                    (df[col] < (Q1 - 1.5 * IQR)) | 
                    (df[col] > (Q3 + 1.5 * IQR))
                )
    return df

def aggregate_risk_scores(g):
    """Aggregate risk scores from edges to nodes."""
    if 'traffic_score' in g._edges.columns:
        # Calculate max traffic score for each node
        node_risk = g._edges.groupby(g._source)['traffic_score'].max().to_frame('max_traffic_score')
        
        nodes = g._nodes.copy()
        nodes = nodes.merge(node_risk, left_on=g._node, right_index=True, how='left')
        nodes['combined_risk'] = (
            nodes['risk_level'] * 0.5 + 
            nodes['max_traffic_score'].fillna(0) * 0.5
        )
        
        return g.nodes(nodes)
    return g

# Multi-stage security analysis pipeline
security_analysis = g.gfql([
    # Stage 1: Calculate traffic scores
    call('calculate_traffic_score', target='edges'),
    
    # Stage 2: Classify protocol risks
    call('classify_protocol_risk', target='edges'),
    
    # Stage 3: Detect anomalies in traffic
    call('detect_anomalies', 
         target='edges',
         args={'columns': ['bytes', 'packets'], 'method': 'zscore', 'threshold': 1.5}),
    
    # Stage 4: Aggregate risks to nodes
    call('aggregate_risk_scores', target='graph'),
    
    # Stage 5: Filter for high-risk scenarios
    n({'combined_risk': lambda x: x > 0.7}),
    e_forward(),
    n()
])

print(f"Security analysis found {len(security_analysis._nodes)} nodes of interest")
print("\nHigh-risk nodes:")
print(security_analysis._nodes[['ip', 'type', 'risk_level', 'combined_risk']].sort_values('combined_risk', ascending=False))

## Call with Custom Return Types

Call operations can return different types of results.

In [None]:
def summarize_network(g):
    """Generate a summary report of the network."""
    summary = {
        'node_count': len(g._nodes),
        'edge_count': len(g._edges),
        'node_types': g._nodes['type'].value_counts().to_dict() if 'type' in g._nodes else {},
        'protocols': g._edges['protocol'].value_counts().to_dict() if 'protocol' in g._edges else {},
        'avg_bytes': g._edges['bytes'].mean() if 'bytes' in g._edges else 0,
        'total_traffic': g._edges['bytes'].sum() if 'bytes' in g._edges else 0
    }
    
    # Add summary as graph metadata (in practice, might return separately)
    g._metadata = summary
    return g

# Generate network summary
summarized = g.gfql([
    call('summarize_network', target='graph')
])

print("Network Summary:")
if hasattr(summarized, '_metadata'):
    for key, value in summarized._metadata.items():
        print(f"  {key}: {value}")

## Security Considerations

Call operations have important security features:

1. **Safelist**: Only pre-approved functions can be called
2. **Sandboxing**: Functions run in a restricted environment
3. **Resource Limits**: Execution time and memory are bounded
4. **Input Validation**: Arguments are validated before execution

In [None]:
# Example of safelist configuration (typically done at server level)
SAFELIST = {
    'calculate_traffic_score': {
        'module': 'network_analysis',
        'allowed_args': ['method'],
        'timeout': 30,
        'memory_limit': '1GB'
    },
    'classify_protocol_risk': {
        'module': 'security_utils',
        'allowed_args': [],
        'timeout': 10
    },
    'filter_by_threshold': {
        'module': 'data_filters',
        'allowed_args': ['column', 'threshold', 'operation'],
        'timeout': 20
    }
}

# Attempting to call non-safelisted function would raise an error
try:
    # This would fail in production if not in safelist
    result = g.gfql([
        call('dangerous_function', target='graph')
    ])
except Exception as e:
    print(f"Expected error: Function 'dangerous_function' not in safelist")

## Summary

Call operations in GFQL provide powerful capabilities for:

1. **Data Transformation**: Apply complex transformations to graph data
2. **Enrichment**: Add external data or calculated fields
3. **Analysis**: Run sophisticated algorithms within queries
4. **Integration**: Connect with external services and APIs
5. **Pipelines**: Build multi-stage analysis workflows

Key concepts:
- `call(function_name, target='nodes'|'edges'|'graph', args={...})`
- Functions must be in the server's safelist
- Can be combined with Let bindings and other operations
- Support for different return types and side effects
- Security through sandboxing and resource limits