Skip to content

Statistical Analysis

Temp edited this page Oct 3, 2025 · 1 revision

Statistical Analysis Tools

8 specialized tools for descriptive statistics, correlation analysis, and time-series operations in PostgreSQL.


📊 Overview

Tool Purpose Statistical Method
stat_describe Descriptive statistics Mean, median, stddev, percentiles
stat_correlation Correlation analysis Pearson correlation coefficient
stat_regression Linear regression Least squares regression
stat_distribution Distribution analysis Histogram, quartiles, outliers
stat_time_series Time-series analysis Moving averages, trends
stat_aggregates Advanced aggregations Mode, variance, skewness
stat_percentile Percentile calculations Custom percentile ranges
stat_outliers Outlier detection IQR method, z-score

🔧 Tool Details

stat_describe

Calculate comprehensive descriptive statistics for numeric columns.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Numeric column to analyze
  • group_by (string, optional): Column to group by

Returns:

  • count: Number of values
  • mean: Average value
  • median: Middle value
  • mode: Most frequent value
  • stddev: Standard deviation
  • variance: Variance
  • min: Minimum value
  • max: Maximum value
  • quartiles: Q1, Q2 (median), Q3
  • iqr: Interquartile range

Example:

result = stat_describe(
    table_name="sales",
    column_name="amount"
)
# Returns: {
#   "count": 10000,
#   "mean": 125.45,
#   "median": 98.50,
#   "stddev": 45.23,
#   "min": 5.00,
#   "max": 500.00,
#   "quartiles": {"q1": 65.00, "q2": 98.50, "q3": 145.00},
#   "iqr": 80.00
# }

# With grouping
result = stat_describe(
    table_name="sales",
    column_name="amount",
    group_by="region"
)

Use Cases:

  • Data exploration
  • Quality assurance
  • Performance baselines
  • Report generation

stat_correlation

Calculate correlation between two numeric columns.

Parameters:

  • table_name (string, required): Source table
  • column1 (string, required): First numeric column
  • column2 (string, required): Second numeric column

Returns:

  • correlation: Pearson correlation coefficient (-1 to 1)
  • p_value: Statistical significance
  • sample_size: Number of data points
  • interpretation: Human-readable interpretation

Example:

result = stat_correlation(
    table_name="products",
    column1="price",
    column2="sales_volume"
)
# Returns: {
#   "correlation": -0.78,
#   "p_value": 0.0001,
#   "sample_size": 500,
#   "interpretation": "Strong negative correlation"
# }

Interpretation:

  • 1.0: Perfect positive correlation
  • 0.7 to 0.9: Strong positive correlation
  • 0.4 to 0.6: Moderate positive correlation
  • 0.1 to 0.3: Weak positive correlation
  • 0.0: No correlation
  • -0.1 to -0.3: Weak negative correlation
  • -0.4 to -0.6: Moderate negative correlation
  • -0.7 to -0.9: Strong negative correlation
  • -1.0: Perfect negative correlation

Use Cases:

  • Price vs demand analysis
  • Feature selection for ML
  • A/B test validation
  • Business metric relationships

stat_regression

Perform linear regression analysis.

Parameters:

  • table_name (string, required): Source table
  • x_column (string, required): Independent variable
  • y_column (string, required): Dependent variable

Returns:

  • slope: Regression coefficient
  • intercept: Y-intercept
  • r_squared: Coefficient of determination
  • equation: Regression equation
  • predictions: Sample predictions

Example:

result = stat_regression(
    table_name="marketing",
    x_column="ad_spend",
    y_column="revenue"
)
# Returns: {
#   "slope": 3.45,
#   "intercept": 1000.00,
#   "r_squared": 0.85,
#   "equation": "y = 3.45x + 1000.00",
#   "interpretation": "For every $1 in ad spend, revenue increases by $3.45"
# }

Use Cases:

  • Revenue forecasting
  • Cost prediction
  • Trend analysis
  • ROI calculation

stat_distribution

Analyze data distribution with histograms and frequency analysis.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Column to analyze
  • num_bins (integer, optional): Number of histogram bins (default: 10)

Returns:

  • histogram: Frequency distribution
  • quartiles: Q1, Q2, Q3, Q4
  • outliers: Values outside normal range
  • skewness: Distribution skew
  • kurtosis: Distribution peakedness

Example:

result = stat_distribution(
    table_name="orders",
    column_name="order_value",
    num_bins=10
)
# Returns: {
#   "histogram": [
#     {"bin": "0-50", "count": 120},
#     {"bin": "50-100", "count": 450},
#     {"bin": "100-150", "count": 380},
#     ...
#   ],
#   "quartiles": [25, 75, 125, 250],
#   "outliers": {"low": [], "high": [850, 920, 1050]},
#   "skewness": 0.45,
#   "kurtosis": 2.1
# }

Use Cases:

  • Data quality checks
  • Anomaly detection
  • Pricing strategy
  • Inventory optimization

stat_time_series

Analyze time-series data with moving averages and trends.

Parameters:

  • table_name (string, required): Source table
  • time_column (string, required): Timestamp/date column
  • value_column (string, required): Numeric value column
  • window_size (integer, optional): Moving average window (default: 7)

Returns:

  • moving_average: Rolling average values
  • trend: Linear trend direction
  • seasonality: Detected patterns
  • forecast: Next period prediction

Example:

result = stat_time_series(
    table_name="daily_sales",
    time_column="sale_date",
    value_column="total_amount",
    window_size=7
)
# Returns: {
#   "moving_average": [
#     {"date": "2025-10-01", "value": 1250.50, "ma_7": 1200.00},
#     {"date": "2025-10-02", "value": 1300.00, "ma_7": 1225.00},
#     ...
#   ],
#   "trend": "increasing",
#   "trend_slope": 15.5,
#   "forecast_next": 1450.00
# }

Use Cases:

  • Sales forecasting
  • Demand planning
  • Performance monitoring
  • Capacity planning

stat_aggregates

Calculate advanced statistical aggregates.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Column to analyze
  • group_by (string, optional): Grouping column

Returns:

  • mode: Most frequent value
  • variance: Variance
  • stddev_pop: Population standard deviation
  • stddev_samp: Sample standard deviation
  • coef_variation: Coefficient of variation

Example:

result = stat_aggregates(
    table_name="sensor_data",
    column_name="temperature",
    group_by="device_id"
)
# Returns: {
#   "groups": [
#     {
#       "device_id": "sensor_1",
#       "mode": 22.5,
#       "variance": 4.2,
#       "stddev_pop": 2.05,
#       "coef_variation": 0.091
#     },
#     ...
#   ]
# }

Use Cases:

  • Quality control
  • Process monitoring
  • Variance analysis
  • Data validation

stat_percentile

Calculate custom percentiles and quantiles.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Numeric column
  • percentiles (list, required): List of percentiles (0-100)

Returns:

  • Percentile values for requested percentiles

Example:

result = stat_percentile(
    table_name="response_times",
    column_name="latency_ms",
    percentiles=[50, 90, 95, 99]
)
# Returns: {
#   "p50": 45.2,
#   "p90": 125.8,
#   "p95": 180.5,
#   "p99": 450.0
# }

Use Cases:

  • SLA monitoring
  • Performance analysis
  • Capacity planning
  • Quality metrics

stat_outliers

Detect outliers using statistical methods.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Column to analyze
  • method (string, optional): Detection method (iqr, zscore)
  • threshold (number, optional): Detection threshold

Returns:

  • outliers: List of outlier values
  • lower_bound: Lower threshold
  • upper_bound: Upper threshold
  • outlier_count: Number of outliers
  • outlier_percentage: Percentage of data

Example:

# IQR method (default)
result = stat_outliers(
    table_name="transactions",
    column_name="amount",
    method="iqr"
)
# Returns: {
#   "method": "iqr",
#   "lower_bound": -50.00,
#   "upper_bound": 350.00,
#   "outliers": [450.00, 520.00, 890.00],
#   "outlier_count": 3,
#   "outlier_percentage": 0.03
# }

# Z-score method
result = stat_outliers(
    table_name="transactions",
    column_name="amount",
    method="zscore",
    threshold=3.0
)

Use Cases:

  • Fraud detection
  • Data cleansing
  • Anomaly detection
  • Quality assurance

🎯 Common Workflows

Data Exploration Workflow

# 1. Get descriptive statistics
desc = stat_describe(
    table_name="sales",
    column_name="revenue"
)

# 2. Check distribution
dist = stat_distribution(
    table_name="sales",
    column_name="revenue",
    num_bins=20
)

# 3. Detect outliers
outliers = stat_outliers(
    table_name="sales",
    column_name="revenue",
    method="iqr"
)

Correlation Analysis Workflow

# 1. Check correlations
corr_price_sales = stat_correlation(
    table_name="products",
    column1="price",
    column2="units_sold"
)

corr_marketing_revenue = stat_correlation(
    table_name="campaigns",
    column1="ad_spend",
    column2="revenue"
)

# 2. Build regression model
regression = stat_regression(
    table_name="campaigns",
    x_column="ad_spend",
    y_column="revenue"
)

Time-Series Analysis Workflow

# 1. Calculate moving averages
time_series = stat_time_series(
    table_name="daily_metrics",
    time_column="date",
    value_column="revenue",
    window_size=7
)

# 2. Get percentiles for SLA
percentiles = stat_percentile(
    table_name="daily_metrics",
    column_name="response_time",
    percentiles=[50, 90, 95, 99]
)

# 3. Detect anomalies
outliers = stat_outliers(
    table_name="daily_metrics",
    column_name="revenue",
    method="zscore",
    threshold=3.0
)

📊 Best Practices

1. Data Quality First

# Always check for nulls and outliers first
desc = stat_describe(table_name="data", column_name="value")
if desc["count"] < total_rows:
    print("Warning: Missing values detected")

outliers = stat_outliers(table_name="data", column_name="value")
if outliers["outlier_percentage"] > 0.05:
    print("Warning: >5% outliers detected")

2. Use Appropriate Grouping

# Regional analysis
stat_describe(
    table_name="sales",
    column_name="revenue",
    group_by="region"
)

# Time-based analysis
stat_aggregates(
    table_name="metrics",
    column_name="value",
    group_by="DATE_TRUNC('month', timestamp)"
)

3. Validate Correlations

corr = stat_correlation(table_name="data", column1="x", column2="y")

# Check statistical significance
if corr["p_value"] < 0.05:
    print("Correlation is statistically significant")
    
# Check sample size
if corr["sample_size"] < 30:
    print("Warning: Small sample size")

📚 Related Documentation


🔗 External Resources


See Home for more tool categories.

Clone this wiki locally