-
Notifications
You must be signed in to change notification settings - Fork 0
Statistical Analysis
8 specialized tools for descriptive statistics, correlation analysis, and time-series operations in PostgreSQL.
Tool | Purpose | Statistical Method |
---|---|---|
stat_describe |
Descriptive statistics | Mean, median, stddev, percentiles |
stat_correlation |
Correlation analysis | Pearson correlation coefficient |
stat_regression |
Linear regression | Least squares regression |
stat_distribution |
Distribution analysis | Histogram, quartiles, outliers |
stat_time_series |
Time-series analysis | Moving averages, trends |
stat_aggregates |
Advanced aggregations | Mode, variance, skewness |
stat_percentile |
Percentile calculations | Custom percentile ranges |
stat_outliers |
Outlier detection | IQR method, z-score |
Calculate comprehensive descriptive statistics for numeric columns.
Parameters:
-
table_name
(string, required): Source table -
column_name
(string, required): Numeric column to analyze -
group_by
(string, optional): Column to group by
Returns:
-
count
: Number of values -
mean
: Average value -
median
: Middle value -
mode
: Most frequent value -
stddev
: Standard deviation -
variance
: Variance -
min
: Minimum value -
max
: Maximum value -
quartiles
: Q1, Q2 (median), Q3 -
iqr
: Interquartile range
Example:
result = stat_describe(
table_name="sales",
column_name="amount"
)
# Returns: {
# "count": 10000,
# "mean": 125.45,
# "median": 98.50,
# "stddev": 45.23,
# "min": 5.00,
# "max": 500.00,
# "quartiles": {"q1": 65.00, "q2": 98.50, "q3": 145.00},
# "iqr": 80.00
# }
# With grouping
result = stat_describe(
table_name="sales",
column_name="amount",
group_by="region"
)
Use Cases:
- Data exploration
- Quality assurance
- Performance baselines
- Report generation
Calculate correlation between two numeric columns.
Parameters:
-
table_name
(string, required): Source table -
column1
(string, required): First numeric column -
column2
(string, required): Second numeric column
Returns:
-
correlation
: Pearson correlation coefficient (-1 to 1) -
p_value
: Statistical significance -
sample_size
: Number of data points -
interpretation
: Human-readable interpretation
Example:
result = stat_correlation(
table_name="products",
column1="price",
column2="sales_volume"
)
# Returns: {
# "correlation": -0.78,
# "p_value": 0.0001,
# "sample_size": 500,
# "interpretation": "Strong negative correlation"
# }
Interpretation:
-
1.0
: Perfect positive correlation -
0.7 to 0.9
: Strong positive correlation -
0.4 to 0.6
: Moderate positive correlation -
0.1 to 0.3
: Weak positive correlation -
0.0
: No correlation -
-0.1 to -0.3
: Weak negative correlation -
-0.4 to -0.6
: Moderate negative correlation -
-0.7 to -0.9
: Strong negative correlation -
-1.0
: Perfect negative correlation
Use Cases:
- Price vs demand analysis
- Feature selection for ML
- A/B test validation
- Business metric relationships
Perform linear regression analysis.
Parameters:
-
table_name
(string, required): Source table -
x_column
(string, required): Independent variable -
y_column
(string, required): Dependent variable
Returns:
-
slope
: Regression coefficient -
intercept
: Y-intercept -
r_squared
: Coefficient of determination -
equation
: Regression equation -
predictions
: Sample predictions
Example:
result = stat_regression(
table_name="marketing",
x_column="ad_spend",
y_column="revenue"
)
# Returns: {
# "slope": 3.45,
# "intercept": 1000.00,
# "r_squared": 0.85,
# "equation": "y = 3.45x + 1000.00",
# "interpretation": "For every $1 in ad spend, revenue increases by $3.45"
# }
Use Cases:
- Revenue forecasting
- Cost prediction
- Trend analysis
- ROI calculation
Analyze data distribution with histograms and frequency analysis.
Parameters:
-
table_name
(string, required): Source table -
column_name
(string, required): Column to analyze -
num_bins
(integer, optional): Number of histogram bins (default: 10)
Returns:
-
histogram
: Frequency distribution -
quartiles
: Q1, Q2, Q3, Q4 -
outliers
: Values outside normal range -
skewness
: Distribution skew -
kurtosis
: Distribution peakedness
Example:
result = stat_distribution(
table_name="orders",
column_name="order_value",
num_bins=10
)
# Returns: {
# "histogram": [
# {"bin": "0-50", "count": 120},
# {"bin": "50-100", "count": 450},
# {"bin": "100-150", "count": 380},
# ...
# ],
# "quartiles": [25, 75, 125, 250],
# "outliers": {"low": [], "high": [850, 920, 1050]},
# "skewness": 0.45,
# "kurtosis": 2.1
# }
Use Cases:
- Data quality checks
- Anomaly detection
- Pricing strategy
- Inventory optimization
Analyze time-series data with moving averages and trends.
Parameters:
-
table_name
(string, required): Source table -
time_column
(string, required): Timestamp/date column -
value_column
(string, required): Numeric value column -
window_size
(integer, optional): Moving average window (default: 7)
Returns:
-
moving_average
: Rolling average values -
trend
: Linear trend direction -
seasonality
: Detected patterns -
forecast
: Next period prediction
Example:
result = stat_time_series(
table_name="daily_sales",
time_column="sale_date",
value_column="total_amount",
window_size=7
)
# Returns: {
# "moving_average": [
# {"date": "2025-10-01", "value": 1250.50, "ma_7": 1200.00},
# {"date": "2025-10-02", "value": 1300.00, "ma_7": 1225.00},
# ...
# ],
# "trend": "increasing",
# "trend_slope": 15.5,
# "forecast_next": 1450.00
# }
Use Cases:
- Sales forecasting
- Demand planning
- Performance monitoring
- Capacity planning
Calculate advanced statistical aggregates.
Parameters:
-
table_name
(string, required): Source table -
column_name
(string, required): Column to analyze -
group_by
(string, optional): Grouping column
Returns:
-
mode
: Most frequent value -
variance
: Variance -
stddev_pop
: Population standard deviation -
stddev_samp
: Sample standard deviation -
coef_variation
: Coefficient of variation
Example:
result = stat_aggregates(
table_name="sensor_data",
column_name="temperature",
group_by="device_id"
)
# Returns: {
# "groups": [
# {
# "device_id": "sensor_1",
# "mode": 22.5,
# "variance": 4.2,
# "stddev_pop": 2.05,
# "coef_variation": 0.091
# },
# ...
# ]
# }
Use Cases:
- Quality control
- Process monitoring
- Variance analysis
- Data validation
Calculate custom percentiles and quantiles.
Parameters:
-
table_name
(string, required): Source table -
column_name
(string, required): Numeric column -
percentiles
(list, required): List of percentiles (0-100)
Returns:
- Percentile values for requested percentiles
Example:
result = stat_percentile(
table_name="response_times",
column_name="latency_ms",
percentiles=[50, 90, 95, 99]
)
# Returns: {
# "p50": 45.2,
# "p90": 125.8,
# "p95": 180.5,
# "p99": 450.0
# }
Use Cases:
- SLA monitoring
- Performance analysis
- Capacity planning
- Quality metrics
Detect outliers using statistical methods.
Parameters:
-
table_name
(string, required): Source table -
column_name
(string, required): Column to analyze -
method
(string, optional): Detection method (iqr
,zscore
) -
threshold
(number, optional): Detection threshold
Returns:
-
outliers
: List of outlier values -
lower_bound
: Lower threshold -
upper_bound
: Upper threshold -
outlier_count
: Number of outliers -
outlier_percentage
: Percentage of data
Example:
# IQR method (default)
result = stat_outliers(
table_name="transactions",
column_name="amount",
method="iqr"
)
# Returns: {
# "method": "iqr",
# "lower_bound": -50.00,
# "upper_bound": 350.00,
# "outliers": [450.00, 520.00, 890.00],
# "outlier_count": 3,
# "outlier_percentage": 0.03
# }
# Z-score method
result = stat_outliers(
table_name="transactions",
column_name="amount",
method="zscore",
threshold=3.0
)
Use Cases:
- Fraud detection
- Data cleansing
- Anomaly detection
- Quality assurance
# 1. Get descriptive statistics
desc = stat_describe(
table_name="sales",
column_name="revenue"
)
# 2. Check distribution
dist = stat_distribution(
table_name="sales",
column_name="revenue",
num_bins=20
)
# 3. Detect outliers
outliers = stat_outliers(
table_name="sales",
column_name="revenue",
method="iqr"
)
# 1. Check correlations
corr_price_sales = stat_correlation(
table_name="products",
column1="price",
column2="units_sold"
)
corr_marketing_revenue = stat_correlation(
table_name="campaigns",
column1="ad_spend",
column2="revenue"
)
# 2. Build regression model
regression = stat_regression(
table_name="campaigns",
x_column="ad_spend",
y_column="revenue"
)
# 1. Calculate moving averages
time_series = stat_time_series(
table_name="daily_metrics",
time_column="date",
value_column="revenue",
window_size=7
)
# 2. Get percentiles for SLA
percentiles = stat_percentile(
table_name="daily_metrics",
column_name="response_time",
percentiles=[50, 90, 95, 99]
)
# 3. Detect anomalies
outliers = stat_outliers(
table_name="daily_metrics",
column_name="revenue",
method="zscore",
threshold=3.0
)
# Always check for nulls and outliers first
desc = stat_describe(table_name="data", column_name="value")
if desc["count"] < total_rows:
print("Warning: Missing values detected")
outliers = stat_outliers(table_name="data", column_name="value")
if outliers["outlier_percentage"] > 0.05:
print("Warning: >5% outliers detected")
# Regional analysis
stat_describe(
table_name="sales",
column_name="revenue",
group_by="region"
)
# Time-based analysis
stat_aggregates(
table_name="metrics",
column_name="value",
group_by="DATE_TRUNC('month', timestamp)"
)
corr = stat_correlation(table_name="data", column1="x", column2="y")
# Check statistical significance
if corr["p_value"] < 0.05:
print("Correlation is statistically significant")
# Check sample size
if corr["sample_size"] < 30:
print("Warning: Small sample size")
- Core Database Tools - Basic SQL operations
- Performance Intelligence - Query optimization
- Monitoring & Alerting - Real-time metrics
See Home for more tool categories.