Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
279 changes: 279 additions & 0 deletions docs/architecture-decisions/high-precision-fractional-bucketing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
---
# Valid statuses: draft | proposed | rejected | accepted | superseded
status: draft
author: Michael Beemer
created: 2025-09-10
updated: 2025-09-10
---

# High-Precision Fractional Bucketing for Sub-Percent Traffic Allocation

This ADR proposes enhancing the fractional operation to support high-precision traffic allocation down to 0.001% granularity by increasing the internal bucket count from 100 to 100,000 while maintaining the existing weight-based API.

## Background

The current fractional operation in flagd uses a 100-bucket system that maps hash values to percentages in the range [0, 100].
This approach works well for most use cases but has significant limitations in high-throughput environments where precise sub-percent traffic allocation is required.

Currently, the smallest allocation possible is 1%, which is insufficient for:

- Gradual rollouts in ultra-high-traffic systems where 1% could represent millions of users
- A/B testing scenarios requiring precise control over small experimental groups
- Canary deployments where operators need to start with very small traffic percentages (e.g., 0.1% or 0.01%)

The current implementation in `fractional.go` calculates bucket assignment using:

```go
bucket := hashRatio * 100 // in range [0, 100]
```

This limits granularity to 1% increments, making it impossible to achieve the precision required for sophisticated traffic management strategies.

## Requirements

- Support traffic allocation precision down to 0.001% (3 decimal places)
- Maintain backwards compatibility with existing weight-based API
- Preserve deterministic bucketing behavior (same hash input always produces same bucket)
- Ensure consistent bucket assignment across different programming languages
- Support weight values up to a reasonable maximum that works across multiple languages
- Maintain current performance characteristics
- Prevent users from being moved between buckets when only distribution percentages change
- Guarantee that any variant with weight > 0 receives some traffic allocation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this. It isn't necessary if the flag is configured properly but I'm afraid that it wouldn't be that obvious that there's a misconfiguration. This basically prevents 0% distribution if a weight is defined.

Copy link
Member

@toddbaert toddbaert Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel conflicted about this, if we were to go forward with a strictly defined max bucket size.

TBH I'm not sure the special handling is worth the possible user confusion in this extreme case.

The other obvious solution is to add a warning at evaluation time (we do similar things for other rules, like invalid semver params)

- Handle edge cases gracefully without silent failures
- Validate weight configurations and provide clear error messages for invalid inputs

## Considered Options

- **Option 1: 10,000 buckets (0.01% precision)** - 1 in every 10,000 users, better but still not sufficient for many high-throughput use cases
- **Option 2: 100,000 buckets (0.001% precision)** - 1 in every 100,000 users, meets most high-precision needs
- **Option 3: 1,000,000 buckets (0.0001% precision)** - 1 in every 1,000,000 users, likely overkill and could impact performance
Comment on lines +47 to +49
Copy link
Member

@toddbaert toddbaert Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any obvious reason that we don't want the max bucket amount to be the sum of all bucket ratios?

ie: ["variant-a", 1], ["variant-b", 1000000] results in 1000001 buckets?

This is:

  • backwards compatible
  • supports infinite precision (though we'd still have to do the max integer check you mentioned)
  • allows us to sidestep the "Minimum Allocation Guarantee" (what you are describing here) which I think is somewhat surprising behavior to end users and feels somewhat arbitrary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must have a static bucket size, or we lose determinism.

Here's an example showing a static bucket:
image

And here's one showing a dynamic bucket size:
image

The important part is that the bucket used must always remain the same, regardless of the configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this a bit more. I don't think my second image is accurate because, despite the bucket sizes changing, the location across the distribution should be consistent. I'll need to run a few tests to see which approach is better.

Copy link
Contributor

@cupofcat cupofcat Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the challenge here properly (and I am not 100% sure so please correct me), we want to ensure that, for a given value "X", it's always distributed into the same bucket no matter how we express the buckets proportion. So, [0.3, 0.5, 0, 0.2], [3, 5, 2], [9, 15, 6], etc should all work the same, across all platforms, all languages.

Additionally, we would like to ensure that very skewed distributions (e.g. [0.1, 1000000]) don't end up simplifying some buckets to 0.

Lastly, we need to be cautious about floating point arithmetic.

Overall, this is a pretty complex problem :D

To meet all these requirements, I think, we need to implement some sort of integer-based bucket normalization to get the canonical representation of the buckets proportion (e.g. the examples above would all normalize to [3, 5, 2]). To achieve that we might need the following:

  1. Use exact precision libraries (e.g. Decimal) when parsing the buckets rather than built-in float types (not sure if that exists for all our languages).
  2. If non-integer buckets are present, scale everything to integers
  3. Calculate the Greatest Common Divisor and normalize using that
  4. Use some advanced algorithm to translate the hash into the correct range (using simple mod operation might skew the distribution).
  5. We probably need to do scaling and normalization and subsequent bucket calculations using Bigints and not standard int32/int64 types (cause of overflows).

The challenge with that approach though is that now we can have the sum of all the buckets be larger than the maximum hash... There are ways around this (that would require using a different approach to "hashing" or downscaling the buckets to fit in int) or we can just say that such cases constitute invalid inputs (I think that saying that the minimum valid resolution is 0.001% essentially guarantees that?).

Copy link
Member

@toddbaert toddbaert Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think you understand the challenge.

I spoke with @beeme1mr a bit about this, and we suspect my proposal here might solve most of our concerns. I don't think we "lose determinism" in any substantially different way than alternatives.

We can lean on JSON schema to mark any non-integer inputs as invalid - we don't really need to support decimals here... as long as we give users the ability to describe relative weights (this can be done with ints, obviously) we can sidestep that mess, IMO.

Overflows might still be a concern for high numbers, but we can also specify and document limits to the total weight and error if we exceed that - that might be better than calculating a GCD, which adds a performance cost that's not going to be necessary in most cases (I'm quite confident most people will just use configs with weights like 1/1/1 or 50/50 etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do a little PoC for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we get rid of floats, in your proposal, how would we deal with same proportions but different values, e.g. [3, 5, 2], [9, 15, 6]?


## Proposal

Implement a 100,000-bucket system that provides 0.001% precision while maintaining the existing integer weight-based API.

### API changes

No API changes are required. The existing fractional operation syntax remains unchanged:

```json
"fractional": [
{ "cat": [{ "var": "$flagd.flagKey" }, { "var": "email" }] },
["red", 50],
["blue", 30],
["green", 20]
]
```

### Implementation Changes

1. **Bucket Count**: Change from 100 to 100,000 buckets by modifying bucket calculation from `hashRatio * 100` to `hashRatio * 100000`
2. **Minimum Allocation Guarantee**: Any variant with weight > 0 receives at least 1 bucket (0.001%)
3. **Excess Bucket Handling**: Remove excess buckets from the largest variant to maintain exactly 100,000 total buckets
4. **Weight Sum Validation**: Reject configurations where total weight exceeds maximum safe integer value
5. **Maximum Weight Sum**: Use language-specific maximum 32-bit signed integer constants for cross-platform compatibility

### Minimum Allocation Guarantee

To prevent silent configuration failures, any variant with a positive weight will receive at least 0.001% allocation (1 bucket), even if the calculated percentage would round to zero. This ensures predictable behavior where positive weights always result in some traffic allocation.

**Example**: Configuration `["variant-a", 1], ["variant-b", 1000000]`

- Without guarantee: variant-a gets 0% (never selected)
- With guarantee: variant-a gets 0.001%, variant-b gets 99.999%

### Excess Bucket Management

When minimum allocations cause the total to exceed 100,000 buckets, excess buckets are removed from the variant with the largest allocation.
This approach:
Comment on lines +87 to +88
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description of how excess buckets are handled is slightly inconsistent. This section states that excess buckets are removed from 'the variant with the largest allocation' (singular), while the 'Edge Case Handling' section on line 127 refers to it as 'Excess distributed fairly among largest variants' (plural). The code example shows a sequential removal process. For clarity and consistency, I suggest refining the description to accurately reflect the implementation, for example: 'Excess buckets are removed sequentially from variants with the largest allocations, starting with the largest, until the total bucket count is exactly 100,000.'


- Maintains the minimum guarantee for small variants
- Has minimal impact on large variants (small relative reduction)
- Preserves deterministic behavior
- Prevents bucket count overflow

### Weight Sum Validation

When the total weight sum exceeds the maximum safe integer value, the fractional evaluation will return a validation error with a clear message.
This prevents integer overflow issues and provides immediate feedback to users about invalid configurations.

```go
import "math"

func validateWeightSum(variants []fractionalEvaluationVariant) error {
var totalWeight int64 = 0
for _, variant := range variants {
totalWeight += int64(variant.weight)
if totalWeight > math.MaxInt32 {
return fmt.Errorf("total weight sum %d exceeds maximum allowed value %d",
totalWeight, math.MaxInt32)
}
}
return nil
}
```

Implementations should prefer built-in language constants (e.g., `math.MaxInt32` in Go, `Integer.MAX_VALUE` in Java, `int.MaxValue` in C#) rather than hardcoded values to ensure maintainability and clarity.

### Edge Case Handling

The implementation addresses several edge cases:

1. **All weights are 0**: Returns empty string (maintains current behavior)
2. **Negative weights**: Treated as 0 (maintains current validation behavior)
3. **Single variant**: Receives all 100,000 buckets regardless of weight value
4. **Empty variants**: Returns error (maintains current validation behavior)
5. **Weight sum overflow**: Returns validation error with clear message
6. **Multiple variants with minimum allocation**: Excess distributed fairly among largest variants

### Maximum Weight Considerations

To ensure cross-language compatibility, we establish a maximum total weight sum equal to the maximum 32-bit signed integer value (2,147,483,647). This limit:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good limit, regardless of this.


- Works reliably across all target languages (Go, Java, .NET, JavaScript, Python)
- Provides more than sufficient range for any practical use case
- Prevents integer overflow issues in 32-bit signed integer systems
- Allows for extremely fine-grained control (individual weights can be 1 out of 2+ billion)
- Uses language-native constants for better maintainability

### Code Changes

The following shows how the core logic in `fractional.go` would be modified.

```go
const bucketCount = 100000

// bucketAllocation represents the number of buckets allocated to a variant
type bucketAllocation struct {
variant string
buckets int
}

func (fe *Fractional) Evaluate(values, data any) any {
valueToDistribute, feDistributions, err := parseFractionalEvaluationData(values, data)
if err != nil {
fe.Logger.Warn(fmt.Sprintf("parse fractional evaluation data: %v", err))
return nil
}

if err := validateWeightSum(feDistributions.weightedVariants); err != nil {
fe.Logger.Warn(fmt.Sprintf("weight validation failed: %v", err))
return nil
}

return distributeValue(valueToDistribute, feDistributions)
}

func validateWeightSum(variants []fractionalEvaluationVariant) error {
var totalWeight int64 = 0
for _, variant := range variants {
totalWeight += int64(variant.weight)
if totalWeight > math.MaxInt32 {
return fmt.Errorf("total weight sum %d exceeds maximum allowed value %d",
totalWeight, math.MaxInt32)
}
}
return nil
}

func calculateBucketAllocations(variants []fractionalEvaluationVariant, totalWeight int) []bucketAllocation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculateBucketAllocations function handles the case where the sum of allocated buckets exceeds bucketCount (excess > 0). However, it doesn't account for the opposite scenario: when the total allocated buckets is less than bucketCount due to rounding down during the proportional calculation (int(...)). If a deficit of buckets occurs, some hash values in distributeValue will not fall into any variant's bucket range, causing the function to incorrectly return an empty string. The ADR should specify how to handle bucket deficits, for instance, by distributing the remaining buckets among the variants (e.g., starting with the largest ones) to ensure the total is exactly bucketCount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Properly supporting guaranteed bucketing may add more complexity than I'd like. I'm sure I can address this issue but I'd like feedback on if it's worth supporting this at all.

The reason I added this is to avoid configurations like ["variant-a", 1], ["variant-b", 1000000], resulting in zero evaluations returning variant-a. You could argue it's a misconfiguration, but it may not be obvious the the user.

allocations := make([]bucketAllocation, len(variants))
totalAllocated := 0

// Calculate initial allocations
for i, variant := range variants {
if variant.weight == 0 {
allocations[i] = bucketAllocation{variant: variant.variant, buckets: 0}
} else {
// Calculate proportional allocation
proportional := int((int64(variant.weight) * bucketCount) / int64(totalWeight))
// Ensure minimum allocation of 1 bucket for any positive weight
buckets := max(1, proportional)
allocations[i] = bucketAllocation{variant: variant.variant, buckets: buckets}
}
totalAllocated += allocations[i].buckets
}

// Handle excess buckets by removing from largest allocation
excess := totalAllocated - bucketCount
if excess > 0 {
// Sort indices by bucket count (descending) to find largest allocation
indices := make([]int, len(allocations))
for i := range indices {
indices[i] = i
}
sort.Slice(indices, func(i, j int) bool {
if allocations[indices[i]].buckets == allocations[indices[j]].buckets {
return allocations[indices[i]].variant < allocations[indices[j]].variant // Tie-break by variant name
}
return allocations[indices[i]].buckets > allocations[indices[j]].buckets
})

// Remove excess from largest allocation, respecting minimum guarantee
for _, idx := range indices {
if excess <= 0 {
break
}

// Don't reduce below 1 bucket if original weight > 0
minAllowed := 0
if variants[idx].weight > 0 {
minAllowed = 1
}

canRemove := allocations[idx].buckets - minAllowed
toRemove := min(excess, canRemove)
allocations[idx].buckets -= toRemove
excess -= toRemove
}
}

return allocations
}
```

**5. Replace the distribution logic:**

```go
func distributeValue(value string, feDistribution *fractionalEvaluationDistribution) string {
if feDistribution.totalWeight == 0 {
return ""
}

allocations := calculateBucketAllocations(feDistribution.weightedVariants, feDistribution.totalWeight)

hashValue := int32(murmur3.StringSum32(value))
hashRatio := math.Abs(float64(hashValue)) / math.MaxInt32
bucket := int(hashRatio * bucketCount) // in range [0, bucketCount)

currentBucket := 0
for _, allocation := range allocations {
currentBucket += allocation.buckets
if bucket < currentBucket {
return allocation.variant
}
}

return ""
}
```

### Consequences

- Good, because it enables precise traffic control for high-throughput environments
- Good, because it matches industry-standard precision offered by leading vendors
- Good, because it maintains API backwards compatibility
- Good, because integer weights remain simple to understand and configure
- Good, because it prevents silent configuration failures through minimum allocation guarantee
- Good, because excess handling is predictable and fair
- Good, because weight validation provides clear error messages for invalid configurations
- Bad, because it represents a behavioral breaking change for existing configurations
- Bad, because it slightly increases memory usage for bucket calculations
- Bad, because actual percentages may differ slightly from configured weights due to minimum allocations

### Implementation Plan

1. Update flagd-testbed with comprehensive test cases for high-precision fractional bucketing across all evaluation modes
2. Implement core logic in flagd to support 100,000-bucket system with minimum allocation guarantee and excess handling
3. Update flagd providers to ensure consistent behavior and testing across language implementations
4. Documentation updates, migration guides, and example configurations to demonstrate the new precision capabilities