New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make triplex id distribution for ranges more homogenous #13
Comments
I have spent some time studying how we are generating these ids. I have been generating ids using some simple code in python follow our algo along the lines of: def random_path(x1, x2):
return x1 + round(random.random() * math.sqrt(x2 - x1))
def gen_paths(n, x1, x2):
ids = []
lower = x1
while (len(ids) < n):
y = random_path(lower, x2)
ids.append(y)
lower = y
return ids To understand this I have run simulations and derived some approximate results for how this method generates sequences of random numbers. In the limit where X_i = U_i*\sqrt(x_2) where S_j = \Sum_{i=1}^j X_i = \sqrt(x2) \Sum_{i=1}^j U_i Because <S_j> = \sqrt(x2)*j/2 This expression for <S_j> - <S_{j-1}> = \sqrt(x2)/2 These two results show the following:
Let's make this more concrete and how many random paths this algorithm can generate in the range This makes it clear why we are seeing memory consumption increase rapidly. Our path space is 6 bytes or 2**48 possible values (before allocating another triplex id). But in practice we get fewer values in this space. Because the mean spacing between values is the constant The fix will be to alter our path generation to vary the spacing with the number of values. The simplest approach is to have the average spacing between n values be |
I mostly agree with your points, but a few comments:
Note that the random path density has two asymptotic behaviors: roughly linear along your constant at the start, exponential towards infinity when approaching the end of the range (but of course limited by binning to integers). Consequentially, I see that the number of IDs that are typically generated before the path is expanded is around 67 million (4 orders of magnitude higher). let i = N;
let last = 0;
for (let i=0; i<2*N; ++i) {
last = createTriplexId(0, 0, last);
if (last.length > 8) {
console.log([i, last]);
break;
}
} |
I have been picking through the details of how we can improve the triplex id generation. There are two main ideas I have been exploring:
|
It's hard to give a proper review of the proposed methods without any (pseudo) code. However, I want to note that the current algorithm behaves pretty well for a somewhat typical editing scenario:
Maybe there exist a body of captured user edit patterns somewhere that can be used as a profiling baseline? |
Here is a draft version of idea (2) above: export
function createTriplexIds(n: number, version: number, store: number, lower: string, upper: string): string[] {
let ids: string[] = [];
whileIds: while (ids.length < n) {
const MAX_PATH = 0xFFFFFFFFFFFF;
let id = '';
let lowerCount = lower ? Private.idTripletCount(lower) : 0;
let upperCount = upper ? Private.idTripletCount(upper) : 0;
forCount: for (let i = 0, p = Math.max(lowerCount, upperCount); i < p; ++i) {
let lp: number;
let lc: number;
let ls: number;
if (i >= lowerCount) {
lp = 0;
lc = 0;
ls = 0;
} else {
lp = Private.idPathAt(lower, i);
lc = Private.idVersionAt(lower, i);
ls = Private.idStoreAt(lower, i);
}
let up: number;
let uc: number;
let us: number;
if (i >= upperCount) {
up = upperCount === 0 ? MAX_PATH + 1 : 0;
uc = 0;
us = 0;
} else {
up = Private.idPathAt(upper, i);
uc = Private.idVersionAt(upper, i);
us = Private.idStoreAt(upper, i);
}
// lower === upper
if (lp === up && lc === uc && ls === us) {
id += Private.createTriplet(lp, lc, ls);
continue forCount;
}
if ((up - lp - 1) >= (n - ids.length)) {
let paths = Private.generatePaths(n, lp, up)
for (let j = 0, m = n-ids.length; j < m; j++) {
ids.push(id + Private.createTriplet(paths[j], version, store));
}
return ids;
}
id += Private.createTriplet(lp, lc, ls);
upperCount = 0;
} // forCount
let np = Private.generatePath(1, 1, MAX_PATH);
id += Private.createTriplet(np, version, store);
ids.push(id.slice());
id = '';
} // whileIds
return ids;
} The main idea is that when it gets to the point where is used to use Pros:
Cons:
|
Here is the namespace Private {
export
function generatePaths(n: number, min: number, max: number): number[] {
let m = max - min;
let delta = m/(n+1);
console.log(m,n+1,m/(n+1),delta);
let paths = []
for (let i = 1; i <= n; i++) {
paths.push(Math.floor(min + i*delta));
}
return paths
}
} |
Also: one thing this work is pointed to is that we likely need a benchmark suite to test the performance different algorithms under different conditions. |
I have a version working now with a benchmark notebook that does
For Existing algorithm:
New algorithm:
The main findings here are that the memory usage is 3x less and the generated ids are able to remain shorter for a given number of list elements and edits. This is promising, so I will continue to work on improving the benchmark to cover different types of usage cases (copy/paste, deleting, etc.). |
The current behavior skews new ids to the end of the range for large inserts.
Context:
For creating a single triplex ID, a random path is generated within the start of the available range (uniformly in the first
Math.sqrt(max - min)
part of the range). Currently, when inserting a range, this ID generation is called sequentially, each time using the previous random ID as the start of the range.Behavior:
With this logic, any sequence insert will have an ID distribution that is heavy towards the end of the insertion range (the density increases as more of the range is consumed). This effect is also further pronounced if the range is already small, i.e. for insertions within a previous insertion.
Proposed solution:
For range inserts:
The text was updated successfully, but these errors were encountered: