Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

sort symbols in order of frequency rather than lexicographically #280

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Feb 9, 2018

for #249

as we keep track of symbols in head we now also keep track of how many times we've seen that symbol, and when we write the symbols in IndexWriter we sort the symbols in order of frequency seen before writing them

@fabxc
Copy link
Contributor

fabxc commented Feb 9, 2018

Thanks for working on this! Currently we appear to have some severe bugs in Prometheus 2.1 tied to the storage. Thus, I'd suggest we freeze any new features to prometheus/tsdb until things go back to being stable.

So this PR will probably be on hold for a bit.

@cstyan
Copy link
Contributor Author

cstyan commented Feb 9, 2018

@fabxc no problem! I can pick this (#249 ) back up when we're in a more stable state. Is there anything I can help with regarding the bugs in 2.1?

@cstyan
Copy link
Contributor Author

cstyan commented Feb 23, 2018

@fabxc will the 2.2.0 release unblock this?

@cstyan
Copy link
Contributor Author

cstyan commented Mar 17, 2018

rebased off master, fixed a merge conflict that I'd missed when I pushed last

Copy link
Collaborator

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll run it locally for a while and report back.

head.go Outdated

for s := range h.head.symbols {
res[s] = struct{}{}
res[s] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 0? It should be:

for s, num := range h.head.symbols {
    res[s] = num
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had a reason, I don't remember at this point :)

@cstyan
Copy link
Contributor Author

cstyan commented Apr 1, 2018

changed the value assigned to the symbols in Reader in head.go and squashed it into the original commit

let me know if there's anything I can do to help test this :)

@cstyan
Copy link
Contributor Author

cstyan commented May 23, 2018

@gouthamve @fabxc are these changes still relevant?

@krasi-georgiev
Copy link
Contributor

Shouldn't you sort the symbols in the index writer , before writing it to disk?
IIUC this is where we want to save some space.

https://github.com/prometheus/tsdb/blob/c848349f07c83bd38d5d19faa5ea71c7fd8923ea/index/index.go#L343
instead of this , sort by popularity.

@cstyan
Copy link
Contributor Author

cstyan commented Jun 12, 2018

@krasi-georgiev I'll have to double check, haven't looked at this in a while

@cstyan
Copy link
Contributor Author

cstyan commented Sep 27, 2018

@krasi-georgiev is that not what is happening here: https://github.com/prometheus/tsdb/pull/280/files#diff-71ebe2bcf31a915b1fa3b3b289d5d31dR354 ?

rebased off master to fix the conflict in head.go

@krasi-georgiev
Copy link
Contributor

failing tests

Copy link
Contributor

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test how much are the savings by this change?

Also I don't see any test to ensure the behaviour that Symbols are saved ordered by frequency.
something like

  • Add some symbols
  • Save index
  • Read index
  • Check Symbols order.

index/index.go Outdated Show resolved Hide resolved
@krasi-georgiev
Copy link
Contributor

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan
Copy link
Contributor Author

cstyan commented Jan 5, 2019

Added a test for the sorting of symbols

I'm not sure if we want to get back the frequency #'s when we read the symbols back out of the table?

@brian-brazil
Copy link
Contributor

I don't see a reason to expose that in the API.

@krasi-georgiev
Copy link
Contributor

failing tests

also

Can you test how much are the savings by this change?

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan
Copy link
Contributor Author

cstyan commented Jan 7, 2019

@krasi-georgiev

Can you test how much are the savings by this change?

Yes I'll have a look at, I guess including a benchmark test? But if you read #249 the goal is to reduce the size of the index file.

@brian-brazil

I don't see a reason to expose that in the API.

I'll have to double check. When I was reading the use of the index and block reader to get the symbols, when compaction happens we read the current symbols, determine which are still in use, and then write those to a new index. In that case we would want the frequencies to persist across that write. But I've probably just misread what's happening.

@krasi-georgiev
Copy link
Contributor

Yes I'll have a look at, I guess including a benchmark test? But if you read #249 the goal is to reduce the size of the index file.

yes this is what I meant , how much is the index file size reduced by this change.

@krasi-georgiev
Copy link
Contributor

@cstyan would you have time to continue with this?
Otherwise I can try to find some time to continue from here.

@cstyan
Copy link
Contributor Author

cstyan commented Jan 25, 2019

@krasi-georgiev yeah I should have some time next week, if you wanted to try something before then feel free.

Krasi Georgiev added 2 commits February 4, 2019 11:39
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
@krasi-georgiev
Copy link
Contributor

updated to the latest master and resolved the conflicts.

Now will run some tests locally to compare the index file savings with this change.

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Feb 4, 2019

using the following test I don't see any difference in the index file size.

The index file size with or without the changes in this PR is 180Mb.

func TestIndexSize(t *testing.T) {
	tmpdir, err := ioutil.TempDir("", "testIndex")
	testutil.Ok(t, err)
	var dirs []string

	for i := int64(0); i < 5; i++ {
		dirs = append(dirs, createBlock(t, tmpdir, genSeries(1000, 10, i*1000, i*1000+1000)))

	}
	compactor, err := NewLeveledCompactor(nil, log.NewNopLogger(), []int64{1000000}, nil)
	testutil.Ok(t, err)

	uid, err := compactor.Compact(tmpdir, dirs, nil)
	testutil.Ok(t, err)
	fmt.Println(tmpdir + "/" + uid.String())
}

This generates random series so it should generate enough churn.

@krasi-georgiev
Copy link
Contributor

ping @cstyan

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants