Improve Stacktraces Samples Memory Layout #796

cyriltovena · 2023-06-26T12:38:47Z

This introduce a new way to abstract away the memory representation and the file format on disk (parquet). The implementation heavily relies on parquet.RowReader

// RowReader reads a sequence of parquet rows.
type RowReader interface {
	// ReadRows reads rows from the reader, returning the number of rows read
	// into the buffer, and any error that occurred. Note that the rows read
	// into the buffer are not safe for reuse after a subsequent call to
	// ReadRows. Callers that want to reuse rows must copy the rows using Clone.
	//
	// When all rows have been read, the reader returns io.EOF to indicate the
	// end of the sequence. It is valid for the reader to return both a non-zero
	// number of rows and a non-nil error (including io.EOF).
	//
	// The buffer of rows passed as argument will be used to store values of
	// each row read from the reader. If the rows are not nil, the backing array
	// of the slices will be used as an optimization to avoid re-allocating new
	// arrays.
	//
	// The application is expected to handle the case where ReadRows returns
	// less rows than requested and no error, by looking at the first returned
	// value from ReadRows, which is the number of rows that were read.
	ReadRows([]Row) (int, error)
}

it uses this new abstraction to represent stacktraces samples differently.

Instead of using a slice of struct such as :

type Profile {
   Samples []*Samples
}
type Sample struct {
	StacktraceID uint64             `parquet:",delta"`
	Value        int64              `parquet:",delta"`
	Labels       []*profilev1.Label `parquet:",list"`
}

It uses a double slice :

type Profile {
   Samples Sample
}
type Samples struct {
	StacktraceIDs []uint32
	Values        []uint64
}

This highly reduce the amount of memory while ingesting profiles since we use less adress space. On top of that we don't use reflection anymore when flushing Profiles by using a custom parquet serialisation.

This is running in dev and has reduce memory usage by 50%

* Ingest stacktraces in the new symdb * Setup read in memory read path * Fix up a comment placement * Start setting up the read path * Update to uint32 * Introduce stacktrace partition (#775) * Introduce stacktrace partition This determines the partition of a particular profile, by looking first at its metadata: * If there is a `Filename` on the main mapping use its filepath.Base(Filename) * Failing that take the externally supplied `service_name` * Fallback to `unknown` Take the underlying string value and hash. * After a chat with cyril we decided to not longer mod and use the hash straight away. We don't wanted to risk the collisions of two very big stacktrace applications. * Remove reconstructMeta from singleBlockQuerier * support multiple versions of stacktraces resolver * Integrate v2 reader for stacktraces in block reader * Fixes tests * Rewrite locations Ids * Rewrite test for counting uniq stacktraces * lint and fmt * Fixes more tests * Fixes leftover from todo --------- Co-authored-by: Christian Simon <simon@swine.de>

cyriltovena and others added 30 commits June 2, 2023 10:29

Increase parquet writer PageBufferSize

813ac6a

reduce by 2 page buffer size

97eb8c5

Introduce symdb

5b33929

Add chunk format description

046e825

Add chunk format description

bacbe04

Improve naming

1c96cd5

Implement stack trace appender

a967b21

Limit chunk by number of nodes

b380330

Stacktrace ID is uint32

b108bac

Add in-memory stacktrace resolver

401c00f

Add writer

25bce59

Add writer

6bba758

Fix stacktrace resolver

ddab48e

Single pass write

71a9ee1

Index file refactoring

b48d915

Fixes, improvements, notes

42c607d

Ignore empty stacktraces

9ce0a86

Fix chunk boundary check

38d1e0b

Fix tests

0240f17

Store chunk headers sorted

c7892a6

Make chunk index explicit

58cbafc

Add file reader

da00cc0

Use group varint encoding

44fa701

Refine stacktrace tree

845d559

Stacktrace tree race condition elimination

2d5abc0

Remove unused stacktracesResolve.do

1271212

Better nil coalescence in stack trace appender

8fb9a64

Format imports

5a349f8

Merge remote-tracking branch 'origin/main' into feat/symdb

146866b

kolesnikovae and others added 27 commits June 20, 2023 18:35

Use prefixed bucket for symbols

fcf5b8b

Initialize locationsIdsByStacktraceID

dc8f2a1

Initialize locationsIdsByStacktraceID for pprof as well

5701177

Fix chunk headers sort

a7e5597

Inline node alloc

6568873

Mapping filename extraction

36922e7

Tidy go.mod

902ec0c

Fix TestHeadIngestStacktraces

6757ee1

Use symdb.DefaultDirName

0acb4de

Sort mappings on write

2f5753b

Make column iterator to respect the context

10d1dbf

Fix unexpected EOF on stacktrace chunk unmarshal

825235c

Fix symbols upload

c31f93d

Fix symbols upload

20a815e

Release fetched data

9241faa

Reduce memory layout by using a more tailored struct

515e4a1

better tests name

509bec2

Fixes tests and implement the new model

9e892e4

Fixes tests and empty repeated values bug

7a09d72

Reset slice by copy is safer.

40b9fbd

Merge branch 'experiment-page-size' into feat/symdb

03dc721

3MB Page Buffer Size

4d7eb65

Sort stacktraces IDs as expected by the resolver

819f6e9

Merge branch 'feat/symdb' into improve-memory-layout

7675607

Refactor samples compaction.

f77053c

Merge remote-tracking branch 'origin/main' into improve-memory-layout

5832ae9

make fmt

b67c811

cyriltovena mentioned this pull request Jun 26, 2023

Improve Locations in memory representation #797

Closed

cyriltovena closed this Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Stacktraces Samples Memory Layout #796

Improve Stacktraces Samples Memory Layout #796

cyriltovena commented Jun 26, 2023 •

edited

Improve Stacktraces Samples Memory Layout #796

Improve Stacktraces Samples Memory Layout #796

Conversation

cyriltovena commented Jun 26, 2023 • edited

cyriltovena commented Jun 26, 2023 •

edited