Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update out of order with main #2072

Merged
merged 193 commits into from
Jun 10, 2022

Conversation

jesusvazquez
Copy link
Member

mimir/out-of-order was based on a commit from April 26. This PR updates it to the latest main commit.

pstibrany and others added 30 commits April 26, 2022 18:20
…o binaries. (#1759)

* Extend Dockerfiles to support multiarch builds for all Go binaries.

By calling any of

make push-multiarch-./cmd/metaconvert/.uptodate
make push-multiarch-./cmd/mimir/.uptodate
make push-multiarch-./cmd/query-tee/.uptodate
make push-multiarch-./cmd/mimir-continuous-test/.uptodate
make push-multiarch-./cmd/mimirtool/.uptodate
make push-multiarch-./operations/mimir-rules-action/.uptodate

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* Update to latest dskit and memberlist fork

Fixes #1743

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Update changelog

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* mimirtool config: Add more retained old defaults

The following parameters have their old defaults retained even when
`--update-defaults` is used with `mimirtool config covert`:

* `activity_tracker.filepath`
* `alertmanager.data_dir`
* `blocks_storage.filesystem.dir`
* `compactor.data_dir`
* `ruler.rule_path`
* `ruler_storage.filesystem.dir`
* `graphite.querier.schemas.backend` (only in GEM)

These are filepaths for which the new defaults don't make more sense
than the old ones. In fact updating these can lead to subpar migration
experience because components start using directories that don't exist.

Because activity_tracker.filepath changed its name since cortex the
tests needed to allow for differentiating old common options and new
ones. This is something that was already there for GEM and was added
for cortex/mimir too.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Update CHANGELOG.md

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
* dashboards: add flag to skip gateway

The gateway component seems to be an enterprise component, so groups
that aren't running enterprise shouldn't need the empty panels and rows
in their dashboards. This patch adds a flag to drop gateway-related
widgets from the mixin dashboards.

Signed-off-by: Josh Carp <jm.carp@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>
* Gracefully shutdown querier when using query-scheduler

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed comment

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added TestQueuesOnTerminatingQuerier

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Commented executionContext

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Update pkg/querier/worker/util.go

Co-authored-by: Peter Štibraný <pstibrany@gmail.com>

* Fixed typo in suggestion

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed superfluous time sensitive assertion

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Commented newExecutionContext()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
* Graceful shutdown querier with not using query-scheduler

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Updated CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improved comment

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Refactoring

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Increase mimir-continuous-test query timeout from 30s to 60

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added PR number to CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Increased default -tests.run-interval from 1m to 5m

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added PR number to CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Fix flaky tests on querier graceful shutdown

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Remove spurious newline

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Update build-image to use golang:1.17.8-bullseye, and add skopeo to build image.

Skopeo will be used in subsequent PR to push multiarch images.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Update build image. Use ubuntu-latest for workflow steps.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* Publish multiarch images.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Tag with extra tag, if pushing tagged commit or release.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Split building of docker images and archiving them into tar.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* When tagging with test, use --all.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Only run deploy step on tags or weekly release branches.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Don't tag with test anymore.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Address review feedback.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Fix license check.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
When using `K6_HA_REPLICAS > 1`, Mimir will accept all HTTP calls but a
part of those call will receive a status code `202`. The following
commit makes this status code as expected otherwise user receive the
following error:
```
reads_inat write (file:///.../mimir-k6/load-testing-with-k6.js:254:8(137))
reads_inat native  executor=ramping-arrival-rate scenario=writing_metrics source=stacktrace
ERRO[0015] GoError: ERR: write failed. Status: 202. Body: replicas did not mach, rejecting sample: replica=replica_1, elected=replica_0
```

At the end of the benchmark summary display errors:
```
     ✗ write worked
      ↳  20% — ✓ 23 / ✗ 92
```

Example of load testing:
```shell
./k6 run load-testing-with-k6.js \
    -e K6_SCHEME="https" \
    -e K6_WRITE_HOSTNAME="${mimir}" \
    -e K6_READ_HOSTNAME="${mimir}" \
    -e K6_USERNAME="${user}" \
    -e K6_WRITE_TOKEN="${password}" \
    -e K6_READ_TOKEN="${password}" \
    -e K6_HA_CLUSTERS="1" \
    -e K6_HA_REPLICAS="3" \
    -e K6_DURATION_MIN="5"
```

Signed-off-by: Wilfried Roset <wilfriedroset@users.noreply.github.com>
* implement read v2

* updated CHANGELOG.md

* extend maxBytesInFram comment.

* addressed PR feedback

* addressed PR feedback

* addressed PR feedback

* use indexed xor chunk function to assert stream remote read tests

* updated CHANGELOG.md

Co-authored-by: Miguel Ángel Ortuño <miguel.ortuno@grafana.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…e is 0 (#1783)

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Print version+arch of Mimir loaded to Docker.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Use debug log for distributor.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…ortex_distributor_ingester_query_failures_total (#1797)

* Remove unused metrics cortex_distributor_ingester_queries_total and cortex_distributor_ingester_query_failures_total

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Remove unused fields

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Added options support to SendSumOfCountersPerUser()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Renamed SkipZeroValueMetrics() to WithSkipZeroValueMetrics()

Signed-off-by: Marco Pracucci <marco@pracucci.com>
… to let people install both while migrating from Cortex to Mimir (#1801)

Signed-off-by: Marco Pracucci <marco@pracucci.com>
…1808)

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Allow customizing mimir cli flags per zone for the store gateway.
Copied the same solution as we have for ingesters.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
…n the ring (#1806)

* Add protection to store-gateway to not drop all blocks if unhealthy in the ring

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Update CHANGELOG.md

Co-authored-by: Peter Štibraný <pstibrany@gmail.com>

Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
…tor_ingester_append_failures_total unused metrics (#1799)

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Extract and test TracerTransport functionality

We need to use a TracerTransport in mimir-continous-test. We have that
in the frontend package, but I don't want to import frontend from the
mimir-continous-test, so we extract it to util/instrumentation.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Set up global tracer in mimir-continuous-test

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Add tracing to the client and spans to the tests

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Add jaeger-mixin to mimir-continuous test container

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* make license

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Add traces to the write path

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Update CHANGELOG.md

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
* Removed unused Info() and advLabelSets from BucketStore

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed unused FilterConfig from BucketStore

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed unused relabelConfig from store-gateway tests

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed unused function expectedTouchedBlockOps()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed unused recorder from BucketStore tests

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* go mod vendor

Signed-off-by: Marco Pracucci <marco@pracucci.com>
williamzelesny and others added 23 commits June 6, 2022 18:32
* Upgrade alpine to 3.16.0

* Enhance MimirRequestLatency runbook with more advice (#1967)

* Enhance MimirRequestLatency runbook with more advice

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>

* Include helm-docs in build and CI (#2026)

* Update the mimir build image and its build doc

Dockerfile: Add helm-docs package to the image.
how-to: Write down the requirements for build in more detail. Add
information about build on linux.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* Expand make doc with helm-docs command

This enables generating the helm chart README with the same make doc
command as all other documentation.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* Update docs/internal/how-to-update-the-build-image.md

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Update contributing guides for the helm chart (#2008)

* Update contributing guides for the helm chart

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* Turn off helm version increment check in CI

This enables periodic releases, as opposed to requiring version bump
for release at every PR.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* Add extraEnvFrom to all services and enable injection into mimir config (#2017)

Add `extraEnvFrom` capability to all Mimir services to enable injecting
secrets via environment variables.

Enable `-config.exand-env=true` option in all Mimir services to be able
to take secrets/settings from the environment and inject them into the
 Mimir configuration file.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* Docs: fix mimir-mixin installation instructions (#2015)

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Docs: make documentation a first class citizen in CHANGELOG (#2025)

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* upgrade to alpine 3.16.0

* upgrade alpine to 3.16.0

Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
This should be automated, but now done manually.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
The default value, shared with all other memcache caches, of 200ms
is too aggressive in most cases. This results in TSDB data often being
fetched from object storage in cases where a slighly longer timeout
would result in a cache hit.

This is set in Jsonnet and Helm instead of as a default of the CLI
flag since the flags (and hence their defaults) are shared among all
caches (index, chunks, metadata, results).

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
* Add test-enterprise-values.yaml
…ry sharding is enabled (#2036)

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Renamed newDiscoveryService() to newMimirDiscoveryService()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added newMimirPdb() utility

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added newMimirStatefulSet() utility

Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Helm: Add golden-record build script
* Helm: add test-values golden record
* Add PR check for `check-helm-tests`
* Add Helm setup to lint-helm action
* Update generated helm tests
* Fix bash linting
* Update contribution guidelines
* Update generated helm manifests
* Helm: fix kube version

Set kubeVersionOverride to generate PodDisruptionBudget API version
consistently.

When I ran the test, I got a diff, because my k8s is newer (1.23).

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

* Update operations/helm/tests/build.sh

Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Co-authored-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
… indexheader reads. (#2019)

Introduces a new experimental configuration option (`-blocks-storage.bucket-store.index-header.map-populate-enabled`).

This enables the use of the `MAP_POPULATE` flag when `mmap`-ing index-header files in the store-gateway. What this flag does is advise the kernel to (synchronously) pre-fault all pages in the memory region, loading them into the file system cache.

Why is this a good idea?
- The initial read process of the index-header files has shown to cause hangups in the store-gateway.
- By using this option, I/O is done in the mmap() syscall, which the Go scheduler can cope with.
- We reduce the likelyhood of Goroutines getting stalled in major page faults.
- The initial read process walks the entire file anyway, so we are not doing any more I/O.
- It's a very low risk change compared to re-writing the BinaryReader (work in progress).

Why is this not perfect?
- The Kernel does not guarantee the pages will stay in memory, so we are only reducing the probability of major page faults.

Rationale about the implementation:
- I have copied the mmap utilities from Prometheus as a temporary measure, for the sake of evaluating this change.
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* Update Prometheus with async chunk mapper changes.

Included changes:

grafana/mimir-prometheus#131
grafana/mimir-prometheus#247

These result is lower memory usage by chunk mapper.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
* Fix ruler config in getting started guide

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>
A previous change (#2019) assumed MAP_POPULATE was available on Darwin. This fixes the build.
…#1949)

* mixin: adapt alerts/playbooks to have into consideration ruler query path components.

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>

* applied PR suggestion

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>

* applied PR suggestion

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>

* restored ruler missed evaluations alert

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>

* updated CHANGELOG.md

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>
* Return and log detailed services information on /ready

This helps debug starting services more easily.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Only return non-running services

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…2009)

* add validation.RateLimited to error catalogue

* Add validation.TooManyHAClusters to error catalogue

* update docs

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* improve new MessageWithLimitConfig and add tests

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* Update from changes in code review

Co-authored-by: Marco Pracucci <marco@pracucci.com>
* Add Patrick Oyarzun as Team Member

* Update MAINTAINERS.md
* mimir-continuous-test: Add smoke test mode

* Add PR number to CHANGELOG

* Update error assertions in write_read_series_test

* Fix doc formatting

* Address PR feedback

* Fix goimports formatting
Signed-off-by: Marco Pracucci <marco@pracucci.com>
* Make MessageWithLimitConfig accept multiple flags

* Add tenant string in per-tenant error labels

* Revert "Add tenant string in per-tenant error labels"

This reverts commit 758ef72.

* rename error too-many-ha-clusters
* ruler: report failed eval on any 5xx status

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>

* addressed PR suggestion

Signed-off-by: Miguel Ángel Ortuño <ortuman@gmail.com>
@jesusvazquez jesusvazquez self-assigned this Jun 10, 2022
The OOO implementation changed the ChunkReader interface.

Mimir imports Thanos and there are issues with the changes on that
interface so we had to fork Thanos to perform the interface change.

We'll try to upstream this soon enough so that we dont need to do this
in the future.
@jesusvazquez jesusvazquez force-pushed the jvp/update-out-of-order-with-main branch from c5cc520 to a2bb750 Compare June 10, 2022 09:48
@jesusvazquez jesusvazquez marked this pull request as ready for review June 10, 2022 10:03
@jesusvazquez jesusvazquez merged commit b3149aa into out-of-order Jun 10, 2022
@jesusvazquez jesusvazquez deleted the jvp/update-out-of-order-with-main branch June 10, 2022 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet