Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query_ingesters_within Config not right may cause query result left out logs #6043

Open
honganan opened this issue Apr 28, 2022 · 4 comments

Comments

@honganan
Copy link
Contributor

honganan commented Apr 28, 2022

Describe the bug
I have a log line which have several labels like: application=myapp, source=error.log, and it contains some keywords:

2022-04-28 06:11:31.239, TID:221a40a1baa748cbbeeea45b2af4cbb2.393.16511262905704693, ERROR, com.xxx.xxx.xxx.dao.BaseDAO, BaseDAO.java, 223, add id clash id=6131027,trynum=3,e=java.sql.SQLIntegrityConstraintViolationException,....

When choose only labels without pipeline, I can query it out. But when add pipeline with keyword contains in the log, the result is empty.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (2.4.2)
  2. Config write component with:
  chunk_idle_period: 3h 
  chunk_retain_period: 30s 
  max_chunk_age: 2h
  1. Config querier and query-frontend component with:
  query_store_only: false
  query_ingesters_within: 15m
  1. execute 2 queries:
    1st: {application="myapp",source="error.log"}
    2nd: {application="myapp",source="error.log"} |= "SQLIntegrityConstraintViolationException"

Expected behavior
Expect both 2 queries returns the log line, but only the 1st did. After about 2 hours letter, both of them can return the log line.

Environment:

  • Infrastructure: VM
  • Deployment tool: Docker
  • Deployment mode: micro services
@honganan
Copy link
Contributor Author

I guess it's because the stream produces logs slow and the chunk is not flushed when I query, and query_ingesters_within configs to query from Ingester only within recent 15 minutes. So the logs between now-2h to now-15m cannot be queried. I want to confirm am I alright?

If that's the problem, should we do some check to the config values and give some warning?

@cyriltovena
Copy link
Contributor

cc @sandeepsukhani

@sandeepsukhani
Copy link
Contributor

I guess it's because the stream produces logs slow and the chunk is not flushed when I query, and query_ingesters_within configs to query from Ingester only within recent 15 minutes. So the logs between now-2h to now-15m cannot be queried. I want to confirm am I alright?

If that's the problem, should we do some check to the config values and give some warning?

Yes, you got it right. The query_ingesters_within should be at least max_chunk_age which defaults to 2h. We should add some validation for this and similar other configs where Loki operators can do mistakes which can be avoided by checks.

@duj4
Copy link

duj4 commented Mar 6, 2024

Hi @sandeepsukhani ,

It seems I met the similar issue, I am running Loki in K8S with SSD mode, the max_chunk_age is set as default 2h and query_ingesters_within is set as 3h. I ingested a log snippet which was generated at March 1st :

2024-03-01 02:52:27,703+0800 jira-stats-0 INFO [c.a.j.util.stats.JiraStats] [JIRA-STATS] [NODE-START] snapshot stats: duration=PT1H0.007S, statsOverhead=n/a, data={"_statsName":"NODE-START","_statsType":"snapshot","_time":"2024-02-29T18:52:27.702Z","_timestamp":1709232747702,"_duration":"PT1H0.007S","_invocations":0,"_statsOverhead":"n/a","configuration":{},"checkIndexOnStart":{"result":"NOT_RUN","timeInSeconds":0},"getIndexBy":{"rebuildLocalIndex":{"result":"NOT_RUN","timeInSeconds":0},"pickIndexSnapshotFromSharedHome":{"result":"NOT_RUN","timeInSeconds":0},"requestIndexSnapshotFromAnotherNode":{"result":"NOT_RUN","timeInSeconds":0},"performFullForegroundReindex":{"result":"NOT_RUN","timeInSeconds":0}},"ensureFreshIndexSnapshot":{"result":"NOT_RUN","snapshotExisted":false,"snapshotCreated":false,"timeInSeconds":0}}
2024-03-01 02:52:27,704+0800 jira-stats-0 INFO [c.a.j.util.stats.JiraStats] [JIRA-STATS] [NODE-START] total stats: duration=PT205H6M19.18S, statsOverhead=n/a, data={"_statsName":"NODE-START","_statsType":"total","_time":"2024-02-29T18:52:27.702Z","_timestamp":1709232747702,"_duration":"PT205H6M19.18S","_invocations":0,"_statsOverhead":"n/a","configuration":{},"checkIndexOnStart":{"result":"NOT_RUN","timeInSeconds":0},"getIndexBy":{"rebuildLocalIndex":{"result":"NOT_RUN","timeInSeconds":0},"pickIndexSnapshotFromSharedHome":{"result":"NOT_RUN","timeInSeconds":0},"requestIndexSnapshotFromAnotherNode":{"result":"NOT_RUN","timeInSeconds":0},"performFullForegroundReindex":{"result":"NOT_RUN","timeInSeconds":0}},"ensureFreshIndexSnapshot":{"result":"NOT_RUN","snapshotExisted":false,"snapshotCreated":false,"timeInSeconds":0}}
2024-03-01 02:52:29,578+0800 plugin-transaction-0 INFO [c.a.jira.plugin.PluginTransactionListener] [JIRA-STATS] [plugin-transaction] numberStartEvents:627, numberEndEvents:627, numberSendEvents:278, numberEventsInTransactions:12196, numberOfPluginEnableEvents:276
2024-03-01 02:52:43,445+0800 Caesium-1-2 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Start to check for index consistency.
2024-03-01 02:52:43,446+0800 Caesium-1-2 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Index consistency check finished. indexConsistent: true; timeToCheckIndexConsistency: 0 ms
2024-03-01 02:52:43,446+0800 Caesium-1-2 INFO ServiceRunner [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 1.343 ms

When I tried to query it from Grafana, it gave me the error like below:
image

But if I ingested the log snippet generated just now, everything looks normal:

2024-03-06 16:32:14,374+0800 Caesium-1-1 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Start to check for index consistency.
2024-03-06 16:32:14,380+0800 Caesium-1-1 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Index consistency check finished. indexConsistent: true; timeToCheckIndexConsistency: 2 ms
2024-03-06 16:32:14,380+0800 Caesium-1-1 INFO ServiceRunner [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 5.981 ms
2024-03-06 16:33:14,375+0800 Caesium-1-4 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Start to check for index consistency.
2024-03-06 16:33:14,379+0800 Caesium-1-4 INFO ServiceRunner [c.a.j.issue.index.DefaultIndexManager] Index consistency check finished. indexConsistent: true; timeToCheckIndexConsistency: 0 ms
2024-03-06 16:33:14,380+0800 Caesium-1-4 INFO ServiceRunner [c.a.jira.servlet.ApplicationStateResolverImpl] Checking index consistency. Time taken: 3.970 ms

Per the doc:

query_ingesters_within: Maximum lookback beyond which queries are not sent to ingester

So it means if the query time range (7 days) is out of the scope of query_ingesters_within(3h), Loki will not search ingesters at all but the backend storage. However, since it is still in the window of max_chunk_age (2h) or chunk_idle_period (30min) , the logs in memory is not flushed to storage yet, then I failed to search the logs from either memory or storage.

Once I set the query_ingesters_within to 0, the elder log files are searchable again.

Please correct me if there is any misunderstanding.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants