Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROMTAIL TO LOKI - Only latest file #8590

Open
H3-LAB opened this issue Feb 22, 2023 · 5 comments
Open

PROMTAIL TO LOKI - Only latest file #8590

H3-LAB opened this issue Feb 22, 2023 · 5 comments

Comments

@H3-LAB
Copy link

H3-LAB commented Feb 22, 2023

Hey guys;
I am new to promtail configurations with loki.
Trying a progressive migration from Telegraf with Influxdb to this promising solution;

The problem that I'm having is related to the difficulty in parsing only the last file in the directory;

Imagining the following scenario:

/path/to/log/file-2023.02.01.log and so on, following a date pattern;

The promtail when it starts is grabbing all the files that end with ".log", and despite several efforts to try to make it only look at the last file, I don't see any other solution than creating a symbolic link to the latest file in that directory;

The fact that promtail is loading all the files implies that there are out-of-order logs and the entry order of new lines in the most recent file is not being respected.

It obviously impacts the creation of Dashboards in Grafana that are not exact as to the input date;

Here is my Promtail config file:

server:
  http_listen_port: 0
  grpc_listen_port: 0

positions:
  filename: /data/loki/app01_positions.yaml

clients:
  - url: http://localhost:8888/loki/api/v1/push

scrape_configs:
  - job_name: AWS_APP01
    static_configs:
    - labels:
        job: AWS_APP01
        __path__: /opt/logs/hosts/hostXPTO/app01-*.log
    pipeline_stages:
      - match:
          selector: '{job="app01"}'
          stages:
            - regex:
                expression: '^app01-(?P<date>\d{4}.\d{2}.\d{2}).log'
            - timestamp:
                source: date
                format: '2006.01.02'
            - regex:
                expression: '^(?P<timestamp>\d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<level>\w+)\s+\[(?P<action>[^\]]+)\]\((?P<process>[^)]+)\)\[(?P<service_id>[^\]]+)\]\s+(?P<message>.*)'
            - labels:
                level:

I did try to map the file name as a value to be used as timestamp to promtail, hoping that promtail used it as a refference to the latest file;

Example:
In telegraf there is a rule/option that forces the process to look only at the last file:
from_begining: false (or something like that..)

Is there anything similar for promtail?

I can understand that promtail use the positions file to know which files he have to look, but how can I tell him only to look at the latest file?

Am i thinking wrong influenced by telegraf?

Cheers!!

@DylanGuedes
Copy link
Contributor

Thanks for your report.

The last time I checked Promtail was scraping files in order so I'm surprised that you experienced issues with out of order. That said, can you confirm that it works fine if you add the files after promtail is already running?

@H3-LAB
Copy link
Author

H3-LAB commented Feb 22, 2023

Thanks for your quick response @DylanGuedes!

Well, maybe I'm misunderstanding something when I look at Grafana's "Live" mode;
Since I'm watching the JOB in Live mode, I was expecting to only see the newlines for the most recent file, and what I'm seeing is the entire input job for all files in position.yml

When I look into positions.yml file I see:

/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.15.log: "57459091"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.16.log: "83489537"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.17.log: "74243738"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.18.log: "89573666"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.19.log: "84541299"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.20.log: "78505419"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.21.log: "76142089"
/opt/logs/hosts/host01-prod/appLogs/app01/app01-2023.02.22.log: "79103375"

The order seems to be right, but the Live watching is showing me logs from yesterday, inserted at 20:13 of today (for example);

If i want to build some dashboard based on whats happening right now, I will have data from every log mixing status, and log_levels from differente days.

I dont see anything in promtail documentation that explains better what can I do with __path__variable, unless specifying the file path where logs are stored;

And as I can see in every peace of Documentation, it seems that log ingestion is based on a premise that the last file on some directory has the name app.log and the rest is archived - app.log.gz (for example) - and you can exclude this files;
The major example is the /var/log/ where, for example, messages.log is always the same file, and rotated to messages.log.date...
With this, all the messages.log still updated and timeframed with the last line entry;

But if we have a app01-2023.02.15.log and app01-2023.02.16.log and app01-2023.02.17.log, and so on...how does promtail know which is the latest file? Every file has .log, so apparently he doesn't know which is the last file;
I thought that he should know based on the system time or something based on some input/variable to define always read the latest file...

@H3-LAB
Copy link
Author

H3-LAB commented Feb 22, 2023

One question, that must be important:

Should I configure the scrap file with some variable to define or match timestamps between log file and Loki?

Pipeline Stage, I'm using regex to label some values:
expression: ^(?P\d{2}:\d{2}:\d{2}.\d{3})\s+

The timestamp label should be used in some way to match timestamps between promtail ingestion and loki timestamps/timeframes?

This is my actual pipeline stage:

    pipeline_stages:
      - match:
          selector: '{job="app01"}'
          stages:
            - regex:
                expression: '^app01-(?P<date>\d{4}.\d{2}.\d{2}).log'
            - timestamp:
                source: date
                format: '2006.01.02'
            - regex:
                expression: '^(?P<timestamp>\d{2}:\d{2}:\d{2}\.\d{3})\s+(?P<level>\w+)\s+\[(?P<action>[^\]]+)\]\((?P<process>[^)]+)\)\[(?P<service_id>[^\]]+)\]\s+(?P<message>.*)'
            - labels:
                level:

@paulrozhkin
Copy link

paulrozhkin commented Apr 28, 2023

I have the same problem. If you don't have a .log at the end, this will be a problem for file parsing. No new solution?
I know, that you can read all content in files, extract timestamp and drop it by older_than, but it can be problem, due to parsing a lot of files.

@H3-LAB
Copy link
Author

H3-LAB commented Apr 28, 2023

I have the same problem. If you don't have a .log at the end, this will be a problem for file parsing. No new solution? I know, that you can read all content in files, extract timestamp and drop it by older_than, but it can be problem, due to parsing a lot of files.

Hello @paulrozhkin!
Yes, you're right. I'm trying to read an NGINX INGRESS access log file with hundreds of requests P/second (widely used API) and having a file format like - nginx_ingress-2022-04-28 doesn't work/respects organization of log entries by date;
That said, my solution was to create a script that creates a symbolic link to the last file in the NGINX logs directory;
About this script I still use "inotify" to ensure that the file/symbolic-link is the updates as soon as there is a change to the file;

I had to do it this way since I have no control over the file format/naming;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants