Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf fails after random time period when using specific log parser #3142

Closed
birengoodco opened this issue Aug 21, 2017 · 7 comments
Closed
Labels
area/influxdb bug unexpected problem or unintended behavior
Milestone

Comments

@birengoodco
Copy link

Bug report

Per daniel's request from here, I'm opening this bug.

Telegraf keeps failing after some random amount of time and has to be restarted.

So what seems to happen is the log fills up with a bunch of “unable to parse” errors due to “bad timestamp”… but it makes no sense, since the data looks good in the log message and on top of that, I’m seeing the data actual end up in Influx. So not sure why I’m getting those… and then after a few minutes, it just dies altogether and nothing makes it into Influx anymore. I get the error E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

I can restart Telegraf and it works fine for a while until it dies again. In my pool of servers it doesn’t seem very consistent (all of the servers are running the same Telegraf config file… the only difference being the hostname that’s set). Any ideas?

I wasn’t having any problems until I introduced this log parser…

I’m also seeing this error on some other servers: E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'sidekiq_log, I’m leaving out the rest of the error as it contains details I'd have to scrub.

Relevant telegraf.conf:

[[inputs.logparser]]
   files = ["/opt/applications/core/current/log/sidekiq.log"]
   from_beginning = false
   [inputs.logparser.grok]
     patterns = ["%{CUSTOM_SIDEKIQ_LOG}"]
     measurement = "sidekiq_log"
     custom_patterns = '''
       THREAD_ID (?:TID-\S+)
       JOB_ID (?:JID-\S+)
       CUSTOM_SIDEKIQ_LOG %{TIMESTAMP_ISO8601:timestamp:ts-"2006-01-02T15:04:05.000Z"} %{POSINT:pid:int} %{THREAD_ID:tid} %{NOTSPACE:worker:tag} %{JOB_ID:jid} %{LOGLEVEL:severity:tag}: %{GREEDYDATA:message}
     '''

System info:

Telegraf Version: 1.3.5
OS: Ubuntu 14.04.5

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

[Include gist of relevant config, logs, etc.]

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Aug 21, 2017
@danielnelson
Copy link
Contributor

Can you scrub the sensitive data and add the full error message?

@birengoodco
Copy link
Author

birengoodco commented Aug 21, 2017

Here you go, here are a few lines:

2017-08-21T21:02:10Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-08-21T21:02:20Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse 'sidekiq_log,worker=Social::UserFriendSignupWorker,severity=DEBUG,instance-id=i-0bd88ec891a72520e,host=server-10.1.1.1\  pid=1225i,tid="TID-gz3qk",jid="JID-ed117a815d6e96fab62c9b9c",message="  ^[[1m^[[35mSQL (0.3ms)^[[0m  INSERT INTO \"user_notifications\" (\"notification_type\", \"origin_id\", \"user_id\", \"created_at\", \"updated_at\", \"tokens\") VALUES ($1, $2, $3, $4, $5, $6) RETURNING \"id\"  [[\"notification_type\", \"signup\"], [\"origin_id\", 25319250], [\"user_id\", 25319258], [\"created_at\", \"2017-08-21 20:08:43.379426\"], [\"updated_at\", \"2017-08-21 20:08:43.379426\"], [\"tokens\", \"{\\"user_first_name\\":\\"John\\",\\"user_name\\":\\"John Smith\\",\\"origin_first_name\\":\\"Bob\\",\\"origin_name\\":\\"Bob Jones\\"}\"]]" 1503346123383000000': bad timestamp
unable to parse 'sidekiq_log,worker=Social::UserFriendSignupWorker,severity=DEBUG,instance-id=i-0bd88ec891a72520e,host=prod-serveronly-as-10.22.35.211\  pid=1225i,tid="TID-gz3qk",jid="JID-ed117a815d6e96fab62c9b9c",message="  ^[[1m^[[35mSQL (0.3ms)^[[0m  INSERT INTO \"user_notifications\" (\"notification_type\", \"origin_id\", \"user_id\", \"created_at\", \"updated_at\", \"tokens\") VALUES ($1, $2, $3, $4, $5, $6) RETURNING \"id\"  [[\"notification_type\", \"signup\"], [\"origin_id\", 25319250], [\"user_id\", 25319257], [\"created_at\", \"2017-08-21 20:08:43.401638\"], [\"updated_at\", \"2017-08-21 20:08:43.401638\"], [\"tokens\", \"{\\"user_first_name\\":\\"Billy\\",\\"user_name\\":\\"Billy Smith\\",\\"origin_first_name\\":\\"Jen\\",\\"origin_name\\":\\"Jen Smith\\"}\"]]" 1503346123405000000': bad timestamp dropped=0]

@danielnelson
Copy link
Contributor

I included a change in #3155 that should mitigate this problem. If InfluxDB rejects a point as being unparsable it will still be logged, but the write batch will treat it as a successful write. My reasoning behind this is that the point normally would never become parsable, and will always fail, so it is better to drop the point than get stuck.

This change will be in 1.4.0.

@birengoodco
Copy link
Author

birengoodco commented Aug 23, 2017

I guess I'm confused about why it even thinks it's not parsable... It logs many rows of similar type just fine, and then suddenly it starts failing. Let me provide some more details... So I decided to comment out the sidekiq log parser since I was having problems with it until you guys were able to get a fix out... but this morning, I realized that now it's failing with the same error for a different log parser... I wasn't seeing these fail in the logs before. Almost as if it's always the "last" log parser that it fails on...

This log parser's conf looks like this:

[[inputs.logparser]]
   files = ["/opt/applications/application/current/log/staging.log"]
   from_beginning = false
   [inputs.logparser.grok]
     patterns = ["%{CUSTOM_CORE_LOG}"]
     measurement = "app_core_log"
     custom_patterns = '''
       LEVEL (?:[IWDE])
       CUSTOM_CORE_LOG %{LEVEL}, \[%{TIMESTAMP_ISO8601:timestamp:ts-"2006-01-02T15:04:05.000000"} \#%{POSINT:pid:int}\]%{SPACE}%{NOTSPACE:severity:tag} -- %{DATA:program}:%{SPACE}%{GREEDYDATA:message}
     '''

Here's a sample record (and also the very record that corresponds with the error that comes later in this post:
D, [2017-08-23T13:21:51.476963 #11596] DEBUG -- : ^[[1m^[[36mSQL (0.7ms)^[[0m ^[[1mINSERT INTO "user_notifications" ("notification_type", "origin_id", "user_id", "created_at", "updated_at", "tokens") VALUES ($1, $2, $3, $4, $5, $6) RETURNING "id"^[[0m [["notification_type", "check_xxx"], ["origin_id", 68340], ["user_id", 68340], ["created_at", "2017-08-23 13:21:51.474298"], ["updated_at", "2017-08-23 13:21:51.474298"], ["tokens", "{\"user_first_name\":\"John\",\"user_name\":\"John Smith\",\"origin_first_name\":\"John\",\"origin_name\":\"John Smith\"}"]]
For a while, Telegraf was parsing records like this just fine... and then randomly it started failing with these (below) kinds of errors repeated in the log... this particular error is for the record above:

unable to parse 'app_core_log,severity=DEBUG,instance-id=i-03545b80a9f76411a,host=server-10.1.1.1\  pid=11596i,message="^[[1m^[[36mSQL (0.7ms)^[[0m  ^[[1mINSERT INTO \"user_notifications\" (\"notification_type\", \"origin_id\", \"user_id\", \"created_at\", \"updated_at\", \"tokens\") VALUES ($1, $2, $3, $4, $5, $6) RETURNING \"id\"^[[0m  [[\"notification_type\", \"check_xxx\"], [\"origin_id\", 68340], [\"user_id\", 68340], [\"created_at\", \"2017-08-23 13:21:51.474298\"], [\"updated_at\", \"2017-08-23 13:21:51.474298\"], [\"tokens\", \"{\\"user_first_name\\":\\"John\\",\\"user_name\\":\\"John Smith\\",\\"origin_first_name\\":\\"John\\",\\"origin_name\\":\\"John Smith\\"}\"]]" 1503494511476963000': bad timestamp dropped=0]
2017-08-23T13:25:30Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

I don't understand the bad timestamp error... I'm pointing this out, because I'm not sure your proposed fix gets to the heart of the issue... perhaps it fixes Telegraf from not dying... but I think this particular error will continue to come up which results in unreliable data in my InfluxDB...

Thoughts? Is this now a separate issue for InfluxDB instead?

@danielnelson
Copy link
Contributor

Telegraf is parsing the lines fine but they are being rejected by InfluxDB due to an escaping issue. Check the InfluxDB issue that is linked above for more details.

@danielnelson
Copy link
Contributor

@birengoodco I think I've fixed the issue, can you test it out with 1.4.0-rc2?

@birengoodco
Copy link
Author

Sweet!!! Thanks @danielnelson! I will give it a whirl tomorrow morning!! Appreciate the fast turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/influxdb bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants