Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lambda-promtail: add support for AWS WAF S3 logs #8254

Conversation

matmisie-gemini
Copy link

@matmisie-gemini matmisie-gemini commented Jan 24, 2023

What this PR does / why we need it:
This PR adds support for adding AWS WAF S3 logs to Loki.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
This is my first GoLang code so I am open to suggestions if things can be done better.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@matmisie-gemini matmisie-gemini requested a review from a team as a code owner January 24, 2023 11:02
@CLAassistant
Copy link

CLAassistant commented Jan 24, 2023

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Jan 24, 2023
// regex that extracts the timestamp from the AWS VPC Flow Logs message log
// format: timestamp (seconds)
// example: ... 1669842701 1669842702 ACCEPT
timestampRegex: regexp.MustCompile(`(?P<begin_date>\d+) (?P<end_date>\d+) (?:ACCEPT|REJECT)`),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriskuchin AWS VPC Flow Log log lines have a begin-end time in seconds. I have added the begin date as timestamp. Did you miss it? See https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html (available fields section)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I explicitly ignored them as they weren't loki standard and I felt that more advanced parsing should be left to actual promtail rather than this lambda becoming a full parser too

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@chriskuchin
Copy link
Contributor

chriskuchin commented Jan 24, 2023

So couple of thoughts and take them with a grain of salt as I am just a consumer of loki like you.

I don't love the struct pattern here as it makes it feel like to add any new log type will always require an MR here. Rather than what I think we should be pushing for to make this more generic. Honestly the only reason I used an | on the regex was just caution but I would have prefered to just wildcard that section and set the log type to the value captured. I did the remapping due to how the original implementation did it although again I would prefer to just have dumped the value from the regex into that value rather than remapping but again to preserve backwards compatibility I didn't. Looking back I would probably advocate for a flag to disable the mapping.

I think the parsing should stay a single regex and we should simply wildcard the type segment. If we need to keep remapping that's fine too but less than ideal in my opinion but is necessary for backwards compatibility.

On the topic of timestamps... I explicitly ignored timestamps in the Flow logs as they didn't match the loki standard and adding them could put them anywhere in the log line. Also because of the way firehose worked i figured the timestamp would be close enough. And if they aren't I think the correct pattern to promote would be for the lambda to check a couple of standards and then just set it to the lambda timestamp. Then if you really want the correct timestamp you can use promtail to extract it from the log line and set it. I think this is better than needing to do yet another MR every time we want to support a new log type from S3 is we push most of the parsing to promtail out of the lambda.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@matmisie-gemini matmisie-gemini force-pushed the lambda-promtail-add-support-for-aws-waf-s3-logs branch from 2e555c4 to 6a7bd5d Compare January 24, 2023 17:34
@matmisie-gemini
Copy link
Author

@chriskuchin thanks for taking the time to provide your thoughts. You convinced me to make it more generic. I applied your suggestions.

I am not sure what you mean by date in vpc flow logs not being "loki standard"?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@chriskuchin
Copy link
Contributor

I think the loki api expected a different timestamp than the ones in the flow logs. That plus it having both a beginning and an end I wasn't sure which people would expect or want and figured it would vary from person to person. Then also its not the first field on the log or even guaranteed to be where it is by default.

Basically I just punted the handling of the timestamps to the implementer using promtail. Cause to be honest for me the 5 minute window of the firehose is probably good enough for these logs at the histogram and if i need it more accurate i can parse it out of the log line using logql...

Also when I implemented mine there was only the one timestamp regex and with all the questions around the "correct" way to handle flow logs i just opted to keep the existing one and not introduce a new one.

Does that make sense?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@JStickler JStickler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Docs review] Couple of small changes.

@@ -95,6 +95,10 @@ For those using Cloudwatch and wishing to test out Loki in a low-risk way, this

Note: Propagating logs from Cloudwatch to Loki means you'll still need to _pay_ for Cloudwatch.

### WAF logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAF? Web Application Framework? Write Amplification Factor? There are 48 definitions for WAF on Acronym Finder. I really had to hunt through the Amazon docs, but apparently it is "web application firewall".

docs/sources/clients/lambda-promtail/_index.md Outdated Show resolved Hide resolved
Co-authored-by: J Stickler <julie.stickler@grafana.com>
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@cstyan cstyan self-requested a review January 25, 2023 00:59
@matmisie-gemini
Copy link
Author

@chriskuchin It does make sense but I think the begin date is something acceptable and could be closer to the actual time something happened that date.now and this could make it easier to precisely find issues in the log related to a specific point in time?

@CCOLLOT
Copy link
Contributor

CCOLLOT commented Jan 25, 2023

@chriskuchin It does make sense but I think the begin date is something acceptable and could be closer to the actual time something happened that date.now and this could make it easier to precisely find issues in the log related to a specific point in time?

I agree with @chriskuchin. In the end, I think it should be up to the end user to parse the start/end timestamp using LogQL. Also, if using the start/end timestamps, we could have an issue with reject_old_samples_max_age if for some reason logs are delivered by lambda-promtail a while after the packet was captured.

In this #8231 (waiting for review), I add the possibility to avoid losing logs when lambda-promtail faces execution failures using 2 SQS queues (one main queue receiving S3 notifications and a Dead-Letter queue receiving failed events). If reprocessing those failed events after reject_old_samples_max_age, we might not be able to ingest those logs anymore.

Alternatively, an operator could say he doesn't want to ingest a network packet log that happened before reject_old_samples_max_age, so in this case, it could make sense to use start/end. But still, I think if such transformation is required it might as well be done on the promtail side.

Besides, I think this PR is getting very close (if not a duplicate) to this one: #8066

@matmisie-gemini
Copy link
Author

@CCOLLOT thanks for the comments. I am ok to not parse the begin/end date of vpc flow logs. Let's see what the Loki team thinks and I will make the change.

Thanks for pointing out this PR: #8066
I added a comment there that it wont work for WAF logs - the filename regex will not catch the WAF S3 path as it also contains hour and minute in it. The timestamp in the log line has a different date format.
If that PR has changes to catch WAF logs I am ok to cancel this one.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Comment on lines 32 to 43
// AWS WAF logs
// source: https://docs.aws.amazon.com/waf/latest/developerguide/logging-s3.html
// format: bucket[/prefix]/AWSLogs/aws-account-id/WAFLogs/region/webacl-name/YYYY/MM/dd/HH/mm/aws-account-id_waflogs_region_webacl-name_YYYYMMddTHHmmZ_random-string.log.gz
// example: aws-waf-logs-test/AWSLogs/11111111111/WAFLogs/us-east-1/TEST-WEBACL/2021/10/28/19/50/11111111111_waflogs_us-east-1_TEST-WEBACL_20211028T1950Z_e0ca43b5.log.gz
filenameRegex = regexp.MustCompile(`AWSLogs/(?P<account_id>\d+)/(?P<type>\w+)/(?P<region>[\w-]+)/(?:[\w-]+/)?(?P<year>\d+)/(?P<month>\d+)/(?P<day>\d+)/(?:(?P<hour>\d+)/)?(?:(?P<minute>\d+)/)?\d+_(?:elasticloadbalancing|vpcflowlogs|waflogs)_\w+-\w+-\d_(?:(?:app|nlb|net)\.*?)?(?P<src>[a-zA-Z0-9\-]+)`)

// regex that extracts the timestamp from message log
timestampRegexList = []*regexp.Regexp {
regexp.MustCompile(`\w+ (?P<timestamp>\d+-\d+-\d+T\d+:\d+:\d+\.\d+Z)`), //h2 2022-12-20T23:55:02.599911Z ...
regexp.MustCompile(`(?P<begin_date>\d{8,}) (?P<end_date>\d{8,}) (?:ACCEPT|REJECT)`), //... 1669842701 1669842702 ACCEPT ... (seconds)
regexp.MustCompile(`"timestamp":(?P<timestamp>\d+)`), //{"timestamp":1671624901861,... (milliseconds)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these regex are getting large enough that we should start organizing them a bit more. What do you think about splitting the filename into
filenames := elasticloadbalancing|vpcflowlogs|waflogs
and

fileBaseRegex := fmt.Sprtinf(`regexp.MustCompile(`AWSLogs/(?P<account_id>\d+)/(?P<type>\w+)/(?P<region>[\w-]+)/(?:[\w-]+/)?(?P<year>\d+)/(?P<month>\d+)/(?P<day>\d+)/(?:(?P<hour>\d+)/)?(?:(?P<minute>\d+)/)?\d+_(?:%s)_\w+-\w+-\d_(?:(?:app|nlb|net)\.*?)?(?P<src>[a-zA-Z0-9\-]+)\`), filenames)

And giving more obvious names to each regex for timestamps, since each parses a different format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One one hand you are right but on the other it is still hard to read if we put "%s" inside this regex.
If we split more parts from the regex into separate variables and leave many "%s" inside the regex it could be a cognitive strain to understand what goes where in the regex. In the state it is right now we can easily copy&paste this into a tool for analysis like https://regex101.com/

Comment on lines 146 to 153
if err == nil {
//https://stackoverflow.com/questions/23929145/how-to-test-if-a-given-time-stamp-is-in-seconds-or-milliseconds
dateNowSecs := time.Now().Unix()
if timeToParseNumber > dateNowSecs {
return time.UnixMilli(timeToParseNumber)
}
return time.Unix(timeToParseNumber, 0)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could format this as

if err != nil {
    continue
}
// seconds vs ms check here

is this "seconds or milliseconds" timestamp format specific to WAF logs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do, thanks.

Milliseconds is the format for the timestamp field for AWS WAF logs.
Seconds is the format for the begin/end fields for AWS VPC Flow logs.

I will add this more info about this in the regex comments.

Comment on lines 171 to 179
if len(match) > 0 {
for i, name := range filenameRegex.SubexpNames() {
if i != 0 && name != "" && match[i] != "" {
labels[name] = match[i]
}
}
labels["type"] = strings.ToLower(labels["type"])
} else {
fmt.Printf("Unknown AWS S3 log filename format: %s\n", labels["key"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again we can simplify this

if len(match) < 1 {
    fmt.Printf("Unknown AWS S3 log filename format: %s\n", labels["key"])
    return labels, nil
}
// do the subexpression checks here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do, thanks.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@cstyan cstyan self-assigned this Jan 27, 2023
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor comment and then we're good to go!

@@ -37,8 +37,17 @@ var (

// regex that extracts the timestamp from message log
timestampRegexList = []*regexp.Regexp {
// regex that extracts the timestamp from the AWS Loadbalancer message log
// format: timestamp (RFC3339)
// example: h2 2022-12-20T23:55:02.599911Z ...
regexp.MustCompile(`\w+ (?P<timestamp>\d+-\d+-\d+T\d+:\d+:\d+\.\d+Z)`), //h2 2022-12-20T23:55:02.599911Z ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the comment at the end of this line and line 51 now with your more detailed comments :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@matmisie-gemini
Copy link
Author

matmisie-gemini commented Feb 2, 2023

@cstyan Do you have any thoughts on the topic of timestamps for AWS VPC Flow logs - the begin/end date? Should we use these as timestamps for loki or use date.now instead (the discussion between chriskuchin and CCOLLOT in this PR)?

@cstyan
Copy link
Contributor

cstyan commented Feb 6, 2023

I tend to agree with @chriskuchin re:

Then if you really want the correct timestamp you can use promtail to extract it from the log line and set it. I think this is better than needing to do yet another MR every time we want to support a new log type from S3 is we push most of the parsing to promtail out of the lambda.

lambda-promtail is still only a utility, not meant as a full replacement for something like promtail, and as a team we currently don't have much time to dedicate to it besides just reviewing community members PRs. We can't continue to expand to new features/cloud provider formats endlessly since the maintenance burden would be too high. I think for your PR you just added parsing/setting the timestamp for this WAF type since that was the existing behaviour for other types. I think we could keep that logic, but add a new flag AWSKeepTimestamps that defaults to false (in which case we use time.Now() as the timestamp for each line we receive) and if true we attempt to parse and keep the timestamp from the line based on the regex.

@cstyan
Copy link
Contributor

cstyan commented Feb 22, 2023

@mmisiewicz-g let me know if my last comments were unclear or if there's anything else I can help you with here :)

@MasslessParticle
Copy link
Contributor

Not activity here for months. Closing for inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants