Skip to content

Query AWS CloudTrail using Spark (python) to perform analysis

Notifications You must be signed in to change notification settings

notdodo/SparkTrail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkTrail CodeQL

Use this Python script to start a Spark standalone session to interact with the CloudTrail bucket. Spark allow to query the logs using a SQL-like syntax.

The startup main.py script will automatically load SSO credentials and set AWS temporary credentials from the SSO to authenticated to the bucket.

Usage

  1. Spawn Poetry shell:

poetry shell

  1. Start the cluster:

PYSPARK_DRIVER_PYTHON=ipython PYTHONSTARTUP=main.py pyspark --packages org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 --driver-memory 15G --executor-memory5G --name SparkTrail

  1. Now the environment is configured and to start running queries link the S3 bucket. When the IPython shell is created, link the bucket and start performing queries:
spark = link_s3("audit-cloudtrail-logs/AWSLogs/")
spark.select("Records.eventName").distinct().show(10)

Notes

  • You can use this script to any bucket with partitioned JSON files (e.g., Databricks audit logs)
  • You may need to adjust the memory sizing to fit your environment
  • You need to aws sso login before running the command

About

Query AWS CloudTrail using Spark (python) to perform analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages