SparkTrail

Use this Python script to start a Spark standalone session to interact with the CloudTrail bucket. Spark allow to query the logs using a SQL-like syntax.

The startup main.py script will automatically load SSO credentials and set AWS temporary credentials from the SSO to authenticated to the bucket.

Usage

Spawn Poetry shell:

poetry shell

Start the cluster:

PYSPARK_DRIVER_PYTHON=ipython PYTHONSTARTUP=main.py pyspark --packages org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 --driver-memory 15G --executor-memory5G --name SparkTrail

Now the environment is configured and to start running queries link the S3 bucket. When the IPython shell is created, link the bucket and start performing queries:

spark = link_s3("audit-cloudtrail-logs/AWSLogs/")
spark.select("Records.eventName").distinct().show(10)

Notes

You can use this script to any bucket with partitioned JSON files (e.g., Databricks audit logs)
You may need to adjust the memory sizing to fit your environment
You need to aws sso login before running the command

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github		.github
.gitignore		.gitignore
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkTrail

Usage

Notes

About

Releases

Packages

Contributors 2

Languages

notdodo/SparkTrail

Folders and files

Latest commit

History

Repository files navigation

SparkTrail

Usage

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages