This project implements a log analytics pipeline that utilizes a combination of Big Data technologies, including Kafka, Spark Streaming, Parquet files, and the HDFS file system. The pipeline aims to process and analyze log files by establishing a flow that begins with a Kafka producer, followed by a Kafka consumer that performs transformations using Spark. The transformed data is then converted into the Parquet file format and stored in HDFS.
-
producer.py
- script to read & inject data into a Kafka topic line by lineArguments:
--bootstrap-servers
: Kafka bootstrap servers. Default islocalhost:9092
.--topic
: Kafka topic to publish the data to. Default iskafka_test
.--file
: Path to the log file to be injected into Kafka. Default is40MBFile.log
.--limit
: Limit the number of log record to be injected. Default is -1, which will inject all lines.--reset
: Clean up Kafka topic before producing new messages. Default isFalse
.
-
consumer.py
- script to read & process Kafka streaming dataArguments:
--bootstrap-servers
: Kafka bootstrap servers. Default islocalhost:9092
.--topic
: Kafka topic to listen to. Default iskafka_test
.--reset
: Clean up previous Spark checkpoint and output before consuming new data. Default isFalse
.
-
transformation.py
- contains helper functions for wrangling Spark streaming dataframes
- Spin up HDFS
hdfs namenode -format
$HADOOP_HOME/sbin/start-all.sh
- Inject all lines from input file into Kafka:
python producer.py --topic log_analytics --file 40MBFile.log --limit -1
- Consume the streaming data from Kafka via Spark Streaming
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 consumer.py --topic log_analytics