Log Analytics Pipeline

This project implements a log analytics pipeline that utilizes a combination of Big Data technologies, including Kafka, Spark Streaming, Parquet files, and the HDFS file system. The pipeline aims to process and analyze log files by establishing a flow that begins with a Kafka producer, followed by a Kafka consumer that performs transformations using Spark. The transformed data is then converted into the Parquet file format and stored in HDFS.

Files

producer.py - script to read & inject data into a Kafka topic line by line

Arguments:
- --bootstrap-servers: Kafka bootstrap servers. Default is localhost:9092.
- --topic: Kafka topic to publish the data to. Default is kafka_test.
- --file: Path to the log file to be injected into Kafka. Default is 40MBFile.log.
- --limit: Limit the number of log record to be injected. Default is -1, which will inject all lines.
- --reset: Clean up Kafka topic before producing new messages. Default is False.
consumer.py - script to read & process Kafka streaming data

Arguments:
- --bootstrap-servers: Kafka bootstrap servers. Default is localhost:9092.
- --topic: Kafka topic to listen to. Default is kafka_test.
- --reset: Clean up previous Spark checkpoint and output before consuming new data. Default is False.
transformation.py - contains helper functions for wrangling Spark streaming dataframes

Usage

Spin up HDFS

hdfs namenode -format
$HADOOP_HOME/sbin/start-all.sh

Inject all lines from input file into Kafka:

python producer.py --topic log_analytics --file 40MBFile.log --limit -1

Consume the streaming data from Kafka via Spark Streaming

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.0 consumer.py --topic log_analytics

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
AppendixB_NASA.ipynb		AppendixB_NASA.ipynb
AppendixC_2GB.ipynb		AppendixC_2GB.ipynb
README.md		README.md
consumer.py		consumer.py
definitions.py		definitions.py
producer.py		producer.py
transformation.py		transformation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Log Analytics Pipeline

Files

Usage

About

Releases

Packages

Contributors 2

Languages

jhnyc/log-analytics

Folders and files

Latest commit

History

Repository files navigation

Log Analytics Pipeline

Files

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages