KafkaHadopHiveSparkExampleProject

Get data. Submit a Kafka topic. Get it with Spark from Topic. Save to HDFS and Apache Hive!

Project Steps;

About Data: Get today's Google trending search results.

Features;

- title: Search title

- subtitle: subtitle about search

- source: source of search

- time_published

- searches: Number of search about this title

Apache Kafka, Apache Spark, Hadoop, Apache Hive, python3 are used in this project.

Therefore, these technologies must be installed on your system. Simply for this;

I installed Ubuntu virtual machine on my machine. You can use UTM or Orecla VM for this, depending on your system.
Install python3 your system.
From https://kafka.apache.org/downloads website you can download Kafka your system. But You must have Java installed on your system. If not, install Java first.
Install Hadoop your system from https://hadoop.apache.org/releases.html website. After this you have to add your env variables in .bashrc file on Linux.
Add your configs in hadoop-env.sh
Edit core-site.xml and hdfs-site.xml files.
Install Apache Hıve from https://hive.apache.org/downloads.html
Add env variables to .bashrc file
Edit hive-env.sh and hive-site.xml
Install Spark from https://spark.apache.org/downloads.html website. For Spark, Scala must be installed on the system.
Add env variables to .bashrc file

Terminal 1: -- Zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties
Terminal2: -- Kafka Broker bin/kafka-server-start.sh config/server.properties
Terminal 3 -- Kafka Config Create a topic: bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic trends
Terminal3 --Hadoop hdfs namenode -format start-dfs.sh
Terminal 4 -- Hive Metastore hive --service metastor
Terminal 5 -- Hive hive

!!! Set up a virtual environment for Python codes. Download packages from Requirements.txt to this environment

Terminal 6 -- Stream Producer python3 producer.py
Terminal 7 -- Stream Consumer + Spark Transformer spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 transformer.py

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
get_data.py		get_data.py
kafkatest(consumer).py		kafkatest(consumer).py
output.csv		output.csv
producer.py		producer.py
requirements.txt		requirements.txt
test.py		test.py
transformer.py		transformer.py