This repo contains a complete Hadoop docker-compose environment with containers for HDFS, Hue, Spark, Jupyter Notebooks, etc. It was forked from https://github.com/zar3bski/hadoop-sandbox and the following components were added:
- A PostgreSQL database backend for HUE, so that HiveQL queries can be executed in the HUE UI
- A Spark cluster with one master and two worker nodes
- A Jupyter Notebook server to write code that is executed in the Spark cluster
- Example Spark applications that connect to HDFS and Hive (see
jupyter-spark/work
) - A MongoDB along with Mongo Express
You'll need a docker engine and docker-compose
-
Clone this repo
-
Add an
.env
file at the root directory that contains the following info:
CLUSTER_NAME=the_name_of_your_cluster
ADMIN_NAME=your_name
ADMIN_PASSWORD=secret
INSTALL_PYTHON=true # whether you want python or not (to run hadoop streaming)
INSTALL_SQOOP=true
- Install and start all services with
docker-compose up --build -d
- hadoop streaming
/opt/hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar
- Yarn ressource manager
- hue
- namenode overview
- Spark master
- Jupyter Notebook server. To See which token must be entered, execute
docker exec jupyter-spark jupyter notebook list
Most sources were gathered from big-data-europe's repos
complete list of HDFS commands Udemy Hadoop course
Go into the namenode
Container and download some data
# Go into the namenode container
docker exec -it namenode bash
# update package list
apt-get update
# Install some software utilities
apt-get install wget unzip
# Download some data into the hadoop-data directory
cd /hadoop-data
wget "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
# Extract the zip file and remove redundant files
unzip ml-100k.zip
rm ml-100k.zip
# Create a directory in HDFS and print out where it is located
hadoop fs -mkdir -p playground # The -p is important! It creates intermediate dirs on the fly
hadoop fs -find / -name "playground" # Yields /user/root/playground
# Copy the data into HDFS and verify it worked
hadoop fs -copyFromLocal ml-100k playground/
hadoop fs -ls playground/ml-100k
Now, browse the HDFS file system from the UI of the namenode and convince yourself that the data is really there!
To loading some data from HDFS into HIVE, open the UI of hue, open up a new HiveQL query console and execute the commands shown in
hue/queries/load_ratings_into_hive.sql
or hue/queries/load_names_into_hive.sql
, or navigate into the Hive explorer from the HUE UI, click the Import button and follow the instructions on the screen
To view some sample spark applications that run on the Spark cluster, connect to HDFS and Hive, check out some of the notebooks in jupyter-spark/work/assignments
.