Skip to content

indierambler/data-environment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docker-Compose Environment for Big Data R & D

Overview

Stack:

  • Spark/Spark-Connect/PySpark
  • Hadoop/HDFS
  • JupyterLab

Containers:

  • spark-master
  • spark-worker-1
  • hadoop-namenode
  • hadoop-datanode
  • hadoop-resourcemanager
  • hadoop-nodemanager-1
  • hadoop-historyserver
  • jupyterlab

Server Requirements:

  • docker

Local Requirements:

  • java-jdk

Tested Host OS:

  • Ubuntu Server 22.04.2 LTS

Coming Soon:

  • JupyterHub integration
  • LocalStack S3 integration

Description

This project provides a docker compose environment that serves JupyterLab, a scalable Spark cluster, and a scalable Hadoop file system. The JupyterLab instance has direct access to the Spark cluster and hadoop resources. Spark and Hadoop instances can also be accessed and used through a remote machine via Spark-Connect. This is useful for scalable data analytics, machine learning development, and personal datastack experience.

Get Running

  1. Clone this repo
    git clone git@github.com:indierambler/data-environment
  2. Move into the new project directory
    cd data-environment
  3. Create and Update the .env file
    mv sample-env.md .env
    Make sure to update the values inside the new .env file (nano .env)
    • set subdomain values only if connecting to a reverse proxy
    • set directories to local locations where container data can be stored
    • set IP addresses to the local server's network address (ports need no change)
    • allocate cores and memory based on what is available in your system
  4. Add execute permissions to build scripts
    chmod +x build/build.sh build/*/build.sh
  5. Build the docker base images
    build/build.sh
  6. Launch the containers
    docker-compose up -d

Update (via git)

  1. Move into data-environment directory
    cd path/to/data-environment
  2. Make sure data-environment is shut down
    docker compose down
  3. Pull Any Exiting Git Changes
    git pull
  4. Start data-environment back up
    docker compose up -d

Access Interfaces

From Local Machine (server) running the data-environment

From Remote Machine (client) with SSH access to server running data-environment

  • JupyterLab web UI: http://<SERVER_IP_ADDR>:8888
  • Hadoop web UI: http://<SERVER_IP_ADDR>:9870
  • Spark master web UI: http://<SERVER_IP_ADDR>:8080
  • Spark-Connect URL: sc://<SERVER_IP_ADDR>:15002

Spark and HDFS Interactions

A Jupyter Notebook for learning and testing basic interactions with Spark and HDFS can be found in the project folder at data-environment/demo/spark-demo.ipynb. This file can be used on a remote machine with SSH access to the Spark/Hadoop server or uploaded directly to the JupyterLab instance for testing.

Using With Nginx Reverse Proxy

This project is set up to work with the nginx-proxy/nginx-proxy project. Some things to keep in mind when connecting up to the reverse proxy:

  • Make sure the reverse proxy and data-environment docker-compose.yaml files put all services on the same docker network
  • Make sure docker-compose/.env subdomains are set correctly

About

Docker Compose environment for big data research and machine learning development

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published