Skip to content
Gathers data science and machine learning problem solving using PySpark and Hadoop.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
compose
notebooks
screenshot
.gitignore
Dockerfile
Dockerfile-cluster
LICENSE
README.md
cluster.sh
core-site.xml
docker-compose-cluster.yml
hdfs-site.xml
mapred-site.xml
ssh_config
start-all.sh
supervisord.conf
yarn-site.xml

README.md

Pyspark-ML

Gathers data science and machine learning problem solving using PySpark and Hadoop.

Covered

  1. Test Pyspark
  2. Text classification IMDB dataset using logistic regression
  3. Text classification IMDB dataset using multinomial
  4. Topic Modelling TFIDF + LDA
  5. Word Vector
  6. Read Iris csv from Hadoop DFS
  7. PCA on Iris dataset
  8. MNIST feed-forward sparkflow
  9. MNIST CNN sparkflow
  10. MNIST RNN-LSTM sparkflow
  11. Fashion-MNIST Inception v1 sparkflow

How-to Notebook

  1. Run docker compose,
compose/build

Or you can choose cluster mode,

docker-compose -f docker-compose-cluster.yml up --build --remove-orphans
  1. Visit localhost:8089 for passwordless jupyter notebook.

How-to Hadoop

Check Hadoop health, localhost:9870

Hadoop DFS Web UI, localhost:9870/explorer.html#/

Hadoop Node Manager, localhost:8042/node

How-to Spark-cluster

If success using cluster mode,

slave_2   | 2018-11-18 07:57:59 INFO  Worker:54 - Successfully registered with master spark://192.168.128.2:7077
slave_1   | 2018-11-18 07:58:10 INFO  Worker:54 - Successfully registered with master spark://192.168.128.2:7077

Check Spark health, localhost:8080

You can’t perform that action at this time.