diff --git a/README.md b/README.md index fda2068..eea1673 100644 --- a/README.md +++ b/README.md @@ -3,20 +3,20 @@ Step-by-step Deep Learning Tutorials on Apache Spark using [BigDL](https://github.com/intel-analytics/BigDL/). The tutorials are inspired by [Apache Spark examples](http://spark.apache.org/examples.html), the [Theano Tutorials](https://github.com/Newmu/Theano-Tutorials) and the [Tensorflow tutorials](https://github.com/nlintz/TensorFlow-Tutorials). ### Topics -1. [RDD](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/spark_basics/RDD.ipynb) -2. [DataFrame](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/spark_basics/DataFrame.ipynb) -3. [SparkSQL](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/spark_basics/spark_sql.ipynb) -4. [StructureStreaming](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/spark_basics/structured_streaming.ipynb) -5. [Forward and backward](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/forward_and_backward.ipynb) -6. [Linear Regression](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/linear_regression.ipynb) -7. [Introduction to MNIST](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/introduction_to_mnist.ipynb) -8. [Logistic Regression](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/logistic_regression.ipynb) -9. [Feedforward Neural Network](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/deep_feed_forward_neural_network.ipynb) -10. [Convolutional Neural Network](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/cnn.ipynb) -11. [Recurrent Neural Network](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/rnn.ipynb) -12. [LSTM](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/lstm.ipynb) -13. [Bi-directional RNN](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/birnn.ipynb) -14. [Auto-encoder](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/autoencoder.ipynb) +1. RDD [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/spark_basics/RDD.ipynb)] +2. DataFrame [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/spark_basics/DataFrame.ipynb)] +3. SparkSQL [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/spark_basics/spark_sql.ipynb)] +4. StructureStreaming [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/spark_basics/structured_streaming.ipynb)] +5. Forward and backward [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/forward_and_backward.ipynb)] +6. Linear Regression [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/linear_regression.ipynb) | [Scala](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/scala/neural_networks/linear_regression.ipynb)] +7. Introduction to MNIST [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/introduction_to_mnist.ipynb) | [Scala](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/scala/neural_networks/introduction_to_mnist.ipynb)] +8. Logistic Regression [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/logistic_regression.ipynb) | [Scala](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/scala/neural_networks/logistic_regression.ipynb)] +9. Feedforward Neural Network [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/deep_feed_forward_neural_network.ipynb)] +10. Convolutional Neural Network [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/cnn.ipynb)] +11. Recurrent Neural Network [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/rnn.ipynb)] +12. LSTM [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/lstm.ipynb)] +13. Bi-directional RNN [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/birnn.ipynb)] +14. Auto-encoder [[Python](https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/python/neural_networks/autoencoder.ipynb)] ### Environment + Python 2.7 diff --git a/notebooks/neural_networks/autoencoder.ipynb b/notebooks/python/neural_networks/autoencoder.ipynb similarity index 100% rename from notebooks/neural_networks/autoencoder.ipynb rename to notebooks/python/neural_networks/autoencoder.ipynb diff --git a/notebooks/neural_networks/birnn.ipynb b/notebooks/python/neural_networks/birnn.ipynb similarity index 100% rename from notebooks/neural_networks/birnn.ipynb rename to notebooks/python/neural_networks/birnn.ipynb diff --git a/notebooks/neural_networks/cnn.ipynb b/notebooks/python/neural_networks/cnn.ipynb similarity index 100% rename from notebooks/neural_networks/cnn.ipynb rename to notebooks/python/neural_networks/cnn.ipynb diff --git a/notebooks/neural_networks/deep_feed_forward_neural_network.ipynb b/notebooks/python/neural_networks/deep_feed_forward_neural_network.ipynb similarity index 100% rename from notebooks/neural_networks/deep_feed_forward_neural_network.ipynb rename to notebooks/python/neural_networks/deep_feed_forward_neural_network.ipynb diff --git a/notebooks/neural_networks/forward_and_backward.ipynb b/notebooks/python/neural_networks/forward_and_backward.ipynb similarity index 100% rename from notebooks/neural_networks/forward_and_backward.ipynb rename to notebooks/python/neural_networks/forward_and_backward.ipynb diff --git a/notebooks/neural_networks/introduction_to_mnist.ipynb b/notebooks/python/neural_networks/introduction_to_mnist.ipynb similarity index 100% rename from notebooks/neural_networks/introduction_to_mnist.ipynb rename to notebooks/python/neural_networks/introduction_to_mnist.ipynb diff --git a/notebooks/neural_networks/linear_regression.ipynb b/notebooks/python/neural_networks/linear_regression.ipynb similarity index 100% rename from notebooks/neural_networks/linear_regression.ipynb rename to notebooks/python/neural_networks/linear_regression.ipynb diff --git a/notebooks/neural_networks/logistic_regression.ipynb b/notebooks/python/neural_networks/logistic_regression.ipynb similarity index 100% rename from notebooks/neural_networks/logistic_regression.ipynb rename to notebooks/python/neural_networks/logistic_regression.ipynb diff --git a/notebooks/neural_networks/lstm.ipynb b/notebooks/python/neural_networks/lstm.ipynb similarity index 100% rename from notebooks/neural_networks/lstm.ipynb rename to notebooks/python/neural_networks/lstm.ipynb diff --git a/notebooks/neural_networks/rnn.ipynb b/notebooks/python/neural_networks/rnn.ipynb similarity index 100% rename from notebooks/neural_networks/rnn.ipynb rename to notebooks/python/neural_networks/rnn.ipynb diff --git a/notebooks/neural_networks/tutorial_images/Bi-directional_RNN/Bi-directional_RNN.jpg b/notebooks/python/neural_networks/tutorial_images/Bi-directional_RNN/Bi-directional_RNN.jpg similarity index 100% rename from notebooks/neural_networks/tutorial_images/Bi-directional_RNN/Bi-directional_RNN.jpg rename to notebooks/python/neural_networks/tutorial_images/Bi-directional_RNN/Bi-directional_RNN.jpg diff --git a/notebooks/neural_networks/tutorial_images/autoencoder/autoencoder_schema.jpg b/notebooks/python/neural_networks/tutorial_images/autoencoder/autoencoder_schema.jpg similarity index 100% rename from notebooks/neural_networks/tutorial_images/autoencoder/autoencoder_schema.jpg rename to notebooks/python/neural_networks/tutorial_images/autoencoder/autoencoder_schema.jpg diff --git a/notebooks/neural_networks/tutorial_images/deep_feed_forward_NN/feedforwardNN_structure.png b/notebooks/python/neural_networks/tutorial_images/deep_feed_forward_NN/feedforwardNN_structure.png similarity index 100% rename from notebooks/neural_networks/tutorial_images/deep_feed_forward_NN/feedforwardNN_structure.png rename to notebooks/python/neural_networks/tutorial_images/deep_feed_forward_NN/feedforwardNN_structure.png diff --git a/notebooks/neural_networks/utils.py b/notebooks/python/neural_networks/utils.py similarity index 100% rename from notebooks/neural_networks/utils.py rename to notebooks/python/neural_networks/utils.py diff --git a/notebooks/spark_basics/DataFrame.ipynb b/notebooks/python/spark_basics/DataFrame.ipynb similarity index 100% rename from notebooks/spark_basics/DataFrame.ipynb rename to notebooks/python/spark_basics/DataFrame.ipynb diff --git a/notebooks/spark_basics/RDD.ipynb b/notebooks/python/spark_basics/RDD.ipynb similarity index 100% rename from notebooks/spark_basics/RDD.ipynb rename to notebooks/python/spark_basics/RDD.ipynb diff --git a/notebooks/spark_basics/spark_sql.ipynb b/notebooks/python/spark_basics/spark_sql.ipynb similarity index 100% rename from notebooks/spark_basics/spark_sql.ipynb rename to notebooks/python/spark_basics/spark_sql.ipynb diff --git a/notebooks/spark_basics/structured_streaming.ipynb b/notebooks/python/spark_basics/structured_streaming.ipynb similarity index 100% rename from notebooks/spark_basics/structured_streaming.ipynb rename to notebooks/python/spark_basics/structured_streaming.ipynb diff --git a/notebooks/scala/neural_networks/introduction_to_mnist.ipynb b/notebooks/scala/neural_networks/introduction_to_mnist.ipynb new file mode 100644 index 0000000..98a2a1d --- /dev/null +++ b/notebooks/scala/neural_networks/introduction_to_mnist.ipynb @@ -0,0 +1,201 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction to the MNIST database" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the following tutorials, we are going to use the MNIST database of handwritten digits. MNIST is a simple computer vision dataset of handwritten digits. It has 60,000 training examles and 10,000 test examples. \"It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.\" For more details of this database, please checkout the website [MNIST](http://yann.lecun.com/exdb/mnist/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In BigDL, we need to write a function to download and read the MNIST data when using Scala." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import java.nio.ByteBuffer\n", + "import java.nio.file.{Files, Path, Paths}\n", + "\n", + "import com.intel.analytics.bigdl.dataset.ByteRecord\n", + "import com.intel.analytics.bigdl.utils.File\n", + "import scopt.OptionParser\n", + "\n", + "def load(featureFile: String, labelFile: String): Array[ByteRecord] = {\n", + " val featureBuffer = ByteBuffer.wrap(Files.readAllBytes(Paths.get(featureFile)))\n", + " val labelBuffer = ByteBuffer.wrap(Files.readAllBytes(Paths.get(labelFile)))\n", + " \n", + " val labelMagicNumber = labelBuffer.getInt()\n", + " require(labelMagicNumber == 2049)\n", + " val featureMagicNumber = featureBuffer.getInt()\n", + " require(featureMagicNumber == 2051)\n", + "\n", + " val labelCount = labelBuffer.getInt()\n", + " val featureCount = featureBuffer.getInt()\n", + " require(labelCount == featureCount)\n", + "\n", + " val rowNum = featureBuffer.getInt()\n", + " val colNum = featureBuffer.getInt()\n", + "\n", + " val result = new Array[ByteRecord](featureCount)\n", + " var i = 0\n", + " while (i < featureCount) {\n", + " val img = new Array[Byte]((rowNum * colNum))\n", + " var y = 0\n", + " while (y < rowNum) {\n", + " var x = 0\n", + " while (x < colNum) {\n", + " img(x + y * colNum) = featureBuffer.get()\n", + " x += 1\n", + " }\n", + " y += 1\n", + " }\n", + " result(i) = ByteRecord(img, labelBuffer.get().toFloat + 1.0f)\n", + " i += 1\n", + " }\n", + "\n", + " result\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we need to import the necessary packages and initialize the engine." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import org.apache.log4j.{Level, Logger}\n", + "import org.apache.spark.SparkContext\n", + "\n", + "import com.intel.analytics.bigdl.utils._\n", + "import com.intel.analytics.bigdl.dataset.DataSet\n", + "import com.intel.analytics.bigdl.dataset.image.{BytesToGreyImg, GreyImgNormalizer, GreyImgToBatch, GreyImgToSample}\n", + "import com.intel.analytics.bigdl.nn.{ClassNLLCriterion, Module}\n", + "import com.intel.analytics.bigdl.models.lenet.Utils._\n", + "import com.intel.analytics.bigdl.nn.{ClassNLLCriterion, Linear, LogSoftMax, Sequential, Reshape}\n", + "import com.intel.analytics.bigdl.numeric.NumericFloat\n", + "import com.intel.analytics.bigdl.optim.{SGD, Top1Accuracy}\n", + "import com.intel.analytics.bigdl.utils.{Engine, LoggerFilter, T, Table}\n", + "import com.intel.analytics.bigdl.tensor.Tensor\n", + "\n", + "Engine.init" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, the paths of training data and validation data should be set." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "val trainData = \"../datasets/mnist/train-images-idx3-ubyte\"\n", + "val trainLabel = \"../datasets/mnist/train-labels-idx1-ubyte\"\n", + "val validationData = \"../datasets/mnist/t10k-images-idx3-ubyte\"\n", + "val validationLabel = \"../datasets/mnist/t10k-labels-idx1-ubyte\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "Then, we need to define some parameters for loading the MINST data." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "//Parameters\n", + "val batchSize = 2048\n", + "val learningRate = 0.2\n", + "val maxEpochs = 15\n", + "\n", + "//Network Parameters\n", + "val nInput = 784 //MNIST data input (img shape: 28*28)\n", + "val nClasses = 10 //MNIST total classes (0-9 digits)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can use predefined function to load and serialize MNIST data. If you want to output the data, some modifications on the funtion should be applied." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "val trainSet = \n", + " DataSet.array(load(trainData, trainLabel), sc) -> BytesToGreyImg(28, 28) -> GreyImgNormalizer(trainMean, trainStd) -> GreyImgToBatch(batchSize)\n", + "val validationSet = \n", + " DataSet.array(load(validationData, validationLabel), sc) -> BytesToGreyImg(28, 28) -> GreyImgNormalizer(testMean, testStd) -> GreyImgToBatch(batchSize)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "sc.stop()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Apache Toree - Scala", + "language": "scala", + "name": "apache_toree_scala" + }, + "language_info": { + "file_extension": ".scala", + "name": "scala", + "version": "2.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/start_toree.sh b/start_toree.sh deleted file mode 100755 index 5f3c031..0000000 --- a/start_toree.sh +++ /dev/null @@ -1,43 +0,0 @@ -#!/bin/bash - -# Check environment variables -if [ -z "${BIGDL_HOME}" ]; then - echo "Please set BIGDL_HOME environment variable" - exit 1 -fi - -if [ -z "${SPARK_HOME}" ]; then - echo "Please set SPARK_HOME environment variable" - exit 1 -fi - -#setup pathes -export PYSPARK_DRIVER_PYTHON=jupyter -export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=./ --ip=* --no-browser --NotebookApp.token=''" -export BIGDL_JAR_NAME=`ls ${BIGDL_HOME}/lib/ | grep jar-with-dependencies.jar` -export BIGDL_JAR="${BIGDL_HOME}/lib/$BIGDL_JAR_NAME" -export BIGDL_PY_ZIP_NAME=`ls ${BIGDL_HOME}/lib/ | grep python-api.zip` -export BIGDL_PY_ZIP="${BIGDL_HOME}/lib/$BIGDL_PY_ZIP_NAME" -export BIGDL_CONF=${BIGDL_HOME}/conf/spark-bigdl.conf - -# Check files -if [ ! -f ${BIGDL_CONF} ]; then - echo "Cannot find ${BIGDL_CONF}" - exit 1 -fi - -if [ ! -f ${BIGDL_PY_ZIP} ]; then - echo "Cannot find ${BIGDL_PY_ZIP}" - exit 1 -fi - -if [ ! -f $BIGDL_JAR ]; then - echo "Cannot find $BIGDL_JAR" - exit 1 -fi - -export SPARK_OPTS="--master local[4] --driver-memory 4g --properties-file ${BIGDL_CONF} --jars ${BIGDL_JAR} --conf spark.driver.extraClassPath=${BIGDL_JAR} --conf spark.executor.extraClassPath=${BIGDL_JAR} --conf spark.sql.catalogImplementation='in-memory'" - -echo 'Install toree to jupyter, this may need root privilege' -sudo jupyter toree install --spark_home=${SPARK_HOME} --spark_opts='${SPARK_OPTS}' -jupyter notebook --notebook-dir=./ --ip=* --no-browser --NotebookApp.token=''