Summary

This example shows how to use BigDL train a model on Standford Treebank dataset dataset using binary TreeLSTM and Glove word embedding vectors. Tree-LSTM is a kind of recursive neural networks, which describes in the paper Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks by Kai Sheng Tai, Richard Socher, and Christopher Manning.

The dataset is a corpus of ~10K one-sentence movie reviews from Rotten Tomatoes. Each sentence has been parsed into constituency-based parse trees, which is kind of binary trees with a word at each leaf. After pre-processing, every node has been tagged a label whose range from -2 to 2, representing the sentiment of the word or phrase. The value from -2 to 2 corresponds to highly negative, moderately negative, neutral, moderately positive and highly positive respectively. The root of the tree represents the sentiment of the entire sentence.

Steps to run this example:

First run the following script

python fetch_and_preprocess.py

The treebank dataset and the Glove word embedding vectors will be downloaded to /tmp/.bigdl/dataset/ directory, after that the treebank will be split into three folders corresponding to train, dev, and test in an appropriate format.

Next just run the following command to run the code:

Spark local:

     spark-submit --master "local[physical_core_number]" --driver-memory 20g      \
                   --class com.intel.analytics.bigdl.example.treeLSTMSentiment.Train \
                   bigdl-VERSION-jar-with-dependencies.jar

Spark cluster:

Standalone:

        MASTER=spark://xxx.xxx.xxx.xxx:xxxx
        spark-submit --master ${MASTER} --driver-memory 20g --executor-memory 10g      \
               --total-executor-cores 32 --executor-cores 8                      \
               --class com.intel.analytics.bigdl.example.treeLSTMSentiment.Train \
               bigdl-VERSION-jar-with-dependencies.jar

Yarn client:

        MASTER=spark://xxx.xxx.xxx.xxx:xxxx
        spark-submit --master yarn --driver-memory 20g --executor-memory 10g           \
               --num-executor 4 --executor-cores 8                               \
               --class com.intel.analytics.bigdl.example.treeLSTMSentiment.Train \
               bigdl-VERSION-jar-with-dependencies.jar

NOTE: The total batch is: 128 and the batch per node is 128/nodeNum. You can also have also set regularizer rate, learning rate, lstm hiddensize, dropout probability and epoch number by adding one of the options below:

         --baseDir           # where is the data, default is '/tmp/.bigdl/dataset/'
         --batchSize         # number of batch size, default is 128             
         --hiddenSize        # number of TreeLSTM hidden size, default is 250
         --learingRate       # number of learning rate, default is 0.05
         --regRate           # number of L2 regularization rate, default is 1e-4
         --p                 # number of dropout probability rate, default is 0.5
         --epoch             # number of epochs, default is 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Summary

Steps to run this example:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Summary

Steps to run this example: