This example shows how to use BigDL train a model on Standford Treebank dataset dataset using binary TreeLSTM and Glove word embedding vectors. Tree-LSTM is a kind of recursive neural networks, which describes in the paper Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks by Kai Sheng Tai, Richard Socher, and Christopher Manning.
The dataset is a corpus of ~10K one-sentence movie reviews from Rotten Tomatoes. Each sentence has been parsed into constituency-based parse trees, which is kind of binary trees with a word at each leaf. After pre-processing, every node has been tagged a label whose range from -2 to 2, representing the sentiment of the word or phrase. The value from -2 to 2 corresponds to highly negative, moderately negative, neutral, moderately positive and highly positive respectively. The root of the tree represents the sentiment of the entire sentence.
First run the following script
python fetch_and_preprocess.py
The treebank dataset and the Glove word embedding vectors will be downloaded to
/tmp/.bigdl/dataset/
directory, after that the treebank will be split into three folders
corresponding to train, dev, and test in an appropriate format.
Next just run the following command to run the code:
- Spark local:
spark-submit --master "local[physical_core_number]" --driver-memory 20g \
--class com.intel.analytics.bigdl.example.treeLSTMSentiment.Train \
bigdl-VERSION-jar-with-dependencies.jar
-
Spark cluster:
- Standalone:
MASTER=spark://xxx.xxx.xxx.xxx:xxxx spark-submit --master ${MASTER} --driver-memory 20g --executor-memory 10g \ --total-executor-cores 32 --executor-cores 8 \ --class com.intel.analytics.bigdl.example.treeLSTMSentiment.Train \ bigdl-VERSION-jar-with-dependencies.jar
- Yarn client:
MASTER=spark://xxx.xxx.xxx.xxx:xxxx spark-submit --master yarn --driver-memory 20g --executor-memory 10g \ --num-executor 4 --executor-cores 8 \ --class com.intel.analytics.bigdl.example.treeLSTMSentiment.Train \ bigdl-VERSION-jar-with-dependencies.jar
- NOTE: The total batch is: 128 and the batch per node is 128/nodeNum. You can also have also set regularizer rate, learning rate, lstm hiddensize, dropout probability and epoch number by adding one of the options below:
--baseDir # where is the data, default is '/tmp/.bigdl/dataset/' --batchSize # number of batch size, default is 128 --hiddenSize # number of TreeLSTM hidden size, default is 250 --learingRate # number of learning rate, default is 0.05 --regRate # number of L2 regularization rate, default is 1e-4 --p # number of dropout probability rate, default is 0.5 --epoch # number of epochs, default is 5