Siamese LSTM for evaluating semantic similarity between sentences of the Quora Question Pairs Dataset.
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Add data dir Mar 16, 2018
README.md Use tf.keras instead of just plain keras API. Mar 17, 2018
predict.py Update predict.py Mar 20, 2018
train.py Make a result worth a little bit more. Mar 22, 2018
util.py Add word2vec generator. Mar 19, 2018
word2vec.py Fix typo. Mar 19, 2018

README.md

Siamese-LSTM

Using MaLSTM model(Siamese networks + LSTM with Manhattan distance) to detect semantic similarity between question pairs. Training dataset used is a subset of the original Quora Question Pairs Dataset(~363K pairs used).

It is Keras implementation based on Original Paper(PDF) and Excellent Medium Article.

Prerequisite

Kaggle's test.csv is too big, so I had extracted only the top 20 questions and created a file called test-20.csv and It is used in the predict.py.

You should put all data files to ./data directory.

How to Run

Training

$ python3 train.py

Predicting

It uses test-20.csv file mentioned above.

$ python3 predict.py

The Results

I have tried with various parameters such as number of hidden states of LSTM cell, activation function of LSTM cell and repeated count of epochs. I have used NVIDIA Tesla P40 GPU x 2 for training and 10% data was used as the validation set(batch size=1024*2). As a result, I have reached about 82.29% accuracy after 50 epochs about 10 mins later.

Epoch 50/50
363861/363861 [==============================] - 12s 33us/step - loss: 0.1172 - acc: 0.8486 - val_loss: 0.1315 - val_acc: 0.8229
Training time finished.
50 epochs in       601.24