Siamese LSTM for evaluating semantic similarity between sentences of the Quora Question Pairs Dataset.
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
data Add data dir Mar 16, 2018 Use tf.keras instead of just plain keras API. Mar 17, 2018 Update Mar 20, 2018 Make a result worth a little bit more. Mar 22, 2018 Add word2vec generator. Mar 19, 2018 Fix typo. Mar 19, 2018


Using MaLSTM model(Siamese networks + LSTM with Manhattan distance) to detect semantic similarity between question pairs. Training dataset used is a subset of the original Quora Question Pairs Dataset(~363K pairs used).

It is Keras implementation based on Original Paper(PDF) and Excellent Medium Article.


Kaggle's test.csv is too big, so I had extracted only the top 20 questions and created a file called test-20.csv and It is used in the

You should put all data files to ./data directory.

How to Run


$ python3


It uses test-20.csv file mentioned above.

$ python3

The Results

I have tried with various parameters such as number of hidden states of LSTM cell, activation function of LSTM cell and repeated count of epochs. I have used NVIDIA Tesla P40 GPU x 2 for training and 10% data was used as the validation set(batch size=1024*2). As a result, I have reached about 82.29% accuracy after 50 epochs about 10 mins later.

Epoch 50/50
363861/363861 [==============================] - 12s 33us/step - loss: 0.1172 - acc: 0.8486 - val_loss: 0.1315 - val_acc: 0.8229
Training time finished.
50 epochs in       601.24