- Semantic Textual Similarity 2012-2017 Dataset
- The benchmark comprises 8628 sentence pairs. Each sentence pair is accompained with a similarity score (0-5), where 0 being least similar and 5 being most similar.
- The train-dev-test splits is as follows:
train | dev | test | total | |
---|---|---|---|---|
news | 3299 | 500 | 500 | 4299 |
caption | 2000 | 625 | 525 | 3250 |
forum | 450 | 375 | 254 | 1079 |
total | 5749 | 1500 | 1379 | 8628 |
- I have design a system similar to the one proposed in "Siamese Recurrent Architectures for Learning Sentence Similarity".
- Instead of a plain LSTM. I have used a BiLSTM followed by a dense layer.
- The loss used to exponential of the negative Manhattan distance between 2 sentence representations generated by the Siamese Network.
- The results can be further improved using the non-parametric log-linear classifier in the post-processing step as illustrated in the base paper.
- utils.py file contains all the helper funcitons.
- siamese_model.py contains the model architecture.
- main.ipynb contains the implementation.