Skip to content

groverjeenu/Bilingual-Word-Embeddings-with-Bucketed-CNN-for-Parallel-Sentence-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction

A TensorFlow implementation of our recent paper in ACL 2017(Student track) (Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction.)

Two sentences are said to be aligned or semantically similar if they convey the same semantics in both the languages. Our code makes use of the Bilingual Word Embeddings for capturing the semantic relatedness of two words across languages. A similarity matrix is constructed between the words of two sentences, which is dynamically pooled to a fixed size dimension 'dim' for classification tasks. We split the data into different bucket sizes as one fixed size representation would not work effectively for all sentence-pair sizes. Separate CNN's were trained on each data split.

Model Overview

Pre-requisites

  • Python 2.7
  • TensorFlow
  • Numpy
  • Scikit Learn
  • Matplotlib

Usage

To replicate the results from our paper, use the testing command below.

$ python main.py

The results would be appended at the end of corresponding files in the results folder. If you want to retrain the model, then uncomment the lines specified in the main.py

Attribution / Thanks

  • Bilingual Word Representations with Monolingual Quality in Mind Link
  • BUCC 2017 dataset Link

License

MIT

Releases

No releases published

Packages

No packages published

Languages