Skip to content
Nilanjan Chatterjee edited this page Mar 21, 2018 · 4 revisions

Welcome to the Quora-Question-Pairs-Kaggle-RedundancyChecker wiki!

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs. The average length is 59 and std is 32.

There are two other columns "q1id" and "q2id" but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort. Proposed Method Converting Questions into Vectors

Here, I plan to use Google’s Word2Vec model to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general. These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications. Word2vec is a group of shallow two-layer models that are used for producing word embeddings. Presented in Efficient Estimation of Word Representations in Vector Space, word2vec takes a large corpus of text as its input and produces a vector space [13]. Every word in the corpus obtains the corresponding vector in this space. The distinctive feature is that words from common contexts in the corpus are located close to one another in the vector space.

There are two well-known algorithms in this domain. One is Google’s network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another method which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

I have used Multilayer Perceptron with sigmoid activation for both hidden and output layers and also a SVM implementation.

Clone this wiki locally