## Quora Question Pairs
This is a personal try at the 2017 Quora Question Pairs Kaggle Competition (https://www.kaggle.com/c/quora-question-pairs/overview).

At the heart of this problem is trying to tackle similarity between two questions; the task is to identify if these questions are duplicates, or in other words, are semantially equivalent questions.

In [1]:
import os
print("File Sizes:")
file_size = os.path.getsize('./test.csv')
num_lines = sum(1 for line in open('./test.csv'))
print("test.csv " + str(round(file_size / (1024*1024),2)) + "MB " + str(num_lines) + " lines")
file_size = os.path.getsize('./train.csv')
num_lines = sum(1 for line in open('./train.csv'))
print("train.csv " + str(round(file_size / (1024*1024),2)) + "MB " + str(num_lines) + " lines")

File Sizes:
test.csv 299.47MB 2345806 lines
train.csv 60.46MB 404302 lines


As we can see, the test data is significantly larger than the training data and has 6X the number of items than the training data.

In [2]:
import pandas
import numpy
train = pandas.read_csv('./train.csv', dtype={'question1': numpy.unicode_, 'question2': numpy.unicode_})
#train.head()
train.question1 = train.question1.astype(str)
train.question2 = train.question2.astype(str)

In [3]:
print(len(train))

404290


Why does the length of this data frame differ from the line count of the file? Opening up the training file, it looks like some of the question text has incorporated a newline character that splits a single example into multi-lines, but seems like this happens pretty infrequently. As an aside, this file uses carriage returns, so one might reasonably assume this was encoded on Windows. The first example I see for this is on train.csv:line 2334, example id 2332. Just to verify...

In [4]:
#train.iloc[2332]

Looks like this was parsed into the frame properly, so no worries.

Next thing we might want to do is look at the occurance of question duplication in the training set. We can do this simply by taking the average of duplicate occurances in the set.

In [5]:
avg_duplicate_likelihood = train['is_duplicate'].mean()
print("Fraction of question pairs that are duplicates: " + str(avg_duplicate_likelihood))

Fraction of question pairs that are duplicates: 0.369197853026293


In [6]:
test = pandas.read_csv('./test.csv', dtype={'question1': numpy.unicode_, 'question2': numpy.unicode_})
test.question1 = test.question1.astype(str)
test.question2 = test.question2.astype(str)

So we can see that duplicates occur about 36.9% of the time. As a baseline, we could simply predict that all examples have a 36.9% probability of being a duplicate.

In [7]:
sub = pandas.DataFrame({'test_id': test['test_id'], 'is_duplicate': avg_duplicate_likelihood})
sub.to_csv('baseline.csv', index=False)

This gives a 0.554 public score. Let's see how we could do better.

## Feature Creation
Let's start with a boosting model, creating a number of features. We could consider the proportion of shared words between sentences, shared n-grams, a td-idf weighted share, cosine_similarity (based on word2vec), and levenshtein distance.

In [8]:
from nltk.corpus import stopwords
import string
stops = set(stopwords.words("english"))

def word_match_share(row):
    q1words = {}
    q2words = {}
    for word in str(row['question1']).translate(str.maketrans('', '', string.punctuation)).lower().split():
        if word not in stops:
            q1words[word] = 1
    for word in str(row['question2']).translate(str.maketrans('', '', string.punctuation)).lower().split():
        if word not in stops:
            q2words[word] = 1
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    shared_words_in_q1 = [w for w in q1words.keys() if w in q2words]
    shared_words_in_q2 = [w for w in q2words.keys() if w in q1words]
    R = (len(shared_words_in_q1) + len(shared_words_in_q2))/(len(q1words) + len(q2words))
    return R

Word sharing as a feature alone provides a public score of 0.42, which is a marked improvement over our baseline.

In [9]:
import re
from nltk.util import ngrams

def bigram_match_share(row):
    q1bigram = {}
    q2bigram = {}
    tokens = [token for token in str(row['question1']).translate(str.maketrans('', '', string.punctuation)).lower().split(" ") if token != "" and token not in stops]
    output = list(ngrams(tokens, 2))
    for bigram in output:
        q1bigram[bigram] = 1
    tokens = [token for token in str(row['question2']).translate(str.maketrans('', '', string.punctuation)).lower().split(" ") if token != "" and token not in stops]
    output = list(ngrams(tokens, 2))
    for bigram in output:
        q2bigram[bigram] = 1
    if len(q1bigram) == 0 or len(q2bigram) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    shared_bigram_in_q1 = [w for w in q1bigram.keys() if w in q2bigram]
    shared_bigram_in_q2 = [w for w in q2bigram.keys() if w in q1bigram]
    R = (len(shared_bigram_in_q1) + len(shared_bigram_in_q2))/(len(q1bigram) + len(q2bigram))
    return R

#bigram_match_share(train.iloc[5])

In [10]:
import re
from nltk.util import ngrams

def trigram_match_share(row):
    q1ngram = {}
    q2ngram = {}
    tokens = [token for token in str(row['question1']).translate(str.maketrans('', '', string.punctuation)).lower().split(" ") if token != "" and token not in stops]
    output = list(ngrams(tokens, 3))
    for ngram in output:
        q1ngram[ngram] = 1
    tokens = [token for token in str(row['question2']).translate(str.maketrans('', '', string.punctuation)).lower().split(" ") if token != "" and token not in stops]
    output = list(ngrams(tokens, 3))
    for ngram in output:
        q2ngram[ngram] = 1
    if len(q1ngram) == 0 or len(q2ngram) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    shared_ngram_in_q1 = [w for w in q1ngram.keys() if w in q2ngram]
    shared_ngram_in_q2 = [w for w in q2ngram.keys() if w in q1ngram]
    R = (len(shared_ngram_in_q1) + len(shared_ngram_in_q2))/(len(q1ngram) + len(q2ngram))
    return R


The effect of adding a bi-gram and tri-gram feature provided some marginal, ~0.01 improvement on our public score

In [11]:
from collections import Counter

# If a word appears only once, we ignore it completely (likely a typo)
# Epsilon defines a smoothing constant, which makes the effect of extremely rare words smaller
def get_weight(count, eps=10000, min_count=2):
    if count < min_count:
        return 0
    else:
        return 1 / (count + eps)

eps = 5000 
train_qs = pandas.Series(train['question1'].tolist() + train['question2'].tolist()).astype(str)
words = (" ".join(train_qs)).lower().split()
counts = Counter(words)
weights = {word: get_weight(count) for word, count in counts.items()}

In [12]:
def tfidf_word_match_share(row):
    q1words = {}
    q2words = {}
    for word in str(row['question1']).translate(str.maketrans('', '', string.punctuation)).lower().split():
        if word not in stops:
            q1words[word] = 1
    for word in str(row['question2']).translate(str.maketrans('', '', string.punctuation)).lower().split():
        if word not in stops:
            q2words[word] = 1
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    
    shared_weights = [weights.get(w, 0) for w in q1words.keys() if w in q2words] + [weights.get(w, 0) for w in q2words.keys() if w in q1words]
    total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words]
    
    R = numpy.sum(shared_weights) / numpy.sum(total_weights)
    return R

The td-idf feature provided a marginal decrease

In [13]:
import pyemd
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)



In [14]:
def word2vec_cosine_similarity(row):
    q1words = []
    q2words = []
    for word in str(row['question1']).translate(str.maketrans('', '', string.punctuation)).lower().split():
        if word not in stops and word in model.vocab:
            q1words.append(word)
    for word in str(row['question2']).translate(str.maketrans('', '', string.punctuation)).lower().split():
        if word not in stops and word in model.vocab:
            q2words.append(word)
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    sim = model.n_similarity(q1words, q2words)
    return sim

The word2vec cosine similarities did not improve the model performance.

In [15]:
# x_train = pandas.DataFrame()
# x_train['word_match'] = train.apply(word_match_share, axis=1, raw=True)
# #x_train['tfidf_word_match'] = train.apply(tfidf_word_match_share, axis=1, raw=True)
# x_train['word2vec_cosine_similarity'] = train.apply(word2vec_cosine_similarity, axis=1, raw=True)
# x_train['q1len'] = train['question1'].str.len()
# x_train['q1len'][numpy.isnan(x_train['q1len'])] = 0
# x_train['q2len'] = train['question2'].str.len()
# x_train['q2len'][numpy.isnan(x_train['q2len'])] = 0
# x_train['len_diff'] = abs(x_train['q1len']-x_train['q2len'])
# x_train['q1_n_words'] = train['question1'].apply(lambda row: len(str(row).split(" ")))
# x_train['q2_n_words'] = train['question2'].apply(lambda row: len(str(row).split(" ")))
# x_train['q1_num_caps'] = train['question1'].apply(lambda row: sum(1 for c in str(row) if c.isupper()))
# x_train['q2_num_caps'] = train['question2'].apply(lambda row: sum(1 for c in str(row) if c.isupper()))

# y_train = pandas.DataFrame()
# y_train = train['is_duplicate'].values

# x_test = pandas.DataFrame()
# x_test['word_match'] = test.apply(word_match_share, axis=1, raw=True)
# #x_test['tfidf_word_match'] = test.apply(tfidf_word_match_share, axis=1, raw=True)
# x_test['word2vec_cosine_similarity'] = test.apply(word2vec_cosine_similarity, axis=1, raw=True)
# x_test['q1len'] = test['question1'].str.len()
# x_test['q1len'][numpy.isnan(x_test['q1len'])] = 0
# x_test['q2len'] = test['question2'].str.len()
# x_test['q2len'][numpy.isnan(x_test['q2len'])] = 0
# x_test['len_diff'] = abs(x_test['q1len']-x_test['q2len'])
# x_test['q1_n_words'] = test['question1'].apply(lambda row: len(str(row).split(" ")))
# x_test['q2_n_words'] = test['question2'].apply(lambda row: len(str(row).split(" ")))
# x_test['q1_num_caps'] = test['question1'].apply(lambda row: sum(1 for c in str(row) if c.isupper()))
# x_test['q2_num_caps'] = test['question2'].apply(lambda row: sum(1 for c in str(row) if c.isupper()))

In [16]:
# from fuzzywuzzy import fuzz
# def lev_dist(row):
#     lev_dist = fuzz.ratio(str(row['question1']), str(row['question2']))
#     return lev_dist/100
# x_train['lev_dist'] = train.apply(lev_dist, axis=1, raw=True)
# x_test['lev_dist'] = test.apply(lev_dist, axis=1, raw=True)

Levenshtein distance provided a mild increase to the model performance

In [17]:
# x_train['bigram_word_match'] = train.apply(bigram_match_share, axis=1, raw=True)
# x_test['bigram_word_match'] = test.apply(bigram_match_share, axis=1, raw=True)

In [18]:
# x_train['trigram_word_match'] = train.apply(trigram_match_share, axis=1, raw=True)
# x_test['trigram_word_match'] = test.apply(trigram_match_share, axis=1, raw=True)

In [19]:
# pos_train = x_train[y_train == 1]
# neg_train = x_train[y_train == 0]

# # Now we oversample the negative class
# # There is likely a much more elegant way to do this...
# p = 0.165
# scale = ((len(pos_train) / (len(pos_train) + len(neg_train))) / p) - 1
# while scale > 1:
#     neg_train = pandas.concat([neg_train, neg_train])
#     scale -=1
# neg_train = pandas.concat([neg_train, neg_train[:int(scale * len(neg_train))]])
# print(len(pos_train) / (len(pos_train) + len(neg_train)))

# x_train = pandas.concat([pos_train, neg_train])
# y_train = (numpy.zeros(len(pos_train)) + 1).tolist() + numpy.zeros(len(neg_train)).tolist()
# del pos_train, neg_train

Another kernal suggested to smooth the data and better proportion the positive and negative training examples. Doing this significantly improved the model score, by 0.07

In [20]:
from sklearn import naive_bayes
from sklearn import ensemble
from sklearn.model_selection import train_test_split
#print(x_train)
#x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train)

## Random Forest
A simple random forest classifier was attempted, but the performance was significantly worse than the boosting model.

In [21]:
# rf_model = ensemble.RandomForestClassifier()
# #print(x_train.values)
# #numpy.isnan(y_train)

# rf_model.fit(x_train.fillna(0), y_train)
# predictions = rf_model.predict(x_valid.fillna(0))
# from sklearn import metrics
# print(metrics.accuracy_score(predictions, y_valid))

In [22]:
# predictions = rf_model.predict_proba(x_test.fillna(0))
# print(predictions[:,1])
# sub = pandas.DataFrame({'test_id': test['test_id'], 'is_duplicate': predictions[:,1]})
# sub.to_csv('random_forest.csv', index=False)

## XGBoost Model
A popular boosting model based on weak learners as opposed to random forest. Saw some of the best performance from this.

In [23]:
# import xgboost as xgb

# # Set our parameters for xgboost
# params = {}
# params['objective'] = 'binary:logistic'
# params['eval_metric'] = 'logloss'
# params['eta'] = 0.02
# params['max_depth'] = 8

# d_train = xgb.DMatrix(x_train, label=y_train)
# d_valid = xgb.DMatrix(x_valid, label=y_valid)

# watchlist = [(d_train, 'train'), (d_valid, 'valid')]

# bst = xgb.train(params, d_train, 500, watchlist, early_stopping_rounds=50, verbose_eval=10)

In [None]:
# d_test = xgb.DMatrix(x_test)
# p_test = bst.predict(d_test)

# sub = pandas.DataFrame()
# sub['test_id'] = test['test_id']
# sub['is_duplicate'] = p_test
# sub.to_csv('simple_xgb.csv', index=False)

## Neural Networks
After attempting numerous iterations and feature engineering with the above boosting model, I attempted using a variety of NN architectures via Keras

In [None]:
from keras.models import Sequential
from keras import layers
# input_dim = x_train.shape[1]  # Number of features

Using TensorFlow backend.


## Shallow Neural Network
The first attempt at a NN via a simple shallow network w/ the existing feature set.

In [None]:
# model = Sequential()
# model.add(layers.Dense(100, input_dim=input_dim, activation='relu'))
# model.add(layers.Dense(1, activation='sigmoid'))
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.summary()
# history = model.fit(x_train, y_train, epochs=100, verbose=True, validation_data=(x_valid, y_valid), batch_size=100)
# loss, accuracy = model.evaluate(x_train, y_train, verbose=True)
# print("Training Accuracy: {:.4f}".format(accuracy))
# loss, accuracy = model.evaluate(x_valid, y_valid, verbose=True)
# print("Testing Accuracy:  {:.4f}".format(accuracy))

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()


In [None]:
# predictions = model.predict_proba(x_test)
# print(predictions[:,0])
# sub = pandas.DataFrame({'test_id': test['test_id'], 'is_duplicate': predictions[:,0]})
# sub.to_csv('shallow_nn.csv', index=False)

This model was the first attempt at using Keras and yielded similar results to the previous boosting approach.

## Neural Networks and Word Embeddings

In [None]:
from keras.preprocessing.text import Tokenizer
questions = train['question1'] + train['question1']
tokenizer = Tokenizer(num_words=200000)
tokenizer.fit_on_texts(numpy.append(train['question1'],train['question2']))

question1_word_sequences = tokenizer.texts_to_sequences(train['question1'])
question2_word_sequences = tokenizer.texts_to_sequences(train['question2'])

vocab_size = len(tokenizer.word_index) + 1 
print(vocab_size)

In [None]:
from keras.preprocessing.sequence import pad_sequences

maxlen = 100

q1_tokenized = pad_sequences(question1_word_sequences, maxlen=25)
q2_tokenized = pad_sequences(question2_word_sequences, maxlen=25)
#x_test['question1'] = pad_sequences(x_test['question1'], padding='post', maxlen=maxlen)
#x_test['question2'] = pad_sequences(x_test['question2'], padding='post', maxlen=maxlen)

In [None]:
question1_word_sequences_test = tokenizer.texts_to_sequences(test['question1'])
question2_word_sequences_test = tokenizer.texts_to_sequences(test['question2'])
q1_tokenized_test = pad_sequences(question1_word_sequences_test, maxlen=25)
q2_tokenized_test = pad_sequences(question2_word_sequences_test, maxlen=25)

In [None]:
print('Shape of question1 data tensor:', q1_tokenized.shape)
print('Shape of question2 data tensor:', q2_tokenized.shape)

In [None]:
X = numpy.stack((q1_tokenized, q2_tokenized), axis=1)

y_train = pandas.DataFrame()
y_train = train['is_duplicate'].values

x_train, x_test, y_train, y_test = train_test_split(X, y_train, test_size=0.1)
q1_train = x_train[:,0]
q2_train = x_train[:,1]
q1_test = x_test[:,0]
q2_test = x_test[:,1]

In [None]:
def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = numpy.zeros((vocab_size, embedding_dim))

    embeddings_index = {}
    with open(filepath) as f:
        for line in f:
            values = line.split(' ')
            word = values[0]
            embedding = numpy.asarray(values[1:], dtype='float32')
            embeddings_index[word] = embedding

    print('Word embeddings: %d' % len(embeddings_index))
    
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

    return embedding_matrix

In [None]:
embedding_dim = 300
embedding_matrix = create_embedding_matrix('./glove.840B.300d.txt',tokenizer.word_index, embedding_dim)

In [None]:
print('Null word embeddings: %d' % numpy.sum(numpy.sum(embedding_matrix, axis=1) == 0))
nonzero_elements = numpy.count_nonzero(numpy.count_nonzero(embedding_matrix, axis=1))
nonzero_elements / vocab_size

## Dense NN Model

In [None]:
# import keras
# from keras import backend as K
# from keras.layers import Input, LSTM, Dense, Dropout, TimeDistributed, Lambda, BatchNormalization
# from keras.models import Model

# question_1 = Input(shape=(25,))
# question_2 = Input(shape=(25,))
# #word_match = Input(shape=(1,))

# # Add the word embedding Layer
# embedding_layer_1 = layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(question_1)
# #embedding_layer_1 = layers.SpatialDropout1D(0.3)(embedding_layer_1)
# embedding_layer_1 = TimeDistributed(Dense(embedding_dim, activation='relu'))(embedding_layer_1)
# embedding_layer_1 = Lambda(lambda x: K.max(x, axis=1), output_shape=(embedding_dim, ))(embedding_layer_1)

# embedding_layer_2 = layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(question_2)
# #embedding_layer_2 = layers.SpatialDropout1D(0.3)(embedding_layer_2)
# embedding_layer_2 = TimeDistributed(Dense(embedding_dim, activation='relu'))(embedding_layer_2)
# embedding_layer_2 = Lambda(lambda x: K.max(x, axis=1), output_shape=(embedding_dim, ))(embedding_layer_2)

# merged_embedding_vector = keras.layers.concatenate([embedding_layer_1, embedding_layer_2], axis=-1)

# # Add the LSTM Layer
# #lstm_layer = layers.LSTM(512)

# #encoded_q1 = lstm_layer(embedding_layer_1)
# #encoded_q2 = lstm_layer(embedding_layer_2)

# #merged_embedding_vector = keras.layers.concatenate([encoded_q1, encoded_q2], axis=-1)
# #merged_vector = keras.layers.concatenate([encoded_q1, encoded_q2], axis=-1)
# #merged_vector = keras.layers.concatenate([merged_embedding_vector, word_match], axis=-1)

# # Add the output Layers
# merged_vector = layers.Dense(200, activation="relu")(merged_embedding_vector)
# merged_vector = Dropout(0.1)(merged_vector)
# merged_vector = BatchNormalization()(merged_vector)
# merged_vector = layers.Dense(200, activation="relu")(merged_vector)
# merged_vector = Dropout(0.1)(merged_vector)
# merged_vector = BatchNormalization()(merged_vector)
# merged_vector = layers.Dense(200, activation="relu")(merged_vector)
# merged_vector = Dropout(0.1)(merged_vector)
# merged_vector = BatchNormalization()(merged_vector)
# merged_vector = layers.Dense(200, activation="relu")(merged_vector)
# merged_vector = Dropout(0.1)(merged_vector)
# merged_vector = BatchNormalization()(merged_vector)
# #output_layer1 = layers.Dropout(0.25)(output_layer1)

# output_layer2 = layers.Dense(1, activation="sigmoid")(merged_vector)

# # Compile the model
# #model = Model(inputs=[question_1,question_2,word_match], outputs=output_layer2)
# model = Model(inputs=[question_1,question_2], outputs=output_layer2)
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.summary()

In [None]:
# print(x_train['question1'].shape)
# print(x_train['question2'].shape)
# # print(x_train['word_match'].shape)
# q_1 = x_train['question1'].values
# q_2 = x_train['question2'].values


#input_list = [q_1, q_2, x_train['word_match'].values]
#validation_list = [x_test['question1'].values, x_test['question2'].values, x_test['word_match']]

# input_list = [q_1, q_2]
# validation_list = [x_test['question1'].values, x_test['question2'].values]
input_list = [q1_train, q2_train]
validation_list = [q1_test, q2_test]

In [None]:
# history = model.fit(input_list, y_train,epochs=20, verbose=True, validation_data=(validation_list, y_test), batch_size=5000)

In [None]:
# loss, accuracy = model.evaluate(input_list, y_train, verbose=False)
# print("Training Accuracy: {:.4f}".format(accuracy))
# loss, accuracy = model.evaluate(validation_list, y_test, verbose=False)
# print("Testing Accuracy:  {:.4f}".format(accuracy))
# plot_history(history)

In [None]:
# predictions = model.predict([q1_tokenized_test,q2_tokenized_test])
# print(predictions[:,0])
# sub = pandas.DataFrame({'test_id': test['test_id'], 'is_duplicate': predictions[:,0]})
# sub.to_csv('dense_nn.csv', index=False)

## LSTM Model

In [None]:
import keras
from keras import backend as K
from keras.layers import Input, LSTM, Dense, Dropout, TimeDistributed, Lambda, BatchNormalization
from keras.models import Model
from keras.regularizers import l2

question_1 = Input(shape=(25,))
question_2 = Input(shape=(25,))
#word_match = Input(shape=(1,))

# Add the word embedding Layer
embedding_layer_1 = layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(question_1)
embedding_layer_1 = layers.SpatialDropout1D(0.3)(embedding_layer_1)

embedding_layer_2 = layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(question_2)
embedding_layer_2 = layers.SpatialDropout1D(0.3)(embedding_layer_2)

merged_embedding_vector = keras.layers.concatenate([embedding_layer_1, embedding_layer_2], axis=-1)

# Add the LSTM Layer
lstm_layer = layers.LSTM(256)

encoded_q1 = lstm_layer(embedding_layer_1)
encoded_q2 = lstm_layer(embedding_layer_2)

merged_embedding_vector = keras.layers.concatenate([encoded_q1, encoded_q2], axis=-1)
#merged_vector = keras.layers.concatenate([encoded_q1, encoded_q2], axis=-1)
#merged_vector = keras.layers.concatenate([merged_embedding_vector, word_match], axis=-1)

# Add the output Layers
merged_vector = layers.Dense(100, activation="relu", kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01))(merged_embedding_vector)
merged_vector = BatchNormalization()(merged_vector)
merged_vector = layers.Dense(100, activation="relu", kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01))(merged_vector)
merged_vector = BatchNormalization()(merged_vector)
#output_layer1 = layers.Dropout(0.25)(output_layer1)

output_layer2 = layers.Dense(1, activation="sigmoid")(merged_vector)

# Compile the model
model = Model(inputs=[question_1,question_2], outputs=output_layer2)
#model = Model(inputs=[question_1,question_2], outputs=output_layer2)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(input_list, y_train,epochs=10, verbose=True, validation_data=(validation_list, y_test), batch_size=100)

In [None]:
loss, accuracy = model.evaluate(input_list, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(validation_list, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)

In [None]:
predictions = model.predict([q1_tokenized_test,q2_tokenized_test])
print(predictions[:,0])
sub = pandas.DataFrame({'test_id': test['test_id'], 'is_duplicate': predictions[:,0]})
sub.to_csv('lstm_nn.csv', index=False)