## Question Classification

In [1]:
#read data
texts = []
labels = []

with open('data/LabelledData.txt','r') as f:
    for line in f:
        text, label = map(str,line.split(",,,"))
        texts.append(text.strip())
        labels.append(label.strip())

In [2]:
#for text, label in zip(texts[:10], labels[:10]):
#    print(text," -->", label)

In [3]:
import re
def pre_process(text):
    text = re.sub(r"\b's\b","is",text)
    text = re.sub(r"[^a-z?\.]"," ",text.lower())
    return text

In [4]:
processed_texts = [pre_process(text) for text in texts]

In [5]:
for text, label in zip(processed_texts[:10], labels[:10]):
    print(text," -->", label)

how did serfdom develop in and then leave russia ?  --> unknown
what films featured the character popeye doyle ?  --> what
how can i find a list of celebrities   real names ?  --> unknown
what fowl grabs the spotlight after the chinese year of the monkey ?  --> what
what is the full form of .com ?  --> what
what contemptible scoundrel stole the cork from my lunch ?  --> what
what team did baseball  s st. louis browns become ?  --> what
what is the oldest profession ?  --> what
what are liver enzymes ?  --> what
name the scar faced bounty hunter of the old west .  --> unknown


In [6]:
import numpy as np

X = np.array(texts)
y = np.array(labels, dtype='str')

In [7]:
unique, counts = np.unique(y, return_counts=True)
dict(zip(unique, counts))

{'affirmation': 104, 'unknown': 272, 'what': 609, 'when': 96, 'who': 402}

In [8]:
from sklearn.preprocessing import LabelBinarizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import os, sys

from keras.layers import Dense, Input, GlobalMaxPooling1D, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model, Sequential
from keras import utils
from keras.layers import concatenate, Activation
from keras.callbacks import ModelCheckpoint

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [9]:
MAX_SEQUENCE_LENGTH = 45
MAX_NUM_WORDS = 1000
VALIDATION_SPLIT = 0.1

In [10]:
# Split data into train and test
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
data = X[indices]
labels = y[indices]
num_validation_samples = int(VALIDATION_SPLIT * X.shape[0])

In [11]:
train_x = data[:-num_validation_samples]
train_y = labels[:-num_validation_samples]
test_x = data[-num_validation_samples:]
test_y = labels[-num_validation_samples:]

In [12]:
encoder = LabelBinarizer()
encoder.fit(train_y)
y_train = encoder.transform(train_y)
y_test = encoder.transform(test_y)

In [13]:
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_x)
x_train = tokenizer.texts_to_sequences(train_x)
x_test = tokenizer.texts_to_sequences(test_x)

In [14]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 3434 unique tokens.


In [15]:
x_train = pad_sequences(x_train, maxlen=MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(x_test, maxlen=MAX_SEQUENCE_LENGTH)


In [16]:
#print('x_train shape:', x_train.shape)
#print('x_test shape:', x_test.shape)
#print('y_train shape:', y_train.shape)
#print('y_test shape:', y_test.shape)

### ML

In [17]:
from keras.layers import Dense, Input, GlobalMaxPooling1D, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.layers import concatenate, Activation

In [18]:
vocab_size = len(word_index)+1

In [19]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = Embedding(vocab_size, 100)(sequence_input)

x1 = Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1)(embedded_sequences)
x1 = GlobalMaxPooling1D()(x1)

x2 = Conv1D(filters=100, kernel_size=3, padding='valid', activation='relu', strides=1)(embedded_sequences)
x2 = GlobalMaxPooling1D()(x2)

merged = concatenate([x1, x2], axis=1)
merged = Dense(256, activation='relu')(merged)
merged = Dropout(0.5)(merged)
merged = Dense(5)(merged)
output = Activation('sigmoid')(merged)

model = Model(inputs=[sequence_input], outputs=[output])
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
model.summary()

Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 45)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 45, 100)      343500      input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 44, 100)      20100       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 43, 100)      30100       embedding_1[0][0]                
__________________________

In [20]:
filepath="weights/CNN_weights.{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

In [21]:
model.fit(x_train, y_train,
                    batch_size=64,
                    epochs=5,
                    verbose=1,
                    validation_split=0.1)

Train on 1201 samples, validate on 134 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd4e57b1b38>

In [22]:
score = model.evaluate(x_test, y_test,
                       batch_size=64, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

Test score: 0.08441820277555569
Test accuracy: 0.9702702841243228


In [23]:
text_labels = encoder.classes_

In [24]:
for i in range(148):
    prediction = model.predict(np.array([x_test[i]]))
    predicted_label = text_labels[np.argmax(prediction)]
    if test_y[i]!=predicted_label:
        print(test_x[i][:100], "...")
        print('Actual label:' + test_y[i])
        print("Predicted label: " + predicted_label + "\n")

when not adventuring on rann , what does adam strange call his profession ? ...
Actual label:when
Predicted label: what

when did the berlin wall go up ? ...
Actual label:unknown
Predicted label: when

when did rococo painting and architecture flourish ? ...
Actual label:what
Predicted label: when

is there a lag time after you take it out of the box before it starts to work ? ...
Actual label:affirmation
Predicted label: unknown

when superman needs to get away from it all , where does he go ? ...
Actual label:when
Predicted label: unknown

when it 's time to relax , what one beer stands clear ? ...
Actual label:when
Predicted label: what

what soap was touted as being `` for people who like people '' ? ...
Actual label:what
Predicted label: who

when is boxing day ? ...
Actual label:what
Predicted label: when



I guess there are some sentences which have been wrongly tagged in the dataset.