# SemEval 2018: Task 8 - Question classification practice notebook

Using Glove embedding with 100d, trained on Twitter dataset(should improve work with non-formal text).
Embeddings need 1-2 GB of RAM and 2-3 GB of disk space. The notebook is self-sufficient, which means it downloads and unpacks its own data, that is becuase it is intended to be run in [Google Colab](https://colab.research.google.com).


## Download Glove embedding

In [0]:
!wget nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip

--2018-12-23 11:53:15--  http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip [following]
--2018-12-23 11:53:15--  https://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1520408741 (1.4G) [application/zip]
Saving to: ‘glove.twitter.27B.zip’


2018-12-23 11:53:51 (41.1 MB/s) - ‘glove.twitter.27B.zip’ saved [1520408741/1520408741]



In [0]:
!ls -l

total 5242484
-rw-rw-r-- 1 root root 1021671926 Dec 23  2015 glove.twitter.27B.100d.txt
-rw-rw-r-- 1 root root 2057595650 Dec 23  2015 glove.twitter.27B.200d.txt
-rw-rw-r-- 1 root root  257699930 Dec 23  2015 glove.twitter.27B.25d.txt
-rw-rw-r-- 1 root root  510889212 Dec 23  2015 glove.twitter.27B.50d.txt
-rw-r--r-- 1 root root 1520408741 Dec 23  2015 glove.twitter.27B.zip
drwxr-xr-x 1 root root       4096 Dec 18 20:29 sample_data


In [0]:
!unzip glove.twitter.27B.zip

Archive:  glove.twitter.27B.zip
replace glove.twitter.27B.100d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

KeyboardInterrupt: ignored

## Download dataset

In [0]:
!wget -O 'dataset.zip' 'https://competitions.codalab.org/my/datasets/download/30915784-67bb-4974-8c24-21a7ec86f587'

--2018-12-23 14:03:45--  https://competitions.codalab.org/my/datasets/download/30915784-67bb-4974-8c24-21a7ec86f587
Resolving competitions.codalab.org (competitions.codalab.org)... 134.158.75.178
Connecting to competitions.codalab.org (competitions.codalab.org)|134.158.75.178|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://newcodalab.lri.fr/prod-private/dataset_data_file/None/77777/train_and_dev_sets_questions_and_an.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=c10a894860b7e12bc63fd4ba116f59fb458a06e9d899ebc6ab7c207d5b03f65d&X-Amz-Date=20181223T140335Z&X-Amz-Credential=AZIAIOSAODNN7EX123LE%2F20181223%2Fnewcodalab%2Fs3%2Faws4_request [following]
--2018-12-23 14:03:46--  https://newcodalab.lri.fr/prod-private/dataset_data_file/None/77777/train_and_dev_sets_questions_and_an.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=c10a894860b7e12bc63fd4ba116f

In [0]:
!unzip dataset.zip
!rm -rf __MACOSXb

Archive:  dataset.zip
  inflating: questions_train.xml     
   creating: __MACOSX/
  inflating: __MACOSX/._questions_train.xml  
  inflating: questions_dev.xml       
  inflating: __MACOSX/._questions_dev.xml  
  inflating: answers_train.xml       
  inflating: __MACOSX/._answers_train.xml  
  inflating: answers_dev.xml         
  inflating: __MACOSX/._answers_dev.xml  


## Data setup

In [0]:
import numpy as np
import pandas as pd
import csv

from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.initializers import Constant

In [0]:
MAX_SEQUENCE_LENGTH = 200
MAX_NUM_WORDS = 10000
EMBEDDING_DIM = 100

glove_embeddings_file = "./glove.twitter.27B.100d.txt"
questions_train_file = "./questions_train.xml"
questions_dev_file = "./questions_dev.xml"

In [0]:
import xml.etree.ElementTree as ET

class XMLParser:
    """Parser assumes the first level of xml tags are to be transformed
    to rows in a Pandas dataframe. For each of the first-level tags it takes
    all of their subtags and attributes, and puts them as columns to
    the current row in the dataframe."""

    def __init__(self, xml_data):
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        """Return a list of dictionaries from the text
         and attributes of the children under this XML root."""
        return [self.parse_element(child) for child in iter(root)]

    def parse_element(self, element, parsed=None):
        """ Collect {key:attribute} and {tag:text} from the XML
         element and all its children into a single dictionary of strings."""

        if parsed is None:
            parsed = dict()

        if element.tag == "RelComment":
            return parsed

        for key in element.keys():
            if key not in parsed:
                parsed[key] = element.attrib.get(key)
            else:
                raise ValueError('Duplicate attribute {0} = {1}, prev = {2}' \
                                 .format(key,
                                         element.attrib.get(key),
                                         parsed[key]))

        for child in iter(element):
            if child.tag == "RelQSubject" or child.tag == "RelQBody":
                parsed[child.tag] = child.text
            else:
                self.parse_element(child, parsed)

        return parsed

    def to_df(self):
        """ Initiate the root XML, parse it, and return a dataframe"""
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)


In [0]:
def df_from_xml_file(filename):
    """Make dataframe from xml file(not recursive)"""
    with open(filename, 'r') as content_file:
         content = content_file.read()

    xml = XMLParser(content)
    xml_df = xml.to_df()
    return xml_df


def make_data(df, class_map):
    """Compact and transform the dataframe needed for classification.
    Returns:
        x - 1d nparray with docs
        y - 1d nparray with classes as integers
        data - dataframe with only the needed features for classification
    """
    data = pd.DataFrame({
        'id': df.THREAD_SEQUENCE,
        'subject': df.RelQSubject,
        'question': df.RelQBody,
        'type': df.RELQ_FACT_LABEL,
    })

    data.question = data.question.fillna("")
    data.question = data.subject + "\t"  + data.question
    data = data.drop(['subject'], axis=1)
    
    x, y = data.question.values, None
    if class_map:
        np.zeros( (data.type.shape[0], ) )
        y = data.type.transform(lambda x: class_map[x]).values

    return x, y, data

In [0]:
types_map = {
    'Opinion': 0, 
    'Factual': 1, 
    'Socializing': 2
}  

questions_train_df = df_from_xml_file(questions_train_file)
questions_test_df = df_from_xml_file(questions_dev_file)

x, y, questions_train_df = make_data(question_train_df, class_map=types_map)
x_test, _, questions_test_df = make_data(question_test_df, class_map=None)

In [0]:
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(x)
sequences = tokenizer.texts_to_sequences(x)
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [0]:
labels = to_categorical(y)

In [0]:
x_train, x_val, y_train, y_val = train_test_split(data,
                                                  labels,
                                                  test_size=0.20,
                                                  random_state=42)

## Model setup

In [0]:
# Load embeddings
embeddings_matrix = pd.read_table(glove_embeddings_file, 
                                  sep=" ", 
                                  index_col=0, 
                                  header=None, 
                                  quoting=csv.QUOTE_NONE)

In [0]:
embedding_layer = Embedding(embeddings_matrix.shape[0],
                            EMBEDDING_DIM,
                            weights=[embeddings_matrix.values],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [0]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
pulling = GlobalMaxPooling1D()(embedded_sequences)
fully_connected = Dense(128, activation='relu')(pulling)
preds = Dense(3, activation='softmax')(fully_connected)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

In [148]:
model.summary()
model.fit(x_train, y_train,
          batch_size=32,
          epochs=100,
          validation_data=(x_val, y_val))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_15 (InputLayer)        (None, 200)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 200, 100)          119351700 
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 100)               0         
_________________________________________________________________
dense_23 (Dense)             (None, 128)               12928     
_________________________________________________________________
dense_24 (Dense)             (None, 3)                 387       
Total params: 119,365,015
Trainable params: 13,315
Non-trainable params: 119,351,700
_________________________________________________________________
Train on 894 samples, validate on 224 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch

<keras.callbacks.History at 0x7f5387d9d198>