# Assignment 5
## Word class prediction with neural networks
The assignment and data are available here: https://snlp2018.github.io/assignments.html

The file `data/train.txt` (and `test.txt` as well) contains a two-column, tab-separated dataset of German words, either nouns or verbs, each with its class label. We train a character-level neutral network to learn the word classes.

### Exercise 1
Data pre-processing. Read the data and encode as follows: target labels as 0s and 1s;  word characters to integers and words to lists of integers.

First, read data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_train = pd.read_csv("data/train.txt", sep = "\t", names = ["class", "word"])
df_train.head()

Unnamed: 0,class,word
0,noun,gemeinderat
1,noun,grenzpolizei
2,verb,ruinieren
3,noun,halbtönen
4,noun,energieexporteuren


In [3]:
df_test = pd.read_csv("data/test.txt", sep = "\t", names = ["class", "word"])
df_test.head()

Unnamed: 0,class,word
0,noun,kaufverpflichtung
1,verb,kosten
2,noun,n
3,noun,blousons
4,noun,verwaltungsgeschäfte


Next, encode `class`; convention: `noun` -> `0`; `verb` -> `1`

In [4]:
train_y = np.where(df_train["class"] == "noun", 0, 1)
train_y[0:5]

array([0, 0, 1, 0, 0])

In [6]:
test_y = np.where(df_test["class"] == "noun", 0, 1)
test_y[0:5]

array([0, 1, 0, 0, 0])

Next, extract alphabet of Unicode characters from `df_train["words"]` and map each to an integer (its position in the list will do):

In [5]:
alphabet = list(set([c for word in [list(str(word)) for word in df_train["word"].tolist()] for c in word]))

In [7]:
print(alphabet)
print(len(alphabet))

['b', 'p', 'g', 'd', 'f', 'w', 'y', 'ä', 'x', 'i', 'k', 'h', 'l', 'c', 'ß', 'ü', 'j', 'm', 'e', 'v', 'q', 'r', 'n', 'u', 'z', 'ö', 's', 'a', 'o', 't']
30


We define a function which takes a word and an alphabet and return a list of integers which encodes the input word: 

In [8]:
def word_encoder(word, alphabet):
    word = str(word)
    out_list = np.zeros(shape = len(word), dtype = "int32")
    for i, char in enumerate(word):
        if char in alphabet: # if the character belongs to the alphabet...
            out_list[i] = alphabet.index(char)+1 # ...its encoding is simply its position plus one, we'll use 0 for padding
        else:
            out_list[i] = 999 # integer reserved for out-of-alphabet characters
    return out_list

For example:

In [9]:
word_encoder("hey", alphabet)

array([12, 19,  7])

In [11]:
word_encoder("heÿ", alphabet)

array([ 12,  19, 999])

In [10]:
word_encoder("yehÿ", alphabet)

array([  7,  19,  12, 999])

It works!

Next, apply this function to each row in our dfs:

In [34]:
train_x = df_train.apply(lambda row: word_encoder(row.word, alphabet), axis = 1)

In [35]:
train_x[0:5]

0           [3, 19, 18, 19, 10, 23, 4, 19, 22, 28, 30]
1       [3, 22, 19, 23, 25, 2, 29, 13, 10, 25, 19, 10]
2                 [22, 24, 10, 23, 10, 19, 22, 19, 23]
3                  [12, 28, 13, 1, 30, 26, 23, 19, 23]
4    [19, 23, 19, 22, 3, 10, 19, 19, 9, 2, 29, 22, ...
dtype: object

In [36]:
test_x = df_test.apply(lambda row: word_encoder(row.word, alphabet), axis = 1)

In [37]:
test_x[0:5]

0    [11, 28, 24, 5, 20, 19, 22, 2, 5, 13, 10, 14, ...
1                             [11, 29, 27, 30, 19, 23]
2                                                 [23]
3                      [1, 13, 29, 24, 27, 29, 23, 27]
4    [20, 19, 22, 6, 28, 13, 30, 24, 23, 3, 27, 3, ...
dtype: object

Next, padding: the sequences of features should all be of the same length (we use the length of the longest sequence). To do so, we use `sequence.pad_sequences` from `keras.preprocessing`:

In [38]:
from keras.preprocessing.sequence import pad_sequences

In [39]:
# how long is the longest word?
max_word = max(len(word) for word in train_x)
print(max_word)

31


In [40]:
train_x = pad_sequences(train_x)

In [41]:
train_x.shape

(20000, 31)

In [42]:
# for example
train_x[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  3, 19, 18, 19, 10, 23,  4, 19, 22, 28, 30])

In [43]:
# same for test features
test_x = pad_sequences(test_x, maxlen = train_x.shape[1]) # the max length be the same between train and test

Finally, one-hot encoding for characters (hence words are encoded into lists of vectors of 0s and 1s). We use `to_categorical` from `keras.utils`:

In [44]:
from keras.utils import to_categorical

In [45]:
train_x_cat = to_categorical(train_x)

In [46]:
train_x_cat.shape

(20000, 31, 31)

In [48]:
# flatten each row
onehot_train = train_x_cat.reshape(train_x_cat.shape[0], train_x_cat.shape[1]*train_x_cat.shape[2])

In [49]:
onehot_train.shape

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [50]:
# same for test
test_x_cat = to_categorical(test_x)

onehot_test = test_x_cat.reshape(test_x_cat.shape[0], test_x_cat.shape[1]*test_x_cat.shape[2])

In [51]:
onehot_test.shape

(6561, 961)

### Exercise 2
We train and tune a simple feed-forward nn with `train_onehot` (features) as input and `train_y` (labels) as output, using `keras`.

First, describe the model:

In [61]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

In [62]:
# model
model = Sequential() # initialize
model.add(Dense(64, activation = "relu", input_dim = onehot_train.shape[1])) # dense layer with ReLU activation
model.add(Dropout(0.2)) # droput
model.add(Dense(1, activation = 'softmax')) # binary classification

model.compile(loss = 'binary_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy'])

In [63]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_6 (Dense)              (None, 64)                61568     
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 65        
Total params: 61,633
Trainable params: 61,633
Non-trainable params: 0
_________________________________________________________________


Training:

In [64]:
# training
history = model.fit(onehot_train,
                    train_y,
                    epochs = 30,
                    batch_size = 64,
                    validation_split = 0.2,
                    verbose = 1)

Train on 16000 samples, validate on 4000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
loss, accuracy = model.evaluate(onehot_train, train_y, verbose = False)
print("On training set, Loss={}, Accuracy={}".format(loss, accuracy))

In [60]:
loss, accuracy = model.evaluate(onehot_test, test_y, verbose = False)
print("On testing set, Loss={}, Accuracy={}".format(loss, accuracy))

On testing set, Loss=12.306769774828973, Accuracy=0.19295838475227356
