# Assignment 5
## Word class prediction with neural networks
The assignment and data are available here: https://snlp2018.github.io/assignments.html

The file `data/train.txt` (and `test.txt` as well) contains a two-column, tab-separated dataset of German words, either nouns or verbs, each with its class label. We train a character-level neutral network to learn the word classes.

### Exercise 1
Data pre-processing. Read the data and encode as follows: target labels as 0s and 1s;  word characters to integers and words to lists of integers.

First, read data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_train = pd.read_csv("data/train.txt", sep = "\t", names = ["class", "word"])
df_train.head()

Unnamed: 0,class,word
0,noun,gemeinderat
1,noun,grenzpolizei
2,verb,ruinieren
3,noun,halbtönen
4,noun,energieexporteuren


In [3]:
df_test = pd.read_csv("data/test.txt", sep = "\t", names = ["class", "word"])
df_test.head()

Unnamed: 0,class,word
0,noun,kaufverpflichtung
1,verb,kosten
2,noun,n
3,noun,blousons
4,noun,verwaltungsgeschäfte


Next, encode `class`; convention: `noun` -> `0`; `verb` -> `1`

In [4]:
df_train["class_bin"] = np.where(df_train["class"] == "noun", 0, 1)
df_train.head()

Unnamed: 0,class,word,class_bin
0,noun,gemeinderat,0
1,noun,grenzpolizei,0
2,verb,ruinieren,1
3,noun,halbtönen,0
4,noun,energieexporteuren,0


In [5]:
df_test["class_bin"] = np.where(df_test["class"] == "noun", 0, 1)
df_test.head()

Unnamed: 0,class,word,class_bin
0,noun,kaufverpflichtung,0
1,verb,kosten,1
2,noun,n,0
3,noun,blousons,0
4,noun,verwaltungsgeschäfte,0


Next, extract alphabet of Unicode characters from `df_train["words"]` and map each to an integer (its position in the list will do):

In [6]:
alphabet = list(set([c for word in [list(str(word)) for word in df_train["word"].tolist()] for c in word]))

In [7]:
print(alphabet)
print(len(alphabet))

['e', 'j', 'u', 'ß', 's', 'l', 'd', 'y', 'v', 'w', 'a', 'p', 'ö', 'c', 'k', 'q', 'ü', 'x', 'ä', 'g', 'm', 't', 'h', 'i', 'r', 'f', 'b', 'o', 'n', 'z']
30


We define a function which takes a word and an alphabet and return a list of integers which encodes the input word: 

In [8]:
def word_encoder(word, alphabet):
    out_list = []
    for i, char in enumerate(str(word)):
        if char in alphabet: # if the character belongs to the alphabet...
            out_list.append(alphabet.index(char)+1) # ...its encoding is simply its position plus one, we'll use 0 for padding
        else:
            out_list.append(999) # integer reserved for out-of-alphabet characters
    return out_list

For example:

In [9]:
word_encoder("hey", alphabet)

[23, 1, 8]

In [10]:
word_encoder("heÿ", alphabet)

[23, 1, 999]

In [11]:
word_encoder("yehÿ", alphabet)

[8, 1, 23, 999]

It works!

Next, apply this function to each row in our dfs:

In [12]:
df_train.loc[:, "word_int"] = df_train.apply(lambda row: word_encoder(row.word, alphabet), axis = 1)

In [13]:
df_train.head()

Unnamed: 0,class,word,class_bin,word_int
0,noun,gemeinderat,0,"[20, 1, 21, 1, 24, 29, 7, 1, 25, 11, 22]"
1,noun,grenzpolizei,0,"[20, 25, 1, 29, 30, 12, 28, 6, 24, 30, 1, 24]"
2,verb,ruinieren,1,"[25, 3, 24, 29, 24, 1, 25, 1, 29]"
3,noun,halbtönen,0,"[23, 11, 6, 27, 22, 13, 29, 1, 29]"
4,noun,energieexporteuren,0,"[1, 29, 1, 25, 20, 24, 1, 1, 18, 12, 28, 25, 2..."


In [14]:
df_test.loc[:, "word_int"] = df_test.apply(lambda row: word_encoder(row.word, alphabet), axis = 1)

In [15]:
df_test.head()

Unnamed: 0,class,word,class_bin,word_int
0,noun,kaufverpflichtung,0,"[15, 11, 3, 26, 9, 1, 25, 12, 26, 6, 24, 14, 2..."
1,verb,kosten,1,"[15, 28, 5, 22, 1, 29]"
2,noun,n,0,[29]
3,noun,blousons,0,"[27, 6, 28, 3, 5, 28, 29, 5]"
4,noun,verwaltungsgeschäfte,0,"[9, 1, 25, 10, 11, 6, 22, 3, 29, 20, 5, 20, 1,..."


Next, padding: the sequences of features should all be of the same length (we use the length of the longest sequence). To do so, we use `sequence.pad_sequences` from `keras.preprocessing`:

In [16]:
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [17]:
# how long is the longest word?
max_word = max(len(word) for word in df_train["word_int"])
print(max_word)

31


In [18]:
train_x = pad_sequences(df_train["word_int"], maxlen = max_word + 1) # add one to account for unknown character too

In [19]:
train_x.shape

(20000, 32)

In [20]:
# for example
train_x[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0, 20,  1, 21,  1, 24, 29,  7,  1, 25, 11, 22])

In [21]:
# same for test features
test_x = pad_sequences(df_test["word_int"], maxlen = train_x.shape[1]) # the max length be the same between train and test

Finally, one-hot encoding for characters (hence words are encoded into lists of vectors of 0s and 1s). We use `to_categorical` from `keras.utils`:

In [22]:
from keras.utils import to_categorical

In [23]:
train_x_cat = to_categorical(train_x)

In [24]:
train_x_cat.shape

(20000, 32, 31)

In [25]:
# flatten each row
onehot_train = [] 
for i in range(0, train_x_cat.shape[0]):
    onehot_train.append(train_x_cat[i].flatten().tolist())

In [26]:
print(len(onehot_train))
print(len(onehot_train[0]))

20000
992


In [27]:
# same for test features
test_x_cat = to_categorical(test_x)

onehot_test = []
for i in range(0, test_x_cat.shape[0]):
    onehot_test.append(test_x_cat[i].flatten().tolist())

In [28]:
print(len(onehot_test))
print(len(onehot_test[0]))

6561
992


### Exercise 2
We train and tune a simple feed-forward nn with `train_onehot` (features) as input and `train_y` (labels) as output, using `keras`.

First, describe the model:

In [29]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

In [30]:
# model
model = Sequential() # initialize
model.add(Dense(64, activation = "relu", input_dim = len(onehot_train[0]))) # dense layer with ReLU activation
model.add(Dropout(0.5)) # droput
model.add(Dense(1, activation = 'softmax')) # binary classification

model.compile(loss = 'binary_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy'])

In [31]:
# training
model.fit(x_train = onehot_train,
                    y_train = df_train["class_bin"],
                    epochs = 5,
                    batch_size = 32,
                    validation_split = 0.2,
                    verbose = 1)