# Assignment 5
## Word class prediction with neural networks
The assignment and data are available here: https://snlp2018.github.io/assignments.html

The file `data/train.txt` (and `test.txt` as well) contains a two-column, tab-separated dataset of German words, either nouns or verbs, each with its class label. We train a character-level neutral network to learn the word classes.

### Exercise 1
Data pre-processing. Read the data and encode as follows: target labels as 0s and 1s;  word characters to integers and words to lists of integers.

First, read data:

In [32]:
import pandas as pd
import numpy as np

In [33]:
df_train = pd.read_csv("data/train.txt", sep = "\t", names = ["class", "word"])
df_train.head()

Unnamed: 0,class,word
0,noun,gemeinderat
1,noun,grenzpolizei
2,verb,ruinieren
3,noun,halbtönen
4,noun,energieexporteuren


In [34]:
df_test = pd.read_csv("data/test.txt", sep = "\t", names = ["class", "word"])
df_test.head()

Unnamed: 0,class,word
0,noun,kaufverpflichtung
1,verb,kosten
2,noun,n
3,noun,blousons
4,noun,verwaltungsgeschäfte


Next, encode `class`; convention: `noun` -> `0`; `verb` -> `1`

In [35]:
df_train["class_bin"] = np.where(df_train["class"] == "noun", 0, 1)
df_train.head()

Unnamed: 0,class,word,class_bin
0,noun,gemeinderat,0
1,noun,grenzpolizei,0
2,verb,ruinieren,1
3,noun,halbtönen,0
4,noun,energieexporteuren,0


In [36]:
df_test["class_bin"] = np.where(df_test["class"] == "noun", 0, 1)
df_test.head()

Unnamed: 0,class,word,class_bin
0,noun,kaufverpflichtung,0
1,verb,kosten,1
2,noun,n,0
3,noun,blousons,0
4,noun,verwaltungsgeschäfte,0


Next, extract alphabet of Unicode characters from `df_train["words"]` and map each to an integer (its position in the list will do):

In [37]:
alphabet = list(set([c for word in [list(str(word)) for word in df_train["word"].tolist()] for c in word]))

In [38]:
print(alphabet)
print(len(alphabet))

['o', 'g', 'ß', 'j', 'q', 'k', 'm', 'a', 'r', 'y', 'z', 'ä', 'd', 'w', 'x', 'v', 'b', 's', 'l', 'c', 'ö', 't', 'ü', 'h', 'f', 'n', 'u', 'p', 'i', 'e']
30


We define a function which takes a word and an alphabet and return a list of integers which encodes the input word: 

In [39]:
def word_encoder(word, alphabet):
    out_list = []
    for i, char in enumerate(str(word)):
        if char in alphabet: # if the character belongs to the alphabet...
            out_list.append(alphabet.index(char)+1) # ...its encoding is simply its position plus one, we'll use 0 for padding
        else:
            out_list.append(999) # integer reserved for out-of-alphabet characters
    return out_list

For example:

In [40]:
word_encoder("hey", alphabet)

[24, 30, 10]

In [41]:
word_encoder("heÿ", alphabet)

[24, 30, 999]

In [42]:
word_encoder("yehÿ", alphabet)

[10, 30, 24, 999]

It works!

Next, apply this function to each row in our dfs:

In [43]:
df_train.loc[:, "word_int"] = df_train.apply(lambda row: word_encoder(row.word, alphabet), axis = 1)

In [44]:
df_train.head()

Unnamed: 0,class,word,class_bin,word_int
0,noun,gemeinderat,0,"[2, 30, 7, 30, 29, 26, 13, 30, 9, 8, 22]"
1,noun,grenzpolizei,0,"[2, 9, 30, 26, 11, 28, 1, 19, 29, 11, 30, 29]"
2,verb,ruinieren,1,"[9, 27, 29, 26, 29, 30, 9, 30, 26]"
3,noun,halbtönen,0,"[24, 8, 19, 17, 22, 21, 26, 30, 26]"
4,noun,energieexporteuren,0,"[30, 26, 30, 9, 2, 29, 30, 30, 15, 28, 1, 9, 2..."


In [45]:
df_test.loc[:, "word_int"] = df_test.apply(lambda row: word_encoder(row.word, alphabet), axis = 1)

In [46]:
df_test.head()

Unnamed: 0,class,word,class_bin,word_int
0,noun,kaufverpflichtung,0,"[6, 8, 27, 25, 16, 30, 9, 28, 25, 19, 29, 20, ..."
1,verb,kosten,1,"[6, 1, 18, 22, 30, 26]"
2,noun,n,0,[26]
3,noun,blousons,0,"[17, 19, 1, 27, 18, 1, 26, 18]"
4,noun,verwaltungsgeschäfte,0,"[16, 30, 9, 14, 8, 19, 22, 27, 26, 2, 18, 2, 3..."


Next, padding: the sequences of features should all be of the same length (we use the length of the longest sequence). To do so, we use `sequence.pad_sequences` from `keras.preprocessing`:

In [47]:
from keras.preprocessing.sequence import pad_sequences

In [48]:
# how long is the longest word?
max_word = max(len(word) for word in df_train["word_int"])
print(max_word)

31


In [49]:
train_x = pad_sequences(df_train["word_int"], maxlen = max_word + 1) # add one to account for unknown character too

In [50]:
train_x.shape

(20000, 32)

In [51]:
# for example
train_x[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  2, 30,  7, 30, 29, 26, 13, 30,  9,  8, 22])

In [52]:
# same for test features
test_x = pad_sequences(df_test["word_int"], maxlen = train_x.shape[1]) # the max length be the same between train and test

Finally, one-hot encoding for characters (hence words are encoded into lists of vectors of 0s and 1s). We use `to_categorical` from `keras.utils`:

In [53]:
from keras.utils import to_categorical

In [54]:
train_x_cat = to_categorical(train_x)

In [55]:
train_x_cat.shape

(20000, 32, 31)

In [56]:
# flatten each row
onehot_train = np.zeros(shape = (train_x_cat.shape[0], train_x_cat.shape[1]*train_x_cat.shape[2]))
for i in range(0, train_x_cat.shape[0]):
    onehot_train[i] = train_x_cat[i].flatten().tolist()

In [57]:
onehot_train.shape

(20000, 992)

In [58]:
# same for test features
test_x_cat = to_categorical(test_x)

onehot_test = np.zeros(shape = (test_x_cat.shape[0], test_x_cat.shape[1]*test_x_cat.shape[2]))
for i in range(0, test_x_cat.shape[0]):
    onehot_test[i] = test_x_cat[i].flatten().tolist()

In [59]:
onehot_test.shape

(6561, 992)

### Exercise 2
We train and tune a simple feed-forward nn with `train_onehot` (features) as input and `train_y` (labels) as output, using `keras`.

First, describe the model:

In [60]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

In [63]:
# model
model = Sequential() # initialize
model.add(Dense(64, activation = "relu", input_dim = onehot_train.shape[1])) # dense layer with ReLU activation
model.add(Dropout(0.5)) # droput
model.add(Dense(1, activation='softmax')) # binary classification

model.compile(loss = 'binary_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy'])

In [66]:
# training
history = model.fit(x_train = onehot_train,
                    y_train = df_train["class_bin"],
                    epochs = 5,
                    batch_size = 32,
                    validation_split = 0.2)

TypeError: Unrecognized keyword arguments: {'x_train': array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 0., 0., 1.]]), 'y_train': 0        0
1        0
2        1
3        0
4        0
        ..
19995    0
19996    0
19997    1
19998    1
19999    0
Name: class_bin, Length: 20000, dtype: int32}