# This talk is about
- `dask` (which is awesome)
- NNs `keras` (which I've not managed to wrap my head around)

### but mostly about failure

## @mattilyra


# Neural Networks are not easy to understand and implement
- Neural network are the new state-of-the-art (sort of)
- Easy to use (no faffing around with feature extraction or selection - sort of)

----

### the plan was to
- take the datasets from Felix and apply deep learning
- take already published research and implement their model in `keras`
- alternatively take an already existing implementation and use that
- run NNs on GPUs on AWS and tell everyone how cool everything is  !!

### Problems
- very little data (probably won't see an improvement)
    - maybe something interesting will happen with word embeddings

## No experience in applying NNs to anything (other than toy problems during BSc)

### What my time was spent on
- Setting up AWS machinery (1 hour)
    - No GPU because AWS decided I shouldn't be using one
    - I exceeded my "maximum limit of 0 GPU nodes"
    
- Installing software and downloading datasets (3 hours)
- Preprocessing data (6 hours)
- Trying to get `keras` to do _something_ / _anything_ (30 hours)

In [9]:
import dask.bag as db
import json

records = db.read_text('../pydata_hackathon/bundestag/unravelled/*.json').map(json.loads)
records.take(2)

({'filename': 'data/txt/17001.txt',
  'in_writing': False,
  'sequence': 169,
  'sitzung': 1,
  'speaker': None,
  'speaker_cleaned': None,
  'speaker_fp': None,
  'speaker_party': None,
  'text': 'Beifall',
  'type': 'poi',
  'wahlperiode': 17},
 {'filename': 'data/txt/17001.txt',
  'in_writing': False,
  'sequence': 168,
  'sitzung': 1,
  'speaker': 'Präsident Dr. Norbert Lammert',
  'speaker_cleaned': 'Dr. Norbert Lammert',
  'speaker_fp': 'norbert-lammert',
  'speaker_party': None,
  'text': 'im Rahmen eines kleinen Empfangs in der Fraktionsebene Gelegenheit zum Gespräch mit den neu gewählten Mitgliedern des Präsidiums ist.',
  'type': 'chair',
  'wahlperiode': 17})

In [10]:
records.pluck('speaker_cleaned').distinct().count().compute()

3683

In [11]:
speaker_freq = records.pluck('speaker_cleaned').frequencies()
speaker_freq.compute()[:5]

[('', 35),
 ('Meine Damen und Herren, der französische Bischof und Staatsmann Talleyrand sagte einmal',
  3),
 ('Zu den Abläufen hat der damalige Finanzminister gesagt', 10),
 ('Fakt ist', 10),
 ('Ich wollte nur gerade darauf hinweisen, dass es aus meiner Sicht zwei große Schwierigkeiten gibt',
  3)]

In [14]:
speakers = (records
           .pluck('speaker_cleaned')
           .filter(lambda speaker: speaker is not None and 0 < len(speaker) < 20))
speaker_freq = speakers.frequencies()
speaker_freq.topk(10, key=lambda x: x[1]).compute()

[('Dr. Norbert Lammert', 10234),
 ('Petra Pau', 9029),
 ('Eduard Oswald', 5080),
 ('Hubertus Heil', 4131),
 ('Volker Kauder', 4110),
 ('Claudia Roth', 3981),
 ('Gerda Hasselfeldt', 3438),
 ('Ulla Schmidt', 3424),
 ('Elke Ferner', 3310),
 ('Renate Künast', 3212)]

In [26]:
%%timeit
speakers = (records
           .pluck('speaker_cleaned')
           .filter(lambda speaker: speaker is not None and len(speaker) < 20)
           .frequencies()
           .topk(10, key=lambda x: x[1])).compute()

1 loop, best of 3: 7.3 s per loop


In [16]:
import glob
from collections import defaultdict

In [31]:
%%timeit
files = glob.glob('../pydata_hackathon/bundestag/unravelled/*.json')
speaker_freq = defaultdict(lambda: 0)
for f in files:
    with open(f, 'r') as fh:
        for line in fh:
            item = json.loads(line)
            speaker = item['speaker_cleaned']
            if speaker is not None and len(speaker) < 20:
                speaker_freq[speaker] += 1
sorted(list(speaker_freq.items()), key=lambda i: i[1], reverse=True)[:10]

1 loop, best of 3: 11.5 s per loop


- `dask.bag` is much faster

#### BUT

- calling `json.loads` per line is expensive

![assets/char_cnn_title.png](assets/char_cnn_title.png)

_Character-level Convolutional Networks for Text Classification*_

(Zhang et. al, Neural Information Processing Systems 2015)

![assets/char_cnn_model.png](assets/char_cnn_model.png)

_Character-level Convolutional Networks for Text Classification*_

(Zhang et. al, Neural Information Processing Systems 2015)

![assets/char_cnn_keymodules.png](assets/char_cnn_keymodules.png)

_Character-level Convolutional Networks for Text Classification*_

(Zhang et. al, Neural Information Processing Systems 2015)

In [None]:
# this model is almost certainly wrong - please don't use it

from keras.models import Sequential
from keras.layers.convolutional import Convolution1D
from keras.layers.pooling import MaxPooling1D
from keras.layers.core import Dense, Flatten, Dropout

model = Sequential()
model.add(Convolution1D(256, 7, border_mode='same', input_length=1014, input_dim=132))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_length=3, stride=1, border_mode='valid'))

model.add(Convolution1D(256, 7, border_mode='same', input_dim=132))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_length=3, stride=1, border_mode='valid'))

model.add(Convolution1D(256, 3, border_mode='same', input_dim=132))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_length=3, stride=1, border_mode='valid'))

model.add(Convolution1D(256, 3, border_mode='same', input_dim=132))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_length=3, stride=1, border_mode='valid'))

model.add(Convolution1D(256, 3, border_mode='same', input_dim=132))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_length=3, stride=1, border_mode='valid'))

model.add(Convolution1D(256, 3, border_mode='same', input_dim=132))
model.add(Activation('relu'))
model.add(MaxPooling1D(pool_length=3, stride=1, border_mode='valid'))

In [None]:
model.add(Flatten())

model.add(Dense(1024, init='normal', input_dim=
    activation='relu', weights=None, W_regularizer=None,
    b_regularizer=None, activity_regularizer=None, W_constraint=None,
    b_constraint=None, bias=True))

model.add(Dropout(0.5))

model.add(Dense(1024, init='normal',
    activation='relu', weights=None, W_regularizer=None,
    b_regularizer=None, activity_regularizer=None, W_constraint=None,
    b_constraint=None, bias=True))

model.add(Dropout(0.5))

n_classes = len(records.pluck('speaker_party').distinct().compute())
model.add(Dense(n_classes, input_dim=256, init='normal',
    activation='softmax', weights=None, W_regularizer=None,
    b_regularizer=None, activity_regularizer=None, W_constraint=None,
    b_constraint=None, bias=True))

In [None]:
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

# data preprocessing ...

- character quantization to 1-hot character vectors
- truncate all documents to N=1014 characters

```python
/home/ubuntu/.conda/envs/keras/lib/python3.5/site-packages/keras/engine/training.py in standardize_input_data(data, names, shapes, check_batch_dim, exception_prefix)
    106                                         ' to have shape ' + str(shapes[i]) +
    107                                         ' but got array with shape ' +
--> 108                                         str(array.shape))
    109     return arrays
    110 

Exception: Error when checking model target: expected dense_21 to have shape (None, 22) but got array with shape (1, 1)```

![](assets/sentence_cnn_title.png)

_Convolutional Neural Networks for Sentence Classification_

(Kim, Empirical Methods in Natural Language Processing 2014)


![](./assets/sentence_cnn_model.png)

https://github.com/bwallace/CNN-for-text-classification (Keras)

# data preprocessing ...

- truncate all sentences to N=30 words
- get word vectors from `spacy` and stack sentences into
  - `1x9000` matrices (doesn't work)
  - `30x300` matrices (doesn't work)

```python

/usr/data/miniconda3/envs/keras/lib/python3.5/site-packages/keras/engine/training.py in standardize_input_data(data, names, shapes, check_batch_dim, exception_prefix)
     95                                 ' to have ' + str(len(shapes[i])) +
     96                                 ' dimensions, but got array with shape ' +
---> 97                                 str(array.shape))
     98             for j, (dim, ref_dim) in enumerate(zip(array.shape, shapes[i])):
     99                 if not j and not check_batch_dim:

Exception: Error when checking model input: expected input to have 2 dimensions, but got array with shape (11, 30, 300)```

- Never figured out what the dimensionality of the input matrix `X` needs to be !! 

# Finally - a comparison 



```
IMDB Movie Reviews (Accuracy)

Keras example CNN model: 0.89
SVM (SGD): 0.88
MNB: 0.84
```

- The comparison isn't really fair to any of the models
  - the NN doesn't have enough training data
  - the document representation is not very good for SVM and NB

# WHYY?!

- can someone please tell what is going on?
- was I wrong to use `keras`?
- good resources:
    - http://wildml.com
    - http://github.com (but you need to find the correct repo)

- https://github.com/zhangxiangxiao/Crepe (Torch 7)
- https://github.com/yoonkim/CNN_sentence (Theano)
- https://github.com/bwallace/CNN-for-text-classification (Keras)
- https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras (Keras)
- http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ (TensorFlow)

## @mattilyra