# Natural Language Processing using Artificial Neural Networks

> “In God we trust. All others must bring data.” – W. Edwards Deming, statistician

# Word Embeddings

### What?
Convert words to vectors in a high dimensional space. Each dimension denotes an aspect like gender, type of object / word.

"Word embeddings" are a family of natural language processing techniques aiming at mapping semantic meaning into a geometric space. This is done by associating a numeric vector to every word in a dictionary, such that the distance (e.g. L2 distance or more commonly cosine distance) between any two vectors would capture part of the semantic relationship between the two associated words. The geometric space formed by these vectors is called an embedding space.



### Why?
By converting words to vectors we build relations between words. More similar the words in a dimension, more closer their scores are.

### Example
_W(green) = (1.2, 0.98, 0.05, ...)_

_W(red) = (1.1, 0.2, 0.5, ...)_

Here the vector values of _green_ and _red_ are very similar in one dimension because they both are colours. The value for second dimension is very different because red might be depicting something negative in the training data while green is used for positiveness.

By vectorizing we are indirectly building different kind of relations between words.

## Example of `word2vec` using gensim

In [1]:
from gensim.models import word2vec
from gensim.models.word2vec import Word2Vec

### Reading blog post from data directory

In [2]:
import os
import pickle

In [3]:
DATA_DIRECTORY = os.path.join(os.path.abspath(os.path.curdir), '..', 
                              'data', 'word_embeddings')

In [4]:
male_posts = []
female_post = []

In [5]:
with open(os.path.join(DATA_DIRECTORY,"male_blog_list.txt"),"rb") as male_file:
    male_posts= pickle.load(male_file)
    
with open(os.path.join(DATA_DIRECTORY,"female_blog_list.txt"),"rb") as female_file:
    female_posts = pickle.load(female_file)

In [6]:
print(len(female_posts))
print(len(male_posts))

2252
2611


In [7]:
filtered_male_posts = list(filter(lambda p: len(p) > 0, male_posts))
filtered_female_posts = list(filter(lambda p: len(p) > 0, female_posts))
posts = filtered_female_posts + filtered_male_posts

In [8]:
print(len(filtered_female_posts), len(filtered_male_posts), len(posts))

2247 2595 4842


## Word2Vec

In [9]:
w2v = Word2Vec(size=200, min_count=1)
w2v.build_vocab(map(lambda x: x.split(), posts[:100]), )

In [13]:
w2v.similarity('I', 'My')

  """Entry point for launching an IPython kernel.


-0.0017214040437083416

In [14]:
print(posts[5])
w2v.similarity('ring', 'husband')

I've tried starting blog after blog and it just never feels right.  Then I read today that it feels strange to most people, but the more you do it the better it gets (hmm, sounds suspiciously like something else!) so I decided to give it another try.    My husband bought me a notepad at  urlLink McNally  (the best bookstore in Western Canada) with that title and a picture of a 50s housewife grinning desperately.  Each page has something funny like "New curtains!  Hurrah!".  For some reason it struck me as absolutely hilarious and has stuck in my head ever since.  What were those women thinking?


  


-0.0081163338784362837

In [15]:
w2v.similarity('ring', 'housewife')

  """Entry point for launching an IPython kernel.


-0.015692758390509903

In [16]:
w2v.similarity('women', 'housewife')  # Diversity friendly

  """Entry point for launching an IPython kernel.


-0.033483647840445246

## Doc2Vec

The same technique of word2vec is extrapolated to documents. Here, we do everything done in word2vec + we vectorize the documents too

In [17]:
import numpy as np

In [18]:
# 0 for male, 1 for female
y_posts = np.concatenate((np.zeros(len(filtered_male_posts)),
                          np.ones(len(filtered_female_posts))))

In [19]:
len(y_posts)

4842

# Convolutional Neural Networks for Sentence Classification

Train convolutional network for sentiment analysis. 

Based on
"Convolutional Neural Networks for Sentence Classification" by Yoon Kim
http://arxiv.org/pdf/1408.5882v2.pdf

For 'CNN-non-static' gets to 82.1% after 61 epochs with following settings:
embedding_dim = 20          
filter_sizes = (3, 4)
num_filters = 3
dropout_prob = (0.7, 0.8)
hidden_dims = 100

For 'CNN-rand' gets to 78-79% after 7-8 epochs with following settings:
embedding_dim = 20          
filter_sizes = (3, 4)
num_filters = 150
dropout_prob = (0.25, 0.5)
hidden_dims = 150

For 'CNN-static' gets to 75.4% after 7 epochs with following settings:
embedding_dim = 100          
filter_sizes = (3, 4)
num_filters = 150
dropout_prob = (0.25, 0.5)
hidden_dims = 150

* it turns out that such a small data set as "Movie reviews with one
sentence per review"  (Pang and Lee, 2005) requires much smaller network
than the one introduced in the original article:
- embedding dimension is only 20 (instead of 300; 'CNN-static' still requires ~100)
- 2 filter sizes (instead of 3)
- higher dropout probabilities and
- 3 filters per filter size is enough for 'CNN-non-static' (instead of 100)
- embedding initialization does not require prebuilt Google Word2Vec data.
Training Word2Vec on the same "Movie reviews" data set is enough to 
achieve performance reported in the article (81.6%)

** Another distinct difference is slidind MaxPooling window of length=2
instead of MaxPooling over whole feature map as in the article

In [20]:
import numpy as np
import word_embedding
from word2vec import train_word2vec

from keras.models import Sequential, Model
from keras.layers import (Activation, Dense, Dropout, Embedding, 
                          Flatten, Input, 
                          Conv1D, MaxPooling1D)
from keras.layers.merge import Concatenate

np.random.seed(2)

Using TensorFlow backend.


### Parameters

Model Variations. See Kim Yoon's Convolutional Neural Networks for 
Sentence Classification, Section 3 for detail.

In [21]:
model_variation = 'CNN-rand'  #  CNN-rand | CNN-non-static | CNN-static
print('Model variation is %s' % model_variation)

Model variation is CNN-rand


In [22]:
# Model Hyperparameters
sequence_length = 56
embedding_dim = 20          
filter_sizes = (3, 4)
num_filters = 150
dropout_prob = (0.25, 0.5)
hidden_dims = 150

In [23]:
# Training parameters
batch_size = 32
num_epochs = 100
val_split = 0.1

In [24]:
# Word2Vec parameters, see train_word2vec
min_word_count = 1  # Minimum word count                        
context = 10        # Context window size    

### Data Preparation 

In [25]:
# Load data
print("Loading data...")
x, y, vocabulary, vocabulary_inv = word_embedding.load_data()

if model_variation=='CNN-non-static' or model_variation=='CNN-static':
    embedding_weights = train_word2vec(x, vocabulary_inv, 
                                       embedding_dim, min_word_count, 
                                       context)
    if model_variation=='CNN-static':
        x = embedding_weights[0][x]
elif model_variation=='CNN-rand':
    embedding_weights = None
else:
    raise ValueError('Unknown model variation')    

Loading data...


In [26]:
# Shuffle data
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices].argmax(axis=1)

In [27]:
print("Vocabulary Size: {:d}".format(len(vocabulary)))

Vocabulary Size: 18765


### Building CNN Model

In [28]:
graph_in = Input(shape=(sequence_length, embedding_dim))
convs = []
for fsz in filter_sizes:
    conv = Conv1D(filters=num_filters,
                  filter_length=fsz,
                  padding='valid',
                  activation='relu',
                  strides=1)(graph_in)
    pool = MaxPooling1D(pool_length=2)(conv)
    flatten = Flatten()(pool)
    convs.append(flatten)
    
if len(filter_sizes)>1:
    out = Concatenate()(convs)
else:
    out = convs[0]

graph = Model(input=graph_in, output=out)

# main sequential model
model = Sequential()
if not model_variation=='CNN-static':
    model.add(Embedding(len(vocabulary), embedding_dim, input_length=sequence_length,
                        weights=embedding_weights))
model.add(Dropout(dropout_prob[0], input_shape=(sequence_length, embedding_dim)))
model.add(graph)
model.add(Dense(hidden_dims))
model.add(Dropout(dropout_prob[1]))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))

  
  if __name__ == '__main__':
  


In [29]:
model.compile(loss='binary_crossentropy', optimizer='rmsprop', 
              metrics=['accuracy'])

# Training model
# ==================================================
model.fit(x_shuffled, y_shuffled, batch_size=batch_size,
          nb_epoch=num_epochs, validation_split=val_split, verbose=2)



Train on 9595 samples, validate on 1067 samples
Epoch 1/100
 - 8s - loss: 0.6496 - acc: 0.6049 - val_loss: 0.5656 - val_acc: 0.7160
Epoch 2/100
 - 6s - loss: 0.4559 - acc: 0.7921 - val_loss: 0.5143 - val_acc: 0.7563
Epoch 3/100
 - 6s - loss: 0.3546 - acc: 0.8493 - val_loss: 0.5100 - val_acc: 0.7741
Epoch 4/100
 - 6s - loss: 0.2938 - acc: 0.8795 - val_loss: 0.5208 - val_acc: 0.7713
Epoch 5/100
 - 5s - loss: 0.2418 - acc: 0.9030 - val_loss: 0.5642 - val_acc: 0.7798
Epoch 6/100
 - 6s - loss: 0.1969 - acc: 0.9260 - val_loss: 0.6212 - val_acc: 0.7788
Epoch 7/100
 - 6s - loss: 0.1563 - acc: 0.9378 - val_loss: 0.6965 - val_acc: 0.7423
Epoch 8/100
 - 6s - loss: 0.1278 - acc: 0.9530 - val_loss: 0.7444 - val_acc: 0.7610
Epoch 9/100
 - 6s - loss: 0.1027 - acc: 0.9622 - val_loss: 0.8164 - val_acc: 0.7610
Epoch 10/100
 - 6s - loss: 0.0739 - acc: 0.9732 - val_loss: 0.9005 - val_acc: 0.7488
Epoch 11/100
 - 5s - loss: 0.0593 - acc: 0.9783 - val_loss: 1.0114 - val_acc: 0.7563
Epoch 12/100
 - 5s - loss:

Epoch 97/100
 - 6s - loss: 1.5686e-05 - acc: 1.0000 - val_loss: 3.1914 - val_acc: 0.7526
Epoch 98/100
 - 6s - loss: 0.0015 - acc: 0.9998 - val_loss: 3.1834 - val_acc: 0.7507
Epoch 99/100
 - 6s - loss: 0.0025 - acc: 0.9995 - val_loss: 3.2276 - val_acc: 0.7498
Epoch 100/100
 - 6s - loss: 8.2668e-05 - acc: 1.0000 - val_loss: 3.1403 - val_acc: 0.7582


<keras.callbacks.History at 0x7f1eca36e5c0>

# Another Example

Using Keras + [**GloVe**](http://nlp.stanford.edu/projects/glove/) - **Global Vectors for Word Representation**