# Gensim

Gensim is a python library for statistical semantics. It has scallable implementations of:

* word2vec
* paragraph2vec
* Collocation detection
* Corpus tools
* and much more...

Here we will see a simple example on how to train your own word2vec model using gensim.

In [4]:
from gensim.models import Word2Vec

import re # Regular expressions package

To train our word2vec model, we need a corpus. We will use the 'big.txt' file from Peter Norvig blog (http://norvig.com/big.txt). Execute the following commands on the file before continuing:

```bash
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' big.txt | sed 's/,/ /g' > out.txt
```

This command will replace every newline and comma with a space, saving the result on `out.txt`.

Now let's read, split the sentences and tokenize our corpus:

In [5]:
with open('/home/andfre/Downloads/out.txt') as fin:                                                    
    raw_text = fin.read()                                                       
                                                                                
# Get sentences                                                                                                                                                     
sentences = re.split('\?+!+|!+\?+|\.+|!+|\?+', raw_text)                        
                                                                                
# Get rid of empty sentences                                                    
sentences = [s.strip() for s in sentences if len(s.strip()) > 0]                
                                                                                
# Tokenize sentences (simple space tokenizer) and lower case them               
sentences = [[w.lower() for w in s.split()] for s in sentences]

Now that we have our corpus, we can train our word2vec model.

In [6]:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
fname = '/home/andfre/w2v_model'
model.save(fname)
model = Word2Vec.load(fname)  # you can continue training with the loaded model!

In [7]:
model = Word2Vec(sentences) #Using default parameters

When our model is trained we can perform various syntactic/semantic NLP word tasks, for example:

In [8]:
model.most_similar('power')

[('law', 0.9024247527122498),
 ('force', 0.8977823853492737),
 ('action', 0.8927546143531799),
 ('constitution', 0.8727369904518127),
 ('laws', 0.8689889907836914),
 ('government', 0.868581235408783),
 ('free', 0.8610345721244812),
 ('state', 0.8484005331993103),
 ('events', 0.8425267338752747),
 ('effect', 0.8419129848480225)]

In [9]:
model.doesnt_match(['night', 'day', 'job', 'afternoon'])

'job'

In [10]:
model.similarity('man', 'woman')

0.84561689182326416

In [11]:
model['horse'] # Raw numpy vector

array([-0.13624771, -0.02266871,  0.63472003, -0.11907553, -0.22339933,
       -0.29526153,  0.73241997,  0.21248201, -0.61262256, -0.11415902,
       -0.26086268, -1.11159909,  0.61068088, -0.79289532,  0.04550897,
        0.33390293, -0.71127284, -0.09242124, -0.33723864,  0.43697152,
        0.55743402, -1.02065897, -0.29523399,  0.58874989, -1.2041502 ,
        0.64604473,  0.70840406, -0.03391901, -0.10667828,  0.7500543 ,
        0.50467169, -0.8855266 ,  0.1187309 ,  0.59485567, -1.5292424 ,
       -0.52705014, -0.15584329,  0.32358038,  1.29023087,  0.02368196,
       -0.42170233,  0.24598011, -0.15846446, -0.0611285 ,  0.0333156 ,
        0.41311353,  0.02710177,  0.96595174, -0.77029204, -0.23648697,
        0.41375861, -0.67001677,  0.8149513 , -0.01893596, -0.68569922,
        0.42287305,  0.94769877,  1.10255313,  0.00508502,  0.93396974,
        0.54887873, -0.71612567,  0.76724315,  0.42328995,  0.34710354,
        0.11477941,  0.4066942 , -0.16123356,  0.3841711 , -0.05

If we are satisfied with our model, we can normalize the vectors and save them.

In [12]:
model.init_sims(replace=True)
model.save_word2vec_format('/home/andfre/wordvecs.txt')

In [None]:
import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('/home/andfre/Downloads/w2v/GoogleNews-vectors-negative300.bin', binary=True) 

In [None]:
model.most_similar('power')

### Exercise

* Download the shakespeare.txt file from Norvig's blog (http://norvig.com/ngrams/shakespeare.txt) and train a word2vec model in the same way we did before.

* Search through gensim documentation how to extract the k nearest neighbors of the equation:
$$ \vec{woman} + \vec{king} - \vec{man} $$

* Compute the KNN for both models (big.txt and shakespeare.txt). What happens? Does one model performs better than the other (by finding the vector 'queen')? Can you explain the disparity between the results?

In [None]:
# PUT YOUR CODE HERE

# Keras for NLP

Last lecture we learned about Theano and Keras, tools to build and train neural networks. Now let's take a look on how to use these tools for NLP tasks.


### Padding

We will only be using Keras' api, but since Keras is built on top of theano, it also inherits it's limitations. One limitation that is very important for NLP tasks is that Theano can only deal with full tensors. That means that you can't have a matrix where the first line has size 20 and the second line has size 30. That's a limitation for NLP because we normally deal with sentences, which are variable in size by nature.

A simple workaround for this problem is to pad every sentence with 0s, so they can all have the same size. Keras provides a tool for that.

In [13]:
from keras.preprocessing.sequence import pad_sequences
sentences = [ # Words indexes
             [3, 2, 25, 2],
             [1, 74],
             [3, 2, 6, 3, 2, 7]
             ]
pad_sequences(sentences)

Using Theano backend.


array([[ 0,  0,  3,  2, 25,  2],
       [ 0,  0,  0,  0,  1, 74],
       [ 3,  2,  6,  3,  2,  7]], dtype=int32)

Now our sentences can be used for training.

### Embedding Layer

A lot of NLP tasks use word vectors (normally trained using word2vec models or GloVe). To use these word vector in Keras, we need an `Embedding` layer. But first, let's learn how to read pre-trained vectors.

In [14]:
import numpy as np

def read_wordvecs(filename):
    fin = open(filename)
    
    word2index = {}
    
    # Masking
    word2index['MASK'] = 0
    # Out of vocabulary words
    word2index['UNKNOWN'] = 1
    # Padding
    word2index['PADDING'] = 2
    
    word_vecs = []
    
    for line in fin:
        splited_line = line.strip().split()
        word = splited_line[0]
        word_vecs.append(splited_line[1:])
        
        word2index[word] = len(word2index)
    
    word_vecs_np = np.zeros(shape=(len(word2index), len(word_vecs[1])), dtype='float32')
    word_vecs_np[3:] = word_vecs
    
    return word_vecs_np, word2index

Now a simple call to `read_wordvecs` will return a word vectors matrix and a word to index dictionary.

Let's have a look on how to intantiate an Embedding layer then:

In [15]:
from keras.layers.embeddings import Embedding
from keras.models import Sequential

# Word vectors for German words
word_vecs, word2index = read_wordvecs('/home/andfre/Downloads/GENSIM_KERAS_CLASS/GermEval/embeddings/GermEval.emb')

sentences = [ # Words indexes
             [3, 2, 25, 2],
             [1, 74],
             [3, 2, 6, 3, 2, 7]
             ]
sentences = pad_sequences(sentences)

model = Sequential()

emb_layer = Embedding(output_dim=word_vecs.shape[1], input_dim=word_vecs.shape[0],
                      mask_zero=True, weights=[word_vecs], input_length=sentences.shape[1])

model.add(emb_layer)

model.compile('rmsprop', 'mse')

output = model.predict(sentences)

print(output.shape)

(3, 6, 100)


### NER Classification

Now that we learned how to extract and use semantic features from text, let's try a named entity recognition (NER) using these features.

Make sure you have the `GermEval` folder before continuing. This folder contains a dataset for German NER and tools to read it. (More info at https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2014/2014_GermEval_Nested_Named_Entity_Recognition_with_Neural_Networks.pdf)

We will use a model that uses a window of size 2 (2 words to the left and 2 words to the right) to predict the NER of the word in the middle. Let's start by reading our dataset. Remember that we already read the word vectors on the last piece of code.

In [16]:
import sys
from keras.utils import np_utils
sys.path.append('/home/andfre/Downloads/GENSIM_KERAS_CLASS')

from GermEval import GermEvalReader
from GermEval import BIOF1Validation

windowSize = 2 # 2 to the left, 2 to the right
numHiddenUnits = 100
trainFile = '/home/andfre/Downloads/GENSIM_KERAS_CLASS/GermEval/data/NER-de-train.tsv'
devFile = '/home/andfre/Downloads/GENSIM_KERAS_CLASS/GermEval/data/NER-de-dev.tsv'
testFile = '/home/andfre/Downloads/GENSIM_KERAS_CLASS/GermEval/data/NER-de-test.tsv'

# Create a mapping for our labels
label2index = {'O':0}
idx = 1

for bioTag in ['B-', 'I-']:
    for nerClass in ['PER', 'LOC', 'ORG', 'OTH']:
        for subtype in ['', 'deriv', 'part']:
            label2index[bioTag+nerClass+subtype] = idx 
            idx += 1
            
#Inverse label mapping
index2label = {v: k for k, v in label2index.items()}

# Read in data   
train_sentences = GermEvalReader.readFile(trainFile)
dev_sentences = GermEvalReader.readFile(devFile)
test_sentences = GermEvalReader.readFile(testFile)

# Create numpy arrays
train_x, train_y = GermEvalReader.createNumpyArray(train_sentences, windowSize, word2index, label2index)
dev_x, dev_y = GermEvalReader.createNumpyArray(dev_sentences, windowSize, word2index, label2index)
test_x, test_y = GermEvalReader.createNumpyArray(test_sentences, windowSize, word2index, label2index)

# Train_y is a 1-dimensional vector containing the index of the label
# With np_utils.to_categorical we map it to a 1 hot matrix
n_out = len(label2index)
train_y_cat = np_utils.to_categorical(train_y, n_out)

We have our data. Now we will build our model:

* An Embedding Layer
* A LSTM Layer
* A Dense Layer (Softmax activation)

In [17]:
from keras.models import Sequential
from keras.layers.core import Dense
from keras.utils import np_utils
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.regularizers import l2

n_in = 2*windowSize+1
n_hidden = numHiddenUnits
n_out = len(label2index)

number_of_epochs = 10
batch_size = 35

model = Sequential()

model.add(Embedding(output_dim=word_vecs.shape[1], input_dim=word_vecs.shape[0],
                    input_length=n_in,  weights=[word_vecs], mask_zero=False))  

model.add(LSTM(n_hidden, W_regularizer=l2(0.0001), U_regularizer=l2(0.0001)))

model.add(Dense(n_out, activation='softmax', W_regularizer=l2(0.0001)))
            
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

With our model compiled, the only thing left to do is train.

Since we don't want to validate using Keras' default validation (accuracy or loss), but BIOF1, we will train the model for 1 epoch, validate, and then loop.

In [18]:
import time
import sys

print(str(train_x.shape[0]) + ' train samples')
print(str(train_x.shape[1]) + ' train dimension')
print(str(test_x.shape[0]) + ' test samples')

print("\n%d epochs" % number_of_epochs)
print("%d mini batches" % (len(train_x)/batch_size))

print("\nA little bit too much for the lecture. Using 10k samples instead\n")
train_x = train_x[:10000]
dev_x = dev_x[:10000]
test_x = test_x[:10000]
train_y_cat = train_y_cat[:10000]
dev_y = dev_y[:10000]
test_y = test_y[:10000]

sys.stdout.flush()

for epoch in range(number_of_epochs):    
    start_time = time.time()
    
    #Train for 1 epoch
    model.fit(train_x, train_y_cat, nb_epoch=1, batch_size=batch_size, verbose=False, shuffle=True)   
    print("%.2f sec for training" % (time.time() - start_time))
    sys.stdout.flush()
  
    # Compute precision, recall, F1 on dev & test data
    pre_dev, rec_dev, f1_dev = BIOF1Validation.compute_f1(model.predict_classes(dev_x, verbose=0), dev_y, index2label)
    pre_test, rec_test, f1_test = BIOF1Validation.compute_f1(model.predict_classes(test_x, verbose=0), test_y, index2label)

    print("%d epoch: F1 on dev: %f, F1 on test: %f" % (epoch+1, f1_dev, f1_test))
    sys.stdout.flush()

452830 train samples
5 train dimension
96483 test samples

10 epochs
12938 mini batches

A little bit too much for the lecture. Using 10k samples instead

104.38 sec for training
1 epoch: F1 on dev: 0.094899, F1 on test: 0.116915
79.00 sec for training
2 epoch: F1 on dev: 0.263254, F1 on test: 0.287046
75.29 sec for training
3 epoch: F1 on dev: 0.339383, F1 on test: 0.370438


KeyboardInterrupt: 

### Exercise

* Using the following code, load the imdb sentiment analysis dataset.

```python
from keras.datasets import imdb

#Using 20k most frequent words
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=20000,
                                                      test_split=0.2)
```

* Pad X_train and X_test using keras preprocessing tools. It's a good idea to set the `maxlen` parameter to something around 80 when padding sentences.
* Build a model with an Embedding layer at the beggining and a binary output (you can use LSTM or GRU as hidden layers). The Embedding layer shouldn't be initialized (just omit the `weights` parameter)
* Train and evaluate

In [None]:
# PUT YOUR CODE HERE