# Introduction to neural networks using NumPy

## Resources
* Stanford CS224n Lecture 4 (Winter 2018) [Slides](https://web.stanford.edu/class/cs224n/lectures/lecture4.pdf)
* Stanford CS224n Lecture 4 (Winter 2017) [Video](https://youtu.be/uc2_iwVqrRI)
* Denny Britz's post (2015): [Implementing a Neural Network from Scratch](http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/)

In [1]:
import numpy as np
import random
import os
import sys
import urllib.request

from tempfile import gettempdir

In [2]:
print('Python', sys.version)

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]


## Task: Classify whether the center word within a window of words is a location

* To build a simple neural network model to illustrate non-linear function approximation, backpropagation and stochastic gradient descent.
* Background to the Named Entity Recognition (NER) problem on [Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition)
* Description of task in Stanford CS224n Lecture 4 [slide 45](https://web.stanford.edu/class/cs224n/lectures/lecture4.pdf#page=45)

## Download and read the data from file
Tjong Kim Sang et al. [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
](http://www.aclweb.org/anthology/W03-0419.pdf)

In [3]:
def maybe_download(url, filename, expected_bytes):
    "Download the file if not present, and make sure it's the right size."    
    local_filename = os.path.join(gettempdir(), filename)
    if not os.path.exists(local_filename):
        local_filename, _ = urllib.request.urlretrieve(url + filename, local_filename)
        statinfo = os.stat(local_filename)
        if statinfo.st_size == expected_bytes:
            print('Found and verified', filename)
        else:
            print(statinfo.st_size)
            raise Exception('Failed to verify ' + local_filename + 
                            '. Can you get to it with a browser?')
    return local_filename


def read_data(filename):
    "Reads the eng.train data file from CONLL2003"
    sents, sent_tags = [], []
    with open(filename) as f:
        dictionary = {'<PAD>': 0}
        sent, tags = [], []
        for line in f:
            if line.startswith('-DOCSTART-'):
                continue
            if line.startswith('\n'):
                if sent and tags:
                    sents.append(sent)
                    sent_tags.append(tags)
                    sent, tags = [], []
                continue
            word, _, _, tag = line.split()
            sent.append(word)
            tags.append(tag)
            if not dictionary.get(word):
                dictionary[word] = len(dictionary)
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return sents, sent_tags, dictionary, reversed_dictionary


In [4]:
filename = maybe_download(
    url='https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/',
    filename='eng.train',
    expected_bytes=3283420)

Click to view raw text: https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train

## Read the data from file and build dictionary

In [5]:
""" sents               : list of sentences (where each sentence is a list of words)
    sent_tags           : list of named-entity tags corresponding to each word in sents
    dictionary          : maps words(strings) to their IDs(int)
    reversed_dictionary : maps IDs(int) to their words(strings)
"""
sents, sent_tags, dictionary, reversed_dictionary = read_data(filename)

In [6]:
print('Sample sentence:', sents[0], sent_tags[0])

Sample sentence: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']


In [7]:
len(dictionary)   # vocabulary size

23624

## Prepare word windows for training

In [8]:
def prepare_windows(window_size=2):
    """
    Param: window_size (int) for each side of center word
    Returns: : tuple (list of +ve windows, list of -ve windows)
    """
    pos_windows, neg_windows = [], []
    span = 2*window_size + 1
    for sent, tags in zip(sents, sent_tags):
        count = len(sent)
        # pad sentence at front and end
        sent = [0]*window_size + [dictionary[word] for word in sent] + [0]*window_size
        for i in range(count):
            window = sent[i:i+span]
            # positive if center word is tagged as location
            if tags[i] in ['B-LOC', 'I-LOC']:
                pos_windows.append(window)
            else:
                neg_windows.append(window)
    return pos_windows, neg_windows

In [9]:
pos_windows, neg_windows = prepare_windows(window_size=2)
print('Number of positive windows: ', len(pos_windows))
print('Number of negative windows: ', len(neg_windows))
print('Sample positive window: ', pos_windows[0], [reversed_dictionary[i] for i in pos_windows[0]])
print('Sample negative window: ', neg_windows[0], [reversed_dictionary[i] for i in neg_windows[0]])

Number of positive windows:  8297
Number of negative windows:  195324
Sample positive window:  [0, 0, 12, 13, 0] ['<PAD>', '<PAD>', 'BRUSSELS', '1996-08-22', '<PAD>']
Sample negative window:  [0, 0, 1, 2, 3] ['<PAD>', '<PAD>', 'EU', 'rejects', 'German']


## Set hyperparameter values 

In [10]:
seed = 0
embedding_size = 100      # word embdedding dimension size 
window_size = 2           # size of window on each side of center word
hidden_size = 200         # size of the hidden layer
vocab_size = len(dictionary)
learning_rate = 0.001     # initial learning rate
num_epochs = 1000         # number of passes over true window samples

## Build and train the classifier

In [11]:
""" This code trains a simple neural network as a binary classifier. 
    The model calculates a score when it is given a window of words. 
    The score is used to determine whether the center word in the
    window is a location or not.
"""

np.random.seed(seed)
x_dim = embedding_size * (2*window_size + 1)

# Initialize model parameters 
embeddings = np.random.uniform(-0.5, 0.5, (vocab_size, embedding_size))
W = np.random.randn(x_dim, hidden_size) * np.sqrt(1.0/x_dim)
b = np.zeros(hidden_size)
u = np.random.randn(hidden_size)

average_error = 0

# Training loop
for epoch in range(num_epochs):
    for i, pos_window in enumerate(pos_windows):
        
        pos_sample = np.array(pos_window)      # s (positive/true window)
        neg_sample = np.array(
            random.sample(neg_windows, k=1))   # s_c (corrupt window)
        
        inputs = np.vstack((pos_sample, neg_sample)) # 2 word vectors
        X = embeddings[inputs].reshape(-1, x_dim)    # joined together
                                                     # and reshaped.
        
        # Forward pass    
        z = X.dot(W) + b                # affine transformation
        a = 1. / (1. + np.exp(-z))      # non-linearity (sigmoid)
        scores = a.dot(u)               # scalar unnormalized scores

        # Max-margin objective
        error = 1 if max(0, 1 - scores[0] + scores[1]) > 0 else 0
        
        # Backward pass
        grad_u = error * (a[1] -a[0])   # gradient for u
        delta = grad_u * (a * (1 - a))  # multiply with sigmoid derivative
        grad_W = np.dot(X.T, delta)     # gradient for W
        grad_b = delta.sum(axis=0)      # gradient for b
        # gradient for the 2 word vectors
        grad_X = np.dot(delta, W.T).reshape(-1, 2*window_size + 1, embedding_size)
                                        
        
        # Parameter updates using gradient descent
        u -= learning_rate * grad_u
        W -= learning_rate * grad_W
        b -= learning_rate * grad_b
        embeddings[inputs] -= learning_rate * grad_X
        
        # Keep track of any errors
        if error: average_error += 1 - scores[0] + scores[1]
     
    # Print average error per epoch
    print('Epoch', epoch + 1, 'error: ', average_error / i)
    
    # Stop training when average error is low enough
    if average_error / i < 0.01:
        break
    
    average_error = 0
    
    # Decay learning rate exponentially every epoch
    learning_rate = learning_rate * 0.9999
            

Epoch 1 error:  1.2543090980779557
Epoch 2 error:  0.8502328591689242
Epoch 3 error:  0.8219493988752279
Epoch 4 error:  0.7964259181583688
Epoch 5 error:  0.8155511480789995
Epoch 6 error:  0.8004339041474317
Epoch 7 error:  0.8159947442810662
Epoch 8 error:  0.7948763515910987
Epoch 9 error:  0.7302909140954216
Epoch 10 error:  0.7135596711859354
Epoch 11 error:  0.6741291547283738
Epoch 12 error:  0.6742432995278562
Epoch 13 error:  0.6451955145812827
Epoch 14 error:  0.606109704150611
Epoch 15 error:  0.5806997630091584
Epoch 16 error:  0.5686487667332706
Epoch 17 error:  0.5476950657018854
Epoch 18 error:  0.5190020315936291
Epoch 19 error:  0.5227998087062005
Epoch 20 error:  0.5051725722335456
Epoch 21 error:  0.493015337391649
Epoch 22 error:  0.489575434440291
Epoch 23 error:  0.4768398116634627
Epoch 24 error:  0.465069174516525
Epoch 25 error:  0.4616172769757095
Epoch 26 error:  0.4592984175784474
Epoch 27 error:  0.4480061542520365
Epoch 28 error:  0.44193215760324467
Epoc

Epoch 223 error:  0.16129254832254766
Epoch 224 error:  0.16270097486783902
Epoch 225 error:  0.15532418837466527
Epoch 226 error:  0.1573840840456377
Epoch 227 error:  0.16254784135332223
Epoch 228 error:  0.16267536023113427
Epoch 229 error:  0.159143343964731
Epoch 230 error:  0.15206929627066818
Epoch 231 error:  0.16019448139264553
Epoch 232 error:  0.16773317673226196
Epoch 233 error:  0.16141329553685127
Epoch 234 error:  0.15507117204877172
Epoch 235 error:  0.1521317253314569
Epoch 236 error:  0.14805124330529737
Epoch 237 error:  0.16060148959545917
Epoch 238 error:  0.15313599380666335
Epoch 239 error:  0.16138608090376894
Epoch 240 error:  0.1573005302116273
Epoch 241 error:  0.14765786373459225
Epoch 242 error:  0.15283173558731872
Epoch 243 error:  0.1378473201119857
Epoch 244 error:  0.15194846376704485
Epoch 245 error:  0.14301178455935748
Epoch 246 error:  0.13799962210634378
Epoch 247 error:  0.14429118893110912
Epoch 248 error:  0.14499553822679823
Epoch 249 error:  

Epoch 440 error:  0.08429377427993424
Epoch 441 error:  0.08324149838199742
Epoch 442 error:  0.08662197895191691
Epoch 443 error:  0.08824525607844656
Epoch 444 error:  0.08623473189840561
Epoch 445 error:  0.08768974263539729
Epoch 446 error:  0.08390202115165764
Epoch 447 error:  0.09110027762565164
Epoch 448 error:  0.08434842591409346
Epoch 449 error:  0.08434254881585623
Epoch 450 error:  0.08428552177189368
Epoch 451 error:  0.08335934608818157
Epoch 452 error:  0.08128395899397282
Epoch 453 error:  0.09117223863003894
Epoch 454 error:  0.07778999068090622
Epoch 455 error:  0.07451097042758015
Epoch 456 error:  0.08130810296622663
Epoch 457 error:  0.07813228366946179
Epoch 458 error:  0.08397093980459187
Epoch 459 error:  0.0792293585596362
Epoch 460 error:  0.07134163517200108
Epoch 461 error:  0.08305697917486358
Epoch 462 error:  0.07639097308283462
Epoch 463 error:  0.07643684445378726
Epoch 464 error:  0.07216975234888037
Epoch 465 error:  0.06934037086656626
Epoch 466 err

Epoch 655 error:  0.05347423785335021
Epoch 656 error:  0.049737131238661275
Epoch 657 error:  0.04659684838216013
Epoch 658 error:  0.052489494404010906
Epoch 659 error:  0.047035777199882875
Epoch 660 error:  0.050774927034476176
Epoch 661 error:  0.05099704328084098
Epoch 662 error:  0.04805154835772248
Epoch 663 error:  0.04325134720388602
Epoch 664 error:  0.053584170023730344
Epoch 665 error:  0.044172384230600485
Epoch 666 error:  0.046701638712228634
Epoch 667 error:  0.04453976096758394
Epoch 668 error:  0.04759639684676993
Epoch 669 error:  0.0456287620060764
Epoch 670 error:  0.04847634445786038
Epoch 671 error:  0.048500275375233354
Epoch 672 error:  0.05035720885856662
Epoch 673 error:  0.053250993410792355
Epoch 674 error:  0.04523231387934181
Epoch 675 error:  0.04939357985949955
Epoch 676 error:  0.04725917953865014
Epoch 677 error:  0.04771881056059636
Epoch 678 error:  0.04681429963897433
Epoch 679 error:  0.05148560122592555
Epoch 680 error:  0.04499127901283376
Epoc

Epoch 870 error:  0.035682367686126454
Epoch 871 error:  0.03594363782863291
Epoch 872 error:  0.03452406595901349
Epoch 873 error:  0.035869394083541645
Epoch 874 error:  0.035804422603696776
Epoch 875 error:  0.037334916965303486
Epoch 876 error:  0.037627145905730075
Epoch 877 error:  0.03287014599481271
Epoch 878 error:  0.031986823057523255
Epoch 879 error:  0.03511839261618608
Epoch 880 error:  0.03497912383489957
Epoch 881 error:  0.03912058828540176
Epoch 882 error:  0.0370363673796316
Epoch 883 error:  0.03232906835999501
Epoch 884 error:  0.03610860071330862
Epoch 885 error:  0.0357110045465154
Epoch 886 error:  0.03172141193388374
Epoch 887 error:  0.031765566564145226
Epoch 888 error:  0.03555441041878228
Epoch 889 error:  0.03862824775329095
Epoch 890 error:  0.03591914216048891
Epoch 891 error:  0.04271316961841926
Epoch 892 error:  0.03450421914065156
Epoch 893 error:  0.034330486911438834
Epoch 894 error:  0.03774723607114327
Epoch 895 error:  0.03473990096895335
Epoch 

## Obtaining the scores 

As the model was trained on a max-margin objective and the ouput scores are unnormalized, the scores can only be interpreted by knowing the class boundary.

In [26]:
X_test = embeddings[pos_windows[0]].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Sample positive window', [reversed_dictionary[i] for i in pos_windows[0]], 'Score:', score_test)

Sample positive window ['<PAD>', '<PAD>', 'BRUSSELS', '1996-08-22', '<PAD>'] Score: [1.58155037]


In [27]:
# score statistics for all the positive windows in the training set
X_test = embeddings[pos_windows].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Max:', np.max(score_test), 'Min:', np.min(score_test), 
      'Mean:', np.mean(score_test), 'Median:', np.median(score_test))

Max: 6.09395330640287 Min: -5.756840786017964 Mean: -0.73314883258487 Median: -0.426166377895731


In [28]:
X_test = embeddings[neg_windows[0]].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Sample negative window', [reversed_dictionary[i] for i in neg_windows[0]], 'Score:', score_test)

Sample negative window ['<PAD>', '<PAD>', 'EU', 'rejects', 'German'] Score: [-6.39355903]


In [29]:
# score statistics for all the negative windows in the training set
X_test = embeddings[neg_windows].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Max:', np.max(score_test), 'Min:', np.min(score_test), 
      'Mean:', np.mean(score_test), 'Median:', np.median(score_test))

Max: 3.523909607407841 Min: -10.927709177663939 Mean: -6.064599375906003 Median: -6.233390325818526


From this, we can be rather certain that when the model ouputs a score of -6 or below, the center word is not a location.

In [30]:
def predict_score(sentence):
    "Param: sentence (list of strings)"
    word_ids = [dictionary[word] for word in sentence.split()]
    X = embeddings[word_ids].reshape(-1, x_dim)
    z = X.dot(W) + b
    a = 1. / (1. + np.exp(-z))
    score = a.dot(u)
    return score

In [31]:
predict_score('shops in Paris are amazing')

array([-1.79525794])

In [32]:
predict_score('not all shops in Paris')

array([-2.92915112])

In [33]:
predict_score('I love New York City')

array([-0.13167138])

In [34]:
predict_score('watch The New Black movie')

array([2.25757973])