# Introduction to Neural Networks using NumPy

## Resources
* Stanford CS224n Lecture 4 (Winter 2018) [Slides](https://web.stanford.edu/class/cs224n/lectures/lecture4.pdf)
* Stanford CS224n Lecture 4 (Winter 2017) [Video](https://youtu.be/uc2_iwVqrRI)
* Denny Britz's post (2015): [Implementing a Neural Network from Scratch](http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/)

In [1]:
import numpy as np
import random
import os
import sys
import urllib.request

from tempfile import gettempdir

In [2]:
print('Python', sys.version)

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]


## Task: Classify whether the center word within a window of words is a location

* To build a simple neural network model to illustrate non-linear function approximation, backpropagation and stochastic gradient descent.
* Background to the Named Entity Recognition (NER) problem on [Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition)
* Description of task in Stanford CS224n Lecture 4 [slide 45](https://web.stanford.edu/class/cs224n/lectures/lecture4.pdf#page=45)

## Download and read the data from file
Tjong Kim Sang et al. [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
](http://www.aclweb.org/anthology/W03-0419.pdf)

In [3]:
def maybe_download(url, filename, expected_bytes):
    "Download the file if not present, and make sure it's the right size."    
    local_filename = os.path.join(gettempdir(), filename)
    if not os.path.exists(local_filename):
        local_filename, _ = urllib.request.urlretrieve(url + filename, local_filename)
        statinfo = os.stat(local_filename)
        if statinfo.st_size == expected_bytes:
            print('Found and verified', filename)
        else:
            print(statinfo.st_size)
            raise Exception('Failed to verify ' + local_filename + 
                            '. Can you get to it with a browser?')
    return local_filename


def read_data(filename):
    "Reads the eng.train data file from CONLL2003"
    sents, sent_tags = [], []
    with open(filename) as f:
        dictionary = {'<PAD>': 0}
        sent, tags = [], []
        for line in f:
            if line.startswith('-DOCSTART-'):
                continue
            if line.startswith('\n'):
                if sent and tags:
                    sents.append(sent)
                    sent_tags.append(tags)
                    sent, tags = [], []
                continue
            word, _, _, tag = line.split()
            sent.append(word)
            tags.append(tag)
            if not dictionary.get(word):
                dictionary[word] = len(dictionary)
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return sents, sent_tags, dictionary, reversed_dictionary


In [4]:
filename = maybe_download(
    url='https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/',
    filename='eng.train',
    expected_bytes=3283420)

Click to view raw text: https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train

## Read the data from file and build dictionary

In [5]:
""" sents               : list of sentences (where each sentence is a list of words)
    sent_tags           : list of named-entity tags corresponding to each word in sents
    dictionary          : maps words(strings) to their IDs(int)
    reversed_dictionary : maps IDs(int) to their words(strings)
"""
sents, sent_tags, dictionary, reversed_dictionary = read_data(filename)

In [6]:
print('Sample sentence:', sents[0], sent_tags[0])

Sample sentence: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']


In [7]:
len(dictionary)   # vocabulary size

23624

## Prepare word windows for training

In [8]:
def prepare_windows(window_size=2):
    """
    Param: window_size (int) for each side of center word
    Returns: : tuple (list of +ve windows, list of -ve windows)
    """
    pos_windows, neg_windows = [], []
    span = 2*window_size + 1
    for sent, tags in zip(sents, sent_tags):
        count = len(sent)
        # pad sentence at front and end
        sent = [0]*window_size + [dictionary[word] for word in sent] + [0]*window_size
        for i in range(count):
            window = sent[i:i+span]
            # positive if center word is tagged as location
            if tags[i] in ['B-LOC', 'I-LOC']:
                pos_windows.append(window)
            else:
                neg_windows.append(window)
    return pos_windows, neg_windows

In [9]:
pos_windows, neg_windows = prepare_windows(window_size=2)
print('Number of positive windows: ', len(pos_windows))
print('Number of negative windows: ', len(neg_windows))
print('Sample positive window: ', pos_windows[0], [reversed_dictionary[i] for i in pos_windows[0]])
print('Sample negative window: ', neg_windows[0], [reversed_dictionary[i] for i in neg_windows[0]])

Number of positive windows:  8297
Number of negative windows:  195324
Sample positive window:  [0, 0, 12, 13, 0] ['<PAD>', '<PAD>', 'BRUSSELS', '1996-08-22', '<PAD>']
Sample negative window:  [0, 0, 1, 2, 3] ['<PAD>', '<PAD>', 'EU', 'rejects', 'German']


## Set hyperparameter values 

In [10]:
seed = 0
embedding_size = 100      # word embdedding dimension size 
window_size = 2           # size of window on each side of center word
hidden_size = 200         # size of the hidden layer
learning_rate = 0.02     # initial learning rate
num_epochs = 100         # number of passes over true window samples

## Prepare test values to monitor training 

In [11]:
pos_sent = [dictionary[word] for word in 'shops in Paris are amazing'.split()]
neg_sent = [dictionary[word] for word in 'not all shops in Paris'.split()]

word_ids = np.vstack((pos_sent, neg_sent))

## Build and train the classifier

In [12]:
""" This code trains a simple neural network as a binary classifier. 
    The model calculates a score when it is given a window of words. 
    The score is used to determine whether the center word in the
    window is a location or not.
"""

np.random.seed(seed)
vocab_size = len(dictionary)
x_dim = embedding_size * (2*window_size + 1)

# Initialize model parameters 
embeddings = np.random.uniform(-0.5, 0.5, (vocab_size, embedding_size))
W = np.random.randn(x_dim, hidden_size) * np.sqrt(1.0/x_dim)
b = np.zeros(hidden_size)
u = np.random.randn(hidden_size)

average_error = 0

# Training loop
for epoch in range(num_epochs):
    for i, pos_window in enumerate(pos_windows):
        neg_window = random.sample(neg_windows, k=1) # s_c
        inputs = np.vstack((pos_window, neg_window)) # stack as matrix
        X = embeddings[inputs].reshape(-1, x_dim)    # concat the words
        
        # Forward pass    
        z = X.dot(W) + b                # affine transformation
        a = 1. / (1. + np.exp(-z))      # non-linearity (sigmoid)
        scores = a.dot(u)               # scalar unnormalized scores

        # Max-margin objective
        error = 1 if max(0, 1 - scores[0] + scores[1]) > 0 else 0
        
        # Backward pass (no updating if error is 0)
        grad_u = error * (a[1] - a[0])       # gradient for u
        delta = grad_u.dot(u) * (a*(1 - a))  # multiply with sigmoid derivative
        grad_W = X.T.dot(delta)              # gradient for W
        grad_b = delta.sum(axis=0)           # gradient for b
        grad_X = delta.dot(W.T)              # gradient for the word vectors
        grad_X = grad_X.reshape(-1, 2*window_size + 1, embedding_size)                                
        
        # Parameter updates using gradient descent
        u -= learning_rate * grad_u
        W -= learning_rate * grad_W
        b -= learning_rate * grad_b
        embeddings[inputs] -= learning_rate * grad_X
        
        # Keep track of any errors
        if error: average_error += 1 - scores[0] + scores[1]
         
    # Check scores for test pair
    if epoch == 0 or (epoch + 1) % 10 == 0:
        X_test = embeddings[word_ids].reshape(-1, x_dim)
        z_test = X_test.dot(W) + b
        a_test = 1. / (1. + np.exp(-z_test))
        scores_test = a_test.dot(u)
        print('Positive window ("shops in Paris are amazing") score:', scores_test[0])
        print('Negative window ("not all shops in Paris") score:', scores_test[1])

    # Print average error per epoch
    print('Epoch', epoch + 1, 'error: ', average_error / i)
    
    # Stop training when average error is low enough
    if average_error / i < 0.02:
        break
    
    average_error = 0
        
    # Decay learning rate exponentially every epoch
    learning_rate = learning_rate * 0.9999


Positive window ("shops in Paris are amazing") score: 0.8836468783513292
Negative window ("not all shops in Paris") score: -1.256260939044756
Epoch 1 error:  0.5579765121942374
Epoch 2 error:  0.3331931522768411
Epoch 3 error:  0.2909576297149046
Epoch 4 error:  0.2541241764738869
Epoch 5 error:  0.20039501078979563
Epoch 6 error:  0.18689564710487705
Epoch 7 error:  0.1707173158075877
Epoch 8 error:  0.18102232846116376
Epoch 9 error:  0.18015253669296025
Positive window ("shops in Paris are amazing") score: 0.036454762724530854
Negative window ("not all shops in Paris") score: -2.8344061881812475
Epoch 10 error:  0.1639897789982405
Epoch 11 error:  0.15749696569041816
Epoch 12 error:  0.13340050849125523
Epoch 13 error:  0.12343361510971182
Epoch 14 error:  0.12525819434294844
Epoch 15 error:  0.12672452132982615
Epoch 16 error:  0.13123665753790192
Epoch 17 error:  0.13913871885225032
Epoch 18 error:  0.11926359214746965
Epoch 19 error:  0.11676539673896921
Positive window ("shops i

## Scores 

In [13]:
# score statistics for all the positive windows in the training set
X_test = embeddings[pos_windows].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Max:', np.max(score_test), 'Min:', np.min(score_test), 
      'Mean:', np.mean(score_test), 'Median:', np.median(score_test))

Max: 8.254149662257406 Min: -5.2024479942477635 Mean: 1.1503232412480688 Median: 0.7110955563133758


In [14]:
# score statistics for all the negative windows in the training set
X_test = embeddings[neg_windows].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Max:', np.max(score_test), 'Min:', np.min(score_test), 
      'Mean:', np.mean(score_test), 'Median:', np.median(score_test))

Max: 2.6992802260686886 Min: -78.50381807390244 Mean: -11.558772555719747 Median: -8.607182845892048
