# Introduction to Neural Networks using NumPy

## Resources
* Stanford CS224n Lecture 4 (Winter 2018) [Slides](https://web.stanford.edu/class/cs224n/lectures/lecture4.pdf)
* Stanford CS224n Lecture 4 (Winter 2017) [Video](https://youtu.be/uc2_iwVqrRI)
* Denny Britz's post (2015): [Implementing a Neural Network from Scratch](http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/)
* Chenglei Si's Gradient calcluation post for CS224n (2018): [Backpropagation](https://medium.com/@sichenglei1125/backpropagation-faa7a0bc6e5c)

In [1]:
import numpy as np
import random
import os
import sys
import urllib.request

from tempfile import gettempdir

In [2]:
print('Python', sys.version)

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]


## Task: Classify whether the center word within a window of words is a location

* To build a simple neural network model to illustrate non-linear function approximation, backpropagation and stochastic gradient descent.
* Background to the Named Entity Recognition (NER) problem on [Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition)
* Description of task in Stanford CS224n Lecture 4 [slide 45](https://web.stanford.edu/class/cs224n/lectures/lecture4.pdf#page=45)

## Download and read the data from file
Tjong Kim Sang et al. [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
](http://www.aclweb.org/anthology/W03-0419.pdf)

In [3]:
def maybe_download(url, filename, expected_bytes):
    "Download the file if not present, and make sure it's the right size."    
    local_filename = os.path.join(gettempdir(), filename)
    if not os.path.exists(local_filename):
        local_filename, _ = urllib.request.urlretrieve(url + filename, local_filename)
        statinfo = os.stat(local_filename)
        if statinfo.st_size == expected_bytes:
            print('Found and verified', filename)
        else:
            print(statinfo.st_size)
            raise Exception('Failed to verify ' + local_filename + 
                            '. Can you get to it with a browser?')
    return local_filename


def read_data(filename):
    "Reads the eng.train data file from CONLL2003"
    sents, sent_tags = [], []
    with open(filename) as f:
        dictionary = {'<PAD>': 0}
        sent, tags = [], []
        for line in f:
            if line.startswith('-DOCSTART-'):
                continue
            if line.startswith('\n'):
                if sent and tags:
                    sents.append(sent)
                    sent_tags.append(tags)
                    sent, tags = [], []
                continue
            word, _, _, tag = line.split()
            sent.append(word)
            tags.append(tag)
            if not dictionary.get(word):
                dictionary[word] = len(dictionary)
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return sents, sent_tags, dictionary, reversed_dictionary


In [4]:
filename = maybe_download(
    url='https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/',
    filename='eng.train',
    expected_bytes=3283420)

Click to view raw text: https://raw.githubusercontent.com/patverga/torch-ner-nlp-from-scratch/master/data/conll2003/eng.train

(There appear to be some errors: e.g. The African Football Confederation is tagged as a location?)

## Read the data from file and build dictionary

In [5]:
""" sents               : list of sentences (where each sentence is a list of words)
    sent_tags           : list of named-entity tags corresponding to each word in sents
    dictionary          : maps words(strings) to their IDs(int)
    reversed_dictionary : maps IDs(int) to their words(strings)
"""
sents, sent_tags, dictionary, reversed_dictionary = read_data(filename)

In [6]:
print('Sample sentence:', sents[0], sent_tags[0])

Sample sentence: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']


In [7]:
len(dictionary)   # vocabulary size

23624

## Prepare word windows for training

In [8]:
def prepare_windows(window_size=2):
    """
    Param: window_size (int) for each side of center word
    Returns: : tuple (list of +ve windows, list of -ve windows)
    """
    pos_windows, neg_windows = [], []
    span = 2*window_size + 1
    for sent, tags in zip(sents, sent_tags):
        count = len(sent)
        # pad sentence at front and end
        sent = [0]*window_size + [dictionary[word] for word in sent] + [0]*window_size
        for i in range(count):
            window = sent[i:i+span]
            # positive if center word is tagged as location
            if tags[i] in ['B-LOC', 'I-LOC']:
                pos_windows.append(window)
            else:
                neg_windows.append(window)
    return pos_windows, neg_windows

In [9]:
pos_windows, neg_windows = prepare_windows(window_size=2)
print('Number of positive windows: ', len(pos_windows))
print('Number of negative windows: ', len(neg_windows))
print('Sample positive window: ', pos_windows[0], [reversed_dictionary[i] for i in pos_windows[0]])
print('Sample negative window: ', neg_windows[0], [reversed_dictionary[i] for i in neg_windows[0]])

Number of positive windows:  8297
Number of negative windows:  195324
Sample positive window:  [0, 0, 12, 13, 0] ['<PAD>', '<PAD>', 'BRUSSELS', '1996-08-22', '<PAD>']
Sample negative window:  [0, 0, 1, 2, 3] ['<PAD>', '<PAD>', 'EU', 'rejects', 'German']


## Set hyperparameter values 

In [10]:
seed = 0
embedding_size = 128     # word embdedding dimension size 
window_size = 2          # size of window on each side of center word
hidden_size = 256        # size of the hidden layer
learning_rate = 0.1      # initial learning rate
num_epochs = 30          # number of passes over true window samples

## Build and train the classifier

In [11]:
""" This code trains a simple neural network as a binary classifier. 
    The model calculates a score when it is given a window of words. 
    The score is used to determine whether the center word in the
    window is a location or not.
"""

np.random.seed(seed)
vocab_size = len(dictionary)
x_dim = embedding_size * (2*window_size + 1)

# Initialize model parameters 
embeddings = np.random.uniform(-0.5, 0.5, (vocab_size, embedding_size))
W = np.random.randn(x_dim, hidden_size) * np.sqrt(1.0/x_dim)
b = np.zeros(hidden_size)
u = np.random.randn(hidden_size)

average_error = 0

# Training loop
for epoch in range(num_epochs):
    for i, pos_window in enumerate(pos_windows):
        neg_window = random.sample(neg_windows, k=1) # s_c
        inputs = np.vstack((pos_window, neg_window)) # stack as matrix
        X = embeddings[inputs].reshape(-1, x_dim)    # concat the words
        
        # Forward pass    
        z = X.dot(W) + b                # affine transformation
        a = 1. / (1. + np.exp(-z))      # non-linearity (sigmoid)
        scores = a.dot(u)               # scalar unnormalized scores

        # Max-margin objective
        error = 1 if max(0, 1 - scores[0] + scores[1]) > 0 else 0
        
        # Backward pass (no updating if error is 0)
        # See Algorithm 6.4 of http://www.deeplearningbook.org/contents/mlp.html
        # Also https://medium.com/@sichenglei1125/backpropagation-faa7a0bc6e5c
        grad_u = error * (a[1] - a[0])  # gradient for u
        delta =  u * (a*(1 - a))        # multiply with sigmoid derivative
        delta[0] = -delta[0]            # flip sign for gradient contributing to pos window
        grad_W = X.T.dot(delta)         # gradient for W
        grad_b = delta.sum(axis=0)      # gradient for b
        grad_X = delta.dot(W.T)         # gradient for the word vectors
        grad_X = grad_X.reshape(-1, 2*window_size + 1, embedding_size)                                
        
        # Parameter updates using gradient descent
        u -= learning_rate * grad_u
        W -= learning_rate * grad_W
        b -= learning_rate * grad_b
        embeddings[inputs] -= learning_rate * grad_X
        
        # Keep track of any errors
        if error == 1: 
            average_error += 1 - scores[0] + scores[1]
            # For every 200 errors, pick 1 to print
            if random.random() < 0.005:
                print('Positive window: ', 
                      [reversed_dictionary[i] for i in pos_window], scores[0])
                print('Negative window: ', 
                      [reversed_dictionary[i] for i in neg_window[0]], scores[1])                
                X = embeddings[inputs].reshape(-1, x_dim)
                z = X.dot(W) + b
                a = 1. / (1. + np.exp(-z))
                scores = a.dot(u)
                print('Positive window (after backprop): ', 
                      [reversed_dictionary[i] for i in pos_window], scores[0])
                print('Negative window (after backprop): ', 
                      [reversed_dictionary[i] for i in neg_window[0]], scores[1], '\n')

    # Print average error per epoch
    print('Epoch', epoch + 1, 'error: ', average_error / i)
    
    # Stop training when average error is low enough
    if average_error / i < 0.01:
        break
    
    average_error = 0
        
    # Decay learning rate exponentially every epoch
    learning_rate = learning_rate * 0.999


Positive window:  ['Umbria', 'between', 'Rome', 'and', 'Florence'] 7.790065364703469
Negative window:  ['slumped', 'against', 'Oncins', ',', 'who'] 8.114010976988057
Positive window (after backprop):  ['Umbria', 'between', 'Rome', 'and', 'Florence'] 8.673733827822637
Negative window (after backprop):  ['slumped', 'against', 'Oncins', ',', 'who'] 8.259611921357704 

Positive window:  [')', ',', 'Trinidad', "'s", 'Hasely'] -3.7213851988053177
Negative window:  ['paid', 'in', 'full', '.', '<PAD>'] 7.546468880695793
Positive window (after backprop):  [')', ',', 'Trinidad', "'s", 'Hasely'] 4.768707919668095
Negative window (after backprop):  ['paid', 'in', 'full', '.', '<PAD>'] -1.730932377919238 

Epoch 1 error:  0.42661604182118
Epoch 2 error:  0.07468615968231973
Epoch 3 error:  0.05445076275777558
Positive window:  ['Public', 'Park', '&', 'Rec', '.'] -4.832648529804965
Negative window:  ['be', '"', 'discrimination', '"', 'against'] -4.777297126966601
Positive window (after backprop):  [

Epoch 23 error:  0.020854775554744135
Positive window:  ['"', 'Nirmal', 'Hriday', '"', '('] -14.371113325527602
Negative window:  ['Advanced', 'Medical', 'and', 'IMED', 'president'] -14.37111724109667
Positive window (after backprop):  ['"', 'Nirmal', 'Hriday', '"', '('] -14.371116621123516
Negative window (after backprop):  ['Advanced', 'Medical', 'and', 'IMED', 'president'] -14.371120536745586 

Epoch 24 error:  0.02133358801927359
Epoch 25 error:  0.02759124600007012
Positive window:  ['"', 'Nirmal', 'Hriday', '"', '('] -11.106694954388153
Negative window:  ['on', 'the', 'ground', 'reached', '62'] -11.106684609554504
Positive window (after backprop):  ['"', 'Nirmal', 'Hriday', '"', '('] -11.106683365880967
Negative window (after backprop):  ['on', 'the', 'ground', 'reached', '62'] -11.106673021783493 

Positive window:  ['Queen', 'of', 'the', 'Angels', ')'] 1.6613654966040468
Negative window:  ['published', 'in', 'the', 'Oct.', '1'] 1.66136549630725
Positive window (after backprop):

In [12]:
# score statistics for all the positive windows in the training set
X_test = embeddings[pos_windows].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Max:', np.max(score_test), 'Min:', np.min(score_test), 
      'Mean:', np.mean(score_test), 'Median:', np.median(score_test))

Max: 10.34201227380021 Min: -10.80154501144774 Mean: 9.98605154511056 Median: 10.341150870076595


In [13]:
# score statistics for all the negative windows in the training set
X_test = embeddings[neg_windows].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
score_test = a_test.dot(u)
print('Max:', np.max(score_test), 'Min:', np.min(score_test), 
      'Mean:', np.mean(score_test), 'Median:', np.median(score_test))

Max: 10.34115294254265 Min: -10.871716617564262 Mean: -10.664492594838745 Median: -10.80154499798712


In [14]:
pos_sent = [dictionary[word] for word in 'shops in Paris are amazing'.split()]
neg_sent = [dictionary[word] for word in 'not all shops in Paris'.split()]

word_ids = np.vstack((pos_sent, neg_sent))

X_test = embeddings[word_ids].reshape(-1, x_dim)
z_test = X_test.dot(W) + b
a_test = 1. / (1. + np.exp(-z_test))
scores_test = a_test.dot(u)
print('Positive window ("shops in Paris are amazing") score:', scores_test[0])
print('Negative window ("not all shops in Paris") score:', scores_test[1])

Positive window ("shops in Paris are amazing") score: 10.341152892687347
Negative window ("not all shops in Paris") score: -10.798703471946109
