In [1]:
%autosave 0

Autosave disabled


## Assignment 3 - Named Entity Recognition

In this assignment, we are going to build a Named Entity Recognition model. With this model, we will also tag new data.

More on Named Entity Recognition:

https://blog.paralleldots.com/data-science/named-entity-recognition-milestone-models-papers-and-technologies/

https://blog.paralleldots.com/product/applications-named-entity-recognition-api/

### Steps:

**1. Import the data**

**2. Build the model**

**3. Pick a dataset to run the model on**

**4. Build a function to load new data and print the tags**

Your web application will load small sections of text (such as tweets or headlines) and from that, you will tag the text based on the presence of named entities.

*What you will be graded on:*

1. Ability to build a model on word and tag data

2. Ability to use the model to predict on new data and display that prediction

*The model will be based on:*
1. Embeddings from words
2. Embeddings from tag inputs

### Step 1: Importing the data

Below is some code to get you started. As in the part of speech tagging example, you will have to write code to:

0. Split your data into a train/test set (Do a 80/20 or 90/10 split since we'll be later applying this model to an entirely separate set of data)
1. Find the set of all words
2. Find the set of all tags
3. **Create a function called ent_tagger** that will turn a sentence into this output for model building :
``` [('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have',  'O'), ('marched',  'O'), ('through',  'O'), ('London', 'B-geo'), ('to',  'O'), ('protest',  'O'), ('the',  'O'), ('war',  'O'), ('in',  'O'), ('Iraq',  'B-geo'), ('and', 'O'), ('demand',  'O'), ('the',  'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops',  'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]
```
4. Make a dictionary of words to index and entity tag to index

In [2]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
import pickle

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
import pandas as pd
import numpy as np

data = pd.read_csv("../data/ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.tail(10)

Unnamed: 0,Sentence #,Word,POS,Tag
1048565,Sentence: 47958,impact,NN,O
1048566,Sentence: 47958,.,.,O
1048567,Sentence: 47959,Indian,JJ,B-gpe
1048568,Sentence: 47959,forces,NNS,O
1048569,Sentence: 47959,said,VBD,O
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O
1048574,Sentence: 47959,attack,NN,O


In [4]:
# Reformat data so that each sentence is put into a vector per row of a pandas dataframe
cleanDat = data.groupby('Sentence #', sort=False).apply(lambda x: pd.DataFrame(data = {'token_sents': [x.Word.tolist()], 'token_tags': [x.Tag.tolist()]}))

Split data into train/test

In [5]:
# Random State
seed = np.random.seed(10)

# Split data based on sentence number
train_sents, test_sents = train_test_split(cleanDat, test_size = .15, random_state = seed)

In [6]:
def make_lexicon(token_seqs, min_freq=1):
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return (lexicon, list(token_counts.keys()))

In [7]:
print("Words:")
words_lexicon, all_words = make_lexicon(train_sents.token_sents, min_freq=2)
with open('models/words_lexicon.pkl', 'wb') as f: #save the tags lexicon by pickling it
    pickle.dump(words_lexicon, f)

print('')
print("TAGS:")
tags_lexicon, all_tags = make_lexicon(train_sents.token_tags, min_freq=2)
with open('models/tags_lexicon.pkl', 'wb') as f: #save the words lexicon by pickling it
    pickle.dump(tags_lexicon, f)

Words:
LEXICON SAMPLE (18835 total items):
{'President': 2, 'Bush': 3, 'has': 4, 'outlined': 5, 'the': 6, 'agenda': 7, 'for': 8, 'his': 9, 'second': 10, 'term': 11, 'in': 12, 'office': 13, 'and': 14, 'asked': 15, 'support': 16, 'of': 17, 'all': 18, 'Americans': 19, 'weekly': 20, 'Saturday': 21}

TAGS:
LEXICON SAMPLE (18 total items):
{'B-per': 2, 'I-per': 3, 'O': 4, 'B-gpe': 5, 'B-tim': 6, 'I-tim': 7, 'B-org': 8, 'B-geo': 9, 'I-org': 10, 'B-art': 11, 'I-geo': 12, 'B-eve': 13, 'I-eve': 14, 'I-gpe': 15, 'I-art': 16, 'B-nat': 17, 'I-nat': 18, '<UNK>': 1}


In [8]:
def ent_tagger(sentence):
    return [(word, tag) for word, tag in zip(sentence.token_sent, sentence.token_tags)]

### Step 1a: Formatting the data
Data will need to be

1. Indexed
2. Limited by vocabulary (ie replace tokens with UNKNOWN if they are too rare, come up with a reasonable limit based on your survey of the data and also model performance)
3. Padded

In [9]:
'''Make a dictionary where the string representation of a lexicon item can be retrieved from its numerical index'''

def get_lexicon_lookup(lexicon):
    lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
    print("LEXICON LOOKUP SAMPLE:")
    print(dict(list(lexicon_lookup.items())[:20]))
    return lexicon_lookup

In [10]:
def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

train_sents['Sentence_Idxs'] = tokens_to_idxs(train_sents['token_sents'], words_lexicon)
train_sents['Tag_Idxs'] = tokens_to_idxs(train_sents['token_tags'], tags_lexicon)
train_sents[['token_sents', 'Sentence_Idxs', 'token_tags', 'Tag_Idxs']][:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,Unnamed: 1_level_0,token_sents,Sentence_Idxs,token_tags,Tag_Idxs
Sentence #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sentence: 30966,0,"[President, Bush, has, outlined, the, agenda, ...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 4...","[B-per, I-per, O, O, O, O, O, O, O, O, O, O, O...","[2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 32442,0,"[Commuters, were, angered, Tuesday, morning, w...","[1, 26, 27, 28, 22, 29, 30, 31, 32, 33, 17, 34...","[O, O, O, B-tim, I-tim, O, O, O, O, O, O, O, O...","[4, 4, 4, 6, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4]"
Sentence: 13584,0,"[Retirement, and, social, assistance, pensions...","[1, 14, 36, 37, 38, 26, 39, 40, 25]","[O, O, O, O, O, O, O, O, O]","[4, 4, 4, 4, 4, 4, 4, 4, 4]"
Sentence: 38902,0,"[The, Shepherd, did, so, ,, and, the, Lion, ,,...","[41, 42, 43, 44, 45, 14, 6, 46, 45, 47, 48, 49...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 38245,0,"[Lee, ,, a, former, Hyundai, executive, and, S...","[59, 45, 60, 61, 62, 63, 14, 64, 65, 66, 67, 6...","[B-per, O, O, O, B-org, O, O, B-geo, O, O, O, ...","[2, 4, 4, 4, 8, 4, 4, 9, 4, 4, 4, 4, 8, 10, 10..."
Sentence: 8197,0,"[The, largest, of, sea, turtles, roams, the, w...","[41, 80, 17, 81, 82, 1, 6, 83, 71, 84, 45, 1, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
Sentence: 47512,0,"[The, region, is, located, along, a, major, As...","[41, 88, 89, 90, 91, 60, 92, 93, 94, 95, 8, 96...","[O, O, O, O, O, O, O, O, O, O, O, O, O]","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]"
Sentence: 21414,0,"[Nobel, laureate, and, former, U.S., Vice, Pre...","[97, 98, 14, 61, 99, 100, 2, 101, 102, 4, 103,...","[B-art, O, O, O, B-geo, B-per, I-per, I-per, I...","[11, 4, 4, 4, 9, 2, 3, 3, 3, 4, 4, 4, 8, 4, 2,..."
Sentence: 30009,0,"[The, infrastructure, loans, are, part, of, $,...","[41, 110, 111, 112, 113, 17, 114, 115, 116, 12...","[O, O, O, O, O, O, O, O, O, O, O, O, B-geo, O,...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 9, 4, 4, ..."
Sentence: 25218,0,"[A, major, figure, in, U.S., journalism, ,, Ti...","[127, 92, 128, 12, 99, 129, 45, 130, 131, 45, ...","[O, O, O, O, B-geo, O, O, B-per, I-per, O, O, ...","[4, 4, 4, 4, 9, 4, 4, 2, 3, 4, 4, 4, 4, 9, 4, ..."


In [11]:
def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

max_seq_len = max([len(idx_seq) for idx_seq in train_sents['Sentence_Idxs']]) # Get length of longest sequence
train_padded_words = pad_idx_seqs(train_sents['Sentence_Idxs'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1
train_padded_tags = pad_idx_seqs(train_sents['Tag_Idxs'],
                                 max_seq_len + 1)  #Add one to max length for offsetting sequence by 1

print("WORDS:\n", train_padded_words)
print("SHAPE:", train_padded_words.shape, "\n")

print("TAGS:\n", train_padded_tags)
print("SHAPE:", train_padded_tags.shape, "\n")

WORDS:
 [[   0    0    0 ...   23   24   25]
 [   0    0    0 ...   34   35   25]
 [   0    0    0 ...   39   40   25]
 ...
 [   0    0    0 ...    6  318   25]
 [   0    0    0 ...  259  374   25]
 [   0    0    0 ... 3531  781   25]]
SHAPE: (40765, 105) 

TAGS:
 [[0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 ...
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]
 [0 0 0 ... 4 4 4]]
SHAPE: (40765, 105) 



### Step 2. Build the model

Here we will build a Bidirectional LSTM-CRF model using the `Bidirectional` function from Keras and `CRF` function from Keras-contrib

**Documentation and source code:**

https://keras.io/layers/wrappers/#bidirectional

https://github.com/keras-team/keras-contrib

Fit your model with a validation split of 0.1, feel free to use as many epochs as you like. Base your predictions both from the input words **and** the tags from previous words like in the POS example.

After building your model, grade your performance on your test set, both by comparing your predicted output to the actual (*at least 3 examples*) and calculate the averaged precision and recall for your tags.

In [12]:
from keras.models import Model
from keras.layers import Input, concatenate, Concatenate, TimeDistributed, Dense, Bidirectional
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU, LSTM
from keras_contrib.layers import CRF
from keras.optimizers import Adam
from keras import regularizers
from keras.callbacks import ModelCheckpoint

In [13]:
def create_model(seq_input_len, n_word_input_nodes, n_tag_input_nodes, n_word_embedding_nodes,
                 n_tag_embedding_nodes, n_hidden_nodes, n_dense_nodes, 
                 stateful=False, batch_size=None):
    
    #Layers 1
    word_input = Input(batch_shape=(batch_size, seq_input_len), name='word_input_layer')
    tag_input = Input(batch_shape=(batch_size, seq_input_len), name='tag_input_layer')

    #Layers 2
    word_embeddings = Embedding(input_dim=n_word_input_nodes,
                                output_dim=n_word_embedding_nodes, 
                                mask_zero=True, name='word_embedding_layer')(word_input) #mask_zero will ignore 0 padding
    #Output shape = (batch_size, seq_input_len, n_word_embedding_nodes)
    tag_embeddings = Embedding(input_dim=n_tag_input_nodes,
                               output_dim=n_tag_embedding_nodes,
                               mask_zero=True, name='tag_embedding_layer')(tag_input) 
    #Output shape = (batch_size, seq_input_len, n_tag_embedding_nodes)
    
    #Layer 3
#     merged_embeddings = Concatenate(axis=-1, name='concat_embedding_layer')([word_embeddings, tag_embeddings])
    merged_embeddings = concatenate([word_embeddings, tag_embeddings], name='concat_embedding_layer')
    #Output shape =  (batch_size, seq_input_len, n_word_embedding_nodes + n_tag_embedding_nodes)
    
    #Layer 4
    hidden_layer = Bidirectional(LSTM(units=n_hidden_nodes, return_sequences=True, 
                                     stateful=stateful, name='hidden_layer'))(merged_embeddings)
#     hidden_layer = Bidirectional(GRU(units=n_hidden_nodes, return_sequences=True, 
#                                      stateful=stateful, name='hidden_layer', 
#                                      recurrent_regularizer=regularizers.l2(.01),
#                                      kernel_regularizer=regularizers.l2(0.01),
#                                      activity_regularizer=regularizers.l2(0.01)))(merged_embeddings)
    #Output shape = (batch_size, seq_input_len, n_hidden_nodes)
    
    #Layer 5
    dense_layer = TimeDistributed(Dense(units=n_dense_nodes, activation='relu'), name='dense_layer')(hidden_layer)

    #Layer 6
    crf = CRF(units=n_tag_input_nodes, learn_mode='marginal', sparse_target=True, name='output_layer')
#     output_layer = crf(hidden_layer)
    output_layer = crf(dense_layer)
    # Output shape = (batch_size, seq_input_len, n_tag_input_nodes)
    
    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[word_input, tag_input], outputs=output_layer)
#     adamOpt = Adam(clipvalue = 1, clipnorm = 1)
    model.compile(loss=crf.loss_function, optimizer="rmsprop", metrics=[crf.accuracy])
    
    return model



#     output_layer = TimeDistributed(Dense(units=n_tag_input_nodes, 
#                                          activation='softmax'), name='output_layer')(hidden_layer)
#     # Output shape = (batch_size, seq_input_len, n_tag_input_nodes)
    
#     #Specify which layers are input and output, compile model with loss and optimization functions
#     model = Model(inputs=[word_input, tag_input], outputs=output_layer)
#     model.compile(loss="sparse_categorical_crossentropy",
#                   optimizer='adam', metrics=['accuracy'])
#     return model


In [14]:
n_word_embedding_nodes=300
n_tag_embedding_nodes=150
n_hidden_nodes=400
n_dense_nodes=100

In [15]:
model = create_model(seq_input_len=train_padded_words.shape[-1] - 1, #substract 1 from matrix length because of offset
                     n_word_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding
                     n_tag_input_nodes=len(tags_lexicon) + 1, #Add one for 0 padding
                     n_word_embedding_nodes=n_word_embedding_nodes,
                     n_tag_embedding_nodes=n_tag_embedding_nodes,
                     n_hidden_nodes=n_hidden_nodes, 
                     n_dense_nodes=n_dense_nodes)

In [16]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
word_input_layer (InputLayer)   (None, 104)          0                                            
__________________________________________________________________________________________________
tag_input_layer (InputLayer)    (None, 104)          0                                            
__________________________________________________________________________________________________
word_embedding_layer (Embedding (None, 104, 300)     5650800     word_input_layer[0][0]           
__________________________________________________________________________________________________
tag_embedding_layer (Embedding) (None, 104, 150)     2850        tag_input_layer[0][0]            
__________________________________________________________________________________________________
concat_emb

In [17]:
filepath="./models/ner_temp_model_weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1)
callbacks_list = [checkpoint]

In [18]:
'''Train the model'''

# output matrix (y) has extra 3rd dimension added because sparse cross-entropy function requires one label per row
model.fit(x=[train_padded_words[:,1:], train_padded_tags[:,:-1]], 
          y=train_padded_tags[:, 1:, None], 
          batch_size=128, epochs=15, validation_split=.1, 
          callbacks=callbacks_list)
# model.save_weights('models/ner_temp_model_weights.h5') #Save model

Train on 36688 samples, validate on 4077 samples
Epoch 1/15
Epoch 00001: saving model to ./models/ner_temp_model_weights-01-0.1598.hdf5
Epoch 2/15
Epoch 00002: saving model to ./models/ner_temp_model_weights-02-0.0009.hdf5
Epoch 3/15
Epoch 00003: saving model to ./models/ner_temp_model_weights-03-0.0002.hdf5
Epoch 4/15
Epoch 00004: saving model to ./models/ner_temp_model_weights-04-0.0001.hdf5
Epoch 5/15
Epoch 00005: saving model to ./models/ner_temp_model_weights-05-0.0000.hdf5
Epoch 6/15
Epoch 00006: saving model to ./models/ner_temp_model_weights-06-0.0000.hdf5
Epoch 7/15
Epoch 00007: saving model to ./models/ner_temp_model_weights-07-0.0000.hdf5
Epoch 8/15
Epoch 00008: saving model to ./models/ner_temp_model_weights-08-0.0000.hdf5
Epoch 9/15
Epoch 00009: saving model to ./models/ner_temp_model_weights-09-0.0000.hdf5
Epoch 10/15
Epoch 00010: saving model to ./models/ner_temp_model_weights-10-0.0000.hdf5
Epoch 11/15
Epoch 00011: saving model to ./models/ner_temp_model_weights-11-0.00

<keras.callbacks.History at 0x246a16abcc0>

In [19]:
with open('models/words_lexicon.pkl', 'rb') as f:
    words_lexicon = pickle.load(f)
    
with open('models/tags_lexicon.pkl', 'rb') as f:
    tags_lexicon = pickle.load(f)

tags_lexicon_lookup = get_lexicon_lookup(tags_lexicon)

predictor_model = create_model(seq_input_len=1,
                               n_word_input_nodes=len(words_lexicon) + 1,
                               n_tag_input_nodes=len(tags_lexicon) + 1,
                               n_word_embedding_nodes=n_word_embedding_nodes,
                               n_tag_embedding_nodes=n_tag_embedding_nodes,
                               n_hidden_nodes=n_hidden_nodes, 
                               n_dense_nodes=n_dense_nodes,
                               stateful=True,
                               batch_size=1)

#Transfer the weights from the trained model
predictor_model.load_weights('./models/ner_temp_model_weights-15-0.0000.hdf5')

LEXICON LOOKUP SAMPLE:
{2: 'B-per', 3: 'I-per', 4: 'O', 5: 'B-gpe', 6: 'B-tim', 7: 'I-tim', 8: 'B-org', 9: 'B-geo', 10: 'I-org', 11: 'B-art', 12: 'I-geo', 13: 'B-eve', 14: 'I-eve', 15: 'I-gpe', 16: 'I-art', 17: 'B-nat', 18: 'I-nat', 1: '<UNK>'}


In [20]:
'''Load the test set and apply same processing steps performed above for training set'''

test_sents['Sentence_Idxs'] = tokens_to_idxs(test_sents['token_sents'], words_lexicon)
test_sents['Tag_Idxs'] = tokens_to_idxs(test_sents['token_tags'], tags_lexicon)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [21]:
'''Predict tags for sentences in test set'''

import numpy

pred_tags = []
for _, sent in test_sents.iterrows():
    tok_sent = sent['token_sents']
    sent_idxs = sent['Sentence_Idxs']
    sent_gold_tags = sent['token_tags']
    sent_pred_tags = []
    prev_tag = 1  #initialize predicted tag sequence with padding
#     prev_tag = 0  #initialize predicted tag sequence with padding
    for cur_word in sent_idxs:
        # cur_word and prev_tag are just integers, but the model expects an input array
        # with the shape (batch_size, seq_input_len), so prepend two dimensions to these values
        p_next_tag = predictor_model.predict(x=[numpy.array(cur_word)[None, None],
                                                numpy.array(prev_tag)[None, None]])[0]
        prev_tag = numpy.argmax(p_next_tag, axis=-1)[0]
        sent_pred_tags.append(prev_tag)
    predictor_model.reset_states()

    #Map tags back to string labels
    sent_pred_tags = [tags_lexicon_lookup[tag] for tag in sent_pred_tags]
    pred_tags.append(sent_pred_tags) #filter padding 

test_sents['Predicted_token_tags'] = pred_tags

#print sample
for _, sent in test_sents[30:50].iterrows():
    print("SENTENCE:\t{}".format("\t".join(sent['token_sents'])))
    print("PREDICTED:\t{}".format("\t".join(sent['Predicted_token_tags'])))
    print("GOLD:\t\t{}".format("\t".join(sent['token_tags'])))
    print("CORRECT:\t{}".format("\t".join([str(x) for x in np.array(sent['token_tags']) == np.array(sent['Predicted_token_tags'])])), "\n\n")

    

SENTENCE:	The	Geo	and	SAMA	channels	,	were	taken	off	the	air	for	a	short	period	of	time	.
PREDICTED:	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O
GOLD:		O	B-org	O	B-org	O	O	O	O	O	O	O	O	O	O	O	O	O	O
CORRECT:	True	False	True	False	True	True	True	True	True	True	True	True	True	True	True	True	True	True 


SENTENCE:	The	provincial	police	chief	,	Chaudhry	Mohammed	Yaqoob	,	says	the	suspects	are	all	ethnic	Baluch	tribesmen	and	were	arrested	overnight	in	a	series	of	raids	in	Quetta	,	the	capital	of	Baluchistan	province	.
PREDICTED:	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O	O
GOLD:		O	O	O	O	O	B-per	I-per	I-per	O	O	O	O	O	O	O	B-org	O	O	O	O	B-tim	O	O	O	O	O	O	B-geo	O	O	O	O	B-geo	O	O
CORRECT:	True	True	True	True	True	False	False	False	True	True	True	True	True	True	True	False	True	True	True	True	False	True	True	True	True	True	True	False	True	True	True	True	False	True	True 


SENTENCE:	Palestinian	and	Egyptian	security	sources	say	Mahmoud	al-Zahar	crossed	into	Egypt	Thursday	with	a	smal

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [22]:
'''Evalute the model by precision, recall, and F1'''

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

if __name__ == '__main__':
    all_gold_tags = [tag for sent_tags in test_sents['token_tags'] for tag in sent_tags]
    all_pred_tags = [tag for sent_tags in test_sents['Predicted_token_tags'] for tag in sent_tags]
    accuracy = accuracy_score(y_true=all_gold_tags, y_pred=all_pred_tags)
    precision = precision_score(y_true=all_gold_tags, y_pred=all_pred_tags, average='weighted')
    recall = recall_score(y_true=all_gold_tags, y_pred=all_pred_tags, average='weighted')
    f1 = f1_score(y_true=all_gold_tags, y_pred=all_pred_tags, average='weighted')

    print("ACCURACY: {:.3f}".format(accuracy))
    print("PRECISION: {:.3f}".format(precision))
    print("RECALL: {:.3f}".format(recall))
    print("F1: {:.3f}".format(f1))

  'precision', 'predicted', average, warn_for)


ACCURACY: 0.741
PRECISION: 0.741
RECALL: 0.741
F1: 0.740


  'precision', 'predicted', average, warn_for)


### Step 3. Pick a dataset

Pick a dataset that has short text, similar to the sentences you just tagged. Headlines and tweets are good choices.

https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=news&page=1&pageSize=20&size=all&filetype=all&license=all

In [23]:
from keras.preprocessing.text import text_to_word_sequence

In [24]:
# Limit new data to only 100 for testing purposes rows
newData = pd.read_csv('../data/abcnews-date-text.csv', nrows = 100)
newData['token_sents'] = newData['headline_text'].apply(lambda x: text_to_word_sequence(x, lower=False))

In [25]:
newData

Unnamed: 0,publish_date,headline_text,token_sents
0,20030219,aba decides against community broadcasting lic...,"[aba, decides, against, community, broadcastin..."
1,20030219,act fire witnesses must be aware of defamation,"[act, fire, witnesses, must, be, aware, of, de..."
2,20030219,a g calls for infrastructure protection summit,"[a, g, calls, for, infrastructure, protection,..."
3,20030219,air nz staff in aust strike for pay rise,"[air, nz, staff, in, aust, strike, for, pay, r..."
4,20030219,air nz strike to affect australian travellers,"[air, nz, strike, to, affect, australian, trav..."
5,20030219,ambitious olsson wins triple jump,"[ambitious, olsson, wins, triple, jump]"
6,20030219,antic delighted with record breaking barca,"[antic, delighted, with, record, breaking, barca]"
7,20030219,aussie qualifier stosur wastes four memphis match,"[aussie, qualifier, stosur, wastes, four, memp..."
8,20030219,aust addresses un security council over iraq,"[aust, addresses, un, security, council, over,..."
9,20030219,australia is locked into war timetable opp,"[australia, is, locked, into, war, timetable, ..."


In [26]:
# Need to convert lexicon to lower case since all new data are lower case
all_lower_words_lexicon = {}
for key, val in words_lexicon.items():
    all_lower_words_lexicon[key.lower()] = val

# Replace <UNK> as uppercase to work with previouse functions
all_lower_words_lexicon['<UNK>'] = all_lower_words_lexicon['<unk>']
all_lower_words_lexicon['<unk>'] = None

### Step 4. Tag your new data!

Create a modification to the **ent_tagger function** that combined words and tags from your original dataset. Now allow the function to also load new text from your new data set, and output the tags predicted from your trained model alongside the text. Make your function load five random texts from your data and output the tagged text.

In [27]:
def predict_ner_tag_from_words(new_sents, words_lexicon=None, print_obs=True):
    '''Predict tags for sentences in new dataset'''
    
    new_sents['Sentence_Idxs'] = tokens_to_idxs(new_sents['token_sents'], words_lexicon)
    print(new_sents)

    pred_tags = []
    for _, sent in new_sents.iterrows():
        tok_sent = sent['token_sents']
        sent_idxs = sent['Sentence_Idxs']
        sent_pred_tags = []
        prev_tag = 1  #initialize predicted tag sequence to UNKNOWN since padding breaks...
#         prev_tag = 0  #initialize predicted tag sequence with padding
        for cur_word in sent_idxs:
            # cur_word and prev_tag are just integers, but the model expects an input array
            # with the shape (batch_size, seq_input_len), so prepend two dimensions to these values
            p_next_tag = predictor_model.predict(x=[numpy.array(cur_word)[None, None],
                                                    numpy.array(prev_tag)[None, None]])[0]
            prev_tag = numpy.argmax(p_next_tag, axis=-1)[0]
            sent_pred_tags.append(prev_tag)
        predictor_model.reset_states()

        #Map tags back to string labels
        sent_pred_tags = [tags_lexicon_lookup[tag] for tag in sent_pred_tags]
        pred_tags.append(sent_pred_tags) #filter padding 

    new_sents['Predicted_token_tags'] = pred_tags

    if print_obs:
        for _, sent in new_sents.iterrows():
            print("SENTENCE:\t{}".format("\t".join(sent['token_sents'])))
            print("PREDICTED:\t{}".format("\t".join(sent['Predicted_token_tags'])))
            print("")


In [28]:
predict_ner_tag_from_words(newData, words_lexicon=all_lower_words_lexicon)

    publish_date                                      headline_text  \
0       20030219  aba decides against community broadcasting lic...   
1       20030219     act fire witnesses must be aware of defamation   
2       20030219     a g calls for infrastructure protection summit   
3       20030219           air nz staff in aust strike for pay rise   
4       20030219      air nz strike to affect australian travellers   
5       20030219                  ambitious olsson wins triple jump   
6       20030219         antic delighted with record breaking barca   
7       20030219  aussie qualifier stosur wastes four memphis match   
8       20030219       aust addresses un security council over iraq   
9       20030219         australia is locked into war timetable opp   
10      20030219  australia to contribute 10 million in aid to iraq   
11      20030219  barca take record as robson celebrates birthda...   
12      20030219                         bathhouse plans move ahead   
13    

[100 rows x 4 columns]
SENTENCE:	aba	decides	against	community	broadcasting	licence
PREDICTED:	O	O	O	O	O	O

SENTENCE:	act	fire	witnesses	must	be	aware	of	defamation
PREDICTED:	O	O	O	O	O	O	O	O

SENTENCE:	a	g	calls	for	infrastructure	protection	summit
PREDICTED:	O	O	O	O	O	O	O

SENTENCE:	air	nz	staff	in	aust	strike	for	pay	rise
PREDICTED:	I-gpe	I-per	I-per	I-per	I-per	I-per	I-per	I-per	I-per

SENTENCE:	air	nz	strike	to	affect	australian	travellers
PREDICTED:	I-gpe	I-per	I-per	I-per	I-per	I-per	I-per

SENTENCE:	ambitious	olsson	wins	triple	jump
PREDICTED:	O	O	O	O	O

SENTENCE:	antic	delighted	with	record	breaking	barca
PREDICTED:	O	O	O	O	O	O

SENTENCE:	aussie	qualifier	stosur	wastes	four	memphis	match
PREDICTED:	O	O	O	O	O	O	O

SENTENCE:	aust	addresses	un	security	council	over	iraq
PREDICTED:	O	O	O	O	O	O	O

SENTENCE:	australia	is	locked	into	war	timetable	opp
PREDICTED:	B-geo	I-geo	I-geo	I-geo	I-geo	I-geo	I-geo

SENTENCE:	australia	to	contribute	10	million	in	aid	to	iraq
PREDICTED:	B-geo	I-g