### Mention Classification Model:

This model was trained to take all sentences talking about a named entity and classify them as dataset mentions or not. The training pipeline was as follows:

- Load Training and filter to sentences between 5 and 65 length.
- Preprocess the Sentences
    - Expand Abbreviations
    - Remove punctuations
    - Remove non albhabetical words
    - Make lower case
    - Perform word stemming
    - Special mapping (e.g replace study by survey)
- Build a vocabulary
- Tokenize sentences into vectors
- Train the model

### Classifer
ANN classifier was used for this purpose. We were inspired by the wide use of CNNs are widely used for document classification but also realized that LSTMs are better at modeling intricate linguistic qualities specially the ones with long range dependencies. Hence we tested both LSTMs and CNN for this task and CNNs gave us better results (I would like to state that our testing of these paradigms was not exhaustive in terms of achitecture and hyperparameters). We observed that our model tended to overfit very quickly so we had to limit training to a very few epochs and introduce strict dropout, becuase we wanted our model to generalize rather than learn the input data. We were able to achive good accuracy on our dataset after hyperparameter tuning and trying different data cleaning methods. Following are important observations:
- abbreviation expansion module aimproved accuracy by 7 %
- word stemming improved accuracy by 6 %
- word2vec or GloVe word embeddings did not do well
- training our own embedding layer with the model did well

### Model Architecture:
After hyperparamter tuning the following architecture was selected:
<img src="modelArchi.png">


In [None]:
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
import string
import json
import numpy as np
import pandas as pd
import re

In [2]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [3]:
## save a list as a text file
def save_list(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [4]:
# Load word stemmer
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
 
ps = PorterStemmer()

In [5]:
# Load positive sentences
positiveSentences = load_doc('all_positive_sentences.txt').split('\n')

In [6]:
#load negative senteces
negativeSentences = load_doc('all_negative_sentences.txt').split('\n')

In [7]:
#Filter sentences by length
positiveSentences = [s for s in positiveSentences if len(s.split()) > 5 and len(s.split()) < 65]
negativeSentences = [s for s in negativeSentences if len(s.split()) > 5 and len(s.split()) < 65] 

In [8]:
## Randomly select negative examples to ballance class sizes
from sklearn.model_selection import train_test_split
ratio = len(positiveSentences)/len(negativeSentences)
_, negativeSentences, _, _ = train_test_split(negativeSentences, np.ones(len(negativeSentences)), test_size=ratio)
len(negativeSentences)

33625

In [9]:
## load abbreviations
file = 'abbreviations.json'
abbtext = load_doc(file)
abbreviations = json.loads(abbtext)

In [10]:
def findAbbreviation(sentence):
    regex = r"\b[A-Z][A-Z]+\b"
    abbreviations = re.findall(regex, sentence)
    return abbreviations

In [11]:
def expandAbbreviation(sentence, abbdict):
    abbs = findAbbreviation(sentence)
    for a in abbs:
        if a in abbdict:
            sentence = sentence.replace(a,abbdict[a][0])
    return sentence

In [12]:
def specialMapping(word):
    if word == 'studi':
        return 'survey'
    else:
        return word

In [34]:
# turn a doc into clean tokens
def clean_doc(doc):
    # abbreviation disambiguation
    doc = expandAbbreviation(doc, abbreviations)
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # stemming
    tokens = [ps.stem(word) for word in tokens]
    #specialMapping
    tokens = [specialMapping(word) for word in tokens]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

## Build Vocab

In [35]:
# load doc and add to vocab
def add_doc_to_vocab(sentence, vocab):
	# clean doc
	tokens = clean_doc(sentence)
	# update counts
	vocab.update(tokens)

In [36]:
def process_docs(sentences, vocab):
    # walk through all files in the folder
    for sentence in sentences:
        add_doc_to_vocab(sentence, vocab)

In [37]:
# define vocab
vocab1 = Counter()
# add all docs to vocab
process_docs(positiveSentences, vocab1)
process_docs(negativeSentences, vocab1)
# print the size of the vocab
print(len(vocab1))
# print the top words in the vocab
print(vocab1.most_common(10))

41776
[('survey', 43501), ('health', 32462), ('nation', 31143), ('examin', 20376), ('nutrit', 18962), ('data', 11460), ('use', 11258), ('age', 7899), ('sampl', 6195), ('iii', 5258)]


In [38]:
# keep tokens with > 5 occurrence
min_occurane = 50
tokens = [k for k,c in vocab1.items() if c >= min_occurane]
print(len(tokens))
# save tokens to a vocabulary file
save_list(tokens, 'bmvocab.txt')

2233


## Save Prepared Data

In [39]:
# load doc, clean and return line of tokens
def doc_to_line(sentence, vocab):
	# clean doc
	tokens = clean_doc(sentence)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

In [40]:
# load all docs in a directory
def process_docs(sentences, vocab):
    lines = list()
    # walk through all files in the folder
    for sentence in sentences:
        # load and clean the doc
        line = doc_to_line(sentence, vocab)
        # add to list
        lines.append(line)
    return lines

In [41]:
# load vocabulary
vocab_filename = 'bmvocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

In [42]:
# prepare negative reviews
negative_lines = process_docs(negativeSentences, vocab)
save_list(negative_lines, 'negative.txt')
# prepare positive reviews
positive_lines = process_docs(positiveSentences, vocab)
save_list(positive_lines, 'positive.txt')

## Model

In [43]:
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import Dropout
from keras import regularizers

In [44]:
# load all training reviews
positive_docs = load_doc('positive.txt').split('\n')
negative_docs = load_doc('negative.txt').split('\n')
x = negative_docs + positive_docs
y = array([0 for _ in range(len(negative_docs))] + [1 for _ in range(len(positive_docs))])
#train_docs = negative_docs + positive_docs

In [45]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [46]:
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(X_train)

In [47]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(X_train)
# pad sequences
max_length = max([len(s.split()) for s in X_train])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

In [48]:
max_length

66

In [49]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(X_test)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

In [50]:
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1

In [51]:
# define model
# model = Sequential()
# model.add(Embedding(vocab_size, 100, input_length=max_length))
# model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
# model.add(MaxPooling1D(pool_size=2))
# model.add(Flatten())
# model.add(Dense(10, activation='relu'))
# model.add(Dense(1, activation='sigmoid'))
# print(model.summary())

# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
#model.add(Dropout(0.5))
model.add(Conv1D(filters=16, kernel_size=4, activation='relu'))
# f 8 k 16
#model.add(Dropout(0.5))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 66, 100)           223400    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 63, 16)            6416      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 31, 16)            0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 496)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 496)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                4970      
_________________________________________________________________
dropout_4 (Dropout)          (None, 10)                0         
__________

In [52]:
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [53]:
# fit network
model.fit(Xtrain, y_train, epochs=3, verbose=2)

Epoch 1/3
 - 8s - loss: 0.3034 - acc: 0.8972
Epoch 2/3
 - 8s - loss: 0.2136 - acc: 0.9399
Epoch 3/3
 - 7s - loss: 0.1887 - acc: 0.9458


<keras.callbacks.History at 0x7fd224321ef0>

In [54]:
# evaluate
loss, acc = model.evaluate(Xtest, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 94.944238


## Precision RecallAnalysis

In [33]:
y_prob = model.predict(Xtest).reshape(len(y_test),)

result = []
for t in np.linspace(0,1,30):
    acc = sum( (y_prob>t) == y_test)/len(y_prob == y_test)
    prec = sum( ((y_prob>t) == y_test) & (y_prob>t) )/sum(y_prob>t)
    rec = sum( ((y_prob>t) == y_test) & (y_prob>t) )/sum(y_test)
    fscore = 2*(prec * rec)/(prec + rec)
    result .append({'t':t, 'Acc':acc , 'Prec':prec, 'Rec':rec, 'F_score': fscore})
print(pd.DataFrame(result))

         Acc   F_score      Prec       Rec         t
0   0.504238  0.670423  0.504238  1.000000  0.000000
1   0.795688  0.830726  0.713394  0.994249  0.034483
2   0.885279  0.897003  0.819490  0.990711  0.068966
3   0.913457  0.919857  0.862826  0.984960  0.103448
4   0.925204  0.929729  0.883329  0.981274  0.137931
5   0.930706  0.934394  0.893992  0.978620  0.172414
6   0.934944  0.938053  0.902220  0.976850  0.206897
7   0.937993  0.940708  0.908292  0.975523  0.241379
8   0.940446  0.942888  0.912881  0.974934  0.275862
9   0.941933  0.944186  0.916100  0.974049  0.310345
10  0.943197  0.945249  0.919548  0.972427  0.344828
11  0.945056  0.946907  0.923357  0.971690  0.379310
12  0.946989  0.948635  0.927455  0.970805  0.413793
13  0.948104  0.949588  0.930634  0.969331  0.448276
14  0.949888  0.951181  0.934795  0.968151  0.482759
15  0.951004  0.952125  0.938422  0.966234  0.517241
16  0.953234  0.954104  0.944388  0.964022  0.551724
17  0.954498  0.955198  0.948532  0.961958  0.

  


## Saving model and tokenizer for reuse

In [34]:
# serialize model to JSON
model_json = model.to_json()
with open("CNNmodel.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("CNNmodel.h5")
print("Saved model to disk")

Saved model to disk


In [35]:
import pickle

# saving
with open('CNNtokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [476]:
len(y_test)

20175