# Question 3 - Named Entity Recognition

Named Entity Recongnition is an NLP task that sets it's goal to identify proper nouns i.e. names of indeviduals, locations and organizaitons.<br>
In the task we are using the data set conll2002;<br> The data set is makred by both POS and NER tags;

**copying word2vec file**<br>
Before the run, download and copy word to vec file from:<br>
* URL: https://github.com/uchile-nlp/spanish-word-embeddings
* section: GloVe embeddings from SBWC
* file link: "Vector format (.vec.gz) (906 MB)"

to: ./notbookPath/data/glove-sbwc.i25.vec

---

### Importing the data
First we check that indeed the data exist and see the format of it's records

In [1]:
from nltk.corpus import conll2002
from nltk.chunk import tree2conlltags

etr = conll2002.chunked_sents('esp.train')  # In Spanish
eta = conll2002.chunked_sents('esp.testa')  # In Spanish
etb = conll2002.chunked_sents('esp.testb')  # In Spanish

dtr = conll2002.chunked_sents('ned.train')  # In Dutch
dta = conll2002.chunked_sents('ned.testa')  # In Dutch
dtb = conll2002.chunked_sents('ned.testb')  # In Dutch

x = etr.__getitem__
print("esp.train:: data point: type %s; value %s" % (type(x), x))

esp.train:: data point: type <class 'method'>; value <bound method LazyMap.__getitem__ of [Tree('S', [Tree('LOC', [('Melbourne', 'NP')]), ('(', 'Fpa'), Tree('LOC', [('Australia', 'NP')]), (')', 'Fpt'), (',', 'Fc'), ('25', 'Z'), ('may', 'NC'), ('(', 'Fpa'), Tree('ORG', [('EFE', 'NC')]), (')', 'Fpt'), ('.', 'Fp')]), Tree('S', [('-', 'Fg')]), ...]>


**Recived sentence structure:**<br>
We see that data point from the data set is a sentences, represented in a tree format - the chunked sentence format.
Chunked sentence is a format in which the sentence tokens are grouped to by their function in the sentence; That may be usefull to check validity of the BIO tags and fix them, but initially we will not need that structure; <br>Therefore we will transform the sentence to a list of 3-tuples of <token, postag, biotag>; <br>

**List sentence structure:**<br>
$\{<token, postag, biotag>_1...<token, postag, biotag>_k\}$

Now we will note the difference between the types of the sentence:

In [2]:
# printing a single word from a sentence

sent = etr[0]
sent_vec = tree2conlltags(sent)

print("tree format:\n%s" % sent)
print("\n")
print("list-of-tuples format:\n%s" % sent_vec)

tree format:
(S
  (LOC Melbourne/NP)
  (/Fpa
  (LOC Australia/NP)
  )/Fpt
  ,/Fc
  25/Z
  may/NC
  (/Fpa
  (ORG EFE/NC)
  )/Fpt
  ./Fp)


list-of-tuples format:
[('Melbourne', 'NP', 'B-LOC'), ('(', 'Fpa', 'O'), ('Australia', 'NP', 'B-LOC'), (')', 'Fpt', 'O'), (',', 'Fc', 'O'), ('25', 'Z', 'O'), ('may', 'NC', 'O'), ('(', 'Fpa', 'O'), ('EFE', 'NC', 'B-ORG'), (')', 'Fpt', 'O'), ('.', 'Fp', 'O')]


---

### Choosing initial features for vector representation
Before training, we need to generate a vector representation of a token in thw data; 
**feature selection** is a task that many times require prior knowlage on the task and it's efficient predictive features.

For initial stage, we have selected the folowing features:
* word form, in small letters
* part of speach
* prefixes {1,2}
* suffixes {1,2}

### Vectorizing a word
Following code generates a vector from a word in a given sentence:<br>

**Note: prepered infrasturucure to flexibly choose model order**

In [3]:
def bool2int(bool):
    if bool:           return 1
    else:              return 0

def setBoolVal(bool=None):
    if (bool==None):   return 0.0
    else:              return 2.0*(bool2int(bool)-0.5)

def getAdjWordFeatures(token=None, postag=None):
    # manually selected features for the adjacent word
    if (token == None):
        features = ["",
                    postag,
                    0,
                    0]
    else:
        features = [token.lower(),
                    postag,
                    setBoolVal(token.isdigit()),
                    setBoolVal(token.isupper())]
    return features

def getWordFeatures(token, postag):
    # manually selected features for the word
    features = [token.lower(),
                postag,
                setBoolVal(token.isdigit()),
                setBoolVal(token.isupper()),
                token[:1],
                token[:2],
                token[-1:],
                token[-2:]]
    return features

def word2features(sent, i, order):
    # sent      list of 3-tuples, each tuple is <token, POS, NER-tag>
    # i         index for tuple for feature extraction
    # order     number of context words on each side to be added to the features
    token  = sent[i][0]
    postag = sent[i][1]
    # set token features
    features = []
    features.extend(getWordFeatures(token,postag))

    # adding features for order-previous and order successive tokens
    for j in range(order):
        # previous words
        prv = (i - (j + 1))
        nxt = (i + (j + 1))

        if (prv >= 0):
            # token exist
            prvToken  = sent[prv][0]
            prvPostag = sent[prv][1]
            features.extend(getAdjWordFeatures(prvToken, prvPostag))
        else:
            # add pad
            features.extend(getAdjWordFeatures(token=None,postag="BOS"))
        if (nxt<len(sent)):
            # token exist
            nxtToken  = sent[nxt][0]
            nxtPostag = sent[nxt][1]
            features.extend(getAdjWordFeatures(nxtToken, nxtPostag))
        else:
            # add pad
            features.extend(getAdjWordFeatures(token=None,postag="EOS"))

    return features

### 3.1.1 Feature extraction:
We now sample 2 words from a sentence to print it's vector representation by the features we have selected;<br>

**Note:**<br>
We are adding adjacent words, by configurable distance fromt he current word; That may require special attention on the vectorization process since we would like to have equal number of features for all words; (Also for words that their word-neighbourhood is **out of sentence bound**)<br>
Therefore a padding is required for these out-of-sentence-bound indexes. And we would need to set neutral feature values for features that are associated with these indexes;

chosen neutral values:
* boolean valued features were transformed to **{-1,0,+1}** where **-1=False, +1=True, 0=value of padded**
* For padded tokens, we state **'BOS' for begining of sentence and 'EOS' for end of sentence**

Following code demonstrate 2 word samples from locations out-of-sentence-bound:

In [4]:
order = 2

tokenIndex = 0
print("sample a token: index in sentence: %d; type %s; value %s" % (tokenIndex, type(sent_vec[tokenIndex]), sent_vec[tokenIndex]))
print(word2features(sent_vec,tokenIndex,order))

tokenIndex = len(sent_vec)-1
print("sample a token: index in sentence: %d; type %s; value %s" % (tokenIndex, type(sent_vec[tokenIndex]), sent_vec[tokenIndex]))
print(word2features(sent_vec,tokenIndex,order))

sample a token: index in sentence: 0; type <class 'tuple'>; value ('Melbourne', 'NP', 'B-LOC')
['melbourne', 'NP', -1.0, -1.0, 'M', 'Me', 'e', 'ne', '', 'BOS', 0, 0, '(', 'Fpa', -1.0, -1.0, '', 'BOS', 0, 0, 'australia', 'NP', -1.0, -1.0]
sample a token: index in sentence: 10; type <class 'tuple'>; value ('.', 'Fp', 'O')
['.', 'Fp', -1.0, -1.0, '.', '.', '.', '.', ')', 'Fpt', -1.0, -1.0, '', 'EOS', 0, 0, 'efe', 'NC', -1.0, 1.0, '', 'EOS', 0, 0]


---

### Data Encoding
Now we encode the data acording to above vectorization scheme;

In [5]:
# encoding data set
def sent2features(sent, order):
    return [word2features(sent, i, order) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]# reaching a data point

def sent2tokens(sent):
    return [token for token, postag, label in sent]

def getX(sentenceDataSet,order):
    xOut = []
    for sent in sentenceDataSet:
        sent_vec = tree2conlltags(sent)
        xOut.extend(sent2features(sent_vec,order))
    return xOut

def getY(sentenceDataSet):
    yOut = []
    for sent in sentenceDataSet:
        sent_vec = tree2conlltags(sent)
        yOut.extend(sent2labels(sent_vec))
    return yOut

def getTokens(sentenceDataSet):
    tokenOut =[]
    for sent in sentenceDataSet:
        sent_vec = tree2conlltags(sent)
        tokenOut.extend(sent2tokens(sent_vec))
    return tokenOut

Above code defines methods to generate token data points and it's BIO tagging for training upon reciving a database of sentences as in chuncked sentence format; <br>We will print first data point and it's tagging from preprocessed data base:

In [6]:
print("-> testing encoding\n")
x           = getX(etr, order)
y           = getY(etr)
tokenList   = getTokens(etr)

print("token sample: %s" % tokenList[0])
print("y sample:     %s" % y[0])
print("x sample:     %s" % x[0])

-> testing encoding

token sample: Melbourne
y sample:     B-LOC
x sample:     ['melbourne', 'NP', -1.0, -1.0, 'M', 'Me', 'e', 'ne', '', 'BOS', 0, 0, '(', 'Fpa', -1.0, -1.0, '', 'BOS', 0, 0, 'australia', 'NP', -1.0, -1.0]


---

### 3.1.2 Trainning
On the first clause we were asked to explore the data and vectorize each word according to the described in the notebook: http://nbviewer.jupyter.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb;

If we want to train using CoNLL package or SciKitLearn package, we need to generate similar vector but in dictionary representation;

**Why dictionary?**<br>
These packages are designed to get a very sparse vectors; In order to save in space, they work with dictionary representation of a sparce vector; if a feature has 0 value, it will not exist in the dictionary representation (there will be no key of that feature);<br>

**Pro** Space;<br> We consider for example just the lower case form of the word, as we have roughly ~200,000 values for word forms, we would have a 1 hot of 200,000 features. Dictionary representation deals with that sparsity;<br>

**Con** Time complexity;<br> Reaching the element is still O(1) but we will suffer consftant factor related to the implementation of the dictionary.
it will inflict constant factor to the run time as reaching an element is the most basic operation, done at least O(n);<br>

We show the code update for that purpose - vectorization to dictinary representation:

In [7]:
def bool2int(bool):
    if bool:           return 1
    else:              return 0

def setBoolVal(bool=None):
    if (bool==None):   return 0.5
    else:              return 1.0*bool2int(bool)

def getPadWordFeatures(token=None, postag=None, pref=""):
    features = {pref + 'token':         token,
                pref + 'postag':        postag,
                pref + 'isdigit':       setBoolVal(None),
                pref + 'isupper':       setBoolVal(None)}
    return features

def getAdjWordFeatures(token=None, postag=None, pref=""):
    # manually selected features for the adjacent word
    features = {pref+'token.lower': token.lower(),
                pref+'postag':      postag,
                pref+'isdigit':     setBoolVal(token.isdigit()),
                pref+'isupper':     setBoolVal(token.isupper())}
    return features

def getWordFeatures(token, postag):
    # manually selected features for the word
    features = {'token.lower':  token.lower(),
                'postag':       postag,
                'isdigit':      setBoolVal(token.isdigit()),
                'isupper':      setBoolVal(token.isupper()),
                'perf1':        token[:1],
                'pref2':        token[:2],
                'suff1':        token[-1:],
                'suff2':        token[-2:]}
    return features

def word2features(sent, i, order):
    # sent      list of 3-tuples, each tuple is <token, POS, NER-tag>
    # i         index for tuple for feature extraction
    # order     number of context words on each side to be added to the features
    token  = sent[i][0]
    postag = sent[i][1]
    # set token features
    features = {}
    features.update(getWordFeatures(token,postag))

    # adding features for order-previous and order successive tokens
    for j in range(order):
        # previous words
        prv = (i - (j + 1))
        nxt = (i + (j + 1))
        prvStr = str(prv)
        nxtStr = str(nxt)

        if (prv >= 0):
            # token exist
            prvToken  = sent[prv][0]
            prvPostag = sent[prv][1]
            features.update(getAdjWordFeatures(prvToken, prvPostag,prvStr))
        else:
            # add pad
            features.update(getPadWordFeatures(token="PAD",postag="BOS", pref=prvStr))
        if (nxt<len(sent)):
            # token exist
            nxtToken  = sent[nxt][0]
            nxtPostag = sent[nxt][1]
            features.update(getAdjWordFeatures(nxtToken, nxtPostag, nxtStr))
        else:
            # add pad
            features.update(getPadWordFeatures(token="PAD",postag="EOS", pref=nxtStr))

    return features

# encoding data set
def sent2features(sent, order):
    return [word2features(sent, i, order) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]# reaching a data point

def sent2tokens(sent):
    return [token for token, postag, label in sent]

def getX(sentenceDataSet,order):
    xOut = []
    for sent in sentenceDataSet:
        sent_vec = tree2conlltags(sent)
        xOut.extend(sent2features(sent_vec,order))
    return xOut

def getY(sentenceDataSet):
    yOut = []
    for sent in sentenceDataSet:
        sent_vec = tree2conlltags(sent)
        yOut.extend(sent2labels(sent_vec))
    return yOut

def getTokens(sentenceDataSet):
    tokenOut =[]
    for sent in sentenceDataSet:
        sent_vec = tree2conlltags(sent)
        tokenOut.extend(sent2tokens(sent_vec))
    return tokenOut

And we test again the vectorized represenation of a token:

In [8]:
print("-> encoding training data\n")
X_train     = getX(etr, order)
Y_train     = getY(etr)
tokenList   = getTokens(etr)
print("data size:  |X_train| = %d; type = %s" % (len(X_train), type(X_train)))
print("label size: |Y_train| = %d; type = %s" % (len(Y_train), type(X_train)))

print("\n\n-> testing encoding\n")
print("token sample: %s" % tokenList[0])
print("y sample:     %s" % Y_train[0])
print("x sample:     %s" % X_train[0])

-> encoding training data

data size:  |X_train| = 264715; type = <class 'list'>
label size: |Y_train| = 264715; type = <class 'list'>


-> testing encoding

token sample: Melbourne
y sample:     B-LOC
x sample:     {'token.lower': 'melbourne', 'postag': 'NP', 'isdigit': 0.0, 'isupper': 0.0, 'perf1': 'M', 'pref2': 'Me', 'suff1': 'e', 'suff2': 'ne', '-1token': 'PAD', '-1postag': 'BOS', '-1isdigit': 0.5, '-1isupper': 0.5, '1token.lower': '(', '1postag': 'Fpa', '1isdigit': 0.0, '1isupper': 0.0, '-2token': 'PAD', '-2postag': 'BOS', '-2isdigit': 0.5, '-2isupper': 0.5, '2token.lower': 'australia', '2postag': 'NP', '2isdigit': 0.0, '2isupper': 0.0}


### Trainer
For trainer we use the crfsuite package

In [9]:
import pycrfsuite

print("-> setting pycrfsuite trainer")
trainer = pycrfsuite.Trainer(verbose=False)

print("-> setting trainer parameters")
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier
    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

print("-> setting data to trainer")
trainer.append(X_train,Y_train)
trainer.train('conll2002-esp.crfsuite')

-> setting pycrfsuite trainer
-> setting trainer parameters
-> setting data to trainer


Classifier was trained and saved as an output file in:<br>
**.../conll2002-esp.crfsuite**

---

### Testing a single sentence

Before moving on towards mesuring the trained model, we would like to have a sanity check:<br> - taging and comparing a single sentence;

We first encode the testa sentence data set to our format:

In [10]:
print("-> encoding test data with order %d" % order)
X_test      = getX(eta, order)
Y_test      = getY(eta)
token_test  = getTokens(eta)

-> encoding test data with order 2


We now print 2 sample sentences and look at their predicted tags vs the true tag;<br>
(**Recall our current model order is 2**)

In [11]:
print("-> setting tagger")
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')

def getSentX(sent, order):
    xOut = []
    sent_vec = tree2conlltags(sent)
    xOut.extend(sent2features(sent_vec, order))
    return xOut

def getSentY(sent):
    yOut = []
    sent_vec = tree2conlltags(sent)
    yOut.extend(sent2labels(sent_vec))
    return yOut

def getSentTokens(sent):
    tokenOut = []
    sent_vec = tree2conlltags(sent)
    tokenOut.extend(sent2tokens(sent_vec))
    return tokenOut

example_sent = eta[0]
print("")
print(' '.join(getSentTokens(example_sent)), end='\n\n')
print("Predicted:", ' '.join(tagger.tag(getSentX(example_sent, order))))
print("Real     :", ' '.join(getSentY(example_sent)))

example_sent = eta[2]
print("")
print(' '.join(getSentTokens(example_sent)), end='\n\n')
print("Predicted:", ' '.join(tagger.tag(getSentX(example_sent, order))))
print("Real     :", ' '.join(getSentY(example_sent)))

-> setting tagger

Sao Paulo ( Brasil ) , 23 may ( EFECOM ) .

Predicted: B-LOC I-LOC O B-LOC O O O O O B-ORG O O
Real     : B-LOC I-LOC O B-LOC O O O O O B-ORG O O

La multinacional española Telefónica ha impuesto un récord mundial al poner en servicio tres millones de nuevas líneas en el estado brasileño de Sao Paulo desde que asumió el control de la operadora Telesp hace 20 meses , anunció hoy el presidente de Telefónica do Brasil , Fernando Xavier Ferreira .

Predicted: O O O B-ORG O O O O O O O O O O O O O O O O O O O B-LOC I-LOC O O O O O O O O B-ORG O O O O O O O O O B-ORG I-ORG I-ORG O B-PER I-PER I-PER O
Real     : O O O B-ORG O O O O O O O O O O O O O O O O O O O B-LOC I-LOC O O O O O O O O B-ORG O O O O O O O O O B-ORG I-ORG I-ORG O B-PER I-PER I-PER O


Tagging seem ok, meanning that we don't have compatability issues and we may proceed to the main task - measuring the effect of model order on it's pereformance;

---

### Experiencing the effect of model order on results

We now will run 6 experiments, each mesuring performance for trained model;

experimens:
* Spanish X model-order=0
* Spanish X model-order=1
* Spanish X model-order=2
* Dutch X model-order=0
* Dutch X model-order=1
* Dutch X model-order=2

In [12]:
from sklearn.preprocessing import LabelBinarizer
from itertools import chain
from sklearn.metrics import classification_report

def bio_classification_report(y_true, y_pred, ommitBaseClass):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(y_true)
    y_pred_combined = lb.fit_transform(y_pred)
    
    if (ommitBaseClass):  tagset = set(lb.classes_) - {'O'}
    else:                 tagset = set(lb.classes_)
        
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

def measureModel(train, test, order, experimentName = "defExperimentName", verbose = False, ommitBaseClass = True):
    # encode training data
    if verbose: print("-> encoding train data")
    X_train     = getX(train, order)
    Y_train     = getY(train)
    
    # train a classifier
    if verbose: print("-> setting trainer parameters")
    trainer = pycrfsuite.Trainer(verbose=False)
    trainer.set_params({
        'c1': 1.0,   # coefficient for L1 penalty
        'c2': 1e-3,  # coefficient for L2 penalty
        'max_iterations': 50,  # stop earlier
        # include transitions that are possible, but not observed
        'feature.possible_transitions': True
        })
    if verbose: print("-> setting data to trainer")
    trainer.append(X_train,Y_train)
    if verbose: print("-> training; classifier saved on ./%s" % experimentName)
    trainer.train(experimentName)
    
    # encode test data
    if verbose: print("-> encoding test data")
    X_test      = getX(test, order)
    Y_test      = getY(test)
    
    # tag test data (prediction)
    if verbose: print("-> tagging test data with trained classifier %s" % experimentName)
    tagger = pycrfsuite.Tagger()
    tagger.open(experimentName)
    Y_pred      = tagger.tag(X_test)  
    
    # print peasured results:
    print("\n\nModel performance classification results for experiment: %s" % experimentName)
    print(bio_classification_report(Y_test, Y_pred, ommitBaseClass))


measureModel(etr, eta, 0, experimentName = "Lang=Spanish_X_Order=0", verbose = False, ommitBaseClass = True)

# measure models order={0,1,2} for Spanish
measureModel(etr, eta, 0, experimentName = "Lang=Spanish_X_Order=0", verbose = False)
measureModel(etr, eta, 1, experimentName = "Lang=Spanish_X_Order=1", verbose = False)
measureModel(etr, eta, 2, experimentName = "Lang=Spanish_X_Order=2", verbose = False)

# measure models order={0,1,2} for Spanish Dutch
measureModel(dtr, dta, 0, experimentName = "Lang=Dutch_X_Order=0", verbose = False)
measureModel(dtr, dta, 1, experimentName = "Lang=Dutch_X_Order=1", verbose = False)
measureModel(dtr, dta, 2, experimentName = "Lang=Dutch_X_Order=2", verbose = False)



Model performance classification results for experiment: Lang=Spanish_X_Order=0
             precision    recall  f1-score   support

      B-LOC       0.61      0.76      0.68       985
      I-LOC       0.61      0.70      0.66       336
     B-MISC       0.58      0.48      0.52       445
     I-MISC       0.39      0.50      0.44       654
      B-ORG       0.81      0.68      0.74      1700
      I-ORG       0.76      0.65      0.70      1366
      B-PER       0.79      0.72      0.76      1222
      I-PER       0.77      0.90      0.83       859

avg / total       0.71      0.69      0.70      7567



### Results anlysis
**base class "O" not omitted**
We also present a single test where the base class "O" not omitted;<br>
Indeed, the interesting cases are the other classes, therefore we focus on them but it is also nice to see one test with the base class; clearly it is shifting the overall precision to a non interesting high score;

**Order 0 => order 1**<br>
On both languages, we see improvment in all BIO tags (except from in Dutch tag 'I-MISC')
That is expected behaviour, assuming these extra features from neighbouring words should imply relation on the classification of the tag;

**Order 1 => order 2**<br>
On both languages, we see non improved results when moving from order 1 => order 2 (Spanish avg presicion remains 0.73, while Duch avg presicion slightly deteriorates from 0.72 to 0.71).
One may think that it is surprizing, but we sohould remmember 2 factors:
* It may be the case that the relation between distance 2 of the location is quite weak
* We did not take significant amount of features from the classified word own index. 
Therefore, selecting features as described on a small training data may cause weakning the 'important' features to effect weakly then they would in a larger sample;<br>

concluding the remark:
If one select large anount of features, one should make sure they are correspond to the magnitude of the training data

### 3.1.3 Checking for sequensial related tag errors

In NER task, the output can be partially self verified with respect to it's sequential rules;
Below a code that verifies the sequential rules between a pair of adjacent bio-tags;

In [13]:
def isBX_IY(tag1: str, tag2: str):
    if (tag1 != "O" and tag2 != "O"):
        delType1, bioType1 = tag1.split('-')
        delType2, bioType2 = tag2.split('-')
        if (delType1 == "B" and delType2 == "I" and bioType1 != bioType2):
            return True
    return False

def isIX_IY(tag1: str, tag2: str):
    if (tag1 != "O" and tag2 != "O"):
        delType1, bioType1 = tag1.split('-')
        delType2, bioType2 = tag2.split('-')
        if (delType1 == "I" and delType2 == "I" and bioType1 != bioType2):
            return True
    return False

def isO_IX(tag1: str, tag2: str):
    if (tag1 == "O" and tag2 != "O"):
        delType2, bioType2 = tag2.split('-')
        if (delType2 == "I"):
            return True
    return False

def countErr(Y_pred, ErrFunc):
    errCnt = 0
    for i in range(len(Y_pred) - 1):
        tag1 = Y_pred[i]
        tag2 = Y_pred[i + 1]
        if ErrFunc(tag1, tag2): errCnt += 1
    return errCnt

#### Validating error count check code:
We must verify that the code indeed find such ileagal sequences;<br> 
Therefore, the following code generates ileagal pairs for that purpose:

In [14]:
# {'I-PER', 'B-LOC', 'O', 'I-LOC', 'B-PER', 'I-ORG', 'B-MISC', 'I-MISC', 'B-ORG'}
errorSeq = ["O", "I-PER", "I-ORG"]
print(errorSeq)
print("BX_IY error count = %3d" % countErr(errorSeq, isBX_IY))
print("IX_IY error count = %3d" % countErr(errorSeq, isIX_IY))
print("O_IX  error count = %3d" % countErr(errorSeq, isO_IX ))

['O', 'I-PER', 'I-ORG']
BX_IY error count =   0
IX_IY error count =   1
O_IX  error count =   1


In [15]:
# {'I-PER', 'B-LOC', 'O', 'I-LOC', 'B-PER', 'I-ORG', 'B-MISC', 'I-MISC', 'B-ORG'}
errorSeq = ["O", "B-MISC", "I-ORG"]
print(errorSeq)
print("BX_IY error count = %3d" % countErr(errorSeq, isBX_IY))
print("IX_IY error count = %3d" % countErr(errorSeq, isIX_IY))
print("O_IX  error count = %3d" % countErr(errorSeq, isO_IX ))

['O', 'B-MISC', 'I-ORG']
BX_IY error count =   1
IX_IY error count =   0
O_IX  error count =   0


In [16]:
# {'I-PER', 'B-LOC', 'O', 'I-LOC', 'B-PER', 'I-ORG', 'B-MISC', 'I-MISC', 'B-ORG'}
errorSeq = ["O", "B-LOC", "I-LOC", "I-ORG", "B-ORG", "I-MISC", "O" , "B-MISC", "I-PER"]
print(errorSeq)
print("BX_IY error count = %3d" % countErr(errorSeq, isBX_IY))
print("IX_IY error count = %3d" % countErr(errorSeq, isIX_IY))
print("O_IX  error count = %3d" % countErr(errorSeq, isO_IX ))

['O', 'B-LOC', 'I-LOC', 'I-ORG', 'B-ORG', 'I-MISC', 'O', 'B-MISC', 'I-PER']
BX_IY error count =   2
IX_IY error count =   1
O_IX  error count =   0


### Testing sequential correctness for all tests
We may want to check all out test data i.e. {Spanish, Duch} X {a,b}

In [17]:
def checkLabelSequenceCorrectness(train, test, order, experimentName = "defExperimentName", verbose = False):
    # encode training data
    if verbose: print("-> encoding train data")
    X_train     = getX(train, order)
    Y_train     = getY(train)
    
    # train a classifier
    if verbose: print("-> setting trainer parameters")
    trainer = pycrfsuite.Trainer(verbose=False)
    trainer.set_params({
        'c1': 1.0,   # coefficient for L1 penalty
        'c2': 1e-3,  # coefficient for L2 penalty
        'max_iterations': 50,  # stop earlier
        # include transitions that are possible, but not observed
        'feature.possible_transitions': True
        })
    if verbose: print("-> setting data to trainer")
    trainer.append(X_train,Y_train)
    if verbose: print("-> training; classifier saved on ./%s" % experimentName)
    trainer.train(experimentName)
    
    # encode test data
    if verbose: print("-> encoding test data")
    X_test      = getX(test, order)
    Y_test      = getY(test)
    
    # tag test data (prediction)
    if verbose: print("-> tagging test data with trained classifier %s" % experimentName)
    tagger = pycrfsuite.Tagger()
    tagger.open(experimentName)
    Y_pred      = tagger.tag(X_test)  
    
    # print peasured results:
    print("\n\nTaged bio-labels sequential corectness experiment: %s" % experimentName)
    print("BX_IY error count = %3d" % countErr(Y_pred, isBX_IY))
    print("IX_IY error count = %3d" % countErr(Y_pred, isIX_IY))
    print("O_IX  error count = %3d" % countErr(Y_pred, isO_IX ))

# testing sequential correctness of bio-tags
order = 0
checkLabelSequenceCorrectness(etr, eta, order, experimentName = "Lang=Spanish_X_TestSet=a_X_order=0", verbose = False)
checkLabelSequenceCorrectness(etr, etb, order, experimentName = "Lang=Spanish_X_TestSet=b_X_order=0", verbose = False)

checkLabelSequenceCorrectness(dtr, dta, order, experimentName = "Lang=Dutch_X_TestSet=a_X_order=0", verbose = False)
checkLabelSequenceCorrectness(dtr, dtb, order, experimentName = "Lang=Dutch_X_TestSet=b_X_order=0", verbose = False)

order = 1
checkLabelSequenceCorrectness(etr, eta, order, experimentName = "Lang=Spanish_X_TestSet=a_X_order=1", verbose = False)
checkLabelSequenceCorrectness(etr, etb, order, experimentName = "Lang=Spanish_X_TestSet=b_X_order=1", verbose = False)

checkLabelSequenceCorrectness(dtr, dta, order, experimentName = "Lang=Dutch_X_TestSet=a_X_order=1", verbose = False)
checkLabelSequenceCorrectness(dtr, dtb, order, experimentName = "Lang=Dutch_X_TestSet=b_X_order=1", verbose = False)



Taged bio-labels sequential corectness experiment: Lang=Spanish_X_TestSet=a_X_order=0
BX_IY error count =   0
IX_IY error count =   0
O_IX  error count =   0


Taged bio-labels sequential corectness experiment: Lang=Spanish_X_TestSet=a_X_order=1
BX_IY error count =   0
IX_IY error count =   0
O_IX  error count =   0


#### Results anlysis
Quite surprisingly, there were no pair of bio-tags that forms an ileagal sequence (even though error locating code was validated);
We are happy to see that in the predicted sequence there are no ileagal pairs of bio-tags;

---

## 3.2 Adding a feature: dence vector representation features addition

In this section we are adding as a feature the word ebmeding form of the word;<br>
For that purpose, we will use a pretrained word embeder, that was trained on spanish corpus;<br> 

The embeder recives a word, and return it's dense vector form;<br>

Since it is known that word with similar meaning in languages tend to have some similarity on their dense form, we expect that adding that feature will capture the essence of the entity we are seacking, i.e. that indeed entityies has some similaruty in the dence vectore space (e.g. with respect to inner product) <br>

We will present the results after adding that feature, expecting for improved performance;

### Fetching data

In [18]:
# Code from the project refference notbook:
from gensim.models.keyedvectors import Doc2VecKeyedVectors as dvk
import data
from gensim.models.keyedvectors import KeyedVectors
import numpy as np

wordvectors_file_vec = 'data\glove-sbwc.i25.vec'
cantidad = 10000
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)
print("-> data was loaded")



-> data was loaded


In [19]:
# Reaching the data

sampleWord = list(wordvectors.vocab.keys())[10]
print(sampleWord)
sampleWord2Vec = wordvectors[sampleWord]
#print(sampleWord2Vec)
dimention = len(wordvectors[sampleWord])
print(dimention)

if (not ("fewgfhjqwiowehjwerpghiwer" in wordvectors.vocab.keys())) : print("notfound")
if sampleWord in wordvectors.vocab.keys() : print("found")
# setting defualt vector;
defVector = np.zeros(dimention) 

las
300
notfound
found


### updating vectorizer code
Below is the code to be updated in order to add the dence vector form of the word as a feature:

In [24]:
# the update needed for the feture format:
denceVectoreNofParam = 100

def word2vec(tokenLower):
    if tokenLower in wordvectors.vocab.keys():
        # found word
        denceVecForm = wordvectors[tokenLower]
    else:
        # not found word
        denceVecForm = np.zeros(denceVectoreNofParam)
    return denceVecForm

def getPadWordFeatures(token=None, postag=None, pref=""):
    features = {pref + 'token':         token,
                pref + 'postag':        postag,
                pref + 'isdigit':       setBoolVal(None),
                pref + 'isupper':       setBoolVal(None)}
    
    # add dense vector form as a feature
    denceFeatureDict = {}
    for i in  range(denceVectoreNofParam):
         featureName = pref+"dence"+str(i)
         denceFeatureDict[featureName] = 0.0
    features.update(denceFeatureDict)
    
    return features

def getAdjWordFeatures(token=None, postag=None, pref=""):
    # manually selected features for the adjacent word
    features = {pref+'token.lower': token.lower(),
                pref+'postag':      postag,
                pref+'isdigit':     setBoolVal(token.isdigit()),
                pref+'isupper':     setBoolVal(token.isupper())}
    
    # computing features for dence form
    denceVecForm = word2vec(token.lower())
    
    # add dense vector form as a feature
    # dim          = len(denceVecForm)
    denceFeatureDict = {}
    for i in  range(denceVectoreNofParam):
         val = denceVecForm[i]
         featureName = pref+"dence"+str(i)
         denceFeatureDict[featureName] = val
    features.update(denceFeatureDict)
    
    return features

def getWordFeatures(token, postag):
    # manually selected features for the word
    
    features = {'token.lower':  token.lower(),
                'postag':       postag,
                'isdigit':      setBoolVal(token.isdigit()),
                'isupper':      setBoolVal(token.isupper()),
                'perf1':        token[:1],
                'pref2':        token[:2],
                'suff1':        token[-1:],
                'suff2':        token[-2:]}
    
    # computing features for dence form
    denceVecForm = word2vec(token.lower())
    
    # add dense vector form as a feature
    # dim          = len(denceVecForm)
    denceFeatureDict = {}
    for i in  range(denceVectoreNofParam):
         val = denceVecForm[i]
         featureName = "dence"+str(i)
         denceFeatureDict[featureName] = val
    features.update(denceFeatureDict)
    
    return features

### Measure results

In [None]:
measureModel(etr, eta, 0, experimentName = "addedWrod2VecFeature_Lang=Spanish_X_Order=0", verbose = True)
measureModel(etr, eta, 1, experimentName = "addedWrod2VecFeature_Lang=Spanish_X_Order=1", verbose = True)
measureModel(etr, eta, 2, experimentName = "addedWrod2VecFeature_Lang=Spanish_X_Order=2", verbose = True)

-> encoding train data
-> setting trainer parameters
-> setting data to trainer
-> training; classifier saved on ./addedWrod2VecFeature_Lang=Spanish_X_Order=0
-> encoding test data
-> tagging test data with trained classifier addedWrod2VecFeature_Lang=Spanish_X_Order=0


Model performance classification results for experiment: addedWrod2VecFeature_Lang=Spanish_X_Order=0
             precision    recall  f1-score   support

      B-LOC       0.63      0.78      0.70       985
      I-LOC       0.62      0.77      0.68       336
     B-MISC       0.65      0.50      0.56       445
     I-MISC       0.47      0.51      0.49       654
      B-ORG       0.83      0.71      0.76      1700
      I-ORG       0.77      0.69      0.73      1366
      B-PER       0.83      0.76      0.79      1222
      I-PER       0.86      0.93      0.89       859

avg / total       0.74      0.72      0.73      7567

-> encoding train data
-> setting trainer parameters
-> setting data to trainer
-> training; c

### Results anlysis

It above experiments we added features that were learned in an external dence vector representation model;
We assumed that those features will have some effect on the preformence as they are related to the word location on the language topology;

We faced with a problem: space complexity of that task is very high; the num of dimention of the represeitation is 300;
For comparison, initially we had 8 features for central work and 4 for neighbouring words;

The local system that ran the process was overlded on RAM, therefore operated on disk, wich is very slow;

Therefore we tool only first 100 dence vectore features;

order 0 = avg presicion = 0.74 => improvment over the best previous result of 0.73 avg precision<br>
order 1 = arg precision = 0.75<br>
order 2 = arg precision = still running<br>

To conclude:<br>
**Adding dence vector features may help significantly to improve results, but it is costly in space resources;**<br>
(we could also run with all 300 features but that would be 3 times more space for the currently overloaded system)