# IIC-3670 NLP UC

- Versiones de librerías, python 3.8.10

- numpy 1.20.3
- nltk 3.7
- gensim 4.1.2
- keras 2.9.0
- tensorflow 2.9.1

## Actividad en clase

Vamos a entrenar un POS tagger usndo el corpus **treenbank** de NLTK. Para esto haga lo siguiente:

- Importe las tagged sents de treebank.
- Preprocese las tagged sents como se mostró en clases.
- Haga la partición de entrenamiento y test con **test_size = 0.2**.
- Construye los sets de símbolos para tags y palabras en la partición de entrenamiento.
- Pase a secuencias de entreos las tagged sents.
- Construya las pad sequences en keras.
- Defina la clase ignore accuracy.
- Define el modelo usando Sequential() de keras. Su modelo va a tener **dos capas bidireccionales, ambas de dim = 256**. Use LCE y Adam a 0.001. 
- Compile el modelo.
- Pase los símbolos de salida a variable categórica. 
- Entrene a **batch_size=64 y 20 epochs**. Si tarda mucho, disminuya los epochs. Use un **validation split de 10%**.
- Evalúe el POS tagger sobre la partición de testing en base a accuracy.
- ¿Qué hace la función ignore accuracy?
- Interprete los resultados.
- Cuanto termine, me avisa para entregarle una **L (logrado)**.
- Recuerde que las L otorgan un bono en la nota final de la asignatura.


***Tiene hasta el final de la clase.***

Vea la descripción del dataset en: https://www.kaggle.com/datasets/crawford/20-newsgroups


In [1]:
import nltk
 
tagged_sentences = nltk.corpus.treebank.tagged_sents()
 
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
print("Tagged words:", len(nltk.corpus.treebank.tagged_words()))

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tagged sentences:  3914
Tagged words: 100676


In [2]:
import numpy as np
 
sentences, sentence_tags =[], [] 
for tagged_sentence in tagged_sentences:
    sentence, tags = zip(*tagged_sentence)
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))
 
print(sentences[1])
print(sentence_tags[1])

['Mr.' 'Vinken' 'is' 'chairman' 'of' 'Elsevier' 'N.V.' ',' 'the' 'Dutch'
 'publishing' 'group' '.']
['NNP' 'NNP' 'VBZ' 'NN' 'IN' 'NNP' 'NNP' ',' 'DT' 'NNP' 'VBG' 'NN' '.']


In [3]:
from sklearn.model_selection import train_test_split
 
(train_sentences, test_sentences,  train_tags,  test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2)

In [4]:
words, tags = set([]), set([])
 
for s in train_sentences:
    for w in s:
        words.add(w.lower())
 
for ts in train_tags:
    for t in ts:
        tags.add(t)
 
word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0  # The special value used for padding
word2index['-OOV-'] = 1  # The special value used for OOVs
 
tag2index = {t: i + 1 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0   # The special value used to padding


In [5]:
train_sentences_X, test_sentences_X, train_tags_y, test_tags_y = [], [], [], []
 
for s in train_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
 
    train_sentences_X.append(s_int)
 
for s in test_sentences:
    s_int = []
    for w in s:
        try:
            s_int.append(word2index[w.lower()])
        except KeyError:
            s_int.append(word2index['-OOV-'])
 
    test_sentences_X.append(s_int)
 
for s in train_tags:
    train_tags_y.append([tag2index[t] for t in s])
 
for s in test_tags:
    test_tags_y.append([tag2index[t] for t in s])
 
print(train_sentences_X[0])
print(test_sentences_X[0])
print(train_tags_y[0])
print(test_tags_y[0])

[7678, 8768, 5131, 2447, 1425, 6817, 9197, 700, 2918, 2447, 727, 1655, 4511, 8860, 2929, 3550, 7897, 9905, 8212, 6428, 4020, 700, 3717, 1074, 5172, 9125, 9799, 2589, 8250, 5131, 2447, 9254, 9073, 7971, 451, 8768, 5131, 2447, 9180, 7807, 700, 2145, 6579, 3717, 2447, 7919, 5131, 4265, 5072, 1510, 4610, 8826, 343, 700, 3717, 7586, 4973, 5545, 4973, 8148, 9684, 2458, 8299, 3639, 4822, 6799, 2184, 9626, 6025, 6579, 3717, 6906, 7669, 5131, 7493, 4345, 7787, 700, 2918, 588, 293, 2447, 7724, 1544, 5131, 2447, 7772, 7367, 5444, 9938]
[1017, 6319, 6565, 6976, 3717, 4546, 9036, 8846, 532, 9036, 2918, 2447, 3675, 4077, 7615, 5444]
[38, 33, 19, 39, 15, 14, 45, 37, 19, 39, 14, 14, 19, 15, 42, 1, 37, 3, 4, 31, 33, 37, 29, 26, 15, 33, 19, 33, 42, 19, 39, 33, 33, 17, 5, 33, 19, 39, 14, 45, 37, 15, 15, 29, 39, 14, 19, 39, 33, 32, 26, 45, 37, 37, 29, 26, 41, 26, 41, 12, 26, 39, 33, 24, 37, 3, 45, 45, 37, 15, 29, 39, 33, 19, 33, 38, 45, 37, 19, 33, 19, 39, 15, 33, 19, 39, 4, 46, 34, 9]
[33, 33, 14, 20, 29

In [7]:
MAX_LENGTH = len(max(train_sentences_X, key=len))
print(MAX_LENGTH)

271


In [8]:
import tensorflow
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
 
train_sentences_X = pad_sequences(train_sentences_X, maxlen=MAX_LENGTH, padding='post')
test_sentences_X = pad_sequences(test_sentences_X, maxlen=MAX_LENGTH, padding='post')
train_tags_y = pad_sequences(train_tags_y, maxlen=MAX_LENGTH, padding='post')
test_tags_y = pad_sequences(test_tags_y, maxlen=MAX_LENGTH, padding='post')
 
print(train_sentences_X[0])
print(test_sentences_X[0])
print(train_tags_y[0])
print(test_tags_y[0])

[7678 8768 5131 2447 1425 6817 9197  700 2918 2447  727 1655 4511 8860
 2929 3550 7897 9905 8212 6428 4020  700 3717 1074 5172 9125 9799 2589
 8250 5131 2447 9254 9073 7971  451 8768 5131 2447 9180 7807  700 2145
 6579 3717 2447 7919 5131 4265 5072 1510 4610 8826  343  700 3717 7586
 4973 5545 4973 8148 9684 2458 8299 3639 4822 6799 2184 9626 6025 6579
 3717 6906 7669 5131 7493 4345 7787  700 2918  588  293 2447 7724 1544
 5131 2447 7772 7367 5444 9938    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [9]:
from keras import backend as K

def ignore_class_accuracy(to_ignore=0):
    def ignore_accuracy(y_true, y_pred):
        y_true_class = K.argmax(y_true, axis=-1)
        y_pred_class = K.argmax(y_pred, axis=-1)
 
        ignore_mask = K.cast(K.not_equal(y_pred_class, to_ignore), 'int32')
        matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask
        accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1)
        return accuracy
    return ignore_accuracy

In [11]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
from keras.optimizers import Adam
 
 
model = Sequential()
model.add(InputLayer(input_shape=(MAX_LENGTH, )))
model.add(Embedding(len(word2index), 128))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(TimeDistributed(Dense(len(tag2index))))
model.add(Activation('softmax'))
 
model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy', ignore_class_accuracy(0)])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 271, 128)          1296384   
                                                                 
 bidirectional_2 (Bidirectio  (None, 271, 512)         788480    
 nal)                                                            
                                                                 
 bidirectional_3 (Bidirectio  (None, 271, 512)         1574912   
 nal)                                                            
                                                                 
 time_distributed (TimeDistr  (None, 271, 47)          24111     
 ibuted)                                                         
                                                                 
 activation (Activation)     (None, 271, 47)           0         
                                                      

In [12]:
def to_categorical(sequences, categories):
    cat_sequences = []
    for s in sequences:
        cats = []
        for item in s:
            cats.append(np.zeros(categories))
            cats[-1][item] = 1.0
        cat_sequences.append(cats)
    return np.array(cat_sequences)


In [13]:
cat_train_tags_y = to_categorical(train_tags_y, len(tag2index))
print(cat_train_tags_y[0])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [14]:
model.fit(train_sentences_X, to_categorical(train_tags_y, len(tag2index)), batch_size=64, epochs=20, validation_split=0.1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f73f0a03970>

In [15]:
scores = model.evaluate(test_sentences_X, to_categorical(test_tags_y, len(tag2index)))
print(f"{model.metrics_names[1]}: {scores[1] * 100}")   

accuracy: 99.10788536071777
