# Clasificador de texto con modelos *transformers*
Implementemos un clasificador usando un modelo BERT haciendo *fine-tuning* sobre un conjunto de análisis de sentimiento en Twitter.  

Usamos la librería `transformers` en su implementación para `Tensorflow`

In [None]:
#instalamos la librería
!pip install transformers

In [None]:
import pandas as pd
pd.options.display.max_colwidth = None
import numpy as np
from transformers import AutoTokenizer, AutoConfig, TFAutoModelForSequenceClassification
import tensorflow as tf

from sklearn.model_selection import train_test_split

In [None]:
#modelo a utilizar
nombre_modelo = 'bert-base-multilingual-uncased'

In [None]:
# Leemos los datos
df = pd.read_csv('https://idal.uv.es/tweets_all.csv')

#seleccionamos columnas de interés
df = df[['content', 'polarity']]

#dejamos polaridades definidas
df = df[(df['polarity']=='P') | (df['polarity']=='N')]

df.head()

In [None]:
df.info()

## Limpieza de texto
Realizamos una pequeña limpieza de texto eliminando menciones, URL y signos de puntuación

In [None]:
import re, string

pattern1 = re.compile(r'@[\w_]+') #elimina menciones
pattern2 = re.compile(r'https?://[\w_./]+') #elimina URL
pattern4 = re.compile('[{}]+'.format(re.escape(string.punctuation))) #elimina símbolos de puntuación

def clean_text(text):
    """Limpiamos las menciones, URL y hashtags del texto. Luego 
    quitamos signos de puntuación"""
    text = pattern1.sub('mención', text)
    text = pattern2.sub('URL', text)
    text = pattern4.sub(' ', text)
    
    return text

Vemos un ejemlo de limpieza en el primer Tweet

In [None]:
df['content'].iloc[0]

In [None]:
clean_text(df['content'].iloc[0])

## Preparemos el conjunto de datos
Aplicamos la función de limpieza a todo el dataset.  
Separamos en conjuntos de entrenamiento y validación (test). No separamos un conjunto real de test.
Codificamos las etiquetas de salida como enteros.

In [None]:
#limpiamos texto y quitamos tweets que se han quedado vacíos
df.content=df.content.apply(clean_text)
df = df[df['content']!='']
#el conjunto de salida es la polaridad, hay que convertir a binario
#codificamos 'P' como 1 y 'N' se queda como 0
Y=(df.polarity=='P').values*1

#Separamos entrenamiento y test
#realmente habría que sacar los tokens sólo del conjunto de entrenamiento...
X_train_tweets, X_test_tweets, Y_train, Y_test = train_test_split(
    df.content,
    Y, 
    test_size = 0.3,
    random_state = 0)
print(X_train_tweets.shape,Y_train.shape)
print(X_test_tweets.shape,Y_test.shape)

In [None]:
#Estimamos la longitud máxima de documento

MAX_SEQUENCE_LENGTH=np.max([len(l.split()) for l in X_train_tweets])
print('longitud máxima: {}'.format(MAX_SEQUENCE_LENGTH))


## Preparamos la entrada para el modelo  
Los modelos de `transformers` utilizan 3 vectores para cada entrada:  
 - 'input_ids': ID del vocabulario para hacer un embedding de los tokens  
 - 'token_type_ids': ID de la frase (en aplicaciones con 2 frases de entrada)
 - 'attention_mask': máscara de atención de los tokens  
Usamos una función de tokenizado específica del modelo para obtener estos vectores

In [None]:
#Tokenizamos y codificamos como Dataset
tokenizer = AutoTokenizer.from_pretrained(nombre_modelo)
train_encodings = tokenizer(X_train_tweets.to_list(), truncation=True, padding=True, return_tensors="tf")


In [None]:
train_encodings['input_ids'].shape

La longitud máxima es mayor que el número de palabras (por el tokenizado *WordPiece*)

Podemos ver los tokens que ha generado y decodificar de nuevo cada documento

In [None]:
train_encodings.keys()

Tomamos como ejemplo el primer tweet

In [None]:
X_train_tweets.to_list()[0]

In [None]:
train_encodings['input_ids'][0]

In [None]:
train_encodings['attention_mask'][0]

In [None]:
print(tokenizer.convert_ids_to_tokens(train_encodings['input_ids'][0]))

### Ejercicio
Tokeniza el conjunto de TEST utilizando el mismo modelo y las mismas características (longitud máxima) del conjunto de TRAIN.  
Define en la variable `MAX_SEQUENCE_LENGTH` la máxima longitud de entrada a usar.

In [None]:
## completar


Para pasar estos datos como entrada al modelo los convertimos en un `Dataset` de Tensorflow que es un objeto iterable que devuelve un diccionario con las muestras de cada iteración (o batch)

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    Y_train
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    Y_test
))

In [None]:
train_dataset

# Fine-tuning de BERT
Ajustamos el modelo de BERT a nuestro problema de clasificación

In [None]:
#definimos modelo de clasificación
id2label = {0: "Neg", 1: "Pos"}
label2id = {val: key for key, val in id2label.items()}
config = AutoConfig.from_pretrained(
    nombre_modelo, hidden_dropout_prob=0.1, num_labels=2, id2label=id2label, label2id=label2id)
model = TFAutoModelForSequenceClassification.from_pretrained(
    nombre_modelo, config=config)

In [None]:
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

model.summary()

### Pregunta
¿de dónde sale el nº de parámetros de la última capa?

In [None]:
#Entrenamos
batch_size = 8
n_epochs = 5
history=model.fit(train_dataset.batch(batch_size),
    epochs=n_epochs,
    batch_size=batch_size,
    validation_data=test_dataset.batch(batch_size))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Fine-tuning BERT')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
score,acc = model.evaluate(test_dataset.batch(batch_size), verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

In [None]:
#obtenemos las predicciones del modelo
predict=model.predict(test_dataset.batch(batch_size))

In [None]:
predict.keys()

In [None]:
predict.logits.shape

In [None]:
np.argmax(predict.logits, 1)

In [None]:
#El modelo de Sentence Classification de Transormers siempre devuelve en la última capa los logits de cada clase
#el núm. de clases se especifica en la configuración del modelo

Y_test_label = list(map(lambda l: model.config.id2label[l], Y_test))
predict_label = list(map(lambda l: model.config.id2label[l], np.argmax(predict.logits, 1)))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(Y_test_label, predict_label))

In [None]:

#Necesitamos la probabilidad sólo para calcular el AUC o ajustar el umbral
predict_proba = tf.nn.softmax(predict.logits)

from sklearn.metrics import roc_auc_score

roc_auc_score(Y_test, predict_proba[:,1])

In [None]:
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_predictions(
    Y_test,
    predict_proba[:,1],
    name="Positive class",
    color="darkorange",
)
plt.plot([0, 1], [0, 1], "k--", label="chance level (AUC = 0.5)")
plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("AUC curve")
plt.legend()
plt.show()

## Fine-tuning en Keras
Usamos la primera capa del modelo BERT de Keras y sobre la salida entrenamos un clasificador binario en Keras. De esta manera podríamos añadir más capas ocultas entre la salida del encoder BERT y la capa de salida del clasificador.  
Si usamos la primera salida del modelo BERT tenemos los `last_hidden_state` de todos los tokens de entrada. Por tanto tenemos un tensor de `MAX_SEQUENCE_LENGTH` x 768 valores.   
Si usamos la segunda salida del modelo BERT tenemos el `pooler_output` del hidden_state del primer token (CLS). Por tanto tenemos un vector de 768 valores.  
Replicamos el modelo que utiliza TFBertForSequenceClassification (https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/models/bert/modeling_tf_bert.py#L1601):  
 - Después del pooler_output pasamos por un dropout para generar los logits.  
 - Los logits entran a una capa densa con activación sigmoide o softmax según el clasificador.  

### Usando la salida `pooler_output`

In [None]:
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from transformers import TFAutoModel

#######################################
### --------- Setup BERT ---------- ###


# Load transformers config and set output_hidden_states to False
config = AutoConfig.from_pretrained(nombre_modelo)
config.output_hidden_states = False

# Load the Transformers BERT model
transformer_model = TFAutoModel.from_pretrained(nombre_modelo, config = config)

#######################################
### ------- Build the model ------- ###
# TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model

# Build your model input
input_ids = Input(shape=(MAX_SEQUENCE_LENGTH,), name='input_ids', dtype='int32')
attention_mask = Input(shape=(MAX_SEQUENCE_LENGTH,), name='attention_mask', dtype='int32') 
# Load the Transformers BERT model as a layer in a Keras model
pooled_output = transformer_model(input_ids, attention_mask)[1]  # (bs, dim)
logits = Dropout(0.1)(pooled_output)  # (bs, dim)
# Then build your model output
output = Dense(units=1,
               kernel_initializer=TruncatedNormal(stddev=config.initializer_range),
               activation="sigmoid",
               name='clases')(logits)
# And combine it all in a model object
model = Model(inputs=[input_ids, attention_mask], outputs=output, name='BERT_BinaryClass')
# Take a look at the model
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5, epsilon=1e-08)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history=model.fit(train_dataset.batch(batch_size), epochs=n_epochs, batch_size=batch_size, validation_data=test_dataset.batch(batch_size))

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Fine-tuning Keras')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
predict = model.predict(test_dataset.batch(batch_size))
predict_label = list(map(lambda l: id2label[l], predict.ravel()>0.5))

print(classification_report(Y_test_label, predict_label))

### Usando el embedding de `[CLS]`

In [None]:
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from transformers import TFAutoModel

#######################################
### --------- Setup BERT ---------- ###


# Load transformers config and set output_hidden_states to False
config = AutoConfig.from_pretrained(nombre_modelo)
config.output_hidden_states = False

# Load the Transformers BERT model
transformer_model = TFAutoModel.from_pretrained(nombre_modelo, config = config)

#######################################
### ------- Build the model ------- ###
# TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model

# Build your model input
input_ids = Input(shape=(MAX_SEQUENCE_LENGTH,), name='input_ids', dtype='int32')
attention_mask = Input(shape=(MAX_SEQUENCE_LENGTH,), name='attention_mask', dtype='int32') 
# Load the Transformers BERT model as a layer in a Keras model
last_hidden_state = transformer_model(input_ids, attention_mask)[0]  # (bs, seq, dim)
logits = Dropout(0.1)(last_hidden_state[:,0,:])  # (bs, dim)
# Then build your model output
output = Dense(units=1,
               kernel_initializer=TruncatedNormal(stddev=config.initializer_range),
               activation="sigmoid",
               name='clases')(logits)
# And combine it all in a model object
model = Model(inputs=[input_ids, attention_mask], outputs=output, name='BERT_BinaryClass')
# Take a look at the model
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5, epsilon=1e-08)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history=model.fit(train_dataset.batch(batch_size), epochs=n_epochs, batch_size=batch_size, validation_data=test_dataset.batch(batch_size))

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Fine-tuning Keras')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
predict = model.predict(test_dataset.batch(batch_size))
predict_label = list(map(lambda l: id2label[l], predict.ravel()>0.5))

print(classification_report(Y_test_label, predict_label))

## Uso del modelo BERT pre-entrenado para generar sentence embeddings
Usamos el modelo BERT pre-entrenado (sin hacer fine-tuning) para generar los embeddings de los Tweets. Luego entrenamos un clasificador binario sobre nuestro corpus.  
Probamos con la salida `pooler_output` y el `last_hidden_state` del primer token (CLS).

In [None]:
#Obtenemos los tokens para la entrada al modelo
entradas_train = train_encodings['input_ids']

In [None]:
entradas_train.shape

In [None]:
from transformers import TFAutoModel
# Load transformers config and set output_hidden_states to False
config = AutoConfig.from_pretrained(nombre_modelo)
config.output_hidden_states = False
#volvemos a cargar el modelo con la configuración anterior
transformer_model = TFAutoModel.from_pretrained(nombre_modelo, config = config)

#calculamos los doc embeddings sobre las entradas (inferencia)
output_train = transformer_model(entradas_train)

In [None]:
#La salida tiene el valor de la última capa oculta y el 'pooled_output' de toda la secuencia
output_train.keys()

In [None]:
output_train.last_hidden_state.shape

In [None]:
output_train.pooler_output.shape

In [None]:
#Cogemos la salida del pooler_output como embedding de documento
salidas_train = output_train.pooler_output
salidas_train.shape

In [None]:
#calculamos los vectores del conjunto de entrenamiento
#Como ya los hemos calculado antes los extraemos de la variable de encodings
entradas_test = test_encodings['input_ids']
output_test = transformer_model(entradas_test)
salidas_test = output_test.pooler_output
salidas_test.shape

### Entrenamos y validamos con scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

modelLR = LogisticRegression(solver='liblinear')
#Entrenamos el modelo con el conjunto de train y validamos
modelLR.fit(salidas_train, Y_train)
prediccion = modelLR.predict(salidas_test)
print(classification_report(Y_test, prediccion, target_names=['N','P']))

### Ejercicio
Repite el modelo utilizando el embedding del token `[CLS]` en lugar de la salida `pooled_output`

In [None]:
#solución

### Uso de la máscara de atención
Repetimos usando las máscaras de atención (sólo tokens válidos) al extraer los sentence embeddings, para ver si cambia el resultado

In [None]:
mask_train = train_encodings['attention_mask']
mask_test = test_encodings['attention_mask']

#calculamos los doc embeddings sobre las entradas
last_hidden_states = transformer_model.predict({'input_ids':entradas_train, 'attention_mask':mask_train})
salidas_train = last_hidden_states.last_hidden_state[:,0,:]
last_hidden_states = transformer_model.predict({'input_ids':entradas_test, 'attention_mask':mask_test})
salidas_test = last_hidden_states.last_hidden_state[:,0,:]

In [None]:
#Entrenamos el modelo con el conjunto de train y validamos
modelLR.fit(salidas_train, Y_train)
prediccion = modelLR.predict(salidas_test)
print(classification_report(Y_test, prediccion, target_names=['N','P']))

In [None]:
#Calculamos la 'pooled_output' y entrenamos con este tensor
salidas_train = transformer_model.predict({'input_ids':entradas_train, 'attention_mask':mask_train})[1]
salidas_test = transformer_model.predict({'input_ids':entradas_test, 'attention_mask':mask_test})[1]

In [None]:
#Entrenamos el modelo con el conjunto de train y validamos
modelLR.fit(salidas_train, Y_train)
prediccion = modelLR.predict(salidas_test)
print(classification_report(Y_test, prediccion, target_names=['N','P']))

### Entrenamos y validamos con una capa densa
Entrenamos con un modelo DL básico equivalente al que usa `TFBertForSequenceClassification`, pero aquí no se ajustan los pesos del modelo BERT.  
Definimos un modelo con la API funcional de Keras usando los vectores de salida del BERT como sentence embedding a la entrada.

In [None]:
# Build your model input
doc_embeddings = Input(shape=(config.hidden_size,), name='doc_embeddings', dtype='float32')

# Then build your model output
pooled_output = Dropout(0.1)(doc_embeddings)  # (bs, dim)
output = Dense(units=1,
               kernel_initializer=TruncatedNormal(stddev=config.initializer_range),
               activation="sigmoid",
               name='clases')(pooled_output)
# And combine it all in a model object
model = Model(inputs=doc_embeddings, outputs=output, name='Binary_BertPretrained')
# Take a look at the model
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, epsilon=1e-08) #el entrenamiento es muy sensible a estos valores
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(salidas_train, np.asarray(Y_train), epochs=n_epochs, batch_size=batch_size, validation_data=(salidas_test, np.asarray(Y_test)))

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('embeddings BERT + capa densa')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
predict = model.predict(salidas_test)
predict_clases = predict>0.5

print(classification_report(Y_test, predict_clases, target_names=['N','P']))

Este modelo es equivalente a utilizar la capa del `TFBertModel` base sin re-entrenar en el modelo completo

In [None]:
# Load the Transformers BERT model
transformer_model = TFAutoModel.from_pretrained(nombre_modelo, config = config, name="Bert_model")
transformer_model.bert.trainable = False #congelamos la actualización de las capas del BERT

#######################################
### ------- Build the model ------- ###
# TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model

# Build your model input
input_ids = Input(shape=(MAX_SEQUENCE_LENGTH,), name='input_ids', dtype='int32')
attention_mask = Input(shape=(MAX_SEQUENCE_LENGTH,), name='attention_mask', dtype='int32') 
# Load the Transformers BERT model as a layer in a Keras model
cls_output = transformer_model(input_ids, attention_mask)[0]  # (bs, seq, dim)
logits = Dropout(0.1)(cls_output[:,0,:])  # (bs, dim)
# Then build your model output
output = Dense(units=1,
               kernel_initializer=TruncatedNormal(stddev=config.initializer_range),
               activation="sigmoid",
               name='clases')(logits)
# And combine it all in a model object
model = Model(inputs=[input_ids, attention_mask], outputs=output, name='BERT_BinaryClass')

# Take a look at the model
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, epsilon=1e-08)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history=model.fit(train_dataset.batch(batch_size), epochs=n_epochs, batch_size=batch_size, validation_data=test_dataset.batch(batch_size))

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('BERT pre-entrenado con máscara atención')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
predict = model.predict(test_dataset.batch(batch_size))
predict_clases = predict>0.5
from sklearn.metrics import classification_report

print(classification_report(Y_test, predict_clases, target_names=['N','P'], zero_division=0))

## Uso de las capas internas del modelo BERT
Por último probamos a usar las 4 últimas capas ocultas de `[CLS]` concatenadas para la predicción

In [None]:
# Load the Transformers BERT model
transformer_model = TFAutoModel.from_pretrained(nombre_modelo, output_hidden_states=True, name="Bert_model")
transformer_model.bert.trainable = False #congelamos la actualización de las capas del BERT

#######################################
### ------- Build the model ------- ###
# TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model

# Build your model input
input_ids = Input(shape=(MAX_SEQUENCE_LENGTH,), name='input_ids', dtype='int32')
attention_mask = Input(shape=(MAX_SEQUENCE_LENGTH,), name='attention_mask', dtype='int32') 
# Load the Transformers BERT model as a layer in a Keras model
hidden_states = transformer_model(input_ids, attention_mask)[2]  # (layer, bs, seq, dim)

hidden_states_size = 4 # count of the last states 
hiddes_states_ind = list(range(-hidden_states_size, 0, 1))

selected_hiddes_states = tf.keras.layers.concatenate(tuple([hidden_states[i][:,0,:] for i in hiddes_states_ind])) #first token of each layer

pooled_output = Dropout(0.1)(selected_hiddes_states)  # (bs, dim)
# Then build your model output
output = Dense(units=1,
               kernel_initializer=TruncatedNormal(stddev=config.initializer_range),
               activation="sigmoid",
               name='clases')(pooled_output)
# And combine it all in a model object
model = Model(inputs=[input_ids, attention_mask], outputs=output, name='BERT_BinaryClass')

# Take a look at the model
model.summary()

El número de parámetros en este caso viene dado por el número de capas a concatenar x dimensiones embedding + BIAS

In [None]:
4 * 768 + 1

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, epsilon=1e-08)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history=model.fit(train_dataset.batch(batch_size), epochs=n_epochs, batch_size=batch_size, validation_data=test_dataset.batch(batch_size))

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('BERT pre-entrenado últimas 4 capas + densa')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
predict = model.predict(test_dataset.batch(batch_size))
predict_clases = predict>0.5
from sklearn.metrics import classification_report

print(classification_report(Y_test, predict_clases, target_names=['N','P']))

Repetimos este modelo entrenando toda la red

In [None]:
# Load the Transformers BERT model
transformer_model = TFAutoModel.from_pretrained(nombre_modelo, output_hidden_states=True, name="Bert_model")

#######################################
### ------- Build the model ------- ###
# TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model

# Build your model input
input_ids = Input(shape=(MAX_SEQUENCE_LENGTH,), name='input_ids', dtype='int32')
attention_mask = Input(shape=(MAX_SEQUENCE_LENGTH,), name='attention_mask', dtype='int32') 
# Load the Transformers BERT model as a layer in a Keras model
hidden_states = transformer_model(input_ids, attention_mask)[2]  # (layer, bs, seq, dim)

hidden_states_size = 4 # count of the last states 
hiddes_states_ind = list(range(-hidden_states_size, 0, 1))

selected_hiddes_states = tf.keras.layers.concatenate(tuple([hidden_states[i][:,0,:] for i in hiddes_states_ind])) #first token of each layer

pooled_output = Dropout(0.1)(selected_hiddes_states)  # (bs, dim)
# Then build your model output
output = Dense(units=1,
               kernel_initializer=TruncatedNormal(stddev=config.initializer_range),
               activation="sigmoid",
               name='clases')(pooled_output)
# And combine it all in a model object
model = Model(inputs=[input_ids, attention_mask], outputs=output, name='BERT_BinaryClass')

# Take a look at the model
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history=model.fit(train_dataset.batch(batch_size), epochs=n_epochs, batch_size=batch_size, validation_data=test_dataset.batch(batch_size))

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('BERT últimas 4 capas + densa')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

In [None]:
predict = model.predict(test_dataset.batch(batch_size))
predict_clases = predict>0.5
from sklearn.metrics import classification_report

print(classification_report(Y_test, predict_clases, target_names=['N','P']))