# Traducción con DeepL y FinBERT

Notebook de referencia:

https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f

https://github.com/jamescalam/transformers/blob/main/course/language_classification/04_window_method_in_pytorch.ipynb

## Traducción

In [None]:
# Librería de traducción DeepL
!pip install deepl

In [81]:
import deepl
translator = deepl.Translator("3ef95800-106e-d003-f11d-49c2dd263233:fx")

In [3]:
import pandas as pd
df_mini = pd.read_csv('../texto_limpio.csv', index_col = 'Unnamed: 0')

from sklearn.utils import shuffle
dataset = shuffle(df_mini, random_state=42).reset_index(drop=True)
dataset

Unnamed: 0,ticker,date,body,r_adj,label,texto_limpio
0,TEF,2021-11-21 11:42:00.000,El magistrado escuchará a varios imputados rel...,0.056964,1,magistrado escuchará varios imputados relacion...
1,ANA,2021-08-27 13:48:00.000,"SEVILLA, 27 Ago. (EUROPA PRESS) - La sección d...",0.012567,1,27 sección agrupación sindical conductores soc...
2,TEF,2021-01-26 18:37:52.000,El Ibex 35 ha regresado a las subidas y rozado...,0.032565,1,ibex 35 regresado subidas rozado cota puntos p...
3,NTGY,2021-01-26 09:16:00.000,"MADRID, 26 Ene. (EUROPA PRESS) - El Ibex 35 ha...",0.022345,1,26 ibex 35 iniciado sesión martes subida lleva...
4,ELE,2021-12-16 11:48:00.000,"Naturgy se sitúa en la plaza 39 MADRID, 16 Dic...",-0.012581,-1,naturgy sitúa plaza 39 16 acciona energía reva...
...,...,...,...,...,...,...
21336,FER,2021-06-07 11:41:00.000,"MADRID, 7 Jun. (EUROPA PRESS) - Adriano Care, ...",0.012905,1,7 adriano socimi orientada residencias anciano...
21337,TEF,2021-06-05 17:30:00.000,"PAMPLONA, 5 Jun. (EUROPA PRESS) - La Asociació...",0.012192,1,5 asociación consumidores navarra irache recib...
21338,ELE,2021-03-08 15:11:00.000,"GRANADA, 8 Mar. (EUROPA PRESS) - El incendio, ...",0.019342,1,8 pasado 27 centro transformación ubicado call...
21339,ELE,2021-01-10 21:33:08.000,Filomena deja en España ciudades incomunicadas...,-0.012100,-1,filomena deja españa ciudades incomunicadas ba...


Controlando la cuota máxima de 500.000 caracteres de traducción en DeepL con la modalidad gratuita. Luego cuesta 20,00 € por cada millón de caracteres traducidos (aparte de una cuota mensual de 5€).

In [78]:
quota_deepl_control = dataset.body.str.len().cumsum() < 500000
translate_texts = dataset[quota_deepl_control].body.to_list()

Haciendo las traducciones en bloques de 40 y pausas de 1 minuto para no quedar bloqueado por DeepL en la modalidad gratuita.

In [82]:
from time import sleep
batch_size = 40
consolidated = []
for idx in range(0,len(translate_texts), batch_size):
    lista = translate_texts[idx:min(idx+batch_size, len(translate_texts)-1)]
    results = translator.translate_text(text=lista, source_lang="ES", target_lang="EN-US")
    consolidated.extend([i.text for i in results])
    sleep(60)
    

Guardando las traducciones en un pkl para evitar volver a reprocesarlas.

In [None]:
# import pickle
# pickle.dump(consolidated, open("../translation.pkl", "wb" ))

Cargando las traducciones desde el pkl para evitar reprocesarlas.

In [1]:
# import pickle
# consolidated = pickle.load(open("../translation.pkl", "rb" ))

Comprobando el resultado de la traducción.

In [2]:
consolidated[0]

'The magistrate will hear several defendants related to the alleged drug trafficking network MADRID, Nov 21 (EUROPA PRESS) - The judge of the National Court Ismael Moreno will start from Monday a new round of statements citing several defendants in the \'Operation Titella\', among which are the nephew of television producer José Luis Moreno and the alleged notary of the organization. In an order, to which Europa Press has had access, the head of the Central Court of Instruction Number 2 has agreed to take a statement to Raul Fernandez Rodriguez, whom the investigators place as the one in charge of making false invoices between the companies of his uncle to defraud the Treasury, avoiding the payment of the invoiced VAT. The agents, who point to Moreno\'s nephew as the administrator of Dreamlight International Productions, also indicate that he would have participated "directly" in the alleged swindle of the ventriloquist\'s partner, Alejandro Roemmers, as well as in the misappropriation

## Aplicando FinBERT

In [21]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

Se predice el sentimiento noticia por noticia en bloques de 510 caracteres ya que la gran mayoría exceden el límite de 512 caracteres. Entonces con este bucle se halla el sentimiento por bloques de 510 caracteres dentro de cada noticia y luego se promedia el sentimiento.

In [22]:
chunksize = 512
proba = []

for text in consolidated:
    tokens = tokenizer.encode_plus(text, add_special_tokens=False,
                                return_tensors='pt')
    input_id_chunks = tokens['input_ids'][0].split(510)
    mask_chunks = tokens['attention_mask'][0].split(510)

    # split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
    input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
    mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))

    # loop through each chunk
    for i in range(len(input_id_chunks)):
        # add CLS and SEP tokens to input IDs
        input_id_chunks[i] = torch.cat([
            torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
        ])
        # add attention tokens to attention mask
        mask_chunks[i] = torch.cat([
            torch.tensor([1]), mask_chunks[i], torch.tensor([1])
        ])
        # get required padding length
        pad_len = chunksize - input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len > 0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i] = torch.cat([
                input_id_chunks[i], torch.Tensor([0] * pad_len)
            ])
            mask_chunks[i] = torch.cat([
                mask_chunks[i], torch.Tensor([0] * pad_len)
            ])
    
    input_ids = torch.stack(input_id_chunks)
    attention_mask = torch.stack(mask_chunks)

    input_dict = {
        'input_ids': input_ids.long(),
        'attention_mask': attention_mask.int()
    }

    outputs = model(**input_dict)
    probs = torch.nn.functional.softmax(outputs[0], dim=-1)
    probs = probs.mean(dim=0)
    proba.append(probs.detach().numpy().tolist())                

Token indices sequence length is longer than the specified maximum sequence length for this model (1429 > 512). Running this sequence through the model will result in indexing errors


Se incluyen las predicciones dentro del dataset_preds para un análisis consolidado y se codifican las predicciones del mismo que tenemos las labels en el resto de modelos.

### Con 3 clases de salida:

In [23]:
import numpy as np
proba_matrix = np.array(proba)
dataset_preds = dataset.iloc[:len(consolidated)].copy()
dataset_preds["Finbert_3label"] = proba_matrix.argmax(axis=1)

In [24]:
dataset_preds.Finbert_3label.replace({2:1,1:0,0:-1}, inplace=True)

In [25]:
dataset_preds[["body", "label", "Finbert_3label"]]

Unnamed: 0,body,label,Finbert_3label
0,El magistrado escuchará a varios imputados rel...,1,1
1,"SEVILLA, 27 Ago. (EUROPA PRESS) - La sección d...",1,1
2,El Ibex 35 ha regresado a las subidas y rozado...,1,-1
3,"MADRID, 26 Ene. (EUROPA PRESS) - El Ibex 35 ha...",1,0
4,"Naturgy se sitúa en la plaza 39 MADRID, 16 Dic...",-1,0
...,...,...,...
137,Los inversores esperaban con cierto nerviosism...,-1,0
138,"VALÈNCIA, 14 May. (EUROPA PRESS) - La aerolíne...",-1,-1
139,"BARCELONA, 29 Abr. (EUROPA PRESS) - El piloto ...",-1,1
140,Colocará en torno a 2.800 millones en el merca...,1,1


Pseudo-accuracy para poder comparar resultados entre diferentes clasificaciones de sentimiento que hemos hecho:

In [26]:
sum(dataset_preds.label == dataset_preds.Finbert_3label)/dataset_preds.shape[0]

0.34507042253521125

### Con 2 clases de salida:

In [27]:
import numpy as np
proba_matrix = np.array(proba)
proba_matrix = proba_matrix[:,[0,2]]
dataset_preds = dataset.iloc[:len(consolidated)].copy()
dataset_preds["Finbert_2label"] = proba_matrix.argmax(axis=1)

In [28]:
dataset_preds.Finbert_2label.replace({1:1,0:-1}, inplace=True)

In [29]:
dataset_preds[["body", "label", "Finbert_2label"]]

Unnamed: 0,body,label,Finbert_2label
0,El magistrado escuchará a varios imputados rel...,1,1
1,"SEVILLA, 27 Ago. (EUROPA PRESS) - La sección d...",1,1
2,El Ibex 35 ha regresado a las subidas y rozado...,1,-1
3,"MADRID, 26 Ene. (EUROPA PRESS) - El Ibex 35 ha...",1,-1
4,"Naturgy se sitúa en la plaza 39 MADRID, 16 Dic...",-1,1
...,...,...,...
137,Los inversores esperaban con cierto nerviosism...,-1,-1
138,"VALÈNCIA, 14 May. (EUROPA PRESS) - La aerolíne...",-1,-1
139,"BARCELONA, 29 Abr. (EUROPA PRESS) - El piloto ...",-1,1
140,Colocará en torno a 2.800 millones en el merca...,1,1


Pseudo-accuracy para poder comparar resultados entre diferentes clasificaciones de sentimiento que hemos hecho:

In [30]:
sum(dataset_preds.label == dataset_preds.Finbert_2label)/dataset_preds.shape[0]

0.49295774647887325

## FinBERT truncando con la longitud 512 caracteres (max BERT)

A continuación se hace una prueba truncando los 512 primeros caracteres y utilizando la pipeline de transformers.

In [192]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="ProsusAI/finbert")

In [193]:
preds = classifier(consolidated, padding=True, truncation=True)
dataset_preds["Finbert_truncated"] = pd.DataFrame(preds).label.replace({"positive":1,"neutral":0,"negative":-1})
dataset_preds

Unnamed: 0,ticker,date,body,r_adj,label,texto_limpio,Finbert_label,Finbert_truncated
0,TEF,2021-11-21 11:42:00.000,El magistrado escuchará a varios imputados rel...,0.056964,1,magistrado escuchará varios imputados relacion...,1,0
1,ANA,2021-08-27 13:48:00.000,"SEVILLA, 27 Ago. (EUROPA PRESS) - La sección d...",0.012567,1,27 sección agrupación sindical conductores soc...,1,0
2,TEF,2021-01-26 18:37:52.000,El Ibex 35 ha regresado a las subidas y rozado...,0.032565,1,ibex 35 regresado subidas rozado cota puntos p...,-1,1
3,NTGY,2021-01-26 09:16:00.000,"MADRID, 26 Ene. (EUROPA PRESS) - El Ibex 35 ha...",0.022345,1,26 ibex 35 iniciado sesión martes subida lleva...,0,1
4,ELE,2021-12-16 11:48:00.000,"Naturgy se sitúa en la plaza 39 MADRID, 16 Dic...",-0.012581,-1,naturgy sitúa plaza 39 16 acciona energía reva...,0,-1
...,...,...,...,...,...,...,...,...
137,AMS,2021-07-13 08:33:01.000,Los inversores esperaban con cierto nerviosism...,-0.019445,-1,inversores esperaban cierto nerviosismo refere...,0,-1
138,AENA,2021-05-14 15:36:00.000,"VALÈNCIA, 14 May. (EUROPA PRESS) - La aerolíne...",-0.023974,-1,14 aerolínea valenciana air nostrum vuelto rev...,-1,1
139,REP,2021-04-29 18:17:00.000,"BARCELONA, 29 Abr. (EUROPA PRESS) - El piloto ...",-0.021390,-1,29 piloto motogp marc márquez reconocido tras ...,1,0
140,BKT,2021-06-21 14:12:00.000,Colocará en torno a 2.800 millones en el merca...,0.014920,1,colocará torno millones mercado si alcanza pre...,1,0


Pseudo-accuracy para poder comparar resultados entre diferentes clasificaciones de sentimiento que hemos hecho:

In [200]:
sum(dataset_preds.label == dataset_preds.Finbert_truncated)/dataset_preds.shape[0]

0.323943661971831

# Transfer Learning con FinBERT

A continuación se reentrena la última capa únicamente de FinBERT para la clasificación de las labels del dataset de noticias y alpha.

In [9]:
len(consolidated)

142

In [25]:
dataset_preds = dataset.iloc[:len(consolidated)].copy()

In [26]:
dataset_preds.label = dataset_preds.label.replace({-1:0})

In [None]:
import numpy as np
proba_matrix = np.array(proba)
dataset_preds["Finbert_TL_2label"] = proba_matrix.argmax(axis=1)

In [27]:
import tensorflow as tf
import keras
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = TFAutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert", from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [28]:
model.layers[0].trainable=False
model.layers[-1].units = 1
model.layers[-1].activation=keras.activations.sigmoid

In [29]:
model.compile(
loss='binary_crossentropy',
optimizer=Adam(learning_rate=0.0001),
metrics=['accuracy']
)

In [16]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(consolidated, dataset_preds.label, test_size=0.2, random_state=42)

In [18]:
x_train = tokenizer(x_train, padding="max_length", truncation=True, return_tensors="tf")

In [19]:
x_test = tokenizer(x_test, padding="max_length", truncation=True, return_tensors="tf")

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint

filepath = './checkpoint'
model_checkpoint_callback = ModelCheckpoint(
    filepath=filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True
)

hist = model.fit(
    x_train.data,
    y_train,
    batch_size=1,
    epochs=200,
    validation_split=0.1,
    callbacks=[model_checkpoint_callback]
    )

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend()
plt.grid()
plt.show()

In [None]:
model.load_weights(filepath)
model.evaluate(x_test.data, y_test)

In [None]:
y_pred = model.predict(x_test.data)

In [None]:
from sklearn.metrics import (classification_report,
                             confusion_matrix,
                             roc_auc_score, precision_score)

In [None]:
report = classification_report(y_test, y_pred)
print(report)

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

def plot_cm(labels, predictions, p=0.5):
    cm = confusion_matrix(labels, predictions)
    plt.figure(figsize=(5, 5))
    sns.heatmap(cm, annot=True, fmt="d")
    plt.title("Confusion matrix (non-normalized))")
    plt.ylabel("Actual label")
    plt.xlabel("Predicted label")


plot_cm(y_test, y_pred)