### Modelo NLP para detección de la veracidad o no de mensajes del juego DIPLOMACY mediante XLNet-model
Modelo de predicción a partir de cadena de caracteres que indica si la frase es verdadera o falsa, utilizando el modelo pre-entrenado XLNet.
https://huggingface.co/docs/transformers/model_doc/xlnet

28ENE2024 Jhon A. Monsalve

In [1]:
import numpy as np
import pandas as pd
import json
import warnings
warnings.simplefilter("ignore")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Deception in Diplomacy Dataset
Dataset with intended and perceived deception labels in the negotiation-based game Diplomacy, where seven players compete for world domination by forging and breaking alliances with each other. Over 17,000 messages are annotated by the sender for their intended truthfulness and by the receiver for their perceived truthfulness. This dataset captures deception in long-lasting relationships, where the interlocutors strategically combine truth with lies to advance objectives

Distributed together with: It Takes Two to Lie: One to Lie, and One to Listen. Denis Peskov, Benny Cheng, Ahmed Elgohary, Joe Barrow, Cristian Danescu-Niculescu-Mizil and Jordan Boyd-Graber. Proceedings of ACL 2020.

In [2]:
# Descargar los datos historicos de DIPLOMACY desde ConvoKit
# !pip install convokit
from convokit import Corpus, download
corpus = Corpus(filename=download("diplomacy-corpus"))

Dataset already exists at C:\Users\josef\.convokit\downloads\diplomacy-corpus


In [3]:
# Estadisticas de la fuente de datos
corpus.print_summary_stats()

Number of Speakers: 83
Number of Utterances: 17289
Number of Conversations: 246


In [4]:
# Training data
corpus_train = Corpus(filename=download("diplomacy-corpus"))
corpus_train.filter_conversations_by(lambda convo: convo.meta.get('acl2020_fold')=='Train')
corpus_train.print_summary_stats()                                    

Dataset already exists at C:\Users\josef\.convokit\downloads\diplomacy-corpus
Number of Speakers: 62
Number of Utterances: 13132
Number of Conversations: 184


In [5]:
# Test data
corpus_test = Corpus(filename=download("diplomacy-corpus"))
corpus_test.filter_conversations_by(lambda convo: convo.meta.get('acl2020_fold')=='Test')
corpus_test.print_summary_stats()   

Dataset already exists at C:\Users\josef\.convokit\downloads\diplomacy-corpus
Number of Speakers: 14
Number of Utterances: 2741
Number of Conversations: 42


In [6]:
# Validation data
corpus_validation = Corpus(filename=download("diplomacy-corpus"))
corpus_validation.filter_conversations_by(lambda convo: convo.meta.get('acl2020_fold')=='Validation')
corpus_validation.print_summary_stats()   

Dataset already exists at C:\Users\josef\.convokit\downloads\diplomacy-corpus
Number of Speakers: 7
Number of Utterances: 1416
Number of Conversations: 20


In [7]:
# Carga de datos desde el archivo utterances.jsonl que contiene las interacciones totales de la fuente corpus
json = './data/utterances.jsonl'

df = pd.read_json(json, lines=True) # Cargar el archivo JSONL en un DataFrame de Pandas

# 'meta' contiene cadenas JSON y se desea expandir en columnas
if 'meta' in df.columns:
    # Convertir las cadenas JSON en columnas separadas
    df_json = df['meta'].apply(pd.Series)

    # Concatenar las nuevas columnas al DataFrame original
    dataframe = pd.concat([df, df_json], axis=1)

dataframe.tail(3)

Unnamed: 0,id,root,text,speaker,meta,reply-to,timestamp,speaker_intention,receiver_perception,receiver,absolute_message_index,relative_message_index,year,game_score,game_score_delta,deception_quadrant
17286,Game9-turkey-france-31,Game9-turkey-france,you have anything else in mind?,turkey-Game9,"{'speaker_intention': 'Truth', 'receiver_perce...",Game9-turkey-france-30,1369,Truth,Truth,france-Game9,1369,31,1903,4,-1,Straightforward
17287,Game9-turkey-france-32,Game9-turkey-france,I guess I'd also be happy to support you into ...,turkey-Game9,"{'speaker_intention': 'Truth', 'receiver_perce...",Game9-turkey-france-31,1370,Truth,Truth,france-Game9,1370,32,1903,4,-1,Straightforward
17288,Game9-turkey-france-33,Game9-turkey-france,"That would be interesting, but I think I want ...",france-Game9,"{'speaker_intention': 'Truth', 'receiver_perce...",Game9-turkey-france-32,1385,Truth,,turkey-Game9,1385,33,1903,5,1,Unknown


In [9]:
dataframe.to_csv("./output/data.csv", index=False)

In [3]:
# Seleccionar las columnas que aplican al modelo
df= dataframe.drop(['id','root','speaker','meta','reply-to','timestamp','receiver_perception','receiver','absolute_message_index',
                'relative_message_index','year','game_score','game_score_delta','deception_quadrant'], axis = 'columns')

# reemplazar el speaker_intention de "Lie" a 0 y "Truth" a 1 como valores numericos para el modelo
df.speaker_intention = df.speaker_intention.replace("Lie",0).replace("Truth",1)
df = df.rename(columns={ 'text':'texto','speaker_intention':'respuesta' })
df.dtypes

texto        object
respuesta     int64
dtype: object

# MODELO XLNet
Modelo preentrenado que utiliza una arquitectura de transformers

In [4]:
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import XLNetTokenizer, XLNetForSequenceClassification, AdamW
import torch.nn.functional as F
# !pip install sentencepiece

In [5]:
# Cargar el tokenizador y el modelo XLNet preentrenado para clasificación de secuencias
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=2)

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Tokenizar los textos
inputs = tokenizer(df['texto'].tolist(), padding=True, truncation=True, return_tensors='pt')

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [7]:
# Crear un TensorDataset
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], torch.tensor(df['respuesta'].tolist()))

In [8]:
# Dividir el conjunto de datos en entrenamiento y prueba [80-20]
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

In [9]:
# Crear DataLoader para facilitar el entrenamiento
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

In [10]:
# Configurar el modelo para entrenar
model.train()

# Definir el optimizador
optimizer = AdamW(model.parameters(), lr=5e-5)

In [19]:
# Entrenamiento del modelo
for epoch in range(2):  # Número de veces que pasa por el conjunto de entrenamiento
    for batch in train_dataloader:
        input_ids, attention_mask, labels = batch

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        loss = F.cross_entropy(logits, labels)
        loss.backward()
        optimizer.step()

KeyboardInterrupt: 

In [None]:
# Evaluación del modelo
model.eval()
total_correct = 0
total_samples = 0

with torch.no_grad():
    for batch in test_dataloader:
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        total_correct += (predictions == labels).sum().item()
        total_samples += labels.size(0)

accuracy = total_correct / total_samples
print(f'Accuracy on test set: {accuracy}')

No logré llegar a esta última etapa