#Обнаружение сгенерированных ИИ научных статей

##Подготовка данных

Скачиваем датасет

In [1]:
!pip install -q opendatasets

In [2]:
import opendatasets as od


od.download('https://www.kaggle.com/competitions/detecting-generated-scientific-papers')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: alexzyukov
Your Kaggle Key: ··········
Downloading detecting-generated-scientific-papers.zip to ./detecting-generated-scientific-papers


100%|██████████| 16.8M/16.8M [00:00<00:00, 973MB/s]


Extracting archive ./detecting-generated-scientific-papers/detecting-generated-scientific-papers.zip to ./detecting-generated-scientific-papers





Импортируем основные библиотеки

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Опредяеляем доступные данные

In [4]:
from pathlib import Path
import os


PATH_TO_DATA = Path('./detecting-generated-scientific-papers/')
INDEX_COL_NAME = 'id'
INPUT_COL_NAME = 'text'
TARGET_COL_NAME = 'fake'

os.listdir(PATH_TO_DATA)

['fake_papers_test_public.csv',
 'fake_papers_train_part_public.csv',
 'sample_submission.csv',
 'fake_papers_train_part_public_extended.csv',
 'fake_papers_test_public_extended.csv']

Используем test как train и train как test, так как в test доступно в разы больше размеченных данных

In [5]:
test_df = pd.read_csv(PATH_TO_DATA / "fake_papers_train_part_public_extended.csv", index_col=INDEX_COL_NAME)
train_df = pd.read_csv(PATH_TO_DATA / "fake_papers_test_public_extended.csv", index_col=INDEX_COL_NAME)
sample_sumbission_df = pd.read_csv(PATH_TO_DATA / "sample_submission.csv", index_col=INDEX_COL_NAME)

In [6]:
train_df.head()

Unnamed: 0_level_0,text,source,fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,The next chapter will consist of a series of e...,summarized_sdg,1
3,"This chapter opens with a discussion of how ""A...",summarized_sdg,1
4,Formal privileges to land are many times advan...,spinbot_paraphrased_sdg,1
6,"In this paper, the paper focuses on the role t...",summarized_sdg,1
7,This article discusses the relationship betwee...,generated_sdg,1


In [7]:
test_df.head()

Unnamed: 0_level_0,text,source,fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Modern two-dimensional imaging is of such qual...,sdg_abstracts_original,0
2,Background: The optimal sequence of systemic p...,generated_sdg,1
5,This chapter opens with a discussion of the ef...,summarized_sdg,1
10,The time scale of the ultra-short-term can str...,micpro_retracted,1
23,Electronic nose or machine olfaction are syste...,generated_micpro,1


In [8]:
train_df[TARGET_COL_NAME].value_counts()

Unnamed: 0_level_0,count
fake,Unnamed: 1_level_1
1,14660
0,6743


In [9]:
train_df[INPUT_COL_NAME].apply(lambda s: len(s.split())).describe()

Unnamed: 0,text
count,21403.0
mean,140.798626
std,69.129464
min,50.0
25%,100.0
50%,116.0
75%,170.0
max,1535.0


In [10]:
x_train = train_df[INPUT_COL_NAME]
y_train = train_df[TARGET_COL_NAME]


x_test = test_df[INPUT_COL_NAME]
y_test = test_df[TARGET_COL_NAME]

##Моделирование

Устанавливаем библиотеку для работы с трансформерами. В данном ноутбуке использован BERT

In [11]:
!pip install -q transformers

Токенизируем текстовые данные

In [12]:
import torch
import torch.nn as nn
from transformers import BertTokenizer
from torch.utils.data import DataLoader, TensorDataset


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

encoded_train = tokenizer.batch_encode_plus(
    x_train,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors='pt'
)

encoded_test = tokenizer.batch_encode_plus(
    x_test,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors='pt'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Преобразуем данные в датасеты torch

In [13]:
import torch
from torch.utils.data import TensorDataset, DataLoader

BATCH = 4
NUM_WORKERS = 2

labels_train = torch.tensor(y_train.values, dtype=torch.long)
labels_test  = torch.tensor(y_test.values, dtype=torch.long)

train_dataset = TensorDataset(encoded_train['input_ids'], encoded_train['attention_mask'], labels_train)
test_dataset  = TensorDataset(encoded_test['input_ids'],  encoded_test['attention_mask'],  labels_test)

Разобьём train на train/valid (80/20)

In [14]:
from sklearn.model_selection import train_test_split


train_idx, val_idx = train_test_split(np.arange(len(train_dataset)), test_size=0.2, random_state=42, stratify=y_train)

def subset_loader(dataset, idxs, batch_size=BATCH, shuffle=True):
    subset = torch.utils.data.Subset(dataset, idxs)
    return DataLoader(subset, batch_size=batch_size, shuffle=shuffle, num_workers=NUM_WORKERS)

train_loader = subset_loader(train_dataset, train_idx, batch_size=BATCH, shuffle=True)
valid_loader = subset_loader(train_dataset, val_idx, batch_size=BATCH, shuffle=False)
test_loader  = DataLoader(test_dataset, batch_size=BATCH, shuffle=False, num_workers=NUM_WORKERS)

print("Batches: train", len(train_loader), "valid", len(valid_loader), "test", len(test_loader))


Batches: train 4281 valid 1071 test 1338


Инициализируем класс для BERT. На выходе модель прогнозирует степень сгенерированности текста ИИ от 0 до 1

In [15]:
import torch.nn as nn
from transformers import BertModel

class BertBinaryClassifier(nn.Module):
    def __init__(self, backbone_name='bert-base-uncased', dropout=0.2):
        super().__init__()
        self.backbone = BertModel.from_pretrained(backbone_name)
        hidden = self.backbone.config.hidden_size
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden, hidden//2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden//2, 1)
        )
    def forward(self, input_ids, attention_mask):
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        cls = out.last_hidden_state[:,0,:]
        logits = self.classifier(cls).squeeze(-1)
        return logits

model = BertBinaryClassifier().to(device)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Инициализируем оптимизатор, для классификатора выбираем более быстрый learning_rate. Создаём функцию для true и predict целевой переменной, метрика F1

In [17]:
from torch.amp import GradScaler, autocast
from torch.optim import AdamW
from sklearn.metrics import f1_score, classification_report, confusion_matrix

scaler = GradScaler()
criterion = nn.BCEWithLogitsLoss()

optimizer = AdamW([
    {"params": model.backbone.parameters(), "lr": 2e-5},
    {"params": model.classifier.parameters(), "lr": 1e-4}
], weight_decay=0.01)

def eval_loader(loader, model):
    model.eval()
    preds, trues = [], []
    with torch.no_grad():
        for ids, mask, labels in loader:
            ids = ids.to(device)
            mask = mask.to(device)
            labels = labels.to(device)
            logits = model(ids, mask)
            probs = torch.sigmoid(logits)
            batch_preds = (probs>0.5).long().cpu().numpy()
            preds.append(batch_preds)
            trues.append(labels.cpu().numpy())
    preds = np.concatenate(preds)
    trues = np.concatenate(trues)
    f1 = f1_score(trues, preds, average='binary')
    return f1, preds, trues

def predict_texts(texts, tokenizer, model, max_len=512):
    enc = tokenizer.batch_encode_plus(texts, padding=True, truncation=True, max_length=max_len, return_tensors='pt')
    ids = enc['input_ids'].to(device)
    mask = enc['attention_mask'].to(device)
    model.eval()
    with torch.no_grad():
        logits = model(ids, mask)
        probs = torch.sigmoid(logits).cpu().numpy()
        preds = (probs > 0.5).astype(int)
    return preds, probs


Обучаем BERT, сохраняем наилучшие веса

In [20]:
import time

EPOCHS = 5
best_val_f1 = -1.0
SAVE_PATH = "best_bert_binary.pt"

for epoch in range(1, EPOCHS+1):
    model.train()
    t0 = time.time()
    running_loss = 0.0
    for step, batch in enumerate(train_loader, start=1):
        ids, mask, labels = batch
        ids = ids.to(device)
        mask = mask.to(device)
        labels = labels.to(device).float()

        with autocast(device_type="cuda" if torch.cuda.is_available() else "cpu"):
            logits = model(ids, mask)
            loss = criterion(logits, labels)
            loss = loss / 1.0

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        running_loss += loss.item()

    val_f1, val_preds, val_trues = eval_loader(valid_loader, model)
    print(f"Epoch {epoch} train_loss={running_loss/len(train_loader):.4f} val_f1={val_f1:.4f} time={(time.time()-t0):.0f}s")

    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        torch.save(model.state_dict(), SAVE_PATH)
        print("Saved best model, val_f1=", best_val_f1)


Epoch 1 train_loss=0.1132 val_f1=0.9880 time=722s
Saved best model, val_f1= 0.9879844305297004
Epoch 2 train_loss=0.0393 val_f1=0.9876 time=717s
Epoch 3 train_loss=0.0258 val_f1=0.9869 time=717s
Epoch 4 train_loss=0.0203 val_f1=0.9857 time=717s
Epoch 5 train_loss=0.0146 val_f1=0.9928 time=721s
Saved best model, val_f1= 0.9928425357873211


Получили 99% верных предсказаний по метрике F1 за 5 эпох, 12 минут каждая. Строим confusion matrix. Видно, что FP и FN много меньше TP и TN. Анализируем метрики precision, recall, F1: все показали отличный результат

In [21]:
model.load_state_dict(torch.load(SAVE_PATH, map_location=device))
test_f1, test_preds, test_trues = eval_loader(test_loader, model)
print("Test F1:", test_f1)
print("\nClassification report (test):")
print(classification_report(test_trues, test_preds))
print("\nConfusion matrix:")
print(confusion_matrix(test_trues, test_preds))


Test F1: 0.991555434486516

Classification report (test):
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1686
           1       0.99      0.99      0.99      3664

    accuracy                           0.99      5350
   macro avg       0.99      0.99      0.99      5350
weighted avg       0.99      0.99      0.99      5350


Confusion matrix:
[[1648   38]
 [  24 3640]]


Строим 10 случайных примеров из теста

In [22]:
n_show = 10
sample_texts = x_test[:n_show].tolist()
preds, probs = predict_texts(sample_texts, tokenizer, model, max_len=512)
for i, txt in enumerate(sample_texts):
    print(f"Example {i+1} | Pred: {int(preds[i])} | Prob: {probs[i]:.4f}")
    print(txt[:600].replace("\n"," ") + ("..." if len(txt)>600 else ""))
    print("-"*100)


Example 1 | Pred: 0 | Prob: 0.0001
Modern two-dimensional imaging is of such quality that echocardiography is now capable of detecting intrapericardial formations. Three morphological types of abnormal intrapericardial echoes have been described: round masses, mattresses and linear echoes. These have been observed in effusions of various origin and seem to be lacking in aetiological specificity. In order to determine more precisely the echocardiographic signs of pericardial metastases, the authors have analyzed 7 cases of intrapericardial masses visualized in a series of 10 patients with metastatic pericardial effusion and exam...
----------------------------------------------------------------------------------------------------
Example 2 | Pred: 1 | Prob: 0.9999
Background: The optimal sequence of systemic palliative chemotherapy in metastatic breast cancer is unknown Background: The optimal sequence of systemic palliative chemotherapy in metastatic breast cancer is unknown. The obje