# 리만 기하학 기반 쌍곡 공간 임베딩 BERT 분류기

의도 분류 작업을 위해 설계된 리만 기하학 기반 쌍곡 공간 아키텍처입니다. 여러 기하학적 공간에서 작동하는 분류 헤드를 결합해 94.55%의 정확도를 달성했습니다.

## 1. 수학적 기반: 다중 기하학 표현

이 모델의 핵심 혁신은 **세 가지 다른 기하학적 공간**을 결합하여 의미 표현을 강화한다는 점입니다:

### 1.1 기존 유클리드 공간 (Euclidean Space)

기존에 기본으로 사용되던 가장 기본적인 벡터 공간으로, 선형 분류기가 작동합니다:

$$f_E(x) = Wx + b$$

계산이 간단하고 학습이 안정적이지만, 직선이라는 단점으로 인해 계층적 관계를 표현하는 능력에 한계가 있습니다.

### 1.2 쌍곡 공간 (Hyperbolic Space)

계층적 데이터를 더 효과적으로 표현할 수 있는 공간으로, 상수 음수 곡률을 가집니다. 포앵카레 볼 모델에서의 핵심 연산:

1. **지수 사상(Exponential Map)**: 접공간에서 쌍곡 공간으로의 변환

   $$\exp_0^c(x) = \tanh(\sqrt{c}\|x\|) \frac{x}{\sqrt{c}\|x\|}$$

2. **뫼비우스 덧셈(Möbius Addition)**: 쌍곡 공간 내 벡터 덧셈
   $$x \oplus_c y = \frac{(1+2c\langle x,y \rangle + c\|y\|^2)x + (1-c\|x\|^2)y}{1+2c\langle x,y \rangle + c^2\|x\|^2\|y\|^2}$$

3. **뫼비우스 행렬-벡터 곱(Möbius Matrix-Vector Multiplication)**:
   이 연산으로 선형 변환을 쌍곡 공간에 적용
   $$M \otimes_c x = \exp_0^c(M\log_0^c(x))$$

여기서 $c$는 학습 가능한 곡률 매개변수입니다. 이는 쌍곡 공간의 곡률을 데이터에 맞게 최적화할 수 있게 합니다.

### 1.3 구면 공간 (Spherical Space)

단위 구면 위에서 작동하는 공간으로, 코사인 유사도 기반 분류에 적합합니다:

1. **정규화(Normalization)**: 벡터를 단위 구면으로 투영
   $$\hat{x} = \frac{x}{\|x\|}$$

2. **구면 내적(Spherical Inner Product)**: 코사인 유사도와 연관됨
   $$\langle \hat{x}, \hat{y} \rangle = \cos(\theta_{xy})$$

## 2. 프로토타입 기반 학습

모델은 각 클래스에 대해 여러 프로토타입 벡터를 학습하여 복잡한 분포를 더 잘 표현합니다:

1. **쌍곡 거리(Hyperbolic Distance)**: 프로토타입과 입력 사이의 거리

   $$d_{\mathbb{H}}(x, p_i) = \frac{1}{\sqrt{c}} \cosh^{-1}\left(1 + \frac{2c\|x-p_i\|^2}{(1-c\|x\|^2)(1-c\|p_i\|^2)}\right)$$

2. **프로토타입 로짓(Prototype Logits)**: 거리의 음수값으로, 가까울수록 높은 점수
   $$\text{logit}_p(x) = -\frac{1}{K}\sum_{k=1}^K d_{\mathbb{H}}(x, p_{k})$$

여기서 $K$는 클래스당 프로토타입 수입니다(여기서는 3).

## 3. 어텐션 풀링 메커니즘

모델은 문장의 모든 토큰에 걸쳐 가중 평균을 계산함으로써 [CLS] 토큰 임베딩을 보강합니다:

1. **어텐션 점수 계산**:
   $$e_i = v^T \tanh(W h_i)$$

2. **정규화 및 마스킹**:
   $$\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)} \cdot \text{mask}_i$$

3. **가중 합(Weighted Sum)**:
   $$c = \sum_i \alpha_i h_i$$

4. **결합(Combination)**:
   $$\text{emb} = \frac{\text{cls} + c}{2}$$

## 4. 동적 가중 앙상블

모델은 각 헤드의 기여도를 훈련 과정에서 동적으로 조정합니다:

$$\text{logits} = w_E \cdot \text{logits}_E + w_H \cdot \text{logits}_H + w_S \cdot \text{logits}_S + w_P \cdot \text{logits}_P$$

여기서 가중치 $w$는 훈련 진행에 따라 변화합니다:

$$w_E = 0.5 - 0.1 \cdot \text{progress}$$
$$w_H = 0.2 + 0.1 \cdot \text{progress}$$
$$w_S = 0.2$$
$$w_P = 0.1$$

훈련 초기에는 더 안정적인 유클리드 헤드에 높은 가중치를 부여하고, 진행됨에 따라 쌍곡 헤드의 가중치를 증가시킵니다.

## 5. 최적화 전략

최적화 과정에서의 주요 전략:

1. **적응형 곡률 학습**: 로그 매개변수화된 곡률 ($c = e^{\text{log}\_c}$)
2. **Mixed Precision 학습**: MixUp 데이터 증강으로 클래스 간 경계 학습
3. **그래디언트 클리핑**: 가파른 그래디언트 방지 ($\|g\| \leq 1.0$)
4. **OneCycle 학습률**: 초기 웜업 후 코사인 감소
5. **확률적 가중치 평균화(SWA)**: 훈련 후반부의 여러 체크포인트 평균화

## 모델 아키텍처 요약

* **백본**: RoBERTa-large (335M 매개변수)
* **풀링**: [CLS] 토큰 + 어텐션 풀링
* **분류 헤드**: 유클리드 + 쌍곡 + 구면 + 프로토타입 (앙상블)
* **손실 함수**: MixUp + 레이블 스무딩
* **최종 정확도**: 94.55% (Banking77 검증 세트)

이 모델은 고전적인 유클리드 접근법과 비유클리드 기하학의 장점을 결합하여 의도 분류 작업에서 최첨단 성능을 달성했습니다.

In [None]:
!pip install transformers datasets tqdm

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [None]:
!pip install -q datasets transformers torch tqdm scikit-learn

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, SWALR, update_bn
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
from sklearn.model_selection import StratifiedKFold
from torch.utils.data import DataLoader, TensorDataset
from tqdm.auto import tqdm

class SAM(torch.optim.Optimizer):
    def __init__(self, params, base_optimizer, rho=0.05, adaptive=False, **kwargs):
        defaults = dict(rho=rho, adaptive=adaptive, **kwargs)
        super().__init__(params, defaults)
        self.base_optimizer = base_optimizer(self.param_groups, **kwargs)
        self.rho = rho
        self.adaptive = adaptive
    @torch.no_grad()
    def first_step(self, zero_grad=True):
        grad_norm = self._grad_norm()
        scale = self.rho / (grad_norm + 1e-12)
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None: continue
                e_w = (torch.abs(p) if self.adaptive else 1.0) * p.grad * scale
                p.add_(e_w)
                self.state[p]['e_w'] = e_w
        if zero_grad: self.zero_grad()
    @torch.no_grad()
    def second_step(self, zero_grad=True):
        for group in self.param_groups:
            for p in group['params']:
                if 'e_w' not in self.state[p]: continue
                p.sub_(self.state[p]['e_w'])
        self.base_optimizer.step()
        if zero_grad: self.zero_grad()
    def step(self, *args, **kwargs):
        raise NotImplementedError("Use first_step and second_step instead")
    def zero_grad(self):
        self.base_optimizer.zero_grad()
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.detach_()
                    p.grad.zero_()
    def _grad_norm(self):
        shared_device = self.param_groups[0]['params'][0].device
        norm = torch.norm(torch.stack([
            ((torch.abs(p) if self.adaptive else 1.0) * p.grad).norm(p=2).to(shared_device)
            for group in self.param_groups for p in group['params']
            if p.grad is not None
        ]), p=2)
        return norm

class LabelSmoothingLoss(nn.Module):
    def __init__(self, classes, smoothing=0.1):
        super().__init__()
        self.cls, self.smooth = classes, smoothing
    def forward(self, pred, target):
        logp = F.log_softmax(pred, dim=1)
        with torch.no_grad():
            true = logp.clone().fill_(self.smooth/(self.cls-1))
            true.scatter_(1, target.unsqueeze(1), 1-self.smooth)
        return (-(true*logp).sum(1)).mean()

def r_drop_loss(l1, l2, alpha=0.4):
    p1, p2 = F.softmax(l1,1), F.softmax(l2,1)
    kl1 = F.kl_div(p1.log(), p2, reduction='batchmean')
    kl2 = F.kl_div(p2.log(), p1, reduction='batchmean')
    return (kl1+kl2)/2 * alpha

def mixup(x, y, alpha=0.4):
    lam = np.random.beta(alpha, alpha)
    idx = torch.randperm(x.size(0), device=x.device)
    return lam*x + (1-lam)*x[idx], y, y[idx], lam

class AttentionPooling(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.attn = nn.Sequential(
            nn.Linear(dim, dim//2),
            nn.Tanh(),
            nn.Linear(dim//2, 1)
        )
    def forward(self, h, mask):
        scores = self.attn(h.float()).masked_fill(mask.unsqueeze(-1)==0, -1e4)
        weights = F.softmax(scores, dim=1).to(h.dtype)
        return (h * weights).sum(1)

class HybridModel(nn.Module):
    def __init__(self, backbone, num_labels, dropout=0.3):
        super().__init__()
        self.bert     = backbone
        dim            = backbone.config.hidden_size
        self.pool     = AttentionPooling(dim)
        self.eucl     = nn.Linear(dim, num_labels)
        self.curv     = nn.Parameter(torch.log(torch.tensor(2.0)))
        self.h1       = nn.Linear(dim, dim)
        self.h2       = nn.Linear(dim, num_labels)
        self.sph_proj = nn.Sequential(nn.LayerNorm(dim), nn.Dropout(dropout), nn.Linear(dim,dim))
        self.sph_head = nn.Linear(dim, num_labels)
        self.proto    = nn.Parameter(torch.randn(num_labels,4,dim)*1e-2)
        self.w        = nn.Parameter(torch.tensor([0.4,0.3,0.2,0.1]))
    def forward(self, x, mask, rd=False, inputs_embeds=None):
        if inputs_embeds is not None:
            out = self.bert(inputs_embeds=inputs_embeds, attention_mask=mask, output_hidden_states=True, return_dict=True)
        else:
            out = self.bert(x, mask, output_hidden_states=True, return_dict=True)
        cls = out.last_hidden_state[:,0]
        att = self.pool(out.last_hidden_state, mask)
        emb = 0.5*cls + 0.5*att
        if rd and self.training:
            emb = F.dropout(emb, 0.15)
        return self.head_from_emb(emb)
    def head_from_emb(self, emb):
        c    = torch.exp(self.curv).clamp(0.1,5.0)
        e    = self.eucl(emb)
        norm = torch.norm(emb,2,dim=1,keepdim=True).clamp_min(1e-6)
        x    = emb/norm * torch.tanh(torch.sqrt(c)*norm)/(torch.sqrt(c)*norm)
        h    = torch.tanh(self.h1(x)); h = self.h2(h)
        s    = F.normalize(self.sph_proj(emb),2,1); s = self.sph_head(s)
        B,D  = x.size(); C,K,_ = self.proto.size()
        pf   = self.proto.view(C*K,D)
        u2   = (x*x).sum(1,True); v2 = (pf*pf).sum(1).unsqueeze(0)
        d2   = ((x.unsqueeze(1)-pf.unsqueeze(0))**2).sum(2)
        z    = 1+2*d2/((1-u2)*(1-v2).clamp_min(1e-6))
        p    = -torch.acosh(z.clamp_min(1+1e-6)).view(B,C,K).mean(2)
        w    = F.softmax(self.w, dim=0)
        return w[0]*e + w[1]*h + w[2]*s + w[3]*p

def make_loader(txt, lbl, tok, bs, shuffle):
    enc = tok(txt, padding=True, truncation=True, max_length=192, return_tensors='pt')
    return DataLoader(TensorDataset(enc.input_ids, enc.attention_mask, torch.tensor(lbl)),
                      batch_size=bs, shuffle=shuffle)

def train_fold(model, train_dl, val_dl, device):
    criterion = LabelSmoothingLoss(model.bert.config.num_labels, smoothing=0.15)
    swa       = AveragedModel(model)
    optim_sam = SAM(model.parameters(), base_optimizer=AdamW,
                    rho=0.05, adaptive=True, lr=1e-5, weight_decay=0.01)
    scheduler = OneCycleLR(optim_sam.base_optimizer, max_lr=5e-5,
                           total_steps=len(train_dl)*25, pct_start=0.1)
    scheduler_swa = SWALR(optim_sam.base_optimizer, swa_lr=1e-5)
    best_acc, patience = 0, 0
    accum_steps = 2
    for epoch in range(1, 26):
        model.train()
        running_loss = 0.0
        pbar = tqdm(train_dl, desc=f"Epoch {epoch}/25")
        for i, (x, mask, y) in enumerate(pbar, 1):
            x, mask, y = x.to(device), mask.to(device), y.to(device)
            if np.random.rand() < 0.5:
                x_emb = model.bert.embeddings(x)
                x_mix, y_a, y_b, lam = mixup(x_emb, y)
                logits1 = model(None, mask, rd=True, inputs_embeds=x_mix)
                logits2 = model(None, mask, rd=True, inputs_embeds=x_mix)
                loss = lam * criterion(logits1, y_a) + (1-lam) * criterion(logits1, y_b) + \
                       lam * criterion(logits2, y_a) + (1-lam) * criterion(logits2, y_b) + \
                       r_drop_loss(logits1, logits2)
            else:
                logits1 = model(x, mask, rd=True)
                logits2 = model(x, mask, rd=True)
                loss = criterion(logits1, y) + criterion(logits2, y) + r_drop_loss(logits1, logits2)
            loss = loss / accum_steps
            loss.backward()
            if i % accum_steps == 0:
                optim_sam.first_step(zero_grad=True)
                if np.random.rand() < 0.5:
                    x_emb = model.bert.embeddings(x)
                    x_mix, y_a, y_b, lam = mixup(x_emb, y)
                    logits1 = model(None, mask, rd=True, inputs_embeds=x_mix)
                    logits2 = model(None, mask, rd=True, inputs_embeds=x_mix)
                    loss2 = lam * criterion(logits1, y_a) + (1-lam) * criterion(logits1, y_b) + \
                            lam * criterion(logits2, y_a) + (1-lam) * criterion(logits2, y_b) + \
                            r_drop_loss(logits1, logits2)
                else:
                    logits1 = model(x, mask, rd=True)
                    logits2 = model(x, mask, rd=True)
                    loss2 = criterion(logits1, y) + criterion(logits2, y) + r_drop_loss(logits1, logits2)
                loss2 = loss2 / accum_steps
                loss2.backward()
                optim_sam.second_step(zero_grad=True)
                scheduler.step()
                if epoch > 12:
                    swa.update_parameters(model)
                    scheduler_swa.step()
            running_loss += loss.item() * accum_steps
            pbar.set_postfix(loss=f"{running_loss/i:.6f}")
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for x, mask, y in val_dl:
                x,mask,y = x.to(device),mask.to(device),y.to(device)
                preds = model(x, mask).argmax(1)
                correct += (preds==y).sum().item()
                total   += y.size(0)
        acc = correct/total*100
        print(f"Epoch {epoch} Val Acc: {acc:.2f}%")
        if acc > best_acc:
            best_acc, patience = acc, 0
        else:
            patience += 1
            if patience >= 7:
                print("Early stopping.")
                break
    update_bn(train_dl, swa)
    return swa

def main():
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    tok    = AutoTokenizer.from_pretrained('roberta-large')
    ds     = load_dataset('banking77')
    texts, labels = np.array(ds['train']['text']), np.array(ds['train']['label'])
    test_txt, test_lbl = ds['test']['text'], ds['test']['label']
    kf = StratifiedKFold(7, shuffle=True, random_state=42)
    ensemble_models = []
    for fold, (tr, va) in enumerate(kf.split(texts, labels), 1):
        print(f"\n--- Fold {fold}/7 ---")
        bert  = AutoModel.from_pretrained('roberta-large').to(device)
        bert.config.num_labels = len(set(labels))
        model = HybridModel(bert, len(set(labels))).to(device)
        for p in model.bert.parameters(): p.requires_grad=False
        for layer in model.bert.encoder.layer[-6:]:
            for p in layer.parameters(): p.requires_grad=True
        tr_dl = make_loader(texts[tr].tolist(), labels[tr].tolist(), tok, bs=8, shuffle=True)
        va_dl = make_loader(texts[va].tolist(), labels[va].tolist(), tok, bs=32, shuffle=False)
        swa_model = train_fold(model, tr_dl, va_dl, device)
        ensemble_models.append(swa_model)
    enc = tok(test_txt, padding=True, truncation=True, max_length=192, return_tensors='pt')
    test_dl = DataLoader(TensorDataset(enc.input_ids, enc.attention_mask, torch.tensor(test_lbl)), batch_size=32)
    correct, total = 0, 0
    for x, mask, y in test_dl:
        x, mask, y = x.to(device), mask.to(device), y.to(device)
        probs = []
        for m in ensemble_models:
            for _ in range(3):
                probs.append(m(x, mask).softmax(1))
        probs = sum(probs) / len(probs)
        preds = probs.argmax(1)
        correct += (preds==y).sum().item()
        total   += y.size(0)
    print(f"\nEnsemble Test Acc: {correct/total*100:.2f}%")

if __name__=='__main__':
    main()


--- Fold 1/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 40.10%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 77.54%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 87.82%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 88.17%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 89.08%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 88.66%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 90.62%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 91.60%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 91.74%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 91.95%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 92.44%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 92.30%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 92.86%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 93.42%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.14%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 93.70%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 93.49%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 93.35%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 93.70%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 93.21%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 93.42%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 93.84%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 93.28%


Epoch 24/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 24 Val Acc: 93.56%


Epoch 25/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 25 Val Acc: 93.77%

--- Fold 2/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 22.46%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 73.69%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 88.03%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 89.01%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 90.48%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 90.76%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 91.60%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 92.79%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 92.79%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 92.09%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 92.86%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 92.58%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 93.70%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 93.77%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.63%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 93.98%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 93.91%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 93.91%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 93.84%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 93.63%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 93.77%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 93.42%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 93.77%
Early stopping.

--- Fold 3/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 20.36%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 74.60%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 85.23%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 88.52%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 89.15%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 91.53%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 91.53%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 92.30%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 92.30%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 91.25%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 92.09%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 92.37%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 93.21%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 93.77%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.98%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 93.91%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 93.91%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 93.84%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 93.98%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 93.98%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 93.84%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 94.05%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 93.35%


Epoch 24/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 24 Val Acc: 93.91%


Epoch 25/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 25 Val Acc: 93.77%

--- Fold 4/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 14.42%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 75.58%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 87.54%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 89.29%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 90.83%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 92.51%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 92.86%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 91.11%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 92.58%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 92.72%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 93.14%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 93.70%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 94.19%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 94.19%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.84%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 93.77%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 94.19%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 94.33%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 94.33%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 94.54%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 94.12%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 94.40%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 94.26%


Epoch 24/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 24 Val Acc: 94.54%


Epoch 25/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 25 Val Acc: 94.05%

--- Fold 5/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 15.89%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 74.18%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 86.21%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 87.19%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 90.20%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 89.78%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 91.60%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 92.79%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 92.02%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 92.58%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 93.28%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 93.00%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 93.98%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 94.05%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.91%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 94.82%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 94.40%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 93.77%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 94.75%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 94.19%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 94.40%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 94.68%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 94.05%
Early stopping.

--- Fold 6/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 11.97%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 75.16%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 87.05%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 89.01%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 91.39%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 91.39%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 90.76%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 91.53%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 92.37%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 92.79%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 91.67%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 92.44%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 92.79%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 92.86%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.14%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 93.35%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 93.70%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 93.49%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 92.93%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 93.28%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 93.07%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 93.42%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 93.28%


Epoch 24/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 24 Val Acc: 93.21%
Early stopping.

--- Fold 7/7 ---


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 1 Val Acc: 35.97%


Epoch 2/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 2 Val Acc: 74.81%


Epoch 3/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 3 Val Acc: 84.95%


Epoch 4/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 4 Val Acc: 89.43%


Epoch 5/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 5 Val Acc: 90.27%


Epoch 6/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 6 Val Acc: 91.32%


Epoch 7/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 7 Val Acc: 92.44%


Epoch 8/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 8 Val Acc: 91.74%


Epoch 9/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 9 Val Acc: 92.44%


Epoch 10/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 10 Val Acc: 92.79%


Epoch 11/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 11 Val Acc: 92.86%


Epoch 12/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 12 Val Acc: 91.88%


Epoch 13/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 13 Val Acc: 93.70%


Epoch 14/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 14 Val Acc: 93.35%


Epoch 15/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 15 Val Acc: 93.98%


Epoch 16/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 16 Val Acc: 93.70%


Epoch 17/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 17 Val Acc: 93.98%


Epoch 18/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 18 Val Acc: 93.84%


Epoch 19/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 19 Val Acc: 93.84%


Epoch 20/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 20 Val Acc: 94.33%


Epoch 21/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 21 Val Acc: 93.91%


Epoch 22/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 22 Val Acc: 93.77%


Epoch 23/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 23 Val Acc: 93.56%


Epoch 24/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 24 Val Acc: 93.70%


Epoch 25/25:   0%|          | 0/1072 [00:00<?, ?it/s]

Epoch 25 Val Acc: 94.05%

Ensemble Test Acc: 95.26%
