<a href="https://colab.research.google.com/github/ny-yo/kaggle-CommonLit-Readability-Prize/blob/main/01_commonlit_roberta_andrey_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ベースkernel  
https://www.kaggle.com/andretugan/lightweight-roberta-solution-in-pytorch

変更履歴  
07/03 ver01 新規作成

In [1]:
!pip install kaggle



In [2]:
from google.colab import drive #インポート
drive.mount('/content/drive/') #GoogleDriveのマウント

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [3]:
# download API key from google drive
## Original: https://colab.research.google.com/drive/1eufc8aNCdjHbrBhuy7M7X6BGyzAyRbrF#scrollTo=y5_288BYp6H1
## When you run for the first time, you will see a link to authenticate.

from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth

auth.authenticate_user()

drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])

filename = "/root/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)

request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


In [4]:
!kaggle competitions list

ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                   2030-07-01 23:59:00  Getting Started     Prizes        201           False  
gan-getting-started                            2030-07-01 23:59:00  Getting Started     Prizes        339           False  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started  Knowledge        987           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       6167           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      51833            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge      13543            True  
connectx

In [5]:
!kaggle competitions download -c commonlitreadabilityprize

sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
!pip install transformers



In [7]:
import os
import math
import random
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from transformers import AdamW
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import AutoConfig
from transformers import get_cosine_schedule_with_warmup

from sklearn.model_selection import KFold

import gc
gc.enable()

In [8]:
NUM_FOLDS = 5
NUM_EPOCHS = 3
BATCH_SIZE = 16
MAX_LEN = 248
EVAL_SCHEDULE = [(0.5, 16), (0.49, 8), (0.48, 4), (0.47, 2), (-1, 1)]
ROBERTA_PATH = "roberta-base"
TOKENIZER_PATH = "roberta-base"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [9]:
def set_random_seed(random_seed):
    random.seed(random_seed)
    np.random.seed(random_seed)
    os.environ["PYTHONHASHSEED"] = str(random_seed)

    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)

    torch.backends.cudnn.deterministic = True

In [10]:
train_df = pd.read_csv("/content/train.csv.zip")
print(train_df.shape)
# Remove incomplete entries if any.
train_df.drop(train_df[(train_df.target == 0) & (train_df.standard_error == 0)].index,
              inplace=True)
train_df.reset_index(drop=True, inplace=True)
print(train_df.shape)

test_df = pd.read_csv("/content/test.csv")
submission_df = pd.read_csv("/content/sample_submission.csv")

(2834, 6)
(2833, 6)


In [11]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

Dataset

In [12]:
class LitDataset(Dataset):
    def __init__(self, df, inference_only=False):
        super().__init__()

        self.df = df
        self.inference_only = inference_only
        self.text = df.excerpt.tolist()

        if not self.inference_only:
            self.target = torch.tensor(df.target.values, dtype=torch.float32)
        
        #batch_encode_plusの説明
        #https://qiita.com/ichiroex/items/6e305a5d5bed7d715c2f
        self.encoded = tokenizer.batch_encode_plus(
            self.text,
            padding = "max_length",
            max_length = MAX_LEN,
            truncation = True,
            return_attention_mask = True
        )

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        input_ids = torch.tensor(self.encoded["input_ids"][index])
        attention_mask = torch.tensor(self.encoded["attention_mask"][index])

        if self.inference_only:
            return (input_ids, attention_mask)
        else:
            target = self.target[index]
            return (input_ids, attention_mask, target)

Model

In [13]:
class LitModel(nn.Module):
    def __init__(self):
        super().__init__()
        
        config = AutoConfig.from_pretrained(ROBERTA_PATH)
        config.update({"output_hidden_states":True, 
                       "hidden_dropout_prob": 0.0,
                       "layer_norm_eps": 1e-7})
        
        self.roberta = AutoModel.from_pretrained(ROBERTA_PATH, config=config)

        self.attention = nn.Sequential(
            nn.Linear(768, 512),
            nn.Tanh(),
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )

        self.regressor = nn.Sequential(
            nn.Linear(768, 1)
        )
    def forward(self, input_ids, attention_mask):
        roberta_output = self.roberta(input_ids=input_ids, attention_mask=attention_mask)

        #robertaの最終層だけを取り出す
        last_layer_hidden_states = roberta_output.hidden_states[-1]

        weights = self.attention(last_layer_hidden_states)

        context_vector = torch.sum(weights * last_layer_hidden_states, dim=1)  

        return self.regressor(context_vector)

In [14]:
def eval_mse(model, data_loader):
    """Evaluates the mean squared error of the |model| on |data_loader|"""
    model.eval()            
    mse_sum = 0

    with torch.no_grad():
        for batch_num, (input_ids, attention_mask, target) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)                        
            target = target.to(DEVICE)           
            
            pred = model(input_ids, attention_mask)                       

            mse_sum += nn.MSELoss(reduction="sum")(pred.flatten(), target).item()
                

    return mse_sum / len(data_loader.dataset)

In [15]:
def predict(model, data_loader):
    """Returns an np.array with predictions of the |model| on |data_loader|"""
    model.eval()

    result = np.zeros(len(data_loader.dataset))    
    index = 0
    
    with torch.no_grad():
        for batch_num, (input_ids, attention_mask) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)
                        
            pred = model(input_ids, attention_mask)                        

            result[index : index + pred.shape[0]] = pred.flatten().to("cpu")
            index += pred.shape[0]

    return result

In [16]:
def train(model, model_path, train_loader, val_loader,
          optimizer, scheduler=None, num_epochs=NUM_EPOCHS):    
    best_val_rmse = None
    best_epoch = 0
    step = 0
    last_eval_step = 0
    eval_period = EVAL_SCHEDULE[0][1]    

    start = time.time()

    for epoch in range(num_epochs):                           
        val_rmse = None         

        for batch_num, (input_ids, attention_mask, target) in enumerate(train_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)            
            target = target.to(DEVICE)                        

            optimizer.zero_grad()
            
            model.train()

            pred = model(input_ids, attention_mask)
                                                        
            mse = nn.MSELoss(reduction="mean")(pred.flatten(), target)
                        
            mse.backward()

            optimizer.step()
            if scheduler:
                scheduler.step()
            
            if step >= last_eval_step + eval_period:
                # Evaluate the model on val_loader.
                elapsed_seconds = time.time() - start
                num_steps = step - last_eval_step
                print(f"\n{num_steps} steps took {elapsed_seconds:0.3} seconds")
                last_eval_step = step
                
                val_rmse = math.sqrt(eval_mse(model, val_loader))                            

                print(f"Epoch: {epoch} batch_num: {batch_num}", 
                      f"val_rmse: {val_rmse:0.4}")

                for rmse, period in EVAL_SCHEDULE:
                    if val_rmse >= rmse:
                        eval_period = period
                        break                               
                
                if not best_val_rmse or val_rmse < best_val_rmse:                    
                    best_val_rmse = val_rmse
                    best_epoch = epoch
                    torch.save(model.state_dict(), model_path)
                    print(f"New best_val_rmse: {best_val_rmse:0.4}")
                else:       
                    print(f"Still best_val_rmse: {best_val_rmse:0.4}",
                          f"(from epoch {best_epoch})")                                    
                    
                start = time.time()
                                            
            step += 1
                        
    
    return best_val_rmse

In [17]:
def create_optimizer(model):
    named_parameters = list(model.named_parameters())    
    
    roberta_parameters = named_parameters[:197]    
    attention_parameters = named_parameters[199:203]
    regressor_parameters = named_parameters[203:]
        
    attention_group = [params for (name, params) in attention_parameters]
    regressor_group = [params for (name, params) in regressor_parameters]

    parameters = []
    parameters.append({"params": attention_group})
    parameters.append({"params": regressor_group})

    for layer_num, (name, params) in enumerate(roberta_parameters):
        weight_decay = 0.0 if "bias" in name else 0.01

        lr = 2e-5

        if layer_num >= 69:        
            lr = 5e-5

        if layer_num >= 133:
            lr = 1e-4

        parameters.append({"params": params,
                           "weight_decay": weight_decay,
                           "lr": lr})

    return AdamW(parameters)

In [18]:
gc.collect()

SEED = 1000
list_val_rmse = []

kfold = KFold(n_splits=NUM_FOLDS, random_state=SEED, shuffle=True)

for fold, (train_indices, val_indices) in enumerate(kfold.split(train_df)):    
    print(f"\nFold {fold + 1}/{NUM_FOLDS}")
    model_path = f"/content/drive/MyDrive/01-andrey-base-model_{fold + 1}.pth"
    
    set_random_seed(SEED + fold)
    
    train_dataset = LitDataset(train_df.loc[train_indices])    
    val_dataset = LitDataset(train_df.loc[val_indices])    
        
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                              drop_last=True, shuffle=True, num_workers=2)    
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                            drop_last=False, shuffle=False, num_workers=2)    
        
    set_random_seed(SEED + fold)    
    
    model = LitModel().to(DEVICE)
    
    optimizer = create_optimizer(model)                        
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_training_steps=NUM_EPOCHS * len(train_loader),
        num_warmup_steps=50)    
    
    list_val_rmse.append(train(model, model_path, train_loader,
                               val_loader, optimizer, scheduler=scheduler))

    del model
    gc.collect()
    
    print("\nPerformance estimates:")
    print(list_val_rmse)
    print("Mean:", np.array(list_val_rmse).mean())


Fold 1/5


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



16 steps took 6.89 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9384
New best_val_rmse: 0.9384

16 steps took 6.34 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.8438
New best_val_rmse: 0.8438

16 steps took 6.34 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.6332
New best_val_rmse: 0.6332

16 steps took 6.34 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6437
Still best_val_rmse: 0.6332 (from epoch 0)

16 steps took 6.34 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.5707
New best_val_rmse: 0.5707

16 steps took 6.33 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5282
New best_val_rmse: 0.5282

16 steps took 6.33 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5513
Still best_val_rmse: 0.5282 (from epoch 0)

16 steps took 6.34 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5392
Still best_val_rmse: 0.5282 (from epoch 0)

16 steps took 6.44 seconds
Epoch: 1 batch_num: 3 val_rmse: 0.5223
New best_val_rmse: 0.5223

16 steps took 6.33 seconds
Epoch: 1 batch_num: 19 val_rmse: 0.5129
New best_val_rmse: 0.5129

16 step

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



16 steps took 6.84 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9605
New best_val_rmse: 0.9605

16 steps took 6.34 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.7348
New best_val_rmse: 0.7348

16 steps took 6.33 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.7169
New best_val_rmse: 0.7169

16 steps took 6.33 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6632
New best_val_rmse: 0.6632

16 steps took 6.34 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.6239
New best_val_rmse: 0.6239

16 steps took 6.34 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.6574
Still best_val_rmse: 0.6239 (from epoch 0)

16 steps took 6.34 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.582
New best_val_rmse: 0.582

16 steps took 6.33 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5612
New best_val_rmse: 0.5612

16 steps took 6.46 seconds
Epoch: 1 batch_num: 3 val_rmse: 0.537
New best_val_rmse: 0.537

16 steps took 6.33 seconds
Epoch: 1 batch_num: 19 val_rmse: 0.5049
New best_val_rmse: 0.5049

16 steps took 6.33 seconds
Epoch: 1 batch_num

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



16 steps took 6.85 seconds
Epoch: 0 batch_num: 16 val_rmse: 1.021
New best_val_rmse: 1.021

16 steps took 6.34 seconds
Epoch: 0 batch_num: 32 val_rmse: 1.065
Still best_val_rmse: 1.021 (from epoch 0)

16 steps took 6.34 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.6681
New best_val_rmse: 0.6681

16 steps took 6.33 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6229
New best_val_rmse: 0.6229

16 steps took 6.33 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.6371
Still best_val_rmse: 0.6229 (from epoch 0)

16 steps took 6.34 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5993
New best_val_rmse: 0.5993

16 steps took 6.34 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5311
New best_val_rmse: 0.5311

16 steps took 6.33 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5808
Still best_val_rmse: 0.5311 (from epoch 0)

16 steps took 6.46 seconds
Epoch: 1 batch_num: 3 val_rmse: 0.5132
New best_val_rmse: 0.5132

16 steps took 6.33 seconds
Epoch: 1 batch_num: 19 val_rmse: 0.5562
Still best_val_rmse: 0.5132 (from epoc

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



16 steps took 6.86 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9691
New best_val_rmse: 0.9691

16 steps took 6.33 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.8864
New best_val_rmse: 0.8864

16 steps took 6.34 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.8383
New best_val_rmse: 0.8383

16 steps took 6.33 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.7234
New best_val_rmse: 0.7234

16 steps took 6.33 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.6247
New best_val_rmse: 0.6247

16 steps took 6.34 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.625
Still best_val_rmse: 0.6247 (from epoch 0)

16 steps took 6.34 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5602
New best_val_rmse: 0.5602

16 steps took 6.33 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.6244
Still best_val_rmse: 0.5602 (from epoch 0)

16 steps took 6.47 seconds
Epoch: 1 batch_num: 3 val_rmse: 0.548
New best_val_rmse: 0.548

16 steps took 6.33 seconds
Epoch: 1 batch_num: 19 val_rmse: 0.5286
New best_val_rmse: 0.5286

16 steps took 6.34 seconds


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



16 steps took 6.87 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9208
New best_val_rmse: 0.9208

16 steps took 6.34 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.7644
New best_val_rmse: 0.7644

16 steps took 6.34 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.6432
New best_val_rmse: 0.6432

16 steps took 6.33 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6143
New best_val_rmse: 0.6143

16 steps took 6.33 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.5717
New best_val_rmse: 0.5717

16 steps took 6.33 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5819
Still best_val_rmse: 0.5717 (from epoch 0)

16 steps took 6.33 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5923
Still best_val_rmse: 0.5717 (from epoch 0)

16 steps took 6.33 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5499
New best_val_rmse: 0.5499

16 steps took 6.48 seconds
Epoch: 1 batch_num: 3 val_rmse: 0.6323
Still best_val_rmse: 0.5499 (from epoch 0)

16 steps took 6.33 seconds
Epoch: 1 batch_num: 19 val_rmse: 0.5454
New best_val_rmse: 0.5454

16 step