<a href="https://colab.research.google.com/github/konumaru/commonLit_readability_prize/blob/main/notebook/RoBERta_Attention_Head_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CommonLit Training with roberta-base

## Experiments


## How to prevent timeouts

以下を開発者用コンソールで実行

```javascript
function ClickConnect(){
  console.log("60sごとに再接続");
  document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect, 1000 * 60);
```

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!nvidia-smi

Fri Jul 30 11:16:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Download Raw Data

In [4]:
%%bash

pip install -q kaggle
mkdir -p ~/.kaggle
cp drive/MyDrive/kaggle/kaggle.json ~/.kaggle/
chmod 600 /root/.kaggle/kaggle.json

kaggle competitions download -c commonlitreadabilityprize

unzip train.csv.zip

Downloading train.csv.zip to /content

Downloading sample_submission.csv to /content

Downloading test.csv to /content

Archive:  train.csv.zip
  inflating: train.csv               


  0%|          | 0.00/1.13M [00:00<?, ?B/s]100%|##########| 1.13M/1.13M [00:00<00:00, 77.8MB/s]
  0%|          | 0.00/108 [00:00<?, ?B/s]100%|##########| 108/108 [00:00<00:00, 108kB/s]
  0%|          | 0.00/6.79k [00:00<?, ?B/s]100%|##########| 6.79k/6.79k [00:00<00:00, 6.80MB/s]


In [5]:
import os
os.makedirs("working", exist_ok=True)

In [15]:
import gc
import math
import os
import pathlib
import random
import time

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn import model_selection
from sklearn.metrics import mean_squared_error
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, AutoTokenizer, get_cosine_schedule_with_warmup

## Parameters

In [16]:
MAX_LEN = 256  # 248

NUM_FOLDS = 5
NUM_EPOCHS = 3
BATCH_SIZE = 16

ROBERTA_PATH = "roberta-base"  # roberta-base  deepset/roberta-base-squad2  roberta-large
EVAL_SCHEDULE = [(0.50, 16), (0.49, 8), (0.48, 4), (0.47, 2), (-1., 1)]

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

SEEDS = [42, 422, 12, 123, 7]

## Split Data

In [17]:
def load_data():
    data = pd.read_csv("train.csv", usecols=["id", 'excerpt', 'target', "standard_error"])
    data.drop(data[(data.target == 0) & (data.standard_error == 0)].index, inplace=True)
    data.reset_index(drop=True, inplace=True)
    return data

In [18]:
def split_fold(data: pd.DataFrame, num_splits: int = 5, seed: int = 42):
    dump_dir = pathlib.Path(f"working/split")
    os.makedirs(dump_dir, exist_ok=True)

    num_bins = int(np.floor(1 + np.log2(len(data))))
    target_bins = pd.cut(data["target"], bins=num_bins, labels=False)
    cv = model_selection.StratifiedKFold(n_splits=num_splits, shuffle=True, random_state=seed)
    for n_fold, (train_idx, valid_idx) in enumerate(cv.split(data, target_bins)):

        train = data.loc[train_idx, :]
        valid = data.loc[valid_idx, :]

        fold_dump_dir = dump_dir / f"{n_fold}-fold"
        fold_dump_dir.mkdir(exist_ok=True)

        train.to_pickle(fold_dump_dir / "train.pkl")
        valid.to_pickle(fold_dump_dir / "valid.pkl")

        print("Fold:", n_fold)
        print(f"\tTrain Target Average: {train.target.mean():.06f}" + f"\tTrain Size={train.shape[0]}")
        print(f"\tValid Target Average: {valid.target.mean():.06f}" + f"\tValid Size={valid.shape[0]}")

In [19]:
data = load_data()
split_fold(data, seed=422)

Fold: 0
	Train Target Average: -0.958474	Train Size=2266
	Valid Target Average: -0.964388	Valid Size=567
Fold: 1
	Train Target Average: -0.960312	Train Size=2266
	Valid Target Average: -0.957042	Valid Size=567
Fold: 2
	Train Target Average: -0.959457	Train Size=2266
	Valid Target Average: -0.960460	Valid Size=567
Fold: 3
	Train Target Average: -0.960964	Train Size=2267
	Valid Target Average: -0.954425	Valid Size=566
Fold: 4
	Train Target Average: -0.959081	Train Size=2267
	Valid Target Average: -0.961967	Valid Size=566


## Dataset & Dataloader

In [20]:
%%bash
sudo pip install -q -U pip
sudo pip install -q git+https://github.com/huggingface/transformers

distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
distutils: /usr/local/include/python3.7/UNKNOWN
sysconfig: /usr/include/python3.7m/UNKNOWN
distutils: /usr/local/bin
sysconfig: /usr/bin
distutils: /usr/local
sysconfig: /usr
user = False
home = None
root = None
prefix = None
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
distutils: /usr/local/include/python3.7/UNKNOWN
sysconfig: /usr/include/python3.7m/UNKNOWN
distutils: /usr/local/bin
sysconfig: /usr/bin
distutils: /usr/local
sysconfig: /usr
user = False
home = None
root = None
prefix = None
  distutils: /usr/lib/python3/dist-packages
  sysconfig: /usr/lib/python3.7/site-packages
  user = False
  home = None
  root = None
  prefix = None
  distutils: /usr/lib/p

In [22]:
class LitDataset(Dataset):
    def __init__(self, df, model_name_or_path="roberta-base", inference_only=False):
        super().__init__()

        self.df = df        
        self.inference_only = inference_only
        self.text = df.excerpt.tolist()
        #self.text = [text.replace("\n", " ") for text in self.text]
        
        if not self.inference_only:
            self.target = torch.tensor(df.target.values, dtype=torch.float32)        

        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
        self.encoded = tokenizer.batch_encode_plus(
            self.text,
            padding = 'max_length',
            max_length = MAX_LEN,
            truncation = True,
            return_attention_mask=True
        )
 

    def __len__(self):
        return len(self.df)

    
    def __getitem__(self, index):        
        input_ids = torch.tensor(self.encoded['input_ids'][index])
        attention_mask = torch.tensor(self.encoded['attention_mask'][index])
        
        if self.inference_only:
            return (input_ids, attention_mask)            
        else:
            target = self.target[index]
            return (input_ids, attention_mask, target)

## Define Model Architectures

In [23]:
%%writefile model_arch.py

import torch
import torch.nn as nn
import transformers
from transformers import (
    AutoConfig,
    AutoModel,
)

class LitModel(nn.Module):
    def __init__(self, model_name_or_path="roberta-base"):
        super().__init__()

        self.config = AutoConfig.from_pretrained(model_name_or_path)
        self.config.update({
            "output_hidden_states":True, 
            "hidden_dropout_prob": 0.0,
            "layer_norm_eps": 1e-7
        })                       
        
        self.roberta = AutoModel.from_pretrained(model_name_or_path, config=self.config)  
        
        hidden_size = self.config.hidden_size
        self.attention = nn.Sequential(            
            nn.Linear(hidden_size, 512),            
            nn.Tanh(),                       
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )        

        self.regressor = nn.Sequential(                   
            nn.Linear(hidden_size, 1)                        
        )

        self._init_embed_layers(reinit_layers=4)

    def _init_embed_layers(self, reinit_layers: int = 4):
        if reinit_layers > 0:
            for layer in self.roberta.encoder.layer[-reinit_layers:]:
                for module in layer.modules():
                    if isinstance(module, nn.Linear):
                        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
                        if module.bias is not None:
                            module.bias.data.zero_()
                    elif isinstance(module, nn.Embedding):
                        module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
                        if module.padding_idx is not None:
                            module.weight.data[module.padding_idx].zero_()
                    elif isinstance(module, nn.LayerNorm):
                        module.bias.data.zero_()
                        module.weight.data.fill_(1.0)

    def forward(self, input_ids, attention_mask):
        roberta_output = self.roberta(input_ids=input_ids, attention_mask=attention_mask)

        last_layer_hidden_states = roberta_output.hidden_states[-1]
        weights = self.attention(last_layer_hidden_states)
        context_vector = torch.sum(weights * last_layer_hidden_states, dim=1)
        # Now we reduce the context vector to the prediction score.
        return self.regressor(context_vector)

Overwriting model_arch.py


In [24]:
!cp model_arch.py /content/working/

In [25]:
%load_ext autoreload
%autoreload 2

from model_arch import LitModel

In [26]:
def eval_mse(model, data_loader):
    """Evaluates the mean squared error of the |model| on |data_loader|"""
    model.eval()            
    mse_sum = 0

    preds = []
    with torch.no_grad():
        for batch_num, (input_ids, attention_mask, target) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)                        
            target = target.to(DEVICE)           
            
            pred = model(input_ids, attention_mask)
            preds.append(pred.flatten())

            mse_sum += nn.MSELoss(reduction="sum")(pred.flatten(), target).item()

    preds = torch.cat(preds, dim=0)
    mse = mse_sum / len(data_loader.dataset)      
    return mse, preds


def predict(model, data_loader):
    """Returns an np.array with predictions of the |model| on |data_loader|"""
    model.eval()

    result = np.zeros(len(data_loader.dataset))    
    index = 0
    
    with torch.no_grad():
        for batch_num, (input_ids, attention_mask) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)
                        
            pred = model(input_ids, attention_mask)                        

            result[index : index + pred.shape[0]] = pred.flatten().to("cpu")
            index += pred.shape[0]

    return result


def train(model, model_path, train_loader, val_loader,
          optimizer, scheduler=None, num_epochs=NUM_EPOCHS):    
    best_pred = None
    best_val_rmse = None
    best_epoch = 0
    step = 0
    last_eval_step = 0
    eval_period = EVAL_SCHEDULE[0][1]    

    start = time.time()

    for epoch in range(num_epochs):                           
        val_rmse = None         

        for batch_num, (input_ids, attention_mask, target) in enumerate(train_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)            
            target = target.to(DEVICE)                        

            optimizer.zero_grad()
            
            model.train()

            pred = model(input_ids, attention_mask)

            mse = nn.MSELoss(reduction="mean")(pred.flatten(), target)
                        
            mse.backward()

            optimizer.step()
            if scheduler:
                scheduler.step()
            
            if step >= last_eval_step + eval_period:
                # Evaluate the model on val_loader.
                elapsed_seconds = time.time() - start
                num_steps = step - last_eval_step
                print(f"\n{num_steps} steps took {elapsed_seconds:0.3} seconds")
                last_eval_step = step
                
                val_mse, val_preds = eval_mse(model, val_loader)
                val_rmse = math.sqrt(val_mse)                            

                print(f"Epoch: {epoch} batch_num: {batch_num}", 
                      f"val_rmse: {val_rmse:0.4}")

                for rmse, period in EVAL_SCHEDULE:
                    if val_rmse >= rmse:
                        eval_period = min(period, eval_period)
                        break                               
                
                if not best_val_rmse or val_rmse < best_val_rmse:                    
                    best_pred = val_preds
                    best_val_rmse = val_rmse
                    best_epoch = epoch
                    torch.save(model.state_dict(), model_path)
                    print(f"New best_val_rmse: {best_val_rmse:0.4}")
                else:       
                    print(f"Still best_val_rmse: {best_val_rmse:0.4}",
                          f"(from epoch {best_epoch})")                                    
                    
                start = time.time()
                                            
            step += 1

    return best_val_rmse, best_pred


def create_optimizer(model):
    named_parameters = list(model.named_parameters())    
    roberta_parameters = [(n, p) for n, p in named_parameters if 'roberta' in n]
    not_roberta_parameters = [(n, p) for n, p in named_parameters if 'roberta' not in n]

    not_roberta_group = [p for n, p in not_roberta_parameters]

    parameters = []
    parameters.append({"params": not_roberta_group})

    group_1 = [f"layer.{i}" for i in range(0, 5)]
    group_2 = [f"layer.{i}" for i in range(5, 9)]
    group_3 = [f"layer.{i}" for i in range(9, 12)]
    for name, params in roberta_parameters:
        weight_decay = 0.0 if "bias" in name else 0.01

        if any([(g in name) for g in group_1]):
            lr = 2e-5
        elif any([(g in name) for g in group_2]):
            lr = 5e-5
        elif any([(g in name) for g in group_3]):
            lr = 1e-4
        else:
            lr = 1e-4

        parameters.append({"params": params, "weight_decay": weight_decay, "lr": lr})
        
    return AdamW(parameters)


## Train

In [27]:
def set_random_seed(random_seed):
    random.seed(random_seed)
    np.random.seed(random_seed)
    os.environ["PYTHONHASHSEED"] = str(random_seed)

    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)

    torch.backends.cudnn.deterministic = True

In [28]:
data = load_data()

metric_rsa = []
for SEED in SEEDS:
    print("SEED =", SEED)
    oof = np.zeros(data.shape[0])
    list_val_rmse = []
    os.makedirs(f"working/seed{SEED}/models", exist_ok=True)
    for fold in range(NUM_FOLDS):    
        print(f"\nFold {fold + 1}/{NUM_FOLDS}")
        model_path = f"working/seed{SEED}/models/model_{fold}.pth"
            
        set_random_seed(SEED + fold)
        
        train_df = pd.read_pickle(f"/content/working/split/{fold}-fold/train.pkl")
        val_df = pd.read_pickle(f"/content/working/split/{fold}-fold/valid.pkl")

        train_dataset = LitDataset(train_df, model_name_or_path=ROBERTA_PATH)    
        val_dataset = LitDataset(val_df, model_name_or_path=ROBERTA_PATH)    
            
        train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=True, num_workers=2)
        val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, drop_last=False, shuffle=False, num_workers=2)    
            
        set_random_seed(SEED + fold)    
        
        model = LitModel(ROBERTA_PATH).to(DEVICE)
        optimizer = create_optimizer(model)                        
        scheduler = get_cosine_schedule_with_warmup(
            optimizer,
            num_training_steps=NUM_EPOCHS * len(train_loader),
            num_warmup_steps=50,
        )    
        
        best_val_rmse, best_pred = train(model, model_path, train_loader, val_loader, optimizer, scheduler=scheduler, num_epochs=NUM_EPOCHS)
        list_val_rmse.append(best_val_rmse)
        oof[val_df.index] = best_pred.detach().cpu().numpy()

        del model
        gc.collect()
        
    target = data["target"].to_numpy().ravel()
    metric = mean_squared_error(target, oof, squared=False)
    metric_rsa.append(metric)

    print(f"oof: {metric:.6f}")

    np.save(f"working/seed{SEED}/oof.npy", oof)
    with open(f"working/seed{SEED}/metric-{metric:.6f}", "w") as f:
        f.write("")

with open(f"working/metric-{np.mean(metric_rsa):.6f}", "w") as f:
        f.write("")

SEED = 42

Fold 1/5


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




KeyboardInterrupt: ignored

## Upload Model to Kaggle Dataset

In [None]:
import os
import json
import datetime
from kaggle.api.kaggle_api_extended import KaggleApi

def upload_to_kaggle_dataset(
    user_id: str,
    dataset_title: str,
    upload_dir: str,
    message: str,
    delete_old_versions: bool = True,
):
    dataset_metadata = {}
    dataset_metadata["id"] = f"{user_id}/{dataset_title}"
    dataset_metadata["licenses"] = [{"name": "CC0-1.0"}]
    dataset_metadata["title"] = dataset_title

    with open(os.path.join(upload_dir, "dataset-metadata.json"), "w") as f:
        json.dump(dataset_metadata, f, indent=4)

    api = KaggleApi()
    api.authenticate()

    if dataset_metadata["id"] not in [str(d) for d in api.dataset_list(user=user_id, search=dataset_title)]:
        # If dataset is not exist, create new dataset.
        print("Create new dataset.")
        api.dataset_create_new(
            folder=upload_dir,
            convert_to_csv=False,
            dir_mode="zip",
        )
    else:
        print("Upload to dataset.")
        api.dataset_create_version(
            folder=upload_dir,
            version_notes=message,
            convert_to_csv=False,
            delete_old_versions=delete_old_versions,
            dir_mode="zip",
        )

In [None]:
# title = "commonlit-finetuned-roberta-base"
# message = """
#     fork best notebook
# """ + datetime.datetime.now().strftime("%Y-%m-%d %H-%M-%S")

# upload_to_kaggle_dataset(
#     user_id="konumaru",
#     dataset_title=title,
#     upload_dir=f"/content/working/",
#     message=message,
#     delete_old_versions=False,
# )

Create new dataset.
Starting upload for file model_arch.py


100%|██████████| 2.41k/2.41k [00:07<00:00, 324B/s]


Upload successful: model_arch.py (2KB)
Starting upload for file seed7.zip


100%|██████████| 2.09G/2.09G [02:16<00:00, 16.4MB/s]


Upload successful: seed7.zip (2GB)
Starting upload for file seed123.zip


100%|██████████| 2.09G/2.09G [02:12<00:00, 17.0MB/s]


Upload successful: seed123.zip (2GB)
Starting upload for file metric-0.480379


0.00B [00:04, ?B/s]


Upload successful: metric-0.480379 (0B)
Starting upload for file seed422.zip


100%|██████████| 2.09G/2.09G [02:11<00:00, 17.1MB/s]


Upload successful: seed422.zip (2GB)
Starting upload for file seed42.zip


100%|██████████| 2.09G/2.09G [02:18<00:00, 16.2MB/s]


Upload successful: seed42.zip (2GB)
Starting upload for file split.zip


100%|██████████| 5.54M/5.54M [00:13<00:00, 440kB/s]


Upload successful: split.zip (6MB)
Starting upload for file seed12.zip


100%|██████████| 2.09G/2.09G [02:12<00:00, 17.0MB/s]


Upload successful: seed12.zip (2GB)
