### Pretrain task

**MLM**

#### Read Feedback Prize - Evaluating Student Writing data

#### Log

* 09.01: do not join the text together, use short text to do MLM task, and reduce max_seq from 2048 to 1024, do not use Feedback Prize dataset
* 09.10: do not join the text together, reduce max_len from 1024 to 512

In [1]:
import os
import re
data_files = os.listdir('./input/pretrain/train')
train_text = []
for data_file in data_files:
    with open('./input/pretrain/train/'+ data_file, 'r') as reader:
        # join all text in a file together
        #train_text += [' '.join(re.split(r'\n+', reader.read()))]
        # do not join the split text segments, use shorter texts for pretraining
        train_text +=re.split(r'\n+', reader.read())

In [2]:
len(train_text)

86707

In [3]:
import random
train_text[random.randint(0,len(train_text))]

'A very appropriate setting for the Facial Action Coding System would be online schooling and home schooling because both of those are mainly based on the one student taking that course or class. If the camera on the computer sent a signal to the online schooler or home schooler, it would be easy to make a change in the lesson to try and get the student more engaged. Having a student pay more attention and be more engaged in a lesson could actually help raise test scores because if someone was doing something that was actually fun and they could actually retain the knowledge, it might be effective.'

In [4]:
len(train_text[random.randint(0,len(train_text))])

320

#### Read Feedback Prize - English Language Learning data

In [5]:
import pandas as pd
pretrain_df = pd.read_csv('./input/pretrain/train.csv')

In [6]:
list(filter(bool, 'Line 1\n\nLine 3\rLine 4\r\n'.splitlines()))

['Line 1', 'Line 3', 'Line 4']

In [7]:
train_lines = []

for text in pretrain_df.full_text:
    ## join all text in a file together
    train_lines += list(filter(bool, text.splitlines()));
    
    ## do not join the split text segments, use shorter texts for pretraining
    #train_lines += [' '.join(re.split(r'\n+', text))]

# print(len(train_lines))    
# train_lines += train_text
print(len(train_lines))
train_lines = pd.DataFrame(train_lines, columns = ['train_lines'])

train_lines.to_csv('./input/pretrain/train_MLM_ver3.csv', index = False)

21661


In [8]:
train_lines.iloc[random.randint(0,len(train_lines))].values

array(["But this happen also when the students have a good communication and they are friendly. However sometimes can be difficult even the sudents don't try to study with the group or only due the part of oneself."],
      dtype=object)

In [9]:
import torch
import pandas as pd;
import numpy as np;
import os
from torch.utils.data import Dataset, DataLoader
from torch.utils.checkpoint import checkpoint
from transformers import AutoTokenizer, AutoModelWithLMHead
from transformers import AdamW
from tqdm import tqdm
import os

In [10]:
class CFG:
    seed = 42;
    model_name = 'microsoft/deberta-v3-large'
    epochs = 3;
    batch_size = 8;
    lr = 1e-6;
    weight_decay = 1e-6
    max_len = 512 # use max length as 2048 #01.09, use max_length as 1024
    mask_prob = 0.15;
    n_accumulate = 4
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [11]:
import numpy as np;
import os
def set_seed(seed = CFG.seed):
    np.random.seed(seed);
    torch.manual_seed(seed);
    torch.cuda.manual_seed(seed);
    torch.backends.cudnn.deterministic = True;
    torch.backends.cudnn.benchmark = True;
    os.environ['PYTHONHASHSEED'] = str(seed)
set_seed()

In [12]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model_name);
model = AutoModelWithLMHead.from_pretrained(CFG.model_name);

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /microsoft/deberta-v3-large/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f017927f310>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 25048fde-ec1d-4be3-99ed-7c9a165fbdc5)')' thrown while requesting HEAD https://huggingface.co/microsoft/deberta-v3-large/resolve/main/tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /microsoft/deberta-v3-large/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f01721a9850>, 'Connection to hugg

In [13]:
special_tokens = tokenizer.encode_plus('[CLS] [SEP] [MASK] [PAD]',
                                      add_special_tokens = False,
                                      return_tensors='pt')
special_tokens = torch.flatten(special_tokens['input_ids'])
special_tokens

tensor([     1,      2, 128000,      0])

In [14]:
def getMaskedLabels(input_ids):
    rand = torch.rand(input_ids.shape);
    mask_arr = (rand < CFG.mask_prob);
    
    for special_token in special_tokens:
        token = special_token.item();
        mask_arr *= (input_ids != token);
    selection = torch.flatten(mask_arr[0].nonzero()).tolist()
    input_ids[selection] = 128000
    
    return input_ids

In [15]:
class MLMDataset:
    def __init__(self, data, tokenizer):
        self.data = data;
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.data);
    
    def __getitem__(self, idx):
        text = self.data[idx]
        
        tokenized_data = self.tokenizer.encode_plus(
                            text,
                            max_length = CFG.max_len,
                            truncation = True,
                            padding = 'max_length',
                            add_special_tokens = True,
                            return_tensors = 'pt'
                        )
        input_ids = torch.flatten(tokenized_data.input_ids);
        attention_mask = torch.flatten(tokenized_data.input_ids);
        labels = getMaskedLabels(input_ids)
        
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

In [16]:
len(train_lines.train_lines.unique())

21386

In [17]:
train_data = MLMDataset(train_lines.train_lines.unique(), tokenizer)
dataloader = DataLoader(train_data, batch_size = CFG.batch_size, shuffle = True)

In [18]:
len(train_data), len(dataloader)

(21386, 2674)

In [19]:
optimizer = AdamW(model.parameters(), lr = CFG.lr, weight_decay = CFG.weight_decay);



In [20]:
def train_loop(model, device):
    model.train()
    batch_losses = []
    loop = tqdm(dataloader, leave=True)
    for batch_num, batch in enumerate(loop):
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

        loss = outputs.loss
        batch_loss = loss / CFG.n_accumulate
        batch_losses.append(batch_loss.item())
    
        loop.set_description(f"Epoch {epoch + 1}")
        loop.set_postfix(loss=batch_loss.item())
        batch_loss.backward()
        
        if batch_num % CFG.n_accumulate == 0 or batch_num == len(dataloader):
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
            optimizer.step()
            model.zero_grad()

    return np.mean(batch_losses)

In [21]:
device = CFG.device
model.to(device)
history = []
best_loss = np.inf
prev_loss = np.inf
model.gradient_checkpointing_enable()
print(f"Gradient Checkpointing: {model.is_gradient_checkpointing}")

for epoch in range(CFG.epochs):
    loss = train_loop(model, device)
    history.append(loss)
    print(f"Loss: {loss}")
    if loss < best_loss:
        print("New Best Loss {:.4f} -> {:.4f}, Saving Model".format(prev_loss, loss))
        # torch.save(model.state_dict(), "./deberta_mlm.pt")
        model.save_pretrained('./input/pretrain/pretrained_model_1009/')
        best_loss = loss
    prev_loss = loss

Gradient Checkpointing: True


Epoch 1: 100%|██████████| 2674/2674 [1:19:23<00:00,  1.78s/it, loss=0.871]


Loss: 1.5585583276327815
New Best Loss inf -> 1.5586, Saving Model


Epoch 2: 100%|██████████| 2674/2674 [1:19:22<00:00,  1.78s/it, loss=0.75] 


Loss: 0.7744891889610661
New Best Loss 1.5586 -> 0.7745, Saving Model


Epoch 3: 100%|██████████| 2674/2674 [1:19:22<00:00,  1.78s/it, loss=0.454]


Loss: 0.5360053671257836
New Best Loss 0.7745 -> 0.5360, Saving Model


In [None]:
Gradient Checkpointing: True
Epoch 1: 100%|██████████| 4848/4848 [9:03:47<00:00,  6.73s/it, loss=0.753]  
Loss: 1.3033109624615007
New Best Loss inf -> 1.3033, Saving Model
Epoch 2: 100%|██████████| 4848/4848 [9:03:44<00:00,  6.73s/it, loss=0.404]  
Loss: 0.6164927809336299
New Best Loss 1.3033 -> 0.6165, Saving Model
Epoch 3: 100%|██████████| 4848/4848 [9:03:37<00:00,  6.73s/it, loss=0.359]  
Loss: 0.46205416228314833

In [None]:
device

### Try to load the model

In [20]:
# from transformers import AutoConfig
# config = AutoConfig.from_pretrained('./input/pretrain/pretrained_model/')

In [22]:
# from transformers import AutoModelForSequenceClassification
# model = AutoModelForSequenceClassification.from_pretrained('./input/pretrain/pretrained_model/', config = config)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at ./input/pretrain/pretrained_model/ and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
