<a href="https://colab.research.google.com/github/kwanhong66/PyTorchKaggle/blob/master/Toxic_comment_classification_bert_simple.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PyTorch x Kaggle

- kaggle: Toxic comment classification challenge
  - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

- notebook
  - https://www.kaggle.com/hawkeoni/pytorch-simple-bert

## Dataset with Kaggle API

In [1]:
!pip install -q kaggle

In [2]:
!wget 'https://raw.githubusercontent.com/kwanhong66/KaggleShoveling/master/token/kaggle.json'

--2020-11-25 05:06:52--  https://raw.githubusercontent.com/kwanhong66/KaggleShoveling/master/token/kaggle.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63 [text/plain]
Saving to: ‘kaggle.json’


2020-11-25 05:06:52 (3.28 MB/s) - ‘kaggle.json’ saved [63/63]



In [3]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle

In [4]:
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
!kaggle datasets list

ref                                                          title                                           size  lastUpdated          downloadCount  
-----------------------------------------------------------  ---------------------------------------------  -----  -------------------  -------------  
babyoda/women-entrepreneurship-and-labor-force               Women Entrepreneurship and Labor Force           1KB  2020-11-21 08:38:51            103  
sakshigoyal7/credit-card-customers                           Credit Card customers                          379KB  2020-11-19 07:38:44            179  
imoore/2020-us-general-election-turnout-rates                2020 US General Election Turnout rates           4KB  2020-11-19 17:13:32             72  
szymonjanowski/internet-articles-data-with-users-engagement  Internet news data with readers engagement       3MB  2020-11-21 17:09:57             44  
alexgude/california-traffic-collision-data-from-switrs       California Traffic Collisio

In [6]:
!kaggle competitions download jigsaw-toxic-comment-classification-challenge

Downloading test.csv.zip to /content
 98% 23.0M/23.4M [00:00<00:00, 11.1MB/s]
100% 23.4M/23.4M [00:00<00:00, 32.1MB/s]
Downloading sample_submission.csv.zip to /content
  0% 0.00/1.39M [00:00<?, ?B/s]
100% 1.39M/1.39M [00:00<00:00, 95.3MB/s]
Downloading train.csv.zip to /content
 65% 17.0M/26.3M [00:00<00:00, 20.5MB/s]
100% 26.3M/26.3M [00:00<00:00, 41.5MB/s]
Downloading test_labels.csv.zip to /content
  0% 0.00/1.46M [00:00<?, ?B/s]
100% 1.46M/1.46M [00:00<00:00, 100MB/s]


In [7]:
!mkdir input

In [8]:
!unzip '*.zip' -d ./input/

Archive:  train.csv.zip
  inflating: ./input/train.csv       

Archive:  test.csv.zip
  inflating: ./input/test.csv        

Archive:  test_labels.csv.zip
  inflating: ./input/test_labels.csv  

Archive:  sample_submission.csv.zip
  inflating: ./input/sample_submission.csv  

4 archives were successfully processed.


## Setup

install trasnformers for using bert model

In [9]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 9.1MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 27.1MB/s 
[?25hCollecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 33.9MB/s 
Collecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)

In [10]:
import os

from typing import Tuple, List
from functools import partial

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils.data import Dataset, DataLoader, RandomSampler
from torch.nn.utils.rnn import pad_sequence

from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup, BertPreTrainedModel

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from tqdm import tqdm 

In [16]:
input_file_path = './input/'

train_df = pd.read_csv(os.path.join(input_file_path, 'train.csv'))

print(train_df.shape)
print(list(train_df.columns))

(159571, 8)
['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [17]:
train_df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


In [18]:
train_split_df, val_split_df = train_test_split(train_df, test_size=0.05)

print(train_split_df.shape)
print(val_split_df.shape)

(151592, 8)
(7979, 8)


In [12]:
bert_model_name = 'bert-base-cased' #@param {type:"string"}
device = torch.device('cpu')
if torch.cuda.is_available():
  device = torch.device('cuda:0')
tokenizer = BertTokenizer.from_pretrained(bert_model_name)

assert tokenizer.pad_token_id == 0, "Padding vlaue used in masks is set to zero, please change it everywhere"

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




## Preparing dataset

- Using torch `Dataset`, creates custom dataset for input data
- inheriting `torch.utils.data.Dataset`
  - \_\_init\_\_ 
  - \_\_len\_\_ 
  - \_\_getitem\_\_

In [19]:
class ToxicDataset(Dataset):

    def __init__(self, tokenizer: BertTokenizer, dataframe: pd.DataFrame, lazy: bool=False):
      self.tokenizer = tokenizer
      self.pad_idx = tokenizer.pad_token_id
      self.lazy = lazy  # data conversion laziness

      if not self.lazy:
        self.X = []
        self.Y = []
        for i, (row) in tqdm(dataframe.iterrows()):  # convert data into tensor
          x, y = self.row_to_tensor(self.tokenizer, row)
          self.X.append(x)
          self.Y.append(y)
      else:
        self.df = dataframe

    @staticmethod  
    def row_to_tensor(tokenizer: BertTokenizer, row: pd.Series) -> Tuple[torch.LongTensor, torch.LongTensor]:
      tokens = tokenizer.encode(row['comment_text'], add_special_tokens=True)
      if len(tokens) > 120:
        tokens = tokens[:119] + [tokens[-1]]
      x = torch.LongTensor(tokens)
      y = torch.FloatTensor(row[['toxic', 'severe_toxic', 'obscene', 'threat',
                                 'insult', 'identity_hate']])
      return x, y

    def __len__(self):
      if self.lazy:
        return len(self.df)
      else:
        return len(self.X)

    def __getitem__(self, index: int) -> Tuple[torch.LongTensor, torch.LongTensor]:
      if not self.lazy:
        return self.X[index], self.Y[index]
      else:
        return self.row_to_tensor(self.tokenizer, self.df.iloc[index])

* collate_fn
  - mini-batch를 구성하기 위해 데이터를 묶어주는 함수

In [22]:
# merges a list of samples to form a mini-batch
def collate_fn(batch: List[Tuple[torch.LongTensor, torch.LongTensor]], device: torch.device) \
      -> Tuple[torch.LongTensor, torch.LongTensor]:
      x, y = list(zip(*batch))
      x = pad_sequence(x, batch_first=True, padding_value=0)  # if batch_fist, B x T x *
      y = torch.stack(y)
      return x.to(device), y.to(device)

* Sampler는 index를 컨트롤하는 방법
* dataset에서 data loading시에 indice/keys의 순서를 지정하는 방법

In [23]:
train_dataset = ToxicDataset(tokenizer, train_split_df, lazy=True)
dev_dataset = ToxicDataset(tokenizer, val_split_df, lazy=True)
collate_fn = partial(collate_fn, device=device)

BATCH_SIZE = 32
train_sampler = RandomSampler(train_dataset)
dev_sampler = RandomSampler(dev_dataset)

# Dataset, Sampler, collate_fn -> DataLoader
train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, 
                            sampler=train_sampler, collate_fn=collate_fn)
dev_iterator = DataLoader(dev_dataset, batch_size=BATCH_SIZE,
                          sampler=dev_sampler, collate_fn=collate_fn)

## Simple BERT model

In [35]:
class BertClassifier(nn.Module):

  def __init__(self, bert: BertModel, num_classes: int):
    super().__init__()
    self.bert = bert
    self.classifier = nn.Linear(bert.config.hidden_size, num_classes)  # in_features, out_features

  def forward(self, input_ids, attention_mask=None, token_type_ids=None, 
              position_ids=None, head_mask=None, labels=None):
    outputs = self.bert(input_ids,
                        attention_mask=attention_mask,
                        position_ids=position_ids,
                        head_mask=head_mask)
    cls_output = outputs[1]  # batch, hidden
    cls_output = self.classifier(cls_output)  # batch, 6(classes)
    cls_output = torch.sigmoid(cls_output)  # sigmoid from logit to probability
    criterion = nn.BCELoss()  # loss function
    loss = 0
    if labels is not None:
      loss = criterion(cls_output, labels)
    return loss, cls_output

In [36]:
model = BertClassifier(BertModel.from_pretrained(bert_model_name), 6).to(device)

* Training and evaluation loops

* Training
  - set train mode: model.train()
  - dataset iterator loop
  - model forwarding with data
  - backpropagation: loss.backward()
  - perform optimization step: optimizer.step()

In [40]:
def train(model, iterator, optimizer, scheduler):
  model.train()  # set train mode
  total_loss = 0 
  for x, y in tqdm(iterator):
    optimizer.zero_grad()  # set gradients to zero
    mask = (x != 0).float()
    loss, outputs = model(x, attention_mask=mask, labels=y)
    total_loss += loss.item()
    loss.backward()
    optimizer.step()
    scheduler.step()
  print(f"Train loss {total_loss / len(iterator)}")

def evaluate(model, iterator):
  model.eval()  # set eval mode
  pred = []
  true = []
  with torch.no_grad():
    total_loss = 0
    for x, y in tqdm(iterator):
      mask = (x != 0).float()
      loss, outputs = model(x, attention_mask=mask, labels=y)
      total_loss += loss
      true += y.cpu().numpy().tolist()
      pred += outputs.cpu().numpy().tolist()
  true = np.array(true)
  pred = np.array(pred)
  for i, name in enumerate(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
                            'identity_hate']):
    print(f"{name} roc_auc {roc_auc_score(true[:, i], pred[:, i])}")
  print(f"Evaluate loss {total_loss / len(iterator)}")

In [33]:
no_decay = ['bias', 'LayerNorm.weight']  # no deacy parameters
optimizer_grouped_parameters = [
  {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
   'weight_decay': 0.01},
  {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
   'weight_decay': 0.0}               
]

EPOCH_NUM = 2

# https://paperswithcode.com/method/slanted-triangular-learning-rates
# triangular learning rate; linearly grows until half of first epoch, then linearly decays
warmup_steps = 10 ** 3
total_steps = len(train_iterator) * EPOCH_NUM - warmup_steps
optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5, eps=1e-8)
# https://huggingface.co/transformers/main_classes/optimizer_schedules.html#transformers.get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)

In [41]:
for i in range(EPOCH_NUM):
  print('=' * 50, f"EPOCH {i}", '=' * 50)
  train(model, train_iterator, optimizer, scheduler)
  evaluate(model, dev_iterator)

  0%|          | 0/4738 [00:00<?, ?it/s]



100%|██████████| 4738/4738 [33:50<00:00,  2.33it/s]
  0%|          | 1/250 [00:00<00:46,  5.33it/s]

Train loss 0.7879722559940165


100%|██████████| 250/250 [00:49<00:00,  5.04it/s]
  0%|          | 0/4738 [00:00<?, ?it/s]

toxic roc_auc 0.4645959231425188
severe_toxic roc_auc 0.2811907170861262
obscene roc_auc 0.5619153504552318
threat roc_auc 0.6512165057025574
insult roc_auc 0.4484191559688391
identity_hate roc_auc 0.62461988590918
Evaluate loss 0.7879172563552856


100%|██████████| 4738/4738 [33:25<00:00,  2.36it/s]
  0%|          | 0/250 [00:00<?, ?it/s]

Train loss 0.7879551610975841


100%|██████████| 250/250 [00:48<00:00,  5.20it/s]

toxic roc_auc 0.4645959231425188
severe_toxic roc_auc 0.2811907170861262
obscene roc_auc 0.5619153504552318
threat roc_auc 0.6512165057025574
insult roc_auc 0.4484191559688391
identity_hate roc_auc 0.6246198859091802
Evaluate loss 0.7879485487937927





In [42]:
model.eval()

test_df = pd.read_csv(os.path.join(input_file_path, 'test.csv'))
submission = pd.read_csv(os.path.join(input_file_path, 'sample_submission.csv'))
columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

for i in tqdm(range(len(test_df) // BATCH_SIZE + 1)):
  batch_df = test_df.iloc[i * BATCH_SIZE: (i + 1) * BATCH_SIZE]
  assert (batch_df["id"] == submission["id"][i * BATCH_SIZE: (i + 1) * BATCH_SIZE]).all(), f"Id mismatch"

  texts = []
  for text in batch_df['comment_text'].tolist():
    text = tokenizer.encode(text, add_special_tokens=True)
    if len(text) > 120:
      text = text[:119] + [tokenizer.sep_token_id]
    texts.append(torch.LongTensor(text))
  
  x = pad_sequence(texts, batch_first=True, padding_value=tokenizer.pad_token_id).to(device)
  mask = (x != tokenizer.pad_token_id).float().to(device)
  
  with torch.no_grad():
    _, outputs = model(x, attention_mask=mask)
  outputs = outputs.cpu().numpy()
  submission.iloc[i * BATCH_SIZE: (i + 1) * BATCH_SIZE][columns] = outputs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
100%|██████████| 4787/4787 [13:06<00:00,  6.09it/s]
