## Genre prediction of NHK shows using BERT

I have done clustering of NHK shows using TF-IDF and K-Means. Now I am going to predict show's genre using BERT! <br>
Note: This code was taken from the article "Multi Class Text Classification With Deep Learning Using BERT" by Susan Li at <br>
https://towardsdatascience.com/multi-class-text-classification-with-deep-learning-using-bert-b59ca2f5c613

In [1]:
!pip install transformers
!pip install fugashi
!pip install ipadic

You should consider upgrading via the '/Users/mihohunter/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/mihohunter/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/mihohunter/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import random
import numpy as np

import torch
from tqdm import tqdm
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data import TensorDataset

import transformers
from transformers import BertJapaneseTokenizer
from transformers import BertForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/mhunter13/NLP_final_project/main/nhk_0626_0703_0710.csv")

In [4]:
df['genre1'] = df['genres'].str[2:4] # new "genre1" column with just one genre code
label_dict = {'00':'ニュース/報道','01':'スポーツ','02':'情報/ワイドショー','03':'ドラマ','04':'音楽','05':'バラエティ','06':'映画','07':'アニメ/特撮','08':'ドキュメンタリー/教養','09':'劇場・公演','10':'趣味/教育','11':'福祉','15':'その他'} 
# create label dict
df['genre_name'] = df.genre1.replace(label_dict) # creating label column with category names

In [5]:
df.genre1.unique()

array(['08', '11', '04', '00', '03', '02', '05', '01', '10', '09', '07',
       '15', '06', ''], dtype=object)

In [6]:
df.drop(df[df['genre1']==''].index, inplace=True)

In [7]:
df = df.drop(df.iloc[:, 9:-3],axis=1)
df['all_content'] = df['title'] + df['subtitle'].fillna('')
df.head(2)

Unnamed: 0,id,event_id,start_time,end_time,title,subtitle,content,act,genres,service.logo_l.height,genre1,genre_name,all_content
0,2022062633674,33674,2022-06-26T04:13:00+09:00,2022-06-26T04:15:00+09:00,インターミッション,,,,['0815'],200,8,ドキュメンタリー/教養,インターミッション
1,2022062633676,33676,2022-06-26T04:15:00+09:00,2022-06-26T04:20:00+09:00,５分でみんなの手話「祖父と祖母と両親ときょうだいが４人なんだ」,２０２１年度に放送したＥテレ「みんなの手話」から１つのキーフレーズをピックアップ。「祖父と祖...,２０２１年度に放送したＥテレ「みんなの手話」から１つのキーフレーズをピックアップ。「祖父と祖...,【出演】三宅健，森田明，那須善子，那須映里，寺澤英弥，【声】黒柳徹子,['1104'],200,11,福祉,５分でみんなの手話「祖父と祖母と両親ときょうだいが４人なんだ」２０２１年度に放送したＥテレ「...


In [8]:
# input_key = "all_content"
# label_key = "genre_name"

In [9]:
## remove duplicates
print(df.shape)
df = df.drop_duplicates("event_id").reset_index(drop=True)
print(df.shape)
df = df[['all_content', 'genre_name']]

(9592, 13)
(7454, 13)


In [10]:
df['genre_name'].value_counts()

ニュース/報道        2236
趣味/教育          1931
ドキュメンタリー/教養    1007
音楽              704
情報/ワイドショー       442
スポーツ            296
ドラマ             245
バラエティ           227
アニメ/特撮          153
その他             117
福祉               58
劇場・公演            24
映画               14
Name: genre_name, dtype: int64

In [11]:
possible_labels = df['genre_name'].unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [12]:
df['label'] = df['genre_name'].replace(label_dict)

In [13]:
df.head()

Unnamed: 0,all_content,genre_name,label
0,インターミッション,ドキュメンタリー/教養,0
1,５分でみんなの手話「祖父と祖母と両親ときょうだいが４人なんだ」２０２１年度に放送したＥテレ「...,福祉,1
2,名曲アルバム「ソルヴェイグの歌」グリーグ作曲「ソルヴェイグの歌」（ソプラノ）天羽明惠，（ピア...,音楽,2
3,みんなのうた「くじらのあくび」「くじらのあくび」うた：ザ・ジェイド,音楽,2
4,イッピン・選「色彩ゆたかに　あざやかに」「西の西陣・東の桐生」と並び称された織物の産地、群馬...,ドキュメンタリー/教養,0


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.2, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

In [16]:
df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [18]:
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese')

In [19]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train']['all_content'].values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=128, 
    return_tensors='pt',
    truncation=True
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val']['all_content'].values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=128, 
    return_tensors='pt',
    truncation=True
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)




In [20]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [21]:
len(label_dict)

13

In [22]:
model = BertForSequenceClassification.from_pretrained("cl-tohoku/bert-base-japanese-v2",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-v2 were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification wer

In [23]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)


In [24]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  



In [25]:
epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

In [26]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [27]:
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [28]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cpu


In [29]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [30]:
# training

for epoch in tqdm(range(1, epochs+1)):

    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
       
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))}, refresh=False)
         
        
    torch.save(model.state_dict(), f'data_volume/finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')
    tqdm.write("------")

  0%|          | 0/5 [00:00<?, ?it/s]
Epoch 1:   0%|          | 0/187 [00:00<?, ?it/s][A
Epoch 1:   1%|          | 1/187 [00:55<2:50:44, 55.08s/it, training_loss=0.831][A
Epoch 1:   1%|          | 2/187 [01:50<2:50:30, 55.30s/it, training_loss=0.802][A
Epoch 1:   2%|▏         | 3/187 [02:46<2:49:53, 55.40s/it, training_loss=0.777][A
Epoch 1:   2%|▏         | 4/187 [03:41<2:49:06, 55.44s/it, training_loss=0.753][A
Epoch 1:   3%|▎         | 5/187 [04:36<2:48:09, 55.44s/it, training_loss=0.730][A
Epoch 1:   3%|▎         | 6/187 [05:32<2:47:10, 55.42s/it, training_loss=0.725][A
Epoch 1:   4%|▎         | 7/187 [06:27<2:46:16, 55.43s/it, training_loss=0.690][A
Epoch 1:   4%|▍         | 8/187 [07:23<2:45:26, 55.46s/it, training_loss=0.679][A
Epoch 1:   5%|▍         | 9/187 [08:18<2:44:36, 55.48s/it, training_loss=0.680][A
Epoch 1:   5%|▌         | 10/187 [09:14<2:43:39, 55.48s/it, training_loss=0.627][A
Epoch 1:   6%|▌         | 11/187 [10:09<2:42:50, 55.51s/it, training_loss=0.686


Epoch 1
Training loss: 1.372821256119937


 20%|██        | 1/5 [3:32:08<14:08:33, 12728.34s/it]

Validation loss: 0.8695322579525887
F1 Score (Weighted): 0.7100091957230805
------



Epoch 2:   0%|          | 0/187 [00:00<?, ?it/s][A
Epoch 2:   1%|          | 1/187 [00:58<3:00:12, 58.13s/it, training_loss=0.198][A
Epoch 2:   1%|          | 2/187 [01:54<2:56:55, 57.38s/it, training_loss=0.203][A
Epoch 2:   2%|▏         | 3/187 [02:51<2:54:59, 57.06s/it, training_loss=0.334][A
Epoch 2:   2%|▏         | 4/187 [03:48<2:53:47, 56.98s/it, training_loss=0.376][A
Epoch 2:   3%|▎         | 5/187 [04:45<2:52:21, 56.82s/it, training_loss=0.244][A
Epoch 2:   3%|▎         | 6/187 [05:41<2:51:13, 56.76s/it, training_loss=0.231][A
Epoch 2:   4%|▎         | 7/187 [06:38<2:50:43, 56.91s/it, training_loss=0.289][A
Epoch 2:   4%|▍         | 8/187 [07:37<2:51:08, 57.37s/it, training_loss=0.356][A
Epoch 2:   5%|▍         | 9/187 [08:34<2:49:59, 57.30s/it, training_loss=0.217][A
Epoch 2:   5%|▌         | 10/187 [09:32<2:49:33, 57.48s/it, training_loss=0.241][A
Epoch 2:   6%|▌         | 11/187 [10:29<2:48:22, 57.40s/it, training_loss=0.334][A
Epoch 2:   6%|▋         | 12/187


Epoch 2
Training loss: 0.7701135443493644


 40%|████      | 2/5 [6:39:49<9:53:16, 11865.53s/it] 

Validation loss: 0.6126995635159472
F1 Score (Weighted): 0.8284324453603168
------



Epoch 3:   0%|          | 0/187 [00:00<?, ?it/s][A
Epoch 3:   1%|          | 1/187 [00:56<2:54:09, 56.18s/it, training_loss=0.219][A
Epoch 3:   1%|          | 2/187 [01:51<2:52:10, 55.84s/it, training_loss=0.180][A
Epoch 3:   2%|▏         | 3/187 [02:47<2:50:24, 55.57s/it, training_loss=0.223][A
Epoch 3:   2%|▏         | 4/187 [03:42<2:49:28, 55.57s/it, training_loss=0.151][A
Epoch 3:   3%|▎         | 5/187 [04:38<2:48:28, 55.54s/it, training_loss=0.220][A
Epoch 3:   3%|▎         | 6/187 [05:33<2:47:21, 55.48s/it, training_loss=0.226][A
Epoch 3:   4%|▎         | 7/187 [06:28<2:46:26, 55.48s/it, training_loss=0.132][A
Epoch 3:   4%|▍         | 8/187 [07:24<2:45:20, 55.42s/it, training_loss=0.143][A
Epoch 3:   5%|▍         | 9/187 [08:19<2:44:14, 55.36s/it, training_loss=0.375][A
Epoch 3:   5%|▌         | 10/187 [09:14<2:43:17, 55.35s/it, training_loss=0.262][A
Epoch 3:   6%|▌         | 11/187 [10:10<2:42:45, 55.48s/it, training_loss=0.304][A
Epoch 3:   6%|▋         | 12/187


Epoch 3
Training loss: 0.4899129242023682


 60%|██████    | 3/5 [9:45:09<6:24:09, 11524.96s/it]

Validation loss: 0.42634263809056994
F1 Score (Weighted): 0.8876997228325014
------



Epoch 4:   0%|          | 0/187 [00:00<?, ?it/s][A
Epoch 4:   1%|          | 1/187 [00:56<2:53:54, 56.10s/it, training_loss=0.100][A
Epoch 4:   1%|          | 2/187 [01:51<2:51:47, 55.72s/it, training_loss=0.109][A
Epoch 4:   2%|▏         | 3/187 [02:47<2:50:38, 55.64s/it, training_loss=0.144][A
Epoch 4:   2%|▏         | 4/187 [03:42<2:49:10, 55.47s/it, training_loss=0.246][A
Epoch 4:   3%|▎         | 5/187 [04:37<2:48:00, 55.39s/it, training_loss=0.080][A
Epoch 4:   3%|▎         | 6/187 [05:32<2:47:06, 55.39s/it, training_loss=0.221][A
Epoch 4:   4%|▎         | 7/187 [06:28<2:46:06, 55.37s/it, training_loss=0.146][A
Epoch 4:   4%|▍         | 8/187 [07:23<2:45:02, 55.32s/it, training_loss=0.090][A
Epoch 4:   5%|▍         | 9/187 [08:18<2:44:08, 55.33s/it, training_loss=0.216][A
Epoch 4:   5%|▌         | 10/187 [09:14<2:43:20, 55.37s/it, training_loss=0.170][A
Epoch 4:   6%|▌         | 11/187 [10:09<2:42:30, 55.40s/it, training_loss=0.086][A
Epoch 4:   6%|▋         | 12/187


Epoch 4
Training loss: 0.34368940075251825


 80%|████████  | 4/5 [12:50:26<3:09:23, 11363.80s/it]

Validation loss: 0.34002479982185874
F1 Score (Weighted): 0.9174618364116568
------



Epoch 5:   0%|          | 0/187 [00:00<?, ?it/s][A
Epoch 5:   1%|          | 1/187 [00:56<2:53:35, 56.00s/it, training_loss=0.137][A
Epoch 5:   1%|          | 2/187 [01:51<2:51:52, 55.74s/it, training_loss=0.118][A
Epoch 5:   2%|▏         | 3/187 [02:47<2:50:40, 55.65s/it, training_loss=0.064][A
Epoch 5:   2%|▏         | 4/187 [03:42<2:49:25, 55.55s/it, training_loss=0.076][A
Epoch 5:   3%|▎         | 5/187 [04:37<2:48:10, 55.44s/it, training_loss=0.132][A
Epoch 5:   3%|▎         | 6/187 [05:33<2:47:34, 55.55s/it, training_loss=0.134][A
Epoch 5:   4%|▎         | 7/187 [22:06<18:05:50, 361.95s/it, training_loss=0.149][A
Epoch 5:   4%|▍         | 8/187 [38:33<27:53:26, 560.93s/it, training_loss=0.090][A
Epoch 5:   5%|▍         | 9/187 [56:15<35:28:54, 717.61s/it, training_loss=0.153][A
Epoch 5:   5%|▌         | 10/187 [1:02:42<30:16:03, 615.61s/it, training_loss=0.062][A
Epoch 5:   6%|▌         | 11/187 [1:03:37<21:42:31, 444.04s/it, training_loss=0.076][A
Epoch 5:   6%|▋   


Epoch 5
Training loss: 0.27252744488856373


100%|██████████| 5/5 [21:21:41<00:00, 15380.36s/it]  

Validation loss: 0.32584634137914537
F1 Score (Weighted): 0.9195400462068966
------





In [31]:
model = BertForSequenceClassification.from_pretrained("cl-tohoku/bert-base-japanese-v2",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)



Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-v2 were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification wer

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32768, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [37]:
model.load_state_dict(torch.load('data_volume/finetuned_BERT_epoch_5.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [38]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [39]:
accuracy_per_class(predictions, true_vals)

Class: ドキュメンタリー/教養
Accuracy: 175/202

Class: 福祉
Accuracy: 0/12

Class: 音楽
Accuracy: 131/141

Class: ニュース/報道
Accuracy: 446/447

Class: ドラマ
Accuracy: 45/49

Class: 情報/ワイドショー
Accuracy: 85/88

Class: バラエティ
Accuracy: 29/45

Class: スポーツ
Accuracy: 47/59

Class: 趣味/教育
Accuracy: 375/386

Class: 劇場・公演
Accuracy: 0/5

Class: アニメ/特撮
Accuracy: 25/31

Class: その他
Accuracy: 23/23

Class: 映画
Accuracy: 0/3

