1、Multi-Class：多分类/多元分类（二分类、三分类、多分类等）

```
二分类：判断邮件属于哪个类别，垃圾或者非垃圾
二分类：判断新闻属于哪个类别，机器写的或者人写的
三分类：判断文本情感属于{正面，中立，负面}中的哪一类
多分类：判断新闻属于哪个类别，如财经、体育、娱乐等
```

2022搜狐校园 情感分析 × 推荐排序 算法大赛
https://www.biendata.xyz/competition/sohu_2022/data/
（2代表极正向，1代表正向，0代表中立，-1代表负向，-2代表极负向）


2、Multi-Label：多标签分类

- 文本可能同时涉及任何 宗教，政治，金融或教育，也可能不属于任何一种。

- 电影可以根据其摘要内容分为动作，喜剧和浪漫类型。有可能电影属于romcoms [浪漫与喜剧]等多种类型。

### 推特有毒文本多标签分类

Toxic Comment Classification Challenge 

Identify and classify toxic online comments

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge


## 导入包

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import torch
from torch.nn import BCEWithLogitsLoss, BCELoss# 多标签分类loss
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix, f1_score, accuracy_score
import pickle
from transformers import AutoModel,AutoConfig,AutoTokenizer
from tqdm import tqdm, trange
from ast import literal_eval

###  查看GPU是否可用

In [3]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'NVIDIA GeForce RTX 3090'

In [5]:
device

device(type='cuda')

## 加载数据与预处理

In [None]:
# df = pd.read_csv('data/03/train.csv') #jigsaw-toxic-comment-classification-challenge
# df.head()

In [6]:
df = pd.read_csv('/home/mw/input/task031964/train.csv') #jigsaw-toxic-comment-classification-challenge
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [7]:
df['toxic'].value_counts()

0    144277
1     15294
Name: toxic, dtype: int64

In [8]:
print('Unique comments: ', df.comment_text.nunique() == df.shape[0]) 
print('Null values: ', df.isnull().values.any())
# df[df.isna().any(axis=1)]

Unique comments:  True
Null values:  False


In [None]:
# 训练集的每一个文本都是不一样的
# 训练集中的元素都是非空的

In [9]:
df.shape[0]-df.count()# 训练集中的元素都是非空的

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

In [11]:
# 粗略统计下文本长度分布以及标准差
print('average sentence length: ', df.comment_text.str.split().str.len().mean())
print('stdev sentence length: ', df.comment_text.str.split().str.len().std())

average sentence length:  67.27352714465661
stdev sentence length:  99.2307021928862


In [10]:
df.comment_text.str.split().str.len().describe()

count    159571.000000
mean         67.273527
std          99.230702
min           1.000000
25%          17.000000
50%          36.000000
75%          75.000000
max        1411.000000
Name: comment_text, dtype: float64

In [20]:
sum(df.comment_text.str.split().str.len()>200) # 统计文本word个数大于200的文本个数

10087

In [19]:
df.shape

(159571, 9)

In [13]:
cols = df.columns # 数据集的列名
label_cols = list(cols[2:]) # 数据集中的标签列名
num_labels = len(label_cols) # 标签个数
print('Label columns: ', label_cols)

Label columns:  ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


In [14]:
print('Count of 1 per label: \n', df[label_cols].sum(), '\n') # 统计每个标签为1的个数
print('Count of 0 per label: \n', df[label_cols].eq(0).sum()) # 统计每个标签为0的个数

Count of 1 per label: 
 toxic            15294
severe_toxic      1595
obscene           8449
threat             478
insult            7877
identity_hate     1405
dtype: int64 

Count of 0 per label: 
 toxic            144277
severe_toxic     157976
obscene          151122
threat           159093
insult           151694
identity_hate    158166
dtype: int64


In [15]:
df = df.sample(frac=1).reset_index(drop=True) #shuffle rows

In [16]:
df['one_hot_labels'] = list(df[label_cols].values) # 直接将六个标签转为one hot
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,one_hot_labels
0,0c88373b4a0aac92,"""\n\nThe removed paragraph \nSlovakization is ...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
1,cd564a2df8bdd285,Mini-Contra Battle section \n\nWould it better...,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
2,012b305351d49596,""", 20 January 2013 (UTC)\nThe argument at WP:N...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
3,29b87947f2deedbc,I am unable to thank you for your review using...,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
4,270deaeac50bbdfc,"""/hollyoaks/scoop/a327249/hollyoaks-pj-brennan...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"


In [17]:
labels = list(df.one_hot_labels.values)
comments = list(df.comment_text.values)

针对不同模型 使用不同的分词器

```
BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) 

XLNet:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=False) 

RoBERTa:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)
```


In [22]:
max_length = 150
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) # tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) # tokenizer 全部转小写
# NLP三大Subword模型详解：BPE、WordPiece、ULM
# https://zhuanlan.zhihu.com/p/191648421
encodings = tokenizer.batch_encode_plus(
    comments,# 传入列表
    max_length=max_length,
    pad_to_max_length=True
) # tokenizer's encoding method
print('tokenizer outputs: ', encodings.keys()) # 分词之后的输出

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\yanqiang/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncase

tokenizer outputs:  dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [23]:
input_ids = encodings['input_ids'] # tokenized and encoded sentences
token_type_ids = encodings['token_type_ids'] # token type ids
attention_masks = encodings['attention_mask'] # attention masks

In [25]:
input_ids[0]

[101,
 1000,
 1996,
 3718,
 20423,
 12747,
 3989,
 2003,
 2411,
 11581,
 3550,
 2004,
 1037,
 3433,
 2000,
 3140,
 23848,
 13380,
 3989,
 1010,
 2029,
 3047,
 3701,
 2044,
 1996,
 17151,
 9354,
 7033,
 1006,
 12014,
 1997,
 7517,
 1007,
 1012,
 2096,
 2053,
 2028,
 2085,
 10592,
 23848,
 13380,
 3989,
 1037,
 3893,
 9874,
 1010,
 2005,
 2195,
 4436,
 1996,
 11581,
 3989,
 1997,
 12747,
 3989,
 2004,
 1037,
 3433,
 2000,
 2009,
 2003,
 21068,
 1024,
 23848,
 13380,
 3989,
 3047,
 2076,
 1996,
 2335,
 1997,
 1996,
 12078,
 1010,
 1998,
 2009,
 3047,
 1999,
 1037,
 2051,
 2558,
 2043,
 2529,
 2916,
 2020,
 2025,
 2641,
 3053,
 2004,
 5415,
 2004,
 2085,
 1012,
 12747,
 3989,
 2003,
 6230,
 2144,
 2059,
 1998,
 2009,
 2001,
 2004,
 5729,
 2104,
 4750,
 18944,
 2004,
 2009,
 2003,
 2085,
 1999,
 3537,
 1045,
 2031,
 3718,
 2023,
 20423,
 1010,
 2025,
 2138,
 1045,
 2052,
 9352,
 21090,
 1010,
 2021,
 2138,
 2023,
 2003,
 1037,
 23250,
 20219,
 1006,
 1037,
 1000,
 1000,
 13433,
 2615,
 1000

In [26]:
tokenizer.convert_ids_to_tokens(input_ids[0])

['[CLS]',
 '"',
 'the',
 'removed',
 'paragraph',
 'slovak',
 '##ization',
 'is',
 'often',
 'rational',
 '##ized',
 'as',
 'a',
 'response',
 'to',
 'forced',
 'mag',
 '##yar',
 '##ization',
 ',',
 'which',
 'happened',
 'mainly',
 'after',
 'the',
 'aus',
 '##gle',
 '##ich',
 '(',
 'compromise',
 'of',
 '1867',
 ')',
 '.',
 'while',
 'no',
 'one',
 'now',
 'considers',
 'mag',
 '##yar',
 '##ization',
 'a',
 'positive',
 'trend',
 ',',
 'for',
 'several',
 'reasons',
 'the',
 'rational',
 '##ization',
 'of',
 'slovak',
 '##ization',
 'as',
 'a',
 'response',
 'to',
 'it',
 'is',
 'questionable',
 ':',
 'mag',
 '##yar',
 '##ization',
 'happened',
 'during',
 'the',
 'times',
 'of',
 'the',
 'monarchy',
 ',',
 'and',
 'it',
 'happened',
 'in',
 'a',
 'time',
 'period',
 'when',
 'human',
 'rights',
 'were',
 'not',
 'considered',
 'nearly',
 'as',
 'universal',
 'as',
 'now',
 '.',
 'slovak',
 '##ization',
 'is',
 'happening',
 'since',
 'then',
 'and',
 'it',
 'was',
 'as',
 'severe',


In [27]:
comments[0]

'"\n\nThe removed paragraph \nSlovakization is often rationalized as a response to forced magyarization, which happened mainly after the Ausgleich (Compromise of 1867). While no one now considers magyarization a positive trend, for several reasons the rationalization of slovakization as a response to it is questionable: magyarization happened during the times of the monarchy, and it happened in a time period when human rights were not considered nearly as universal as now. Slovakization is happening since then and it was as severe under communist dictatorship as it is now in democratic\n\nI have removed this paragraph, not because I would necessarily disagree, but because this is a speculative formulation (a ""POV"" in wikipedia terminology) not suitable for an encyclopaedia. If this was an undisputed topic, maybe such formulations could work, but since that is not the case, it is definitely better to stick to pure sourced data.  "'

In [28]:
label_counts = df.one_hot_labels.astype(str).value_counts()
one_freq = label_counts[label_counts==1].keys()
one_freq_idxs = sorted(list(df[df.one_hot_labels.astype(str).isin(one_freq)].index), reverse=True)
print('df label indices with only one instance: ', one_freq_idxs)

df label indices with only one instance:  [63106, 40974]


In [29]:
# Gathering single instance inputs to force into the training set after stratified split
one_freq_input_ids = [input_ids.pop(i) for i in one_freq_idxs]
one_freq_token_types = [token_type_ids.pop(i) for i in one_freq_idxs]
one_freq_attention_masks = [attention_masks.pop(i) for i in one_freq_idxs]
one_freq_labels = [labels.pop(i) for i in one_freq_idxs]

####  划分一些训练集和验证集

In [30]:
# 训练集和验证集划分

train_inputs, validation_inputs, train_labels, validation_labels, train_token_types, validation_token_types, train_masks, validation_masks = train_test_split(input_ids, labels, token_type_ids,attention_masks,random_state=2020, test_size=0.10, stratify = labels)

# Add one frequency data to train data
train_inputs.extend(one_freq_input_ids)
train_labels.extend(one_freq_labels)
train_masks.extend(one_freq_attention_masks)
train_token_types.extend(one_freq_token_types)

# 将原始id转为torch 张量
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)
train_token_types = torch.tensor(train_token_types)

validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)
validation_token_types = torch.tensor(validation_token_types)

In [31]:
# 批数据大小 ：8 16 32 64  128 256
batch_size = 32

# 训练集 
train_data = TensorDataset(train_inputs, train_masks, train_labels, train_token_types)
train_sampler = RandomSampler(train_data) # 
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels, validation_token_types)
validation_sampler = SequentialSampler(validation_data) # 按顺序遍历
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

In [32]:
# 保存处理好的数据
torch.save(validation_dataloader,'validation_data_loader')
torch.save(train_dataloader,'train_data_loader')

## 定义模型以及设置

```
AutoModel:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)

BERT:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)

XLNet:
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=num_labels)

RoBERTa:
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=num_labels)
```



In [33]:
from transformers import AutoModelForSequenceClassification

In [34]:
# 加载预训练模型
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels) 
# num_labels：6 默认情况2分类
model.cuda()

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\yanqiang/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_a

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
# xxxxxlForSequenceClassification
# 输出有两个
# loss
# logits

设置优化器

https://huggingface.co/transformers/main_classes/optimizer_schedules.html

In [36]:
paras=[para for para in model.named_parameters()]

In [38]:
# paras

In [41]:
from transformers import AdamW

In [42]:
# 对不同参数设置weight_decay_rate
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [43]:
optimizer = AdamW(optimizer_grouped_parameters,lr=2e-5,correct_bias=True)
# 1e-5,2e-5,5e-5
# optimizer = AdamW(model.parameters(),lr=2e-5)  # 默认优化器



## 训练模型

In [None]:
# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 3 # 训练轮数，15万训练集 任务比较简单的，最多设置5

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):

  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train() # 设置训练模式

  # Tracking variables
  tr_loss = 0 #running loss
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):# 遍历批数据
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    # 每一批数据展开
    # train_inputs.extend(one_freq_input_ids)
    # train_labels.extend(one_freq_labels)
    # train_masks.extend(one_freq_attention_masks)
    # train_token_types.extend(one_freq_token_types)
    # 接收batch的输入
    b_input_ids, b_input_mask, b_labels, b_token_types = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()

    # # Forward pass for multiclass classification
    # outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    # loss = outputs[0]
    # logits = outputs[1]

    # Forward pass for multilabel classification
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    logits = outputs[0]
    loss_func = BCEWithLogitsLoss() # 计算损失
    loss = loss_func(logits.view(-1,num_labels),b_labels.type_as(logits).view(-1,num_labels)) #convert labels to float for calculation
    # loss_func = BCELoss() 
    # loss = loss_func(torch.sigmoid(logits.view(-1,num_labels)),b_labels.type_as(logits).view(-1,num_labels)) #convert labels to float for calculation
    train_loss_set.append(loss.item())# 记录loss    

    # Backward pass
    loss.backward() # loss反向求导
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    # scheduler.step()
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))

###############################################################################

  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Variables to gather full output
  logit_preds,true_labels,pred_labels,tokenized_texts = [],[],[],[]

  # Predict
  for i, batch in enumerate(validation_dataloader):
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels, b_token_types = batch
    with torch.no_grad():
      # Forward pass
      outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      b_logit_pred = outs[0]
      pred_label = torch.sigmoid(b_logit_pred)

      b_logit_pred = b_logit_pred.detach().cpu().numpy()
      pred_label = pred_label.to('cpu').numpy()
      b_labels = b_labels.to('cpu').numpy()

    tokenized_texts.append(b_input_ids)
    logit_preds.append(b_logit_pred)
    true_labels.append(b_labels)
    pred_labels.append(pred_label)

  # Flatten outputs
  pred_labels = [item for sublist in pred_labels for item in sublist]
  true_labels = [item for sublist in true_labels for item in sublist]

  # 计算准确率
  threshold = 0.50
  pred_bools = [pl>threshold for pl in pred_labels]
  true_bools = [tl==1 for tl in true_labels]
  val_f1_accuracy = f1_score(true_bools,pred_bools,average='micro')*100
  val_flat_accuracy = accuracy_score(true_bools, pred_bools)*100

  print('F1 Validation Accuracy: ', val_f1_accuracy)
  print('Flat Validation Accuracy: ', val_flat_accuracy)

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Train loss: 0.05082032523087247


Epoch:  33%|███▎      | 1/3 [23:00<46:01, 1380.97s/it]

F1 Validation Accuracy:  78.08238436665262
Flat Validation Accuracy:  92.33565206492449
Train loss: 0.03456845487250288


Epoch:  67%|██████▋   | 2/3 [46:01<23:00, 1380.97s/it]

F1 Validation Accuracy:  78.78084179970972
Flat Validation Accuracy:  92.59885943473084
Train loss: 0.027323115857782174


Epoch: 100%|██████████| 3/3 [1:09:02<00:00, 1380.97s/it]

F1 Validation Accuracy:  78.47726262775885
Flat Validation Accuracy:  92.8495331202607





In [None]:
torch.save(model.state_dict(), 'bert_model_toxic')

## 加载数据以及预处理

In [46]:
test_df = pd.read_csv('data/03/test.csv')
test_labels_df = pd.read_csv('data/03/test_labels.csv')
test_df = test_df.merge(test_labels_df, on='id', how='left')
test_label_cols = list(test_df.columns[2:])
print('Null values: ', test_df.isnull().values.any()) #should not be any null sentences or labels
print('Same columns between train and test: ', label_cols == test_label_cols) #columns should be the same
test_df.head()

Null values:  False
Same columns between train and test:  True


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,":If you have a look back at the source, the in...",-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,I don't anonymously edit articles at all.,-1,-1,-1,-1,-1,-1


In [47]:
test_df = test_df[~test_df[test_label_cols].eq(-1).any(axis=1)] #remove irrelevant rows/comments with -1 values
test_df['one_hot_labels'] = list(test_df[test_label_cols].values)
test_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,one_hot_labels
5,0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
7,000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
11,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
13,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"
14,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0]"


In [48]:
# Gathering input data
test_labels = list(test_df.one_hot_labels.values)
test_comments = list(test_df.comment_text.values)

In [49]:
# 测试集分词编码
test_encodings = tokenizer.batch_encode_plus(test_comments,max_length=max_length,pad_to_max_length=True)
test_input_ids = test_encodings['input_ids']
test_token_type_ids = test_encodings['token_type_ids']
test_attention_masks = test_encodings['attention_mask']



In [50]:
# Make tensors out of data
test_inputs = torch.tensor(test_input_ids)
test_labels = torch.tensor(test_labels)
test_masks = torch.tensor(test_attention_masks)
test_token_types = torch.tensor(test_token_type_ids)
# Create test dataloader
test_data = TensorDataset(test_inputs, test_masks, test_labels, test_token_types)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
# Save test dataloader
torch.save(test_dataloader,'test_data_loader')

## 预测与评估

In [None]:
# Test

# Put model in evaluation mode to evaluate loss on the validation set
model.eval()

#track variables
logit_preds,true_labels,pred_labels,tokenized_texts = [],[],[],[]

# Predict
for i, batch in enumerate(test_dataloader):
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels, b_token_types = batch
  with torch.no_grad():
    # Forward pass
    outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    b_logit_pred = outs[0]
    pred_label = torch.sigmoid(b_logit_pred)

    b_logit_pred = b_logit_pred.detach().cpu().numpy()
    pred_label = pred_label.to('cpu').numpy()
    b_labels = b_labels.to('cpu').numpy()

  tokenized_texts.append(b_input_ids)
  logit_preds.append(b_logit_pred)
  true_labels.append(b_labels)
  pred_labels.append(pred_label)

# Flatten outputs
tokenized_texts = [item for sublist in tokenized_texts for item in sublist]
pred_labels = [item for sublist in pred_labels for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]
# Converting flattened binary values to boolean values
true_bools = [tl==1 for tl in true_labels]

我们需要对范围为 [0, 1] 的 sigmoid 函数输出进行阈值处理。 下面我使用 0.50 作为阈值。

In [None]:
pred_bools = [pl>0.50 for pl in pred_labels] #boolean output after thresholding

# Print and save classification report
print('Test F1 Accuracy: ', f1_score(true_bools, pred_bools,average='micro'))
print('Test Flat Accuracy: ', accuracy_score(true_bools, pred_bools),'\n')
clf_report = classification_report(true_bools,pred_bools,target_names=test_label_cols)
pickle.dump(clf_report, open('classification_report.txt','wb')) #save report
print(clf_report)

Test F1 Accuracy:  0.6792739193783389
Test Flat Accuracy:  0.8803651255118947 

               precision    recall  f1-score   support

        toxic       0.57      0.86      0.69      6090
 severe_toxic       0.40      0.47      0.43       367
      obscene       0.64      0.78      0.70      3691
       threat       0.45      0.70      0.55       211
       insult       0.69      0.70      0.69      3427
identity_hate       0.75      0.49      0.59       712

    micro avg       0.61      0.77      0.68     14498
    macro avg       0.58      0.67      0.61     14498
 weighted avg       0.62      0.77      0.68     14498
  samples avg       0.07      0.07      0.07     14498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## 结果输出

In [None]:
idx2label = dict(zip(range(6),label_cols))
print(idx2label)

{0: 'toxic', 1: 'severe_toxic', 2: 'obscene', 3: 'threat', 4: 'insult', 5: 'identity_hate'}


In [None]:
# Getting indices of where boolean one hot vector true_bools is True so we can use idx2label to gather label names
true_label_idxs, pred_label_idxs=[],[]
for vals in true_bools:
  true_label_idxs.append(np.where(vals)[0].flatten().tolist())
for vals in pred_bools:
  pred_label_idxs.append(np.where(vals)[0].flatten().tolist())

In [None]:
# Gathering vectors of label names using idx2label
true_label_texts, pred_label_texts = [], []
for vals in true_label_idxs:
  if vals:
    true_label_texts.append([idx2label[val] for val in vals])
  else:
    true_label_texts.append(vals)

for vals in pred_label_idxs:
  if vals:
    pred_label_texts.append([idx2label[val] for val in vals])
  else:
    pred_label_texts.append(vals)

In [None]:
# Decoding input ids to comment text
comment_texts = [tokenizer.decode(text,skip_special_tokens=True,clean_up_tokenization_spaces=False) for text in tokenized_texts]

In [None]:
# Converting lists to df
comparisons_df = pd.DataFrame({'comment_text': comment_texts, 'true_labels': true_label_texts, 'pred_labels':pred_label_texts})
comparisons_df.to_csv('comparisons.csv')
comparisons_df.head()

Unnamed: 0,comment_text,true_labels,pred_labels
0,thank you for understanding . i think very hig...,[],[]
1,: dear god this site is horrible .,[],[]
2,""" : : : somebody will invariably try to add re...",[],[]
3,""" it says it right there that it is a type . t...",[],[]
4,""" = = before adding a new product to the list ...",[],[]


## F1：阈值搜索

In [51]:
macro_thresholds = np.array(range(1,10))/10
macro_thresholds

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [None]:



f1_results, flat_acc_results = [], []
for th in macro_thresholds:
  pred_bools = [pl>th for pl in pred_labels]
  test_f1_accuracy = f1_score(true_bools,pred_bools,average='micro')
  test_flat_accuracy = accuracy_score(true_bools, pred_bools)
  f1_results.append(test_f1_accuracy)
  flat_acc_results.append(test_flat_accuracy)

best_macro_th = macro_thresholds[np.argmax(f1_results)] #best macro threshold value

micro_thresholds = (np.array(range(10))/100)+best_macro_th #calculating micro threshold values

f1_results, flat_acc_results = [], []
for th in micro_thresholds:
  pred_bools = [pl>th for pl in pred_labels]
  test_f1_accuracy = f1_score(true_bools,pred_bools,average='micro')
  test_flat_accuracy = accuracy_score(true_bools, pred_bools)
  f1_results.append(test_f1_accuracy)
  flat_acc_results.append(test_flat_accuracy)

best_f1_idx = np.argmax(f1_results) #best threshold value

# Printing and saving classification report
print('Best Threshold: ', micro_thresholds[best_f1_idx])
print('Test F1 Accuracy: ', f1_results[best_f1_idx])
print('Test Flat Accuracy: ', flat_acc_results[best_f1_idx], '\n')

best_pred_bools = [pl>micro_thresholds[best_f1_idx] for pl in pred_labels]
clf_report_optimized = classification_report(true_bools,best_pred_bools, target_names=label_cols)
pickle.dump(clf_report_optimized, open('classification_report_optimized.txt','wb'))
print(clf_report_optimized)

Best Threshold:  0.6
Test F1 Accuracy:  0.6838089672413233
Test Flat Accuracy:  0.8889149395104567 

               precision    recall  f1-score   support

        toxic       0.60      0.83      0.70      6090
 severe_toxic       0.47      0.34      0.39       367
      obscene       0.68      0.74      0.71      3691
       threat       0.52      0.63      0.57       211
       insult       0.73      0.63      0.68      3427
identity_hate       0.79      0.41      0.54       712

    micro avg       0.65      0.73      0.68     14498
    macro avg       0.63      0.60      0.60     14498
 weighted avg       0.66      0.73      0.68     14498
  samples avg       0.07      0.07      0.07     14498



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
