# Récupération des données Stack OverFlow

- https://github.com/pnageshkar/NLP/blob/master/Medium/Multi_label_Classification_BERT_Lightning.ipynb
- https://curiousily.com/posts/multi-label-text-classification-with-bert-and-pytorch-lightning/

La récupération des données de Stack Overflow est un peu compliqué nous devons partitionner nos données et faire des extractions petit à petit.
Une première analyse de la structure de la base nous a ammené a vouloir filtrer les posts en ne selectionnant que les questions. Pour pouvoir extraire un même nombre de paquet (car les identifiants n'ont pas des valeurs continues) nous utilisons la fonction dense_rank() qui permet de réaffecter un nouvel index a chaque ligne de façon continu. Nous utilisons aussi une fonction with, car l'interpreteur SQL de la base ne permet pas les select imbriqués. 

Ci-dessous un aperçu de la requête utilisée pour l'extraction de données

with extract_post as 
( select * , dense_rank() over (order by id asc) as rank from posts 
where PostTypeId = 1 
)
select * from extract_post
where rank >= 30000
and rank < 60000

## Import des données

In [1]:
! pip install -q pytorch-lightning
! pip install -q transformers

[K     |████████████████████████████████| 705 kB 8.4 MB/s 
[K     |████████████████████████████████| 5.9 MB 42.7 MB/s 
[K     |████████████████████████████████| 596 kB 48.7 MB/s 
[K     |████████████████████████████████| 141 kB 56.7 MB/s 
[K     |████████████████████████████████| 419 kB 58.8 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.2+zzzcolab20220719082949 requires tensorboard<2.9,>=2.8, but you have tensorboard 2.10.0 which is incompatible.[0m
[K     |████████████████████████████████| 4.7 MB 7.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 36.6 MB/s 
[K     |████████████████████████████████| 101 kB 10.0 MB/s 
[?25h

In [2]:
!nvidia-smi

Fri Aug 19 00:55:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
#Declaration des imports utilise pour l'analyse des donnees
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
from datetime import datetime
from tqdm.auto import tqdm

pd.options.display.max_rows = 40

In [4]:
# Import all libraries
import pandas as pd
import numpy as np
import re

# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn ,cuda
from torch.utils.data import DataLoader,Dataset,RandomSampler, SequentialSampler

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

#handling html data
from bs4 import BeautifulSoup

import tensorflow as tf

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda")
device

device(type='cuda')

In [5]:
def print_now() :
    # datetime object containing current date and time
    now = datetime.now()
    print("now =", now)

In [6]:
%%time
df = pd.read_csv("https://www.dropbox.com/s/b0rb0xk3nzopniz/20Kcleaned_nlp_questions_30.csv?dl=1")

CPU times: user 8.82 s, sys: 1.73 s, total: 10.5 s
Wall time: 24.7 s


In [7]:
%%time
import ast
df['Tags_new'] = df['Tags_new'].apply(ast.literal_eval)

CPU times: user 3.06 s, sys: 73.9 ms, total: 3.13 s
Wall time: 6.04 s


In [10]:
df.head()

Unnamed: 0,Score,Body,Title,Tags,Tags_new,Clean_text
0,9685,"<p>After reading <a href=""http://groups.google...","What is the ""-->"" operator in C/C++?",c operators code-formatting standards-compliance,[c],what is the -- operator in c c++ after reading...
1,8117,"<p>Recently, I ran some of my JavaScript code ...","What does ""use strict"" do in JavaScript, and w...",javascript syntax jslint use-strict,[javascript],what does use strict do in javascript and what...
2,7710,<p>How can I redirect the user from one page t...,How do I redirect to another webpage?,javascript jquery redirect,"[javascript, jquery]",how do i redirect to another webpage how can i...
3,7417,<p>Usually I would expect a <code>String.conta...,How to check whether a string contains a subst...,javascript string substring string-matching,[javascript],how to check whether a string contains a subst...
4,7364,<p>I've recently started maintaining someone e...,var functionName = function() {} vs function f...,javascript function syntax idioms,[javascript],var functionname function vs function function...


In [None]:
numberToKeep = 50000
df = df.nlargest(numberToKeep,"Score")

In [11]:
print_now()

now = 2022-08-18 14:44:45.010411


## Apprentissage supervisé

Transformation des tags en label binaire en colonne. 

In [8]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb  = MultiLabelBinarizer()

In [9]:
ybin = mlb.fit_transform(df['Tags_new'])
ybin.shape

(200000, 30)

In [10]:
# Getting a sense of how the tags data looks like
print(ybin[0])
print(mlb.inverse_transform(ybin[0].reshape(1,-1)))
print(mlb.classes_)
print(len(mlb.classes_))

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[('c',)]
['.net' 'ajax' 'android' 'asp.net' 'asp.net-mvc' 'c' 'c#' 'c++' 'css'
 'database' 'discussion' 'django' 'html' 'iphone' 'java' 'javascript'
 'jquery' 'mysql' 'objective-c' 'php' 'python' 'regex' 'ruby'
 'ruby-on-rails' 'sql' 'sql-server' 'vb.net' 'windows' 'wpf' 'xml']
30


In [15]:
# compute no. of words in each question
word_cnt = [len(quest.split()) for quest in df['Clean_text']]
# Plot the distribution
fig = px.histogram(word_cnt,nbins = 100)
fig.show()
#plt.xlabel('Word Count/Question')
#plt.ylabel('# of Occurences')
#plt.title("Frequency of Word Counts/sentence")
#plt.show()

NameError: ignored

La plupart des questions ont moins de 300 mots

In [11]:
from sklearn.model_selection import train_test_split
# First Split for Train and Test
x_train,x_test,y_train,y_test = train_test_split(df['Clean_text'], ybin, test_size=0.1, random_state=RANDOM_SEED,shuffle=True)
# Next split Train in to training and validation
x_tr,x_val,y_tr,y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=RANDOM_SEED,shuffle=True)

In [12]:
x_tr = x_tr.reset_index().drop(columns = ["index"])
x_val = x_val.reset_index().drop(columns = ["index"])
x_test = x_test.reset_index().drop(columns = ["index"])

In [13]:
x_val.Clean_text.head()

0    is there a way to define the timezone for an a...
1    how to use content tag in a lib class my probl...
2    dynamic keyword enables maybe monad so i have ...
3    divide and conquer of large objects for gc per...
4    100 edits in one evening. is it wrong yesterda...
Name: Clean_text, dtype: object

In [14]:
print(len(x_tr) ,len(x_val), len(x_test))

144000 36000 20000


In [15]:
def avg_jacard(y_true,y_pred):
    '''
    see https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics
    '''
    jacard = np.minimum(y_true,y_pred).sum(axis=1) / np.maximum(y_true,y_pred).sum(axis=1)
    
    return jacard.mean()*100

def print_score(y_pred, clf):
    print("Clf: ", clf.__class__.__name__)
    print("Jacard score: {}".format(avg_jacard(y_test, y_pred)))
    print("Hamming loss: {}".format(hamming_loss(y_pred, y_test)*100))
    print("---")

First create QTagDataset class based on the Dataset class,that readies the text in a format needed for the BERT Model

In [16]:
class QTagDataset (Dataset):
    def __init__(self,quest,tags, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = quest
        self.labels = tags
        self.max_len = max_len
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, item_idx):
        text = self.text[item_idx]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True, # Add [CLS] [SEP]
            max_length= self.max_len,
            padding = 'max_length',
            return_token_type_ids= False,
            return_attention_mask= True, # Differentiates padded vs normal token
            truncation=True, # Truncate data beyond max length
            return_tensors = 'pt' # PyTorch Tensor format
          )
        
        input_ids = inputs['input_ids'].flatten()
        attn_mask = inputs['attention_mask'].flatten()
        #token_type_ids = inputs["token_type_ids"]
        
        return {
            'input_ids': input_ids ,
            'attention_mask': attn_mask,
            'label': torch.tensor(self.labels[item_idx], dtype=torch.float)
            
        }

Since we are using Pytorch Lightning for Model training - we will setup the QTagDataModule class that is derived from the LightningDataModule

In [17]:
class QTagDataModule (pl.LightningDataModule):
    
    def __init__(self,x_tr,y_tr,x_val,y_val,x_test,y_test,tokenizer,batch_size=16,max_token_len=200):
        super().__init__()
        self.tr_text = x_tr
        self.tr_label = y_tr
        self.val_text = x_val
        self.val_label = y_val
        self.test_text = x_test
        self.test_label = y_test
        self.tokenizer = tokenizer
        self.batch_size = batch_size
        self.max_token_len = max_token_len

    def setup(self, stage=None):
        self.train_dataset = QTagDataset(quest=self.tr_text, tags=self.tr_label, tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.val_dataset  = QTagDataset(quest=self.val_text,tags=self.val_label,tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.test_dataset  = QTagDataset(quest=self.test_text,tags=self.test_label,tokenizer=self.tokenizer,max_len = self.max_token_len)
        
        
    def train_dataloader(self):
        return DataLoader (self.train_dataset,batch_size = self.batch_size,shuffle = True, num_workers = 1 )

    def val_dataloader(self):
        return DataLoader (self.val_dataset, batch_size = self.batch_size)

    def test_dataloader(self):
        return DataLoader (self.test_dataset, batch_size = self.batch_size)

In [18]:
# Initialize the Bert tokenizer
BERT_MODEL_NAME = "bert-base-cased" # we will use the BERT base model(the smaller one)
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

%%time
max_word_cnt = 300
quest_cnt = 0

# For every sentence...
for question in df['Clean_text']:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = Bert_tokenizer.encode(question, add_special_tokens=True)

    # Update the maximum sentence length.
    if len(input_ids) > max_word_cnt:
        quest_cnt +=1

print(f'# Question having word count > {max_word_cnt}: is  {quest_cnt}')

In [19]:
# Initialize the parameters that will be use for training
N_EPOCHS = 35
BATCH_SIZE = 32
MAX_LEN = 252
LR = 2e-05

In [20]:
# Instantiate and set up the data_module
QTdata_module = QTagDataModule(x_tr.Clean_text,y_tr,x_val.Clean_text,y_val,x_test.Clean_text,y_test,Bert_tokenizer,BATCH_SIZE,MAX_LEN)
QTdata_module.setup()

In [21]:
class QTagClassifier(pl.LightningModule):
    # Set up the classifier
    def __init__(self, n_classes=10, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):
        super().__init__()

        self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
        self.steps_per_epoch = steps_per_epoch
        self.n_epochs = n_epochs
        self.lr = lr
        self.criterion = nn.BCEWithLogitsLoss()
        
    def forward(self,input_ids, attn_mask):
        output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
        output = self.classifier(output.pooler_output)
                
        return output
    
    
    def training_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('train_loss',loss , prog_bar=True,logger=True)
        
        return {"loss" :loss, "predictions":outputs, "labels": labels }


    def validation_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('val_loss',loss , prog_bar=True,logger=True)
        
        return loss

    def test_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('test_loss',loss , prog_bar=True,logger=True)
        
        return loss
    
    
    def configure_optimizers(self):
        optimizer = AdamW(self.parameters() , lr=self.lr)
        warmup_steps = self.steps_per_epoch//3
        total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps

        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)

        return [optimizer], [scheduler]

In [22]:
# Instantiate the classifier model
steps_per_epoch = len(x_tr)//BATCH_SIZE
model = QTagClassifier(n_classes= len((mlb.classes_)), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [23]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [24]:
#Initialize Pytorch Lightning callback for Model checkpointing

# saves a file like: input/QTag-epoch=02-val_loss=0.32.ckpt
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',# monitored quantity
    filename='QTag8K-{epoch:02d}-{val_loss:.2f}',
    save_top_k=2, #  save the top 3 models
    mode='min', # mode of the monitored quantity  for optimization
    verbose=True,
    dirpath= '/content/drive/MyDrive/TAG_NLP', 
)

In [25]:
# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , callbacks=[checkpoint_callback], 
                     resume_from_checkpoint = "/content/drive/MyDrive/TAG_NLP/QTag8K-epoch=34-val_loss=0.05.ckpt", 
                     accelerator  = 'gpu', devices = 1 ,enable_progress_bar=True)

  "Setting `Trainer(resume_from_checkpoint=)` is deprecated in v1.5 and"
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [31]:
%%time
# Train the Classifier Model
trainer.fit(model, QTdata_module)

  ckpt_path = ckpt_path or self.resume_from_checkpoint
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/drive/MyDrive/TAG_NLP/QTag8K-epoch=29-val_loss=0.05.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name       | Type              | Params
-------------------------------------------------
0 | bert       | BertModel         | 108 M 
1 | classifier | Linear            | 23.1 K
2 | criterion  | BCEWithLogitsLoss | 0     
-------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
433.333   Total estimated model params size (MB)
INFO:pytorch_lightning.utilities.rank_zero:Restored all states from the checkpoint file at /content/drive/MyDrive/TAG_NLP/QTag8K-epoch=29-val_loss=0.05.ckpt


Sanity Checking: 0it [00:00, ?it/s]

Training: 4500it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 30, global step 73800: 'val_loss' reached 0.05330 (best 0.05330), saving model to '/content/drive/MyDrive/TAG_NLP/QTag8K-epoch=30-val_loss=0.05.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 31, global step 78300: 'val_loss' reached 0.05223 (best 0.05223), saving model to '/content/drive/MyDrive/TAG_NLP/QTag8K-epoch=31-val_loss=0.05.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 32, global step 82800: 'val_loss' reached 0.05124 (best 0.05124), saving model to '/content/drive/MyDrive/TAG_NLP/QTag8K-epoch=32-val_loss=0.05.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 33, global step 87300: 'val_loss' reached 0.05039 (best 0.05039), saving model to '/content/drive/MyDrive/TAG_NLP/QTag8K-epoch=33-val_loss=0.05.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 34, global step 91800: 'val_loss' reached 0.04950 (best 0.04950), saving model to '/content/drive/MyDrive/TAG_NLP/QTag8K-epoch=34-val_loss=0.05.ckpt' as top 2
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=35` reached.


CPU times: user 4h 19min 35s, sys: 33min 43s, total: 4h 53min 18s
Wall time: 4h 57min 13s


In [26]:
# Evaluate the model performance on the test dataset
trainer.test(model,datamodule=QTdata_module)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss           0.7473402619361877
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.7473402619361877}]

In [27]:
model_path = checkpoint_callback.best_model_path
model_path

''

In [28]:
len(y_test), len(x_test)

(20000, 20000)

In [29]:
from torch.utils.data import TensorDataset

# Tokenize all questions in x_test
input_ids = []
attention_masks = []


for quest in x_test.Clean_text:
    encoded_quest =  Bert_tokenizer.encode_plus(
                    quest,
                    None,
                    add_special_tokens=True,
                    max_length= MAX_LEN,
                    padding = 'max_length',
                    return_token_type_ids= False,
                    return_attention_mask= True,
                    truncation=True,
                    return_tensors = 'pt'      
    )
    
    # Add the input_ids from encoded question to the list.    
    input_ids.append(encoded_quest['input_ids'])
    # Add its attention mask 
    attention_masks.append(encoded_quest['attention_mask'])
  

In [30]:
# Now convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(y_test)

# Set the batch size.  
TEST_BATCH_SIZE = 64  

# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)

In [31]:
flat_pred_outs = 0
flat_true_labels = 0

In [32]:
device

device(type='cuda')

In [33]:
# Put model in evaluation mode
model = model.to(device) # moving model to cuda
model.eval()

QTagClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tru

In [34]:
# Tracking variables 
pred_outs, true_labels = [], []
#i=0
# Predict 
for batch in pred_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
  
    # Unpack the inputs from our dataloader
    b_input_ids, b_attn_mask, b_labels = batch
 
    with torch.no_grad():
        # Forward pass, calculate logit predictions
        pred_out = model(b_input_ids,b_attn_mask)
        pred_out = torch.sigmoid(pred_out)
        # Move predicted output and labels to CPU
        pred_out = pred_out.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        #i+=1
        # Store predictions and true labels
        #print(i)
        #print(outputs)
        #print(logits)
        #print(label_ids)
    pred_outs.append(pred_out)
    true_labels.append(label_ids)

In [35]:
len(pred_outs), len(pred_outs[0]), len(pred_outs[0][0])

(313, 64, 30)

In [36]:
79*64

5056

In [58]:
pred_outs[0][0]

array([0.6059645 , 0.4354556 , 0.54056865, 0.4453591 , 0.5718665 ,
       0.43433276, 0.50064343, 0.6078752 , 0.3511153 , 0.5439792 ,
       0.5982042 , 0.31513053, 0.7134746 , 0.52414644, 0.5469207 ,
       0.51537544, 0.54228944, 0.50514805, 0.7714528 , 0.513471  ,
       0.25459626, 0.3523429 , 0.46378592, 0.4989294 , 0.53962594,
       0.6834876 , 0.5882423 , 0.39475718, 0.54431146, 0.55760443],
      dtype=float32)

In [38]:
true_labels[0][0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

In [39]:
# Combine the results across all batches. 
flat_pred_outs = np.concatenate(pred_outs, axis=0)

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

In [49]:
#define candidate threshold values
threshold  = np.arange(0.10,0.80,0.01)
threshold

array([0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ,
       0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31,
       0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42,
       0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53,
       0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64,
       0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75,
       0.76, 0.77, 0.78, 0.79])

In [50]:
# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
    y_pred = []

    for tag_label_row in pred_prob:
        temp=[]
        for tag_label in tag_label_row:
            if tag_label >= thresh:
                temp.append(1) # Infer tag value as 1 (present)
            else:
                temp.append(0) # Infer tag value as 0 (absent)
        y_pred.append(temp)

    return y_pred

In [51]:
from sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold

#convert labels to 1D array
y_true = flat_true_labels.ravel() 

for thresh in threshold:
    
    #classes for each threshold
    pred_bin_label = classify(flat_pred_outs,thresh) 

    #convert to 1D array
    y_pred = np.array(pred_bin_label).ravel()

    scores.append(metrics.f1_score(y_true,y_pred))

In [52]:
# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')

Optimal Threshold Value = 0.45999999999999985


In [59]:
#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,0.7)
y_pred = np.array(y_pred_labels).ravel() # Flatten

In [60]:
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.93      0.94    573988
           1       0.03      0.05      0.03     26012

    accuracy                           0.89    600000
   macro avg       0.49      0.49      0.49    600000
weighted avg       0.92      0.89      0.90    600000



In [61]:
y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)

df = pd.DataFrame({'Body':x_test.Clean_text,'Actual Tags':y_act,'Predicted Tags':y_pred})

In [62]:
y_act[:5], y_pred[:5]

([('django', 'python'),
  ('javascript',),
  ('django', 'python'),
  ('c#',),
  ('python',)],
 [('html', 'objective-c'),
  ('html', 'objective-c', 'sql-server'),
  ('html', 'objective-c', 'sql-server'),
  ('objective-c',),
  ('objective-c',)])

In [63]:
df.sample(10)

Unnamed: 0,Body,Actual Tags,Predicted Tags
15697,comparing two enum types for equivalence in my...,"(c#,)","(html, objective-c, sql-server)"
10102,base64 encode file using nsdata chunks update ...,"(iphone, objective-c)","(html, objective-c, sql-server)"
2278,what is the size of an enum in c i m creating ...,"(c,)","(html, objective-c, sql-server)"
7822,how to serialize python objects in a human-rea...,"(python,)","(html, objective-c)"
15961,why should i use exit select here are a couple...,"(vb.net,)","(html, objective-c, sql-server)"
16823,memory bandwidth usage how do you calculate me...,"(c, c#, c++)","(objective-c,)"
12912,using rubyzip to add files and nested director...,"(ruby,)","(objective-c,)"
14203,sql server 2005 isnumeric not catching 0310d45...,"(sql, sql-server)","(objective-c,)"
6994,php s preg match and preg match all functions ...,"(php,)","(html, objective-c, sql-server)"
9417,handling extended characters in windows comman...,"(windows,)","(html, objective-c, sql-server)"
