# Récupération des données Stack OverFlow

- https://github.com/pnageshkar/NLP/blob/master/Medium/Multi_label_Classification_BERT_Lightning.ipynb
- https://curiousily.com/posts/multi-label-text-classification-with-bert-and-pytorch-lightning/

La récupération des données de Stack Overflow est un peu compliqué nous devons partitionner nos données et faire des extractions petit à petit.
Une première analyse de la structure de la base nous a ammené a vouloir filtrer les posts en ne selectionnant que les questions. Pour pouvoir extraire un même nombre de paquet (car les identifiants n'ont pas des valeurs continues) nous utilisons la fonction dense_rank() qui permet de réaffecter un nouvel index a chaque ligne de façon continu. Nous utilisons aussi une fonction with, car l'interpreteur SQL de la base ne permet pas les select imbriqués. 

Ci-dessous un aperçu de la requête utilisée pour l'extraction de données

with extract_post as 
( select * , dense_rank() over (order by id asc) as rank from posts 
where PostTypeId = 1 
)
select * from extract_post
where rank >= 30000
and rank < 60000

## Import des données

In [1]:
! pip install -q pytorch-lightning
! pip install -q transformers

[K     |████████████████████████████████| 705 kB 4.0 MB/s 
[K     |████████████████████████████████| 5.9 MB 51.5 MB/s 
[K     |████████████████████████████████| 419 kB 51.9 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.2+zzzcolab20220719082949 requires tensorboard<2.9,>=2.8, but you have tensorboard 2.10.0 which is incompatible.[0m
[K     |████████████████████████████████| 4.7 MB 5.0 MB/s 
[K     |████████████████████████████████| 101 kB 5.5 MB/s 
[K     |████████████████████████████████| 6.6 MB 36.0 MB/s 
[?25h

In [2]:
! pip install -q plotly_express

In [3]:
import plotly.express as px

In [65]:
!nvidia-smi

Mon Aug 22 15:51:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    30W /  70W |   2668MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
#Declaration des imports utilise pour l'analyse des donnees
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
from datetime import datetime
from tqdm.auto import tqdm

pd.options.display.max_rows = 40

In [6]:
# Import all libraries
import pandas as pd
import numpy as np
import re

# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn ,cuda
from torch.utils.data import DataLoader,Dataset,RandomSampler, SequentialSampler

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

#handling html data
from bs4 import BeautifulSoup

import tensorflow as tf

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda")
device

device(type='cuda')

In [7]:
def print_now() :
    # datetime object containing current date and time
    now = datetime.now()
    print("now =", now)

In [8]:
%%time
df = pd.read_csv("https://www.dropbox.com/s/8acv2mufr6g4mha/20Kcleaned_nlp_questions_10.csv?dl=1")

CPU times: user 7.73 s, sys: 1.22 s, total: 8.95 s
Wall time: 16.2 s


In [9]:
%%time
import ast
df['Tags_new'] = df['Tags_new'].apply(ast.literal_eval)

CPU times: user 2.34 s, sys: 59 ms, total: 2.4 s
Wall time: 3 s


In [10]:
df.head()

Unnamed: 0,Score,Body,Title,Tags,Tags_new,Clean_text
0,8117,"<p>Recently, I ran some of my JavaScript code ...","What does ""use strict"" do in JavaScript, and w...",javascript syntax jslint use-strict,[javascript],what does use strict do in javascript and what...
1,7710,<p>How can I redirect the user from one page t...,How do I redirect to another webpage?,javascript jquery redirect,"[javascript, jquery]",how do i redirect to another webpage how can i...
2,7417,<p>Usually I would expect a <code>String.conta...,How to check whether a string contains a subst...,javascript string substring string-matching,[javascript],how to check whether a string contains a subst...
3,7364,<p>I've recently started maintaining someone e...,var functionName = function() {} vs function f...,javascript function syntax idioms,[javascript],var functionname function vs function function...
4,7332,"<p>Given the following code, what does the <co...","What does if __name__ == ""__main__"": do?",python namespaces main python-module idioms,[python],what does if name main do given the following ...


In [11]:
#numberToKeep = 50000
#df = df.nlargest(numberToKeep,"Score")

In [12]:
print_now()

now = 2022-08-22 15:07:41.530779


## Apprentissage supervisé

Transformation des tags en label binaire en colonne. 

In [13]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb  = MultiLabelBinarizer()

In [14]:
ybin = mlb.fit_transform(df['Tags_new'])
ybin.shape

(200000, 10)

In [15]:
# Getting a sense of how the tags data looks like
print(ybin[0])
print(mlb.inverse_transform(ybin[0].reshape(1,-1)))
print(mlb.classes_)
print(len(mlb.classes_))

[0 0 0 0 0 0 1 0 0 0]
[('javascript',)]
['.net' 'asp.net' 'c#' 'c++' 'iphone' 'java' 'javascript' 'jquery' 'php'
 'python']
10


In [16]:
# compute no. of words in each question
word_cnt = [len(quest.split()) for quest in df['Clean_text']]
# Plot the distribution
fig = px.histogram(word_cnt,nbins = 100)
fig.show()
#plt.xlabel('Word Count/Question')
#plt.ylabel('# of Occurences')
#plt.title("Frequency of Word Counts/sentence")
#plt.show()

La plupart des questions ont moins de 300 mots. Nous avons aussi une contrainte technique qui nous permet pas de définir plus de 300 mots. Au delà des 300 mots, la GPU disponible n'est pas en mesure de contenir toute l'information pour entraîner notre modèle.

In [20]:
def avg_jacard(y_true,y_pred):
    '''
    see https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics
    '''
    jacard = np.minimum(y_true,y_pred).sum(axis=1) / np.maximum(y_true,y_pred).sum(axis=1)
    
    return jacard.mean()*100

def print_score(y_pred, clf):
    print("Clf: ", clf.__class__.__name__)
    print("Jacard score: {}".format(avg_jacard(y_test, y_pred)))
    print("Hamming loss: {}".format(hamming_loss(y_pred, y_test)*100))
    print("---")

First create QTagDataset class based on the Dataset class,that readies the text in a format needed for the BERT Model

In [21]:
class QTagDataset (Dataset):
    def __init__(self,quest,tags, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = quest
        self.labels = tags
        self.max_len = max_len
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, item_idx):
        text = self.text[item_idx]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True, # Add [CLS] [SEP]
            max_length= self.max_len,
            padding = 'max_length',
            return_token_type_ids= False,
            return_attention_mask= True, # Differentiates padded vs normal token
            truncation=True, # Truncate data beyond max length
            return_tensors = 'pt' # PyTorch Tensor format
          )
        
        input_ids = inputs['input_ids'].flatten()
        attn_mask = inputs['attention_mask'].flatten()
        #token_type_ids = inputs["token_type_ids"]
        
        return {
            'input_ids': input_ids ,
            'attention_mask': attn_mask,
            'label': torch.tensor(self.labels[item_idx], dtype=torch.float)
            
        }

Since we are using Pytorch Lightning for Model training - we will setup the QTagDataModule class that is derived from the LightningDataModule

In [22]:
class QTagDataModule (pl.LightningDataModule):
    
    def __init__(self,x_tr,y_tr,x_val,y_val,x_test,y_test,tokenizer,batch_size=16,max_token_len=200):
        super().__init__()
        self.tr_text = x_tr
        self.tr_label = y_tr
        self.val_text = x_val
        self.val_label = y_val
        self.test_text = x_test
        self.test_label = y_test
        self.tokenizer = tokenizer
        self.batch_size = batch_size
        self.max_token_len = max_token_len

    def setup(self, stage=None):
        self.train_dataset = QTagDataset(quest=self.tr_text, tags=self.tr_label, tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.val_dataset  = QTagDataset(quest=self.val_text,tags=self.val_label,tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.test_dataset  = QTagDataset(quest=self.test_text,tags=self.test_label,tokenizer=self.tokenizer,max_len = self.max_token_len)
        
        
    def train_dataloader(self):
        return DataLoader (self.train_dataset,batch_size = self.batch_size,shuffle = True, num_workers = 1 )

    def val_dataloader(self):
        return DataLoader (self.val_dataset, batch_size = self.batch_size)

    def test_dataloader(self):
        return DataLoader (self.test_dataset, batch_size = self.batch_size)

In [23]:
# Initialize the Bert tokenizer
BERT_MODEL_NAME = "bert-base-cased" # we will use the BERT base model(the smaller one)
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

%%time
max_word_cnt = 300
quest_cnt = 0

# For every sentence...
for question in df['Clean_text']:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = Bert_tokenizer.encode(question, add_special_tokens=True)

    # Update the maximum sentence length.
    if len(input_ids) > max_word_cnt:
        quest_cnt +=1

print(f'# Question having word count > {max_word_cnt}: is  {quest_cnt}')

In [24]:
# Initialize the parameters that will be use for training
N_EPOCHS = 15
BATCH_SIZE = 32
MAX_LEN = 300
LR = 2e-05

In [25]:
# Instantiate and set up the data_module
QTdata_module = QTagDataModule(x_tr.Clean_text,y_tr,x_val.Clean_text,y_val,x_test.Clean_text,y_test,Bert_tokenizer,BATCH_SIZE,MAX_LEN)
QTdata_module.setup()

In [26]:
class QTagClassifier(pl.LightningModule):
    # Set up the classifier
    def __init__(self, n_classes=10, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):
        super().__init__()

        self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
        self.steps_per_epoch = steps_per_epoch
        self.n_epochs = n_epochs
        self.lr = lr
        self.criterion = nn.BCEWithLogitsLoss()
        
    def forward(self,input_ids, attn_mask):
        output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
        output = self.classifier(output.pooler_output)
                
        return output
    
    
    def training_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('train_loss',loss , prog_bar=True,logger=True)
        
        return {"loss" :loss, "predictions":outputs, "labels": labels }


    def validation_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('val_loss',loss , prog_bar=True,logger=True)
        
        return loss

    def test_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('test_loss',loss , prog_bar=True,logger=True)
        
        return loss
    
    
    def configure_optimizers(self):
        optimizer = AdamW(self.parameters() , lr=self.lr)
        warmup_steps = self.steps_per_epoch//3
        total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps

        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)

        return [optimizer], [scheduler]

In [27]:
# Instantiate the classifier model
steps_per_epoch = len(x_tr)//BATCH_SIZE
model = QTagClassifier(n_classes= len((mlb.classes_)), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [37]:
#Initialize Pytorch Lightning callback for Model checkpointing

# saves a file like: input/QTag-epoch=02-val_loss=0.32.ckpt
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',# monitored quantity
    filename='QTag20K_10-{epoch:02d}-{val_loss:.2f}',
    save_top_k=2, #  save the top 3 models
    mode='min', # mode of the monitored quantity  for optimization
    verbose=True,
    dirpath= '/content/drive/MyDrive/TAG_NLP', 
)

In [38]:
# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , callbacks=[checkpoint_callback], 
                     resume_from_checkpoint = "/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt", 
                     accelerator  = 'gpu', devices = 1 ,enable_progress_bar=True)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [39]:
%%time
# Train the Classifier Model
trainer.fit(model, QTdata_module)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name       | Type              | Params
-------------------------------------------------
0 | bert       | BertModel         | 108 M 
1 | classifier | Linear            | 7.7 K 
2 | criterion  | BCEWithLogitsLoss | 0     
-------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
433.272   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 4500: 'val_loss' reached 0.76302 (best 0.76302), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=00-val_loss=0.76.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 9000: 'val_loss' reached 0.64367 (best 0.64367), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=01-val_loss=0.64.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 13500: 'val_loss' reached 0.48659 (best 0.48659), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=02-val_loss=0.49.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 18000: 'val_loss' reached 0.37430 (best 0.37430), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=03-val_loss=0.37.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 22500: 'val_loss' reached 0.34781 (best 0.34781), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=04-val_loss=0.35.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 5, global step 27000: 'val_loss' reached 0.32918 (best 0.32918), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=05-val_loss=0.33.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 6, global step 31500: 'val_loss' reached 0.29145 (best 0.29145), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=06-val_loss=0.29.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 7, global step 36000: 'val_loss' reached 0.25622 (best 0.25622), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=07-val_loss=0.26.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 8, global step 40500: 'val_loss' reached 0.22562 (best 0.22562), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=08-val_loss=0.23.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 9, global step 45000: 'val_loss' reached 0.20054 (best 0.20054), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=09-val_loss=0.20.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 10, global step 49500: 'val_loss' reached 0.17946 (best 0.17946), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=10-val_loss=0.18.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 11, global step 54000: 'val_loss' reached 0.16231 (best 0.16231), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=11-val_loss=0.16.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 12, global step 58500: 'val_loss' reached 0.14825 (best 0.14825), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=12-val_loss=0.15.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 13, global step 63000: 'val_loss' reached 0.13651 (best 0.13651), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=13-val_loss=0.14.ckpt' as top 2


Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 14, global step 67500: 'val_loss' reached 0.12674 (best 0.12674), saving model to '/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt' as top 2
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=15` reached.


CPU times: user 16h 14min 55s, sys: 1h 48min 29s, total: 18h 3min 25s
Wall time: 18h 15min 38s


In [41]:
model_path = checkpoint_callback.best_model_path
model_path

'/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt'

In [32]:
# Evaluate the model performance on the test dataset
trainer.test(model,datamodule=QTdata_module
             ,ckpt_path="/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt")

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss           0.12820042669773102
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.12820042669773102}]

In [33]:
len(y_test), len(x_test)

(20000, 20000)

In [None]:
model.load_from_checkpoint(checkpoint_path='/content/drive/MyDrive/TAG_NLP/QTag20K_10-epoch=14-val_loss=0.13.ckpt')

In [34]:
from torch.utils.data import TensorDataset

# Tokenize all questions in x_test
input_ids = []
attention_masks = []


for quest in x_test.Clean_text:
    encoded_quest =  Bert_tokenizer.encode_plus(
                    quest,
                    None,
                    add_special_tokens=True,
                    max_length= MAX_LEN,
                    padding = 'max_length',
                    return_token_type_ids= False,
                    return_attention_mask= True,
                    truncation=True,
                    return_tensors = 'pt'      
    )
    
    # Add the input_ids from encoded question to the list.    
    input_ids.append(encoded_quest['input_ids'])
    # Add its attention mask 
    attention_masks.append(encoded_quest['attention_mask'])
  

In [35]:
# Now convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(y_test)

# Set the batch size.  
TEST_BATCH_SIZE = 64  

# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)

In [36]:
flat_pred_outs = 0
flat_true_labels = 0

In [37]:
device

device(type='cuda')

In [38]:
# Put model in evaluation mode
model = model.to(device) # moving model to cuda
model.eval()

QTagClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tru

In [39]:
# Tracking variables 
pred_outs, true_labels = [], []
#i=0
# Predict 
for batch in pred_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
  
    # Unpack the inputs from our dataloader
    b_input_ids, b_attn_mask, b_labels = batch
 
    with torch.no_grad():
        # Forward pass, calculate logit predictions
        pred_out = model(b_input_ids,b_attn_mask)
        pred_out = torch.sigmoid(pred_out)
        # Move predicted output and labels to CPU
        pred_out = pred_out.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        #i+=1
        # Store predictions and true labels
        #print(i)
        #print(outputs)
        #print(logits)
        #print(label_ids)
    pred_outs.append(pred_out)
    true_labels.append(label_ids)

In [40]:
len(pred_outs), len(pred_outs[0]), len(pred_outs[0][0])

(313, 64, 10)

In [41]:
313 * 64

20032

In [42]:
pred_outs[0][0]

array([0.2714007 , 0.04550791, 0.7560344 , 0.01421181, 0.01180358,
       0.13909444, 0.01074913, 0.00819525, 0.05320427, 0.06574861],
      dtype=float32)

In [43]:
true_labels[0][0]

array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [44]:
# Combine the results across all batches. 
flat_pred_outs = np.concatenate(pred_outs, axis=0)

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

In [45]:
#define candidate threshold values
threshold  = np.arange(0.10,0.80,0.01)
threshold

array([0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ,
       0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31,
       0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42,
       0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53,
       0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64,
       0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75,
       0.76, 0.77, 0.78, 0.79])

In [46]:
# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
    y_pred = []

    for tag_label_row in pred_prob:
        temp=[]
        for tag_label in tag_label_row:
            if tag_label >= thresh:
                temp.append(1) # Infer tag value as 1 (present)
            else:
                temp.append(0) # Infer tag value as 0 (absent)
        y_pred.append(temp)

    return y_pred

In [75]:
from sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold

#convert labels to 1D array
y_true = flat_true_labels.ravel() 

for thresh in threshold:
    
    #classes for each threshold
    pred_bin_label = classify(flat_pred_outs,thresh) 

    #convert to 1D array
    y_pred = np.array(pred_bin_label).ravel()

    scores.append(metrics.f1_score(y_true,y_pred))

In [48]:
# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')

Optimal Threshold Value = 0.3799999999999999


In [76]:
#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,opt_thresh)

In [78]:
y_pred_labels[0] , flat_true_labels[0]

([0, 0, 1, 0, 0, 0, 0, 0, 0, 0], array([1, 0, 1, 0, 0, 0, 0, 0, 0, 0]))

In [84]:
print(metrics.classification_report(flat_true_labels, y_pred_labels, target_names= mlb.classes_,zero_division=0))

              precision    recall  f1-score   support

        .net       0.76      0.29      0.42      2347
     asp.net       0.81      0.61      0.70      1307
          c#       0.76      0.78      0.77      4892
         c++       0.94      0.83      0.88      2269
      iphone       0.97      0.81      0.88      1276
        java       0.94      0.85      0.89      3513
  javascript       0.87      0.67      0.76      1959
      jquery       0.91      0.85      0.88      1381
         php       0.95      0.80      0.87      1989
      python       0.97      0.86      0.91      2040

   micro avg       0.88      0.74      0.80     22973
   macro avg       0.89      0.74      0.80     22973
weighted avg       0.87      0.74      0.79     22973
 samples avg       0.81      0.78      0.78     22973



In [87]:
print("Jacard score: {}".format(avg_jacard(flat_true_labels, y_pred_labels))) 

Jacard score: 75.80708333333332


In [88]:
y_pred = np.array(y_pred_labels).ravel() # Flatten
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98    177027
           1       0.88      0.74      0.80     22973

    accuracy                           0.96    200000
   macro avg       0.92      0.86      0.89    200000
weighted avg       0.96      0.96      0.96    200000



In [51]:
y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)

df = pd.DataFrame({'Body':x_test.Clean_text,'Actual Tags':y_act,'Predicted Tags':y_pred})

In [64]:
df.sample(30)

Unnamed: 0,Body,Actual Tags,Predicted Tags
1735,expression generated based on interface i m ge...,"(c#,)","(c#,)"
6705,java - distinct list of objects i have a list ...,"(java,)","(java,)"
6593,can i authenticate an iframe page from the par...,"(javascript,)","(javascript, jquery)"
9612,exception handling on background threads using...,"(c#,)","(java,)"
9436,spring transaction propagation issue i am usin...,"(java,)","(java,)"
10860,disabling authenticode signature verification ...,"(.net,)","(.net, c#)"
18593,technique to limit number of instances of our ...,"(python,)","(python,)"
13127,porting node.js server-side code to html5 webs...,"(javascript,)","(javascript,)"
7195,displaying camel cased words as such with unde...,"(java,)","(java,)"
19347,hibernate mapping through another entity consi...,"(java,)","(java,)"
