# Récupération des données Stack OverFlow

- https://github.com/pnageshkar/NLP/blob/master/Medium/Multi_label_Classification_BERT_Lightning.ipynb
- https://curiousily.com/posts/multi-label-text-classification-with-bert-and-pytorch-lightning/

La récupération des données de Stack Overflow est un peu compliqué nous devons partitionner nos données et faire des extractions petit à petit.
Une première analyse de la structure de la base nous a ammené a vouloir filtrer les posts en ne selectionnant que les questions. Pour pouvoir extraire un même nombre de paquet (car les identifiants n'ont pas des valeurs continues) nous utilisons la fonction dense_rank() qui permet de réaffecter un nouvel index a chaque ligne de façon continu. Nous utilisons aussi une fonction with, car l'interpreteur SQL de la base ne permet pas les select imbriqués. 

Ci-dessous un aperçu de la requête utilisée pour l'extraction de données

with extract_post as 
( select * , dense_rank() over (order by id asc) as rank from posts 
where PostTypeId = 1 
)
select * from extract_post
where rank >= 30000
and rank < 60000

## Import des données

In [2]:
! pip install -q pytorch-lightning
! pip install -q bs4
! pip install -q transformers

[K     |████████████████████████████████| 701 kB 5.2 MB/s 
[K     |████████████████████████████████| 596 kB 40.4 MB/s 
[K     |████████████████████████████████| 419 kB 63.1 MB/s 
[K     |████████████████████████████████| 5.9 MB 48.0 MB/s 
[K     |████████████████████████████████| 141 kB 60.0 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.2+zzzcolab20220719082949 requires tensorboard<2.9,>=2.8, but you have tensorboard 2.10.0 which is incompatible.[0m
[K     |████████████████████████████████| 4.7 MB 5.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 55.2 MB/s 
[K     |████████████████████████████████| 101 kB 13.5 MB/s 
[?25h

In [3]:
!nvidia-smi

Sun Aug 14 10:35:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
#Declaration des imports utilise pour l'analyse des donnees
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
from datetime import datetime
from tqdm.auto import tqdm

pd.options.display.max_rows = 40

In [5]:
# Import all libraries
import pandas as pd
import numpy as np
import re

# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn ,cuda
from torch.utils.data import DataLoader,Dataset,RandomSampler, SequentialSampler

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

#handling html data
from bs4 import BeautifulSoup

import tensorflow as tf

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda")
device

device(type='cuda')

In [6]:
def print_now() :
    # datetime object containing current date and time
    now = datetime.now()
    print("now =", now)

In [7]:
%%time
df = pd.read_csv("https://www.dropbox.com/s/br1seajznkzd97m/cleaned_nlp_questions_30.csv?dl=1")

CPU times: user 8.51 s, sys: 2.07 s, total: 10.6 s
Wall time: 23 s


In [8]:
%%time
import ast
df['Tags_new'] = df['Tags_new'].apply(ast.literal_eval)

CPU times: user 1.77 s, sys: 29.9 ms, total: 1.8 s
Wall time: 1.81 s


In [9]:
df.head()

Unnamed: 0,Id,Score,Body,Title,Tags,AnswerCount,rank,NumberTags,Tags_new,Body_new,Title_new,Clean_text
0,927358,23972,<p>I accidentally <strong>committed the wrong ...,How do I undo the most recent local commits in...,git version-control git-commit undo,101,158258,4,[git],I accidentally committed the wrong files to Gi...,how do i undo the most recent local commits in...,how do i undo the most recent local commits in...
1,2003505,18993,<p>I want to delete a branch both locally and ...,How do I delete a Git branch locally and remot...,git version-control git-branch git-push git-re...,41,403484,5,[git],I want to delete a branch both locally and rem...,how do i delete a git branch locally and remotely,how do i delete a git branch locally and remot...
2,292357,13052,<p>What are the differences between <code>git ...,What is the difference between 'git pull' and ...,git version-control git-pull git-fetch,38,39164,4,[git],What are the differences between git pull and ...,what is the difference between git pull and gi...,what is the difference between git pull and gi...
3,477816,11080,"<p>I've been messing around with <a href=""http...",What is the correct JSON content type?,json rest http-headers mime-types content-type,36,69935,5,[json],I've been messing around with JSON for some ti...,what is the correct json content type,what is the correct json content type I've bee...
4,348170,10368,<p>I mistakenly added files to Git using the c...,How do I undo 'git add' before commit?,git undo git-revert git-add,38,48655,4,[git],I mistakenly added files to Git using the comm...,how do i undo git add before commit,how do i undo git add before commit I mistaken...


In [9]:
numberToKeep = 100000
df = df.nlargest(numberToKeep,"Score")

In [10]:
print_now()

now = 2022-08-14 10:36:50.800624


## Apprentissage supervisé

Transformation des tags en label binaire en colonne. 

In [11]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb  = MultiLabelBinarizer()

In [12]:
ybin = mlb.fit_transform(df['Tags_new'])
ybin.shape

(50000, 100)

In [13]:
# Getting a sense of how the tags data looks like
print(ybin[0])
print(mlb.inverse_transform(ybin[0].reshape(1,-1)))
print(mlb.classes_)
print(len(mlb.classes_))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[('git',)]
['.net' 'actionscript-3' 'ajax' 'algorithm' 'android' 'apache'
 'apache-flex' 'arrays' 'asp.net' 'asp.net-mvc' 'asp.net-mvc-2' 'bash'
 'bug' 'c' 'c#' 'c++' 'class' 'cocoa' 'cocoa-touch' 'css' 'database'
 'debugging' 'delphi' 'design-patterns' 'discussion' 'django' 'eclipse'
 'email' 'entity-framework' 'events' 'excel' 'facebook' 'feature-request'
 'file' 'flash' 'forms' 'generics' 'git' 'google-app-engine' 'hibernate'
 'html' 'http' 'image' 'internet-explorer' 'iphone' 'java' 'javascript'
 'jquery' 'json' 'linq' 'linq-to-sql' 'linux' 'macos' 'multithreading'
 'mysql' 'nhibernate' 'objective-c' 'oop' 'oracle' 'parsing' 'performance'
 'perl' 'php' 'python' 'qt' 'regex' 'ruby' 'ruby-on-rails' 'security'
 'sharepoint' 'silverlight' 'spring' 'sql' 'sql-server' 'sql-server-2005'
 

In [None]:
# compute no. of words in each question
word_cnt = [len(quest.split()) for quest in df['Clean_text']]
# Plot the distribution
fig = px.histogram(word_cnt,nbins = 100)
fig.show()
#plt.xlabel('Word Count/Question')
#plt.ylabel('# of Occurences')
#plt.title("Frequency of Word Counts/sentence")
#plt.show()

KeyboardInterrupt: 

La plupart des questions ont moins de 300 mots

In [14]:
from sklearn.model_selection import train_test_split
# First Split for Train and Test
x_train,x_test,y_train,y_test = train_test_split(df['Clean_text'], ybin, test_size=0.1, random_state=RANDOM_SEED,shuffle=True)
# Next split Train in to training and validation
x_tr,x_val,y_tr,y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=RANDOM_SEED,shuffle=True)

In [15]:
x_tr = x_tr.reset_index().drop(columns = ["index"])
x_val = x_val.reset_index().drop(columns = ["index"])
x_test = x_test.reset_index().drop(columns = ["index"])

In [16]:
x_val.Clean_text.head()

0    how do you make the application window open wh...
1    use of sqlparameter in sql like clause not wor...
2    django display image in admin interface I've d...
3    css - position absolute - auto height I am hav...
4    how to print time in format 2009 08 10 18 17 5...
Name: Clean_text, dtype: object

In [17]:
print(len(x_tr) ,len(x_val), len(x_test))

36000 9000 5000


In [18]:
def avg_jacard(y_true,y_pred):
    '''
    see https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics
    '''
    jacard = np.minimum(y_true,y_pred).sum(axis=1) / np.maximum(y_true,y_pred).sum(axis=1)
    
    return jacard.mean()*100

def print_score(y_pred, clf):
    print("Clf: ", clf.__class__.__name__)
    print("Jacard score: {}".format(avg_jacard(y_test, y_pred)))
    print("Hamming loss: {}".format(hamming_loss(y_pred, y_test)*100))
    print("---")

First create QTagDataset class based on the Dataset class,that readies the text in a format needed for the BERT Model

In [19]:
class QTagDataset (Dataset):
    def __init__(self,quest,tags, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = quest
        self.labels = tags
        self.max_len = max_len
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, item_idx):
        text = self.text[item_idx]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True, # Add [CLS] [SEP]
            max_length= self.max_len,
            padding = 'max_length',
            return_token_type_ids= False,
            return_attention_mask= True, # Differentiates padded vs normal token
            truncation=True, # Truncate data beyond max length
            return_tensors = 'pt' # PyTorch Tensor format
          )
        
        input_ids = inputs['input_ids'].flatten()
        attn_mask = inputs['attention_mask'].flatten()
        #token_type_ids = inputs["token_type_ids"]
        
        return {
            'input_ids': input_ids ,
            'attention_mask': attn_mask,
            'label': torch.tensor(self.labels[item_idx], dtype=torch.float)
            
        }

Since we are using Pytorch Lightning for Model training - we will setup the QTagDataModule class that is derived from the LightningDataModule

In [20]:
class QTagDataModule (pl.LightningDataModule):
    
    def __init__(self,x_tr,y_tr,x_val,y_val,x_test,y_test,tokenizer,batch_size=16,max_token_len=200):
        super().__init__()
        self.tr_text = x_tr
        self.tr_label = y_tr
        self.val_text = x_val
        self.val_label = y_val
        self.test_text = x_test
        self.test_label = y_test
        self.tokenizer = tokenizer
        self.batch_size = batch_size
        self.max_token_len = max_token_len

    def setup(self, stage=None):
        self.train_dataset = QTagDataset(quest=self.tr_text, tags=self.tr_label, tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.val_dataset  = QTagDataset(quest=self.val_text,tags=self.val_label,tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.test_dataset  = QTagDataset(quest=self.test_text,tags=self.test_label,tokenizer=self.tokenizer,max_len = self.max_token_len)
        
        
    def train_dataloader(self):
        return DataLoader (self.train_dataset,batch_size = self.batch_size,shuffle = True, num_workers = 1 )

    def val_dataloader(self):
        return DataLoader (self.val_dataset, batch_size = self.batch_size)

    def test_dataloader(self):
        return DataLoader (self.test_dataset, batch_size = self.batch_size)

In [21]:
# Initialize the Bert tokenizer
BERT_MODEL_NAME = "bert-base-cased" # we will use the BERT base model(the smaller one)
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

%%time
max_word_cnt = 300
quest_cnt = 0

# For every sentence...
for question in df['Clean_text']:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = Bert_tokenizer.encode(question, add_special_tokens=True)

    # Update the maximum sentence length.
    if len(input_ids) > max_word_cnt:
        quest_cnt +=1

print(f'# Question having word count > {max_word_cnt}: is  {quest_cnt}')

In [22]:
# Initialize the parameters that will be use for training
N_EPOCHS = 10
BATCH_SIZE = 32
MAX_LEN = 252
LR = 2e-05

In [23]:
# Instantiate and set up the data_module
QTdata_module = QTagDataModule(x_tr.Clean_text,y_tr,x_val.Clean_text,y_val,x_test.Clean_text,y_test,Bert_tokenizer,BATCH_SIZE,MAX_LEN)
QTdata_module.setup()

In [24]:
class QTagClassifier(pl.LightningModule):
    # Set up the classifier
    def __init__(self, n_classes=10, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):
        super().__init__()

        self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
        self.steps_per_epoch = steps_per_epoch
        self.n_epochs = n_epochs
        self.lr = lr
        self.criterion = nn.BCEWithLogitsLoss()
        
    def forward(self,input_ids, attn_mask):
        output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
        output = self.classifier(output.pooler_output)
                
        return output
    
    
    def training_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('train_loss',loss , prog_bar=True,logger=True)
        
        return {"loss" :loss, "predictions":outputs, "labels": labels }


    def validation_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('val_loss',loss , prog_bar=True,logger=True)
        
        return loss

    def test_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('test_loss',loss , prog_bar=True,logger=True)
        
        return loss
    
    
    def configure_optimizers(self):
        optimizer = AdamW(self.parameters() , lr=self.lr)
        warmup_steps = self.steps_per_epoch//3
        total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps

        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)

        return [optimizer], [scheduler]

In [25]:
# Instantiate the classifier model
steps_per_epoch = len(x_tr)//BATCH_SIZE
model = QTagClassifier(n_classes= len((mlb.classes_)), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [26]:
#Initialize Pytorch Lightning callback for Model checkpointing

# saves a file like: input/QTag-epoch=02-val_loss=0.32.ckpt
checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',# monitored quantity
    filename='QTag10K-{epoch:02d}-{val_loss:.2f}',
    save_top_k=3, #  save the top 3 models
    mode='min', # mode of the monitored quantity  for optimization
)

In [29]:
# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , callbacks=[checkpoint_callback], num_sanity_val_steps=0, 
                     accelerator  = 'gpu', devices = 1 ,enable_progress_bar=True)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [30]:
%%time
# Train the Classifier Model
trainer.fit(model, QTdata_module)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name       | Type              | Params
-------------------------------------------------
0 | bert       | BertModel         | 108 M 
1 | classifier | Linear            | 76.9 K
2 | criterion  | BCEWithLogitsLoss | 0     
-------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
433.549   Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


CPU times: user 3h 46min 30s, sys: 1h 9min 32s, total: 4h 56min 2s
Wall time: 4h 58min 39s


In [31]:
# Evaluate the model performance on the test dataset
trainer.test(model,datamodule=QTdata_module)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss           0.11846496909856796
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 0.11846496909856796}]

In [32]:
model_path = checkpoint_callback.best_model_path
model_path

'/content/lightning_logs/version_0/checkpoints/QTag5K-epoch=09-val_loss=0.12.ckpt'

In [34]:
!zip -r /content/lightning_logs.zip /content//lightning_logs

  adding: content//lightning_logs/ (stored 0%)
  adding: content//lightning_logs/version_0/ (stored 0%)
  adding: content//lightning_logs/version_0/checkpoints/ (stored 0%)
  adding: content//lightning_logs/version_0/checkpoints/QTag5K-epoch=08-val_loss=0.15.ckpt (deflated 13%)
  adding: content//lightning_logs/version_0/checkpoints/QTag5K-epoch=09-val_loss=0.12.ckpt (deflated 13%)
  adding: content//lightning_logs/version_0/checkpoints/QTag5K-epoch=07-val_loss=0.19.ckpt (deflated 13%)
  adding: content//lightning_logs/version_0/events.out.tfevents.1660474448.f54e1f7a0b60.69.0 (deflated 68%)
  adding: content//lightning_logs/version_0/events.out.tfevents.1660492557.f54e1f7a0b60.69.1 (deflated 17%)
  adding: content//lightning_logs/version_0/hparams.yaml (stored 0%)


In [37]:
from google.colab import files
files.download('/content/lightning_logs.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [38]:
len(y_test), len(x_test)

(5000, 5000)

In [39]:
from torch.utils.data import TensorDataset

# Tokenize all questions in x_test
input_ids = []
attention_masks = []


for quest in x_test.Clean_text:
    encoded_quest =  Bert_tokenizer.encode_plus(
                    quest,
                    None,
                    add_special_tokens=True,
                    max_length= MAX_LEN,
                    padding = 'max_length',
                    return_token_type_ids= False,
                    return_attention_mask= True,
                    truncation=True,
                    return_tensors = 'pt'      
    )
    
    # Add the input_ids from encoded question to the list.    
    input_ids.append(encoded_quest['input_ids'])
    # Add its attention mask 
    attention_masks.append(encoded_quest['attention_mask'])
  

In [40]:
# Now convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(y_test)

# Set the batch size.  
TEST_BATCH_SIZE = 64  

# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)

In [41]:
flat_pred_outs = 0
flat_true_labels = 0

In [42]:
device

device(type='cuda')

In [43]:
# Put model in evaluation mode
model = model.to(device) # moving model to cuda
model.eval()

QTagClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tru

In [44]:
# Tracking variables 
pred_outs, true_labels = [], []
#i=0
# Predict 
for batch in pred_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
  
    # Unpack the inputs from our dataloader
    b_input_ids, b_attn_mask, b_labels = batch
 
    with torch.no_grad():
        # Forward pass, calculate logit predictions
        pred_out = model(b_input_ids,b_attn_mask)
        pred_out = torch.sigmoid(pred_out)
        # Move predicted output and labels to CPU
        pred_out = pred_out.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        #i+=1
        # Store predictions and true labels
        #print(i)
        #print(outputs)
        #print(logits)
        #print(label_ids)
    pred_outs.append(pred_out)
    true_labels.append(label_ids)

In [45]:
len(pred_outs), len(pred_outs[0]), len(pred_outs[0][0])

(79, 64, 100)

In [46]:
79*64

5056

In [47]:
pred_outs[0][0]

array([0.07886031, 0.06549477, 0.10160048, 0.09932765, 0.10509416,
       0.06035288, 0.06301752, 0.06991319, 0.07086246, 0.05096096,
       0.06676254, 0.06773514, 0.0815134 , 0.05452433, 0.12293115,
       0.10983247, 0.04802525, 0.08324813, 0.05755777, 0.12075527,
       0.07491882, 0.09444433, 0.07674796, 0.10451932, 0.15724702,
       0.10338587, 0.07140759, 0.06251292, 0.04470817, 0.12802437,
       0.06922557, 0.06188386, 0.08410352, 0.07833665, 0.05909397,
       0.06628044, 0.09427049, 0.07877872, 0.0498284 , 0.09402348,
       0.06415045, 0.06372312, 0.05475028, 0.09663416, 0.07724487,
       0.1409877 , 0.09198173, 0.06226265, 0.06218348, 0.11365787,
       0.04378047, 0.12780261, 0.09132635, 0.09085774, 0.1230332 ,
       0.06216977, 0.07291014, 0.07247474, 0.12965178, 0.07274061,
       0.08244441, 0.0999212 , 0.07772429, 0.1368389 , 0.08504378,
       0.08717851, 0.05576034, 0.07018624, 0.06794223, 0.10679739,
       0.06538279, 0.06214009, 0.0773861 , 0.06997795, 0.08636

In [48]:
true_labels[0][0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [49]:
# Combine the results across all batches. 
flat_pred_outs = np.concatenate(pred_outs, axis=0)

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

In [50]:
#define candidate threshold values
threshold  = np.arange(0.10,0.30,0.01)
threshold

array([0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ,
       0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29])

In [51]:
# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
    y_pred = []

    for tag_label_row in pred_prob:
        temp=[]
        for tag_label in tag_label_row:
            if tag_label >= thresh:
                temp.append(1) # Infer tag value as 1 (present)
            else:
                temp.append(0) # Infer tag value as 0 (absent)
        y_pred.append(temp)

    return y_pred

In [52]:
from sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold

#convert labels to 1D array
y_true = flat_true_labels.ravel() 

for thresh in threshold:
    
    #classes for each threshold
    pred_bin_label = classify(flat_pred_outs,thresh) 

    #convert to 1D array
    y_pred = np.array(pred_bin_label).ravel()

    scores.append(metrics.f1_score(y_true,y_pred))

In [53]:
# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')

Optimal Threshold Value = 0.13


In [54]:
#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,opt_thresh)
y_pred = np.array(y_pred_labels).ravel() # Flatten

In [55]:
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98    492515
           1       0.08      0.18      0.11      7485

    accuracy                           0.96    500000
   macro avg       0.53      0.57      0.54    500000
weighted avg       0.97      0.96      0.96    500000



In [58]:
y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)

df = pd.DataFrame({'Body':x_test.Clean_text,'Actual Tags':y_act,'Predicted Tags':y_pred})

In [59]:
y_act[:5], y_pred[:5]

([('delphi', 'string'),
  ('discussion',),
  ('android',),
  ('.net', 'asp.net', 'c#', 'performance', 'string'),
  ('discussion',)],
 [('discussion', 'java', 'python'),
  ('discussion', 'java', 'python'),
  ('discussion', 'java', 'python'),
  ('discussion', 'java', 'python'),
  ('discussion', 'java', 'python')])

In [61]:
df.sample(10)

Unnamed: 0,Body,Actual Tags,Predicted Tags
1078,shortcut for echo pre print r myarray echo pre...,"(php,)","(discussion, java, python)"
2889,unique key with nulls This question requires s...,"(database, mysql)","(discussion, java, python)"
2303,how can i parse a time string containing milli...,"(python,)","(discussion, java, python)"
2301,how to programmatically select an item in a wp...,"(.net, c#, wpf)","(discussion, java, oracle, python)"
107,remove comments from c c++ code Is there an ea...,"(c, c++)","(discussion, java, python)"
4026,would you like to test the 2017 developer surv...,"(discussion,)","(discussion, java, python)"
4133,pass a list to a function to act as multiple a...,"(python,)","(discussion, java, python)"
3206,somebody is storing credit card data - how are...,"(security,)","(discussion, java, python)"
1183,parse a .py file read the ast modify it then w...,"(python,)","(discussion, java, python)"
3831,how can i ask a second question inside a quest...,"(discussion,)","(discussion, java, python)"
