<a href="https://colab.research.google.com/github/ivancorrales/colab-notebooks/blob/main/Quotes_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install and import libraries

In [None]:
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install sklearn
!pip install transformers
!pip install pytorch-lightning

Collecting transformers
  Downloading transformers-4.12.2-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 32.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.5 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 5.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  At

In [None]:
# Import all libraries
import pandas as pd
import numpy as np
import re
import os

# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn ,cuda
from torch.utils.data import DataLoader,Dataset,RandomSampler, SequentialSampler

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline

is_gpu_available = torch.cuda.is_available()
device = torch.device("cuda:0" if is_gpu_available else "cpu")
if is_gpu_available:
  !nvidia-smi

Tue Nov  2 14:06:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    28W / 149W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Download the dataset

We will make use of a dataset that is hosted on Kaggle. We need to follow the steps below to download and use kaggle data within Google Colab:

1. Sign in to https://kaggle.com/, then click on your profile picture on the top right and select “My Account” from the menu.

2. Scroll down to the “API” section and click “Create New API Token”. This will download a file kaggle.json.

3. Upload the downloaded kaggle.json file in the next cell.


In [None]:
"""The Kaggle dataset path"""
KAGGLE_DATASET ='akmittal/quotes-dataset'

!pip install -q kaggle
from google.colab import files
files.upload()

!pip install -q kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download "{KAGGLE_DATASET}"
!mkdir  /content/dataset
!unzip -q /content/quotes-dataset.zip -d /content/dataset
!ls /content/dataset

Saving kaggle.json to kaggle.json
kaggle.json
Downloading quotes-dataset.zip to /content
  0% 0.00/3.88M [00:00<?, ?B/s]
100% 3.88M/3.88M [00:00<00:00, 95.5MB/s]
quotes.json


## Load dataset

Load the json file into a Pandas DataFrame

In [None]:
import pandas as pd

df = pd.read_json('/content/dataset/quotes.json')
df.head(6)

Unnamed: 0,Quote,Author,Tags,Popularity,Category
0,"Don't cry because it's over, smile because it ...",Dr. Seuss,"[attributed-no-source, cry, crying, experience...",0.155666,life
1,"Don't cry because it's over, smile because it ...",Dr. Seuss,"[attributed-no-source, cry, crying, experience...",0.155666,happiness
2,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"[attributed-no-source, best, life, love, mista...",0.129122,love
3,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"[attributed-no-source, best, life, love, mista...",0.129122,life
4,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"[attributed-no-source, best, life, love, mista...",0.129122,truth
5,Be yourself; everyone else is already taken.,Oscar Wilde,"[attributed-no-source, be-yourself, honesty, i...",0.113223,inspiration


Clean the dataset

In [None]:
# Take only the required attributes and discard the others
df = df[['Quote','Tags']]
# Drop the duplicates records into our dataset
df = df.drop_duplicates(['Quote'])
print(f'The data frame contains {len(df)} records.')
df.head(6)

The data frame contains 36937 records.


Unnamed: 0,Quote,Tags
0,"Don't cry because it's over, smile because it ...","[attributed-no-source, cry, crying, experience..."
2,"I'm selfish, impatient and a little insecure. ...","[attributed-no-source, best, life, love, mista..."
5,Be yourself; everyone else is already taken.,"[attributed-no-source, be-yourself, honesty, i..."
6,Two things are infinite: the universe and huma...,"[attributed-no-source, human-nature, humor, in..."
9,"Be who you are and say what you feel, because ...","[ataraxy, be-yourself, confidence, fitting-in,..."
10,You've gotta dance like there's nobody watchin...,"[dance, heaven, hurt, inspirational, life, lov..."


## Normalize dataset

In this step we will normalize the tags (tolowercase, remove empty blanks). Additionally, we will work only with the 15 most used tags.

In [None]:
df.Tags = df.Tags.transform(lambda tags: [tag.lower().strip() for tag in tags])

tags = [element for list_ in df.Tags for element in list_]
tags = [tag.lower().strip() for tag in tags]

print(f'There are {len(tags)} tags.')

There are 215664 tags.


As we can obser on the above cell, there are more than 200k tags. As this Notebook has mainly educational purposes we will discard the other tags and clean the dataset.

In [None]:
classes = pd.Series(tags).value_counts()[:15].index
classes = list(set(classes))
classes.sort()
df['Tags'] = df.Tags.transform(lambda tags: list(set(tags).intersection(classes)))
df = df[df.Tags.transform(lambda tags: len(tags)>0)]

print(f'We will only consider the following tags: {classes}.')
print(f'The data frame contains {len(df)} records with one or more tags.')

We will only consider the following tags: ['death', 'faith', 'god', 'happiness', 'hope', 'humor', 'inspirational', 'inspirational-quotes', 'life', 'love', 'philosophy', 'poetry', 'relationships', 'truth', 'wisdom'].
The data frame contains 23632 records with one or more tags.


To work on our multi-label tag classification we will convert the field 'Tags' (with arrays of tags) into 15 columns (one per tag) with binary values.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Tags')),
                          columns=mlb.classes_,
                          index=df.index))
df.head(n=6)

Unnamed: 0,Quote,death,faith,god,happiness,hope,humor,inspirational,inspirational-quotes,life,love,philosophy,poetry,relationships,truth,wisdom
0,"Don't cry because it's over, smile because it ...",0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
2,"I'm selfish, impatient and a little insecure. ...",0,0,0,0,0,0,0,0,1,1,0,0,0,1,0
5,Be yourself; everyone else is already taken.,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
6,Two things are infinite: the universe and huma...,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0
10,You've gotta dance like there's nobody watchin...,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0
13,You know you're in love when you can't fall as...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


## Splitting dataset into train, test and validation data

We have more than 20k records into our dataframe. Our intention is purely educational, so let's take a subset with "only" 2k records to work with.

In [None]:
df = df[:2000]

The 70% of our records will be used for training the model and the other 30% will be distributed between test and validation.

In [None]:
RANDOM_STATE = 167

train_data,temp_data = train_test_split(
    df,
    test_size=.3, 
    random_state=RANDOM_STATE,
    shuffle=True,
)


test_data, val_data = train_test_split(
    temp_data, 
    test_size=0.5, 
    random_state=RANDOM_STATE, 
    shuffle=True,
)
print(f' - dataset for training model: {train_data.shape[0]}.')
print(f' - dataset for validate trained model: {val_data.shape[0]}.')
print(f' - dataset for test the model {test_data.shape[0]}.')

 - dataset for training model: 1400.
 - dataset for validate trained model: 300.
 - dataset for test the model 300.


## Preparing the Dataset and DataModule



In [None]:
class QuoteTagDataset (Dataset):
  def __init__(self, data, tokenizer, max_len):
      self.tokenizer = tokenizer
      self.data      = data
      self.max_len   = max_len
      
  def __len__(self):
      return len(self.data)
  
  def __getitem__(self, item_idx):
      item   = self.data.iloc[item_idx]
      quote  = item['Quote']
      labels = item[classes]

      inputs = self.tokenizer.encode_plus(
          quote,
          None,
          max_length= self.max_len,
          padding = 'max_length',
          add_special_tokens=True,
          return_token_type_ids= False,
          return_attention_mask= True,
          truncation=True,
          return_tensors = 'pt'
        )
      
      input_ids = inputs['input_ids'].flatten()
      attn_mask = inputs['attention_mask'].flatten()
      
      return {
          'input_ids': input_ids ,
          'attention_mask': attn_mask,
          'label': torch.tensor(labels, dtype=torch.float)    
      }

In [None]:
class QuoteTagDataModule (pl.LightningDataModule):
    
    def __init__(self, train_data, val_data, test_data,tokenizer,train_batch_size=8, val_batch_size=8, test_batch_size=8, max_token_len=150):
        super().__init__()
        self.train_data = train_data
        self.test_data  = test_data
        self.val_data   = val_data
        self.tokenizer = tokenizer
        self.train_batch_size = train_batch_size
        self.test_batch_size = test_batch_size
        self.val_batch_size = val_batch_size
        self.max_token_len = max_token_len

    def setup(self):
        self.train_dataset = QuoteTagDataset(data=self.train_data, tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.val_dataset  = QuoteTagDataset(data=self.val_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.test_dataset  = QuoteTagDataset(data=self.test_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
        
        
    def train_dataloader(self):
        return DataLoader (self.train_dataset, batch_size = self.train_batch_size, shuffle = True , num_workers=0)

    def val_dataloader(self):
        return DataLoader (self.val_dataset,batch_size=self.val_batch_size)

    def test_dataloader(self):
        return DataLoader (self.test_dataset,batch_size=self.test_batch_size)

In [None]:
# Initialize the Bert tokenizer
BERT_MODEL_NAME = 'bert-base-cased'
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

# Initialize the parameters that will be use for training
TRAIN_BATCH_SIZE = 8
TEST_BATCH_SIZE  = 8
VAL_BATCH_SIZE   = 8
MAX_LEN          = 128

# Instantiate and set up the data_module
data_module = QuoteTagDataModule(train_data,val_data,test_data,Bert_tokenizer,TRAIN_BATCH_SIZE,VAL_BATCH_SIZE,TEST_BATCH_SIZE,MAX_LEN)
data_module.setup()

## Train the Model

In [None]:
class QuoteTagClassifier(pl.LightningModule):
    
    def __init__(self, n_classes=15, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):
        super().__init__()

        self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
        self.steps_per_epoch = steps_per_epoch
        self.n_epochs = n_epochs
        self.lr = lr
        self.criterion = nn.BCEWithLogitsLoss()
        
    def forward(self,input_ids, attn_mask):
        output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
        output = self.classifier(output.pooler_output)
        return output
    
    
    def training_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('train_loss',loss , prog_bar=True,logger=True)
        
        return {"loss" :loss, "predictions":outputs, "labels": labels }


    def validation_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('val_loss',loss , prog_bar=True,logger=True)
        
        return loss

    def test_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('test_loss',loss , prog_bar=True,logger=True)
        return loss
    
    
    def configure_optimizers(self):
        optimizer = AdamW(self.parameters() , lr=self.lr)
        warmup_steps = self.steps_per_epoch//3
        total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps
        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)
        return [optimizer], [scheduler]

In [None]:
N_EPOCHS   = 20
LR         = 2e-05
steps_per_epoch = len(train_data)//TRAIN_BATCH_SIZE

model = QuoteTagClassifier(n_classes=len(classes), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , gpus = 1, progress_bar_refresh_rate = 20)
# Train the Classifier Model
trainer.fit(model, data_module)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  f"DataModule.{name} has already been called, so it will not be called again. "
  f"DataModule.{name} has already been called, so it will not be called again. "
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type              | Params
-------------------------------------------------
0 | bert       | BertModel         | 108 M 
1 | classifier | Linear            | 11.5 K
2 | criterion  | BCEWithLogitsLoss | 0     
-------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
433.287   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  f"DataModule.{name} has already been called, so it will not be called again. "


In [None]:
# Evaluate the model performance on the test dataset
trainer.test(model,datamodule=data_module)

  f"DataModule.{name} has already been called, so it will not be called again. "
  f"DataModule.{name} has already been called, so it will not be called again. "
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss': 0.2259049415588379}
--------------------------------------------------------------------------------


[{'test_loss': 0.2259049415588379}]

## Evaluate Model Performance on Test Set

In [None]:
from torch.utils.data import TensorDataset

# Tokenize all quotes in test_data
input_ids = []
attention_masks = []


for quote in test_data.Quote:
    encoded_quote =  Bert_tokenizer.encode_plus(
      quote,
      None,
      add_special_tokens=True,
      max_length= MAX_LEN,
      padding = 'max_length',
      return_token_type_ids= False,
      return_attention_mask= True,
      truncation=True,
      return_tensors = 'pt'      
    )
    
    input_ids.append(encoded_quote['input_ids'])
    attention_masks.append(encoded_quote['attention_mask'])
    
# Now convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(test_data[classes].values)

# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)

In [None]:
flat_pred_outs = 0
flat_true_labels = 0

In [None]:
# Put model in evaluation mode
model = model.to(device) # moving model to cuda
model.eval()

# Tracking variables 
pred_outs, true_labels = [], []
#i=0
# Predict 
for batch in pred_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
  
    # Unpack the inputs from our dataloader
    b_input_ids, b_attn_mask, b_labels = batch
 
    with torch.no_grad():
        pred_out = model(b_input_ids,b_attn_mask)
        pred_out = torch.sigmoid(pred_out)
        pred_out = pred_out.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

    pred_outs.append(pred_out)
    true_labels.append(label_ids)

In [None]:
flat_pred_outs = np.concatenate(pred_outs, axis=0)
flat_true_labels = np.concatenate(true_labels, axis=0)

## Predictions of Tags in Test set

First of all we need to identify the threshdol that performs the best for the test dataset.

In [None]:
threshold  = np.arange(0.4,0.51,0.01)
threshold

array([0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ])

Let's define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.

In [None]:
# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
    y_pred = []

    for tag_label_row in pred_prob:
        temp=[]
        for tag_label in tag_label_row:
            if tag_label >= thresh:
                temp.append(1) 
            else:
                temp.append(0)
        y_pred.append(temp)

    return y_pred

In [None]:
from sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold

#convert labels to 1D array
y_true = flat_true_labels.ravel() 

for thresh in threshold:
    
    #classes for each threshold
    pred_bin_label = classify(flat_pred_outs,thresh) 

    #convert to 1D array
    y_pred = np.array(pred_bin_label).ravel()

    scores.append(metrics.f1_score(y_true,y_pred))

In [None]:
# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')

Optimal Threshold Value = 0.4


## Performance Score Evaluation

In [None]:
#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,opt_thresh)
y_pred = np.array(y_pred_labels).ravel() # Flatten

In [None]:
print(metrics.classification_report(y_true,y_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.96      4151
           1       0.64      0.20      0.31       349

    accuracy                           0.93      4500
   macro avg       0.79      0.60      0.63      4500
weighted avg       0.91      0.93      0.91      4500



In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
yt = mlb.fit_transform([classes])
yt.shape

y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)

df = pd.DataFrame({'Body':test_data['Quote'],'Actual Tags':y_act,'Predicted Tags':y_pred})
df.sample(40)

Unnamed: 0,Body,Actual Tags,Predicted Tags
7964,Puns are the highest form of literature.,"(humor,)",()
900,A day without laughter is a day wasted.,"(philosophy,)",()
7491,The bond forged between us was not one that co...,"(love,)","(life, love)"
7620,"I won't ever leave you, even though you're alw...","(love,)",()
4245,The definition of success to me is not necessa...,"(life,)","(life,)"
3234,I feel like if you enjoyed the 119 hours that ...,"(love,)","(life,)"
10,You've gotta dance like there's nobody watchin...,"(inspirational, life, love)","(love,)"
2326,Indifference and neglect often do much more da...,"(relationships,)",()
3993,"Love doesn't just sit there, like a stone, it ...","(love,)",()
3480,Remember: the time you feel lonely is the time...,"(life,)","(life,)"


## Try the model

In [None]:

def predict(quote):
    quote_enc = Bert_tokenizer.encode_plus(
            quote,
            None,
            add_special_tokens=True,
            max_length= MAX_LEN,
            padding = 'max_length',
            return_token_type_ids= False,
            return_attention_mask= True,
            truncation=True,
            return_tensors = 'pt'      
    )
    outputs = model(quote_enc['input_ids'], quote_enc['attention_mask'])
    pred_out = outputs[0].detach().numpy()
    preds = [(pred > opt_thresh) for pred in pred_out ]
    preds = np.asarray(preds)
    new_preds = preds.reshape(1,-1).astype(int)
    pred_tags = mlb.inverse_transform(new_preds)
    return pred_tags

In [None]:
sentence = 'After all, life’s better when we’re happy, healthy, and successful.'

tags = predict(quote)
if not tags[0]:
    print('This sentence can not be associated with any known tag - Please review to see if a new tag is required ')
else:
    print(f'Following Tags are associated : \n {tags}')


RuntimeError: ignored

In [None]:
classes

['death',
 'faith',
 'god',
 'happiness',
 'hope',
 'humor',
 'inspirational',
 'inspirational-quotes',
 'life',
 'love',
 'philosophy',
 'poetry',
 'relationships',
 'truth',
 'wisdom']