<a href="https://colab.research.google.com/github/infinitylogesh/How-to-do-more-with-less-data/blob/master/Active_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## How to do more with less data ?- Active learning

This notebook is a demo of the active learning technique explained on this blog post. Here I try to use Active learning to utilize only 23% of actual training dataset ( [ATIS intent classification dataset](https://www.kaggle.com/hassanamin/atis-airlinetravelinformationsystem#)) to achieve the same / close result as training on 100% of dataset.

**Active learning :**

<img src="https://i.imgur.com/8FDnlut.png"/>

In [None]:
! pip install transformers==2.8.0
! pip install pytorch-lightning

## Dataset Preparation

We will be using the ATIS intent classification dataset for the demo. Download the data from kaggle [here](https://www.kaggle.com/hassanamin/atis-airlinetravelinformationsystem#).

In [1]:
# Add column names to data file

import pandas as pd

train_df = pd.read_csv("/content/atis_intents_train.csv",header=None)
test_df = pd.read_csv("/content/atis_intents_test.csv",header=None)

train_df.columns = ['label','sentence']
test_df.columns = ['label','sentence']

train_df.to_csv("./train.csv",index=False)
test_df.to_csv("./test.csv",index=False)


In [2]:
from transformers.data.processors.utils import DataProcessor,InputExample,InputFeatures
import csv

class ATISIntentProcessor(DataProcessor):
    """Processor for the ATIS intent classification task."""

    def get_example_from_tensor_dict(self, tensor_dict):
        return InputExample(
            tensor_dict["idx"].numpy(),
            tensor_dict["sentence"].numpy().decode("utf-8"),
            None,
            str(tensor_dict["label"].numpy()),
        )

    def get_train_examples(self, data_dir):
        """ Reads train csv file and converts to list of InputExample"""
        return self._create_examples(self._read_csv(os.path.join(data_dir,"train.csv"),quotechar='"'), "train")

    def get_dev_examples(self, data_dir):
        """ Reads dev csv file and converts to list of InputExample"""
        return self._create_examples(self._read_csv(os.path.join(data_dir,"test.csv"),quotechar='"'), "test")
      
    def get_examples_from_csv(self,csv_file):
        return self._create_examples(self._read_csv(csv_file,quotechar='"'), "custom")

    def _read_csv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8-sig") as f:
            return list(csv.reader(f, delimiter=",", quotechar=quotechar))

    def get_labels(self):
        """ list of labels """
        return ['atis_flight', 'atis_flight_time', 'atis_airfare', 'atis_aircraft',
       'atis_ground_service', 'atis_airline', 'atis_abbreviation',
       'atis_quantity']

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[1]
            label = line[0]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples


In [3]:
from torch.utils.data.sampler import RandomSampler

def get_train_test_data_loaders(data_folder,
                     max_seq_length=128,
                     output_mode="classification",
                     train_batch_size=32,
                     test_batch_size=32):
  """ Creates pytorch dataloaders from train and test files"""
  processor = ATISIntentProcessor()
  train_examples = processor.get_train_examples(data_folder) 
  test_examples = processor.get_dev_examples(data_folder)
  train_features = convert_examples_to_features(examples=train_examples,
                                          tokenizer=tokenizer,
                                          label_list=processor.get_labels(),
                                          output_mode=output_mode,
                                          max_length=max_seq_length)
  test_features = convert_examples_to_features(examples=test_examples,
                                          tokenizer=tokenizer,
                                          label_list=processor.get_labels(),
                                          output_mode=output_mode,
                                          max_length=max_seq_length)

  train_dataset = TensorDataset(torch.tensor([f.input_ids for f in train_features], dtype=torch.long), 
                                  torch.tensor([f.attention_mask for f in train_features], dtype=torch.long), 
                                  torch.tensor([f.token_type_ids for f in train_features], dtype=torch.long), 
                                  torch.tensor([f.label for f in train_features], dtype=torch.long))

  test_dataset = TensorDataset(torch.tensor([f.input_ids for f in test_features], dtype=torch.long), 
                                  torch.tensor([f.attention_mask for f in test_features], dtype=torch.long), 
                                  torch.tensor([f.token_type_ids for f in test_features], dtype=torch.long), 
                                  torch.tensor([f.label for f in test_features], dtype=torch.long))
  train_sampler = RandomSampler(train_dataset)

  train_dataloader = DataLoader(train_dataset,sampler=train_sampler,batch_size=train_batch_size)
  test_dataloader = DataLoader(test_dataset,sampler=None,batch_size=test_batch_size,shuffle=False)

  return train_dataloader,test_dataloader

In [4]:
def get_data_loader_from_file(csv_file,
                     max_seq_length=128,
                     output_mode="classification",
                     batch_size=32):
  processor = ATISIntentProcessor()
  examples = processor.get_examples_from_csv(csv_file) 
  features = convert_examples_to_features(examples=examples,
                                          tokenizer=tokenizer,
                                          label_list=processor.get_labels(),
                                          output_mode=output_mode,
                                          max_length=max_seq_length)

  dataset = TensorDataset(torch.tensor([f.input_ids for f in features], dtype=torch.long), 
                                  torch.tensor([f.attention_mask for f in features], dtype=torch.long), 
                                  torch.tensor([f.token_type_ids for f in features], dtype=torch.long), 
                                  torch.tensor([f.label for f in features], dtype=torch.long))

  dataloader = DataLoader(dataset,sampler=None,batch_size=batch_size,shuffle=False)
  return dataloader

## Trainer



In [5]:
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers import BertTokenizer
from torch.utils.data import TensorDataset, RandomSampler, DataLoader, random_split
import torch,os
from transformers import BertForSequenceClassification,BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)


In [11]:
from sklearn.metrics import accuracy_score,f1_score
from transformers import AutoConfig,AutoTokenizer,AutoModelForSequenceClassification
import torch.nn.functional as F
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping
from torch import nn
import numpy as np

pl.trainer.seed_everything(seed=42)

class TransformersFinetuner(pl.LightningModule):
    """
    Module to finetune transformers - sequence classification models. 
    """

    def __init__(self,model_name_or_path,num_labels,train_dataloader,test_dataloader,**config_kwargs):
        super(TransformersFinetuner, self).__init__()
        self.config = AutoConfig.from_pretrained(
            model_name_or_path,
            **({"num_labels": num_labels} if num_labels is not None else {}),
            **config_kwargs,
        )
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name_or_path
        )
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name_or_path,
            from_tf=bool(".ckpt" in model_name_or_path),
            config=self.config
        )
        self._train_dataloader = train_dataloader
        self._test_dataloader = test_dataloader


    def forward(self,**inputs):
        return self.model(**inputs)

    def training_step(self, batch, batch_nb):
        inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
        if self.config.model_type != "distilbert":
            inputs["token_type_ids"] = batch[2] if self.config.model_type in ["bert", "xlnet", "albert"] else None
        outputs = self(**inputs)
        loss = outputs[0]
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_nb):
        inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}

        if self.config.model_type != "distilbert":
            inputs["token_type_ids"] = batch[2] if self.config.model_type in ["bert", "xlnet", "albert"] else None

        outputs = self(**inputs)
        tmp_eval_loss, logits = outputs[:2]
        preds = logits.detach().cpu().numpy()
        out_label_ids = inputs["labels"].detach().cpu().numpy()
        preds = np.argmax(preds,axis=-1)
        val_acc = accuracy_score(preds,out_label_ids)
        val_f1 = torch.tensor(f1_score(y_true=out_label_ids,y_pred=preds,average="micro"))
        val_acc = torch.tensor(val_acc)
        return {"val_loss": tmp_eval_loss.detach().cpu(),"val_acc":val_acc,"val_f1":val_f1, "pred": preds, "target": out_label_ids}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        avg_val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
        avg_val_f1 = torch.stack([x['val_f1'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss, 'avg_val_acc': avg_val_acc,'avg_val_f1':avg_val_f1}
        return {'val_loss': avg_loss, 'progress_bar': tensorboard_logs,'avg_val_acc':avg_val_acc ,'avg_val_f1':avg_val_f1}

    def test_step(self, batch, batch_nb):
        return self.validation_step(batch,batch_nb)

    def test_epoch_end(self, outputs):
        return self.validation_epoch_end(outputs)
    
    def configure_optimizers(self):
        return torch.optim.Adam([p for p in self.parameters() if p.requires_grad], lr=2e-05, eps=1e-08)

    def train_dataloader(self):
        return self._train_dataloader

    def val_dataloader(self):
        return self._test_dataloader

    def test_dataloader(self):
        return self._test_dataloader

In [12]:
def train(data_folder,num_labels):
  """
  Training wrapper
  """
  train_dataloader,test_dataloader = get_train_test_data_loaders(data_folder)
  bert_finetuner = TransformersFinetuner(model_name_or_path='bert-base-uncased',
                                         num_labels=num_labels,
                                         train_dataloader=train_dataloader,
                                         test_dataloader=test_dataloader)
  early_stop_callback = EarlyStopping(
    monitor='avg_val_acc',
    min_delta=0.00,
    patience=3,
    verbose=False,
    mode='max'
  )
  # most basic trainer, uses good defaults (1 gpu)
  trainer = pl.Trainer(gpus=1,max_epochs=10,early_stop_callback=early_stop_callback,weights_summary=None,logger=None)    
  trainer.fit(bert_finetuner) 
  return bert_finetuner,trainer

In [None]:
#avg_val_acc=0.991, avg_val_f1=0.991

In [13]:
from tqdm import tqdm_notebook

def predict(model,dataloader,label_list=None,device='cuda'):
  """
  Returns Model predictions and probabilities for a given dataloader 
  """
  device = torch.device(device)
  model.to(device)
  model.eval()
  probabilities=[]
  predictions=[]
  for dl_idx, batch in tqdm_notebook(enumerate(dataloader)):
    with torch.no_grad():
      inputs = {"input_ids": batch[0].to(device), "attention_mask": batch[1].to(device), "labels": batch[3].to(device)}
      output = model(**inputs)
      logits = output[1]
      probs = torch.softmax(logits,dim=-1)
      probs = probs.detach().cpu().numpy()
      probabilities.extend(probs)
      if label_list:
        batch_predictions = list(map(lambda idx:label_list[idx],np.argmax(probs,axis=-1)))
      else:
        batch_predictions = np.argmax(probs,axis=-1)
      predictions.extend(batch_predictions)
  return predictions,probabilities


In [14]:
# train_dataloader,test_dataloader = get_data_loaders("./data")
# predictions,probs = predict(bert_finetuner,test_dataloader)

## Training on 100% data

To later compare on the performance on the full training dataset, we will do the training once on full training dataset.

In [15]:
label_list = ATISIntentProcessor().get_labels()
bert_finetuner,trainer = train("./",num_labels=len(label_list))

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Saving latest checkpoint..





In [16]:
test_results = trainer.test(bert_finetuner,verbose=False)
print(f"Accuracy :{test_results[0]['avg_val_acc']}, F1: {test_results[0]['avg_val_f1']}")

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…


Accuracy :0.99375, F1: 0.99375


# Active learning Iterations

Now, that we know what is possible with 100% of labelled training data. We will try to achieve the same with only part of the data using Active learning.


### Active learning iteration utilities

In [17]:
from sklearn.model_selection import train_test_split
from scipy.stats import entropy

def active_learning_iteration(iteration_folder):
  """
  Runs training on the iteration train dataset and returns predictions, entropies of
  the unlabelled dataset.
  """
  label_list = ATISIntentProcessor().get_labels()
  model,trainer = train(iteration_folder,len(label_list))
  unlabelled_dataloader = get_data_loader_from_file(os.path.join(iteration_folder,"unlabelled.csv"))
  # get predictions and probabilities of the trained model on unlabbeled dataset
  predictions,probs = predict(model,unlabelled_dataloader,label_list)
  unlabelled_df = pd.read_csv(os.path.join(iteration_folder,"unlabelled.csv"))
  unlabelled_df["predictions"] = predictions
  unlabelled_df["probs"] = probs
  unlabelled_df["entropy"] = entropy(probs,axis=-1)
  return unlabelled_df,trainer,model

def do_label(unlabelled_df,iter_perc=0.1):
  """
  A function that mimicks the realworld labelling task. This takes top n % of 
  most uncertain (high entropy) rows from unlabelled df and returns 
  the labells for them
  """
  sorted_unlabelled_df = unlabelled_df.sort_values(by=['entropy'],ascending=False,ignore_index=True)
  labeled_rows_count = round(len(unlabelled_df)*(0.1))
  labelled_rows,unlabelled_df = sorted_unlabelled_df.loc[:labeled_rows_count,:],sorted_unlabelled_df.loc[labeled_rows_count:,:]
  return labelled_rows,unlabelled_df


## Iteration 1

We will select only 5% of the training dataset to train in the first iteration, Remaining training dataset is considered as unlabbeled.

In [18]:
train_df = pd.read_csv("./train.csv")
test_df = pd.read_csv("./test.csv")

# Select only 5% of the train dataset to train in the first iteration, Remaining 
# dataset is considered as unlablled.
unlabelled_df,train_iter_1_df = train_test_split(train_df,random_state=42,test_size=0.05)

iteration_folder = "./data/iteration_1"
if not os.path.exists(iteration_folder):
  os.makedirs(iteration_folder)

# saving the iteration files in folder data folder corresponding to iteration.
unlabelled_df.to_csv(os.path.join(iteration_folder,"unlabelled.csv"),index=False)
train_iter_1_df.to_csv(os.path.join(iteration_folder,"train.csv"),index=False)
test_df.to_csv(os.path.join(iteration_folder,"test.csv"),index=False)

print(f"Number of samples in train dataset : {len(train_iter_1_df)}")
print(f"Percentage of samples from original training set : {(len(train_iter_1_df)/len(train_df))*100} %")
print(f"Number of unlabelled samples : {len(unlabelled_df)}")
print(f"Number of test samples : {len(test_df)}")

# Run active learning iteration - #1, Training is done on 5% of original dataset 
# and prediction is done on the unlabbeled data split to pick the uncertain
# predicitons
unlabelled_df,trainer,model = active_learning_iteration(iteration_folder)

Number of samples in train dataset : 242
Percentage of samples from original training set : 5.0062060405461315 %
Number of unlabelled samples : 4592
Number of test samples : 800


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Saving latest checkpoint..





Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if sys.path[0] == '':


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




#### Uncertainity sampling and labelling

We pick the top 10% most unlablled samples that our model is uncertain about using entropy and label those predictions.

In [19]:
# Top 10% most uncertain samples are labelled from the unlabbeled dataset 
labelled_df,unlabbeled_iter_1_df = do_label(unlabelled_df,iter_perc=0.1)

#### Iteration 1 - Test results

In [20]:
test_results = trainer.test(model,verbose=False)
print(f"Accuracy :{test_results[0]['avg_val_acc']}, F1: {test_results[0]['avg_val_f1']}")

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…


Accuracy :0.915, F1: 0.915


## Iteration 2

We add the newly labelled (most uncertain) samples to the previous iteration training dataset and re-run our training and predict loop.

In [21]:

# Append newly labelled dataset from previous iteration to the previous iteration training dataset
train_iter_2_df = train_iter_1_df.append(labelled_df)
iteration_folder = "./data/iteration_2"
if not os.path.exists(iteration_folder):
  os.makedirs(iteration_folder)
unlabbeled_iter_1_df.to_csv(os.path.join(iteration_folder,"unlabelled.csv"),index=False)
train_iter_2_df.to_csv(os.path.join(iteration_folder,"train.csv"),index=False)
test_df.to_csv(os.path.join(iteration_folder,"test.csv"),index=False)

print(f"Number of samples in train dataset : {len(train_iter_2_df)}")
print(f"Percentage of samples from original training set : {(len(train_iter_2_df)/len(train_df))*100} %")
print(f"Number of unlabelled samples : {len(unlabelled_df)}")
print(f"Number of test samples : {len(test_df)}")

# Iteration is run with newly labelled dataset + previous iteration train dataset
unlabelled_df,trainer,model = active_learning_iteration(iteration_folder)

Number of samples in train dataset : 702
Percentage of samples from original training set : 14.52213487794787 %
Number of unlabelled samples : 4592
Number of test samples : 800


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Saving latest checkpoint..





Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if sys.path[0] == '':


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




#### Uncertainity sampling and labelling

We pick the top 10% most unlablled samples that our model is uncertain about using entropy and label those predictions.

In [22]:
labelled_df,unlabbeled_iter_2_df = do_label(unlabelled_df,iter_perc=0.1)

#### Iteration 2 - Test results

In [23]:
test_results = trainer.test(model,verbose=False)
print(f"Accuracy :{test_results[0]['avg_val_acc']}, F1: {test_results[0]['avg_val_f1']}")

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…


Accuracy :0.98, F1: 0.98


## Iteration 3

In [27]:
train_iter_3_df = train_iter_2_df.append(labelled_df)
iteration_folder = "./data/iteration_3"
if not os.path.exists(iteration_folder):
  os.makedirs(iteration_folder)
unlabelled_df.to_csv(os.path.join(iteration_folder,"unlabelled.csv"),index=False)
train_iter_3_df.to_csv(os.path.join(iteration_folder,"train.csv"),index=False)
test_df.to_csv(os.path.join(iteration_folder,"test.csv"),index=False)

print(f"Number of samples in train dataset : {len(train_iter_3_df)}")
print(f"Percentage of samples from original training set : {(len(train_iter_3_df)/len(train_df))*100} %")
print(f"Number of unlabelled samples : {len(unlabelled_df)}")
print(f"Number of test samples : {len(test_df)}")

# Iteration is run with newly labelled dataset + previous iteration train dataset
unlabelled_df,trainer,model = active_learning_iteration(iteration_folder)

Number of samples in train dataset : 1116
Percentage of samples from original training set : 23.086470831609432 %
Number of unlabelled samples : 4133
Number of test samples : 800


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Saving latest checkpoint..





Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if sys.path[0] == '':


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




#### Uncertainity sampling and labelling

We pick the top 10% most unlablled samples that our model is uncertain about using entropy and label those predictions.

In [25]:
labelled_df,unlabbeled_iter_3_df = do_label(unlabelled_df,iter_perc=0.1)

#### Iteration 3 - Test results

In [26]:
test_results = trainer.test(model,verbose=False)
print(f"Accuracy :{test_results[0]['avg_val_acc']}, F1: {test_results[0]['avg_val_f1']}")

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…


Accuracy :0.99, F1: 0.99


# Conclusion

By labelling and training on just **23%** of the trianing data , We were able to achieve the same accuracy as labelling the full dataset. When starting a labelling from scratch, It is always a good idea to consider active learning to reduce the labelling effort.

One thing to note here is that this was possible by the combination of active learning, powerful pretrained model like BERT and the nature of the dataset . The results could vary based on the dataset.

# Credits

The technique demoed here and discussed on the blog post is based on the book [Human in the loop machine learning](https://www.manning.com/books/human-in-the-loop-machine-learning?utm_source=affiliates&utm_medium=affiliates&a_aid=logesh) by Robert Munro