<a href="https://colab.research.google.com/github/madhavjk/AI/blob/main/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a simple classifier on top of BERT

In [None]:
# We'll use Huggingface's Transformers package
!pip install transformers
import transformers
from transformers import BertPreTrainedModel, BertTokenizer, BertModel, BertConfig
from transformers.modeling_bert import BertPooler
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset



Prepare a dataset of sentences related to COVID-19, and sentences about BERT. By leveraging the information contained in the pretrained model, we're going to train a model that has a strong ability to generalize beyond the classification training examples. We put the data in a Pandas dataframe.

In [None]:
import pandas as pd

covid_texts = [
  "Approximately 90 days of the SARS-CoV-2 (COVID-19) spreading originally from Wuhan, China, and across the globe has led to a widespread chain of events with imminent threats to the fragile relationship between community health and economic health.",
  "Despite near hourly reporting on this crisis, there has been no regular, updated, or accurate reporting of hospitalizations for COVID-19.",
  "It is known that many test-positive individuals may not develop symptoms or have a mild self-limited viral syndrome consisting of fever, malaise, dry cough, and constitutional symptoms.",
  "However some individuals develop a more fulminant syndrome including viral pneumonia, respiratory failure requiring oxygen, acute respiratory distress syndrome requiring mechanical ventilation, and in substantial fractions leading to death attributable to COVID-19.",
  "The pandemic is evolving in a clustered, non-inform fashion resulting in many hospitals with preparedness but few or no cases, and others that are completely overwhelmed.",
  "Thus, a considerable risk of spread when personal protection equipment becomes exhausted and a large fraction of mortality in those not offered mechanical ventilation are both attributable to a crisis due to maldistribution of resources.",
  "The pandemic is amenable to self-reporting through a mobile phone application that could obtain critical information on suspected cases and report on the results of self testing and actions taken.",
  "The only method to understand the clustering and the immediate hospital resource needs is mandatory, uniform, daily reporting of hospital censuses of COVID-19 cases admitted to hospital wards and intensive care units.",
  "Current reports of hospitalizations are delayed, uncertain, and wholly inadequate.",
  "This paper urges all the relevant stakeholders to take up self-reporting and reporting of hospitalizations of COVID-19 as an urgent task in combating this devastating pandemic."]
bert_texts = [
  "Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google.",
  "BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google.",
  "Google is leveraging BERT to better understand user searches.",
  "The reasons for BERT's state-of-the-art performance on these natural language understanding tasks are not yet well understood.",
  "Current research has focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences,[6][7] analysis of internal vector representations through probing classifiers,[8][9] and the relationships represented by attention weights.",                  
  "BERT has its origins from pre-training contextual representations including Semi-supervised Sequence Learning,[10] Generative Pre-Training, ELMo,[11] and ULMFit.[12]",
  "Unlike previous models, BERT is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.",
  "Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary, where BERT is deeply bidirectional.",
  "On October 25, 2019, Google Search announced that they had started applying BERT models for English language search queries within the US.[13]",
  "On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages.[14]"
]

raw_data = pd.DataFrame({"text": covid_texts + bert_texts,
                      "label": ["COVID-19"]*len(covid_texts) + ["machine-learning"]*len(bert_texts)})
raw_data

Unnamed: 0,text,label
0,Approximately 90 days of the SARS-CoV-2 (COVID...,COVID-19
1,"Despite near hourly reporting on this crisis, ...",COVID-19
2,It is known that many test-positive individual...,COVID-19
3,However some individuals develop a more fulmin...,COVID-19
4,"The pandemic is evolving in a clustered, non-i...",COVID-19
5,"Thus, a considerable risk of spread when perso...",COVID-19
6,The pandemic is amenable to self-reporting thr...,COVID-19
7,The only method to understand the clustering a...,COVID-19
8,Current reports of hospitalizations are delaye...,COVID-19
9,This paper urges all the relevant stakeholders...,COVID-19


Create a Torch [map-style dataset](https://pytorch.org/docs/stable/data.html#map-style-datasets) that processes our dataframe using the appropriate BERT tokenizer.

In [None]:
class ClassificationDataset(Dataset):
  def __init__(self, dataframe, tokenizer):
    super(ClassificationDataset, self).__init__()
    self.data = dataframe
    self.label_mapping = {label: i for i, label in enumerate(set(dataframe["label"]))}
    self.id2label = {i: label for i, label in enumerate(set(dataframe["label"]))}

    self.tokenizer = tokenizer

    # Determine the maximum length of a sequence in the training data.
    # This approach does involve encoding the dataset twice which could be avoided if necessary.
    encoded_texts = [tokenizer.encode(text) for text in dataframe["text"]]
    max_len = max([len(text) for text in encoded_texts])
    self.encoded_seqs = [tokenizer.encode_plus(text, pad_to_max_length=True, max_length=max_len)
                         for text in dataframe["text"]]
  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    seq = self.encoded_seqs[index]
    return {"text": torch.tensor(seq["input_ids"]),
            "attention_mask": torch.tensor(seq["attention_mask"]),
            "label": torch.tensor(self.label_mapping[self.data.loc[index].label])}


Here we create our own custom "head" that wraps the model.

It's fine to use BertForSequenceClassification here instead. (It would actually be good to do a side-by-side comparison of the convergence of learning for this randomly initialized classifier and using BertForSequenceClassification; maybe there's no big diff).

In [None]:
class BertClassifier(BertPreTrainedModel):
    """ A modification of BertForSequenceClassification which doesn't use the parameters trained on NSP,
    instead re-initializing params.
    """
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)
        self.pooler = BertPooler(config)

        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        ):

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )
        hiddens = outputs[0]
        pooled_output = self.pooler(hiddens)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

Best practices involve shuffling our examples here. But we leave that as an exercise for the reader.

We'd also run this on a GPU in the real world.

In [None]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertClassifier.from_pretrained(model_name)
config = BertConfig.from_pretrained(model_name)

torch_dataset = ClassificationDataset(raw_data, tokenizer)
torch_dataloader = DataLoader(torch_dataset, batch_size=4) #Hint: Shuffle most easily applied here

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

model.train()

for epoch in range(10):
    print("epoch: {}".format(epoch))
    for batch in torch_dataloader:
        optimizer.zero_grad()
        loss, logits =  model(batch["text"], attention_mask=batch["attention_mask"], labels=batch["label"])
        loss.backward()
        optimizer.step()
    print("loss: {}".format(loss.item()))

epoch: 0
loss: 1.374000072479248
epoch: 1
loss: 0.32587191462516785
epoch: 2
loss: 0.06642933189868927
epoch: 3
loss: 0.025280989706516266
epoch: 4
loss: 0.012373460456728935
epoch: 5
loss: 0.007129683159291744
epoch: 6
loss: 0.006250674836337566
epoch: 7
loss: 0.004350499715656042
epoch: 8
loss: 0.0033491668291389942
epoch: 9
loss: 0.0033215307630598545


# Applying the trained model to sentences with words not seen in classification training

Now let's create some novel sentences that don't use vocabulary from the training set!

In [None]:
all_covid = " ".join(covid_texts)
all_covid

'Approximately 90 days of the SARS-CoV-2 (COVID-19) spreading originally from Wuhan, China, and across the globe has led to a widespread chain of events with imminent threats to the fragile relationship between community health and economic health. Despite near hourly reporting on this crisis, there has been no regular, updated, or accurate reporting of hospitalizations for COVID-19. It is known that many test-positive individuals may not develop symptoms or have a mild self-limited viral syndrome consisting of fever, malaise, dry cough, and constitutional symptoms. However some individuals develop a more fulminant syndrome including viral pneumonia, respiratory failure requiring oxygen, acute respiratory distress syndrome requiring mechanical ventilation, and in substantial fractions leading to death attributable to COVID-19. The pandemic is evolving in a clustered, non-inform fashion resulting in many hospitals with preparedness but few or no cases, and others that are completely ove

Do any salient words in this sentence occur in the COVID-19 classification data?

In [None]:
covid_test = "coronaviruses afflict the lungs and can be deadly"

In [None]:
fmt = "{: <14}{}"
print(fmt.format("Word", "In train?"))
print("-----------------------")
for tok in covid_test.split():
  print(fmt.format(tok, tok in all_covid))

Word          In train?
-----------------------
coronaviruses False
afflict       False
the           True
lungs         False
and           True
can           False
be            True
deadly        False


The only words from this sentence seen in the COVID-19 training data are stopwords.

Now let's create a sentence about machine learning that doesn't contain any relevant words from the classification training data.

In [None]:
ml_test = "Engineers at Amazon have also made numerous automatic speech recognition advances"
all_bert = " ".join(bert_texts)
print(fmt.format("Word", "In train?"))
print("-----------------------")
for tok in ml_test.split():
  print(fmt.format(tok, tok in all_bert))

Word          In train?
-----------------------
Engineers     False
at            True
Amazon        False
have          False
also          False
made          False
numerous      False
automatic     False
speech        False
recognition   False
advances      False


Let's run these through the classifier and see what we get.

In [None]:
test_dataframe = pd.DataFrame({"text": [covid_test, ml_test],
                               "label": ["COVID-19", "machine-learning"]})
test_dataset = ClassificationDataset(test_dataframe, tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size=2)

In [None]:
model.eval()

for batch in test_dataloader:
    logits, *_ =  model(batch["text"], attention_mask=batch["attention_mask"])
    print(torch.softmax(logits, dim=1))
    for seq_logits in logits:
      label_id = torch.argmax(seq_logits)
      print(test_dataset.id2label[label_id.item()])
    

tensor([[0.0040, 0.9960],
        [0.9963, 0.0037]], grad_fn=<SoftmaxBackward>)
COVID-19
machine-learning
