# ALBERT Fine-Tuning for Quora Dataset

Switching from BERT to ALBERT won't require much changes relative to the notebook with BERT for sequence classification.

There are only two code changes required to switch the code from using BERT to ALBERT:
1. From the HuggingFace `transformers` library, we've replaced the classes:
    *  `BertTokenizer` --> `AlbertTokenizer`
    * `BertForSequenceClassification` --> `AlbertForSequenceClassification`
2. For our pre-trained model, we have replaced `"bert-base-uncased"` with `"albert-base-v1"`.

In [None]:
import os
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
df = pd.read_csv("gdrive/MyDrive/train.csv.zip", compression='zip', low_memory=False)
print(f"Number of texts: {df.shape[0]}")

Number of texts: 1306122


Training on so much data would take up too much time. Hence, we're going to train only on the 10% of the original dataset. As we will see that would be enough to get really good performance measure.

As we're dealing with skewed classes in our dataset we perform stratified splitting.

In [None]:
df, test_df = train_test_split(df, random_state=42, train_size=.1, stratify=df.target.values)

In [None]:
df.head()

Unnamed: 0,qid,question_text,target
104651,147ea801de098a0e692f,If we trade in hourly timefram how we can pred...,0
416131,518d27683385952ea3b6,Is there any testing or coaching that helps pe...,0
1218668,eed982cc6e78e2b7dfd9,What is Norton 360 useful for?,0
341531,42e655c6a196c459c0d0,Cell wall of fungi made up of which substance?,0
1056479,cf03b5238f820d580187,"As a parent, which martial arts class would yo...",0


In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device.

In [None]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 7.7MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 37.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 56.6MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
sentences = df.question_text.values
labels = df.target.values

In [None]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |▎                               | 10kB 28.9MB/s eta 0:00:01[K     |▌                               | 20kB 16.4MB/s eta 0:00:01[K     |▉                               | 30kB 13.8MB/s eta 0:00:01[K     |█                               | 40kB 12.8MB/s eta 0:00:01[K     |█▍                              | 51kB 7.5MB/s eta 0:00:01[K     |█▋                              | 61kB 8.4MB/s eta 0:00:01[K     |██                              | 71kB 8.5MB/s eta 0:00:01[K     |██▏                             | 81kB 8.8MB/s eta 0:00:01[K     |██▌                             | 92kB 9.0MB/s eta 0:00:01[K     |██▊                             | 102kB 7.4MB/s eta 0:00:01[K     |███                             | 112kB 7.4MB/s eta 0:00:01[K     |███▎               

In [None]:
from transformers import AlbertTokenizer

# Load the ALBERT tokenizer.
print('Loading ALBERT tokenizer...')
tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v1')

Loading ALBERT tokenizer...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1312669.0, style=ProgressStyle(descript…




In [None]:
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  If we trade in hourly timefram how we can predict what happen in 15 minutes timefram?
Tokenized:  ['▁if', '▁we', '▁trade', '▁in', '▁hour', 'ly', '▁time', 'fra', 'm', '▁how', '▁we', '▁can', '▁predict', '▁what', '▁happen', '▁in', '▁15', '▁minutes', '▁time', 'fra', 'm', '?']
Token IDs:  [100, 95, 1238, 19, 1671, 102, 85, 8691, 79, 184, 95, 92, 9584, 98, 2384, 19, 357, 902, 85, 8691, 79, 60]


In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []

# For every sentence...
for sent in tqdm(sentences):
    # `encode` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    encoded_sent = tokenizer.encode(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'

                        # This function also supports truncation and conversion
                        # to pytorch tensors, but we need to do padding, so we
                        # can't use these features :( .
                        #max_length = 128,          # Truncate all sentences.
                        #return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    input_ids.append(encoded_sent)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

100%|██████████| 130612/130612 [00:28<00:00, 4643.14it/s]

Original:  If we trade in hourly timefram how we can predict what happen in 15 minutes timefram?
Token IDs: [2, 100, 95, 1238, 19, 1671, 102, 85, 8691, 79, 184, 95, 92, 9584, 98, 2384, 19, 357, 902, 85, 8691, 79, 60, 3]





In [None]:
print('Max sentence length: ', max([len(sen) for sen in input_ids]))

Max sentence length:  178


In [None]:
# Set the maximum sequence length.
# I've chosen 128 for speed

MAX_LEN = 128

print('\nPadding/truncating all sentences to %d values...' % MAX_LEN)

print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))

# Pad our input tokens with value 0.
# "post" indicates that we want to pad and truncate at the end of the sequence,
# as opposed to the beginning.
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")

print('\nDone.')


Padding/truncating all sentences to 128 values...

Padding token: "<pad>", ID: 0

Done.


In [None]:
# Create attention masks
attention_masks = []

# For each sentence...
for sent in tqdm(input_ids):
    
    # Create the attention mask.
    #   - If a token ID is 0, then it's padding, set the mask to 0.
    #   - If a token ID is > 0, then it's a real token, set the mask to 1.
    att_mask = [int(token_id > 0) for token_id in sent]
    
    # Store the attention mask for this sentence.
    attention_masks.append(att_mask)

100%|██████████| 130612/130612 [00:09<00:00, 13318.48it/s]


In [None]:
# Use 90% for training and 10% for validation.
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, 
                                                            random_state=2021, test_size=0.05)
# Do the same for the masks.
train_masks, validation_masks, _, _ = train_test_split(attention_masks, labels,
                                             random_state=2021, test_size=0.05)

In [None]:
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

In [None]:
batch_size = 16

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

## ALBERT for sequence classification

*Note: This section has been revised for ALBERT.*

For this task, we first want to modify the pre-trained ALBERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task. 

Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of the same trained ALBERT model, each has different top layers and output types designed to accomodate their specific NLP task.  

Here is the current list of classes provided for fine-tuning:
* AlbertForMaskedLM
* **AlbertForSequenceClassification** - The one we'll use.
* AlbertForQuestionAnswering

The documentation for these can be found under [here](https://huggingface.co/transformers/model_doc/albert.html).

In [None]:
from transformers import AlbertForSequenceClassification, AdamW, AlbertConfig

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = AlbertForSequenceClassification.from_pretrained(
    "albert-xxlarge-v1", # Using the base model, designed to be the same size as
                      # the original BERT-base.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=706.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=892728632.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at albert-xxlarge-v1 were not used when initializing AlbertForSequenceClassification: ['predictions.bias', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.dense.bias', 'predictions.decoder.weight', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-xxlarge-v1 and are newly initialized: ['classifier.weight', 'classifier.bias']
Y

AlbertForSequenceClassification(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=4096, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((4096,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=4096, out_features=4096, bias=True)
                (key): Linear(in_features=4096, out_features=4096, bias=True)
                (value): Linear(in_features=4096, out_featur

Just for curiosity's sake, we can browse all of the model's parameters by name here.

In the below cell, I've printed out the names and dimensions of the weights for:

1. The embedding layer.
2. The transformer encoder layer. 
    * In ALBERT, this layer is replicated 12 times, with the same weights used in every layer!
3. The output layer.

In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The ALBERT-base model has only {:} different named parameters.\n'.format(len(params)))

total_params = 0

# For each parameter...
for i in range(0, len(params)):
    
    # Look up its name and size.
    p_name = params[i][0]
    p_size = params[i][1].size()

    # Tally up the total number of individual weights.
    num_elements = params[i][1].numel()
    total_params += num_elements

    # Print section headers between the three groups.
    if i == 0:
        print('==== Embedding Layer ====\n')
    elif i == 7:
        print('\n==== Transformer Encoder ====\n')
    elif i == 23:
        print('\n==== Output Layer ====\n')

    # Print out the parameter's index, name, and size.
    print("{:>2}. {:<82} {:>12}".format(i, p_name, str(p_size)))

print('\nALBERT-base has {:,} unique parameter values.'.format(total_params))

The ALBERT-base model has only 27 different named parameters.

==== Embedding Layer ====

 0. albert.embeddings.word_embeddings.weight                                           torch.Size([30000, 128])
 1. albert.embeddings.position_embeddings.weight                                       torch.Size([512, 128])
 2. albert.embeddings.token_type_embeddings.weight                                     torch.Size([2, 128])
 3. albert.embeddings.LayerNorm.weight                                                 torch.Size([128])
 4. albert.embeddings.LayerNorm.bias                                                   torch.Size([128])
 5. albert.encoder.embedding_hidden_mapping_in.weight                                  torch.Size([4096, 128])
 6. albert.encoder.embedding_hidden_mapping_in.bias                                    torch.Size([4096])

==== Transformer Encoder ====

 7. albert.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight  torch.Size([4096])
 8. albert.enco

In [None]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 1e-5, # From run_glue.sh
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

In [None]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs
epochs = 1 # Estimated based on 5,336 training steps in run_glue.sh

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 320, # From run_glue.sh
                                            num_training_steps = total_steps)

In [None]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
import random

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128


# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...
  Batch    40  of  7,756.    Elapsed: 0:02:35.
  Batch    80  of  7,756.    Elapsed: 0:05:10.
  Batch   120  of  7,756.    Elapsed: 0:07:44.
  Batch   160  of  7,756.    Elapsed: 0:10:19.
  Batch   200  of  7,756.    Elapsed: 0:12:54.
  Batch   240  of  7,756.    Elapsed: 0:15:29.
  Batch   280  of  7,756.    Elapsed: 0:18:04.
  Batch   320  of  7,756.    Elapsed: 0:20:38.
  Batch   360  of  7,756.    Elapsed: 0:23:13.
  Batch   400  of  7,756.    Elapsed: 0:25:48.
  Batch   440  of  7,756.    Elapsed: 0:28:23.
  Batch   480  of  7,756.    Elapsed: 0:30:58.
  Batch   520  of  7,756.    Elapsed: 0:33:32.
  Batch   560  of  7,756.    Elapsed: 0:36:07.
  Batch   600  of  7,756.    Elapsed: 0:38:42.
  Batch   640  of  7,756.    Elapsed: 0:41:17.
  Batch   680  of  7,756.    Elapsed: 0:43:52.
  Batch   720  of  7,756.    Elapsed: 0:46:26.
  Batch   760  of  7,756.    Elapsed: 0:49:01.
  Batch   800  of  7,756.    Elapsed: 0:51:36.
  Batch   840  of  7,756.    Elapsed: 0:54:11.


In [None]:
saved_dir = './gdrive/MyDrive/model_save/'

trained_model = AlbertForSequenceClassification.from_pretrained(
    saved_dir, # Using the base model, designed to be the same size as
                      # the original BERT-base.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
trained_model.to(device)

AlbertForSequenceClassification(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=4096, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((4096,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=4096, out_features=4096, bias=True)
                (key): Linear(in_features=4096, out_features=4096, bias=True)
                (value): Linear(in_features=4096, out_featur

In [None]:
test_df.head()

Unnamed: 0,qid,question_text,target
452819,58b326e49fa20fe99ea6,What could be the very last thought of a perso...,0
1193005,e9d1086f367ec17a0b91,How can I work on the impact of digital securi...,0
164018,2012196811a3046d33b5,16. Let X and Y be independent and identically...,0
784867,99c1082398f2165e0fbe,Why does the Indian Muslim can't marry Europea...,0
135624,1a8ccde5c1e68910a27f,How GST will affect life of farmer?,0


In [None]:
df, _ = train_test_split(test_df, random_state=42, train_size=.1, stratify=test_df.target.values)
df.shape

(117551, 3)

In [None]:
sentences = df.question_text.values
labels = df.target.values

In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []

# For every sentence...
for sent in tqdm(sentences):
    # `encode` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    encoded_sent = tokenizer.encode(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'

                        # This function also supports truncation and conversion
                        # to pytorch tensors, but we need to do padding, so we
                        # can't use these features :( .
                        #max_length = 128,          # Truncate all sentences.
                        #return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    input_ids.append(encoded_sent)

100%|██████████| 117551/117551 [00:24<00:00, 4817.71it/s]


In [None]:
MAX_LEN = 128

input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")

# Create attention masks
attention_masks = []

# For each sentence...
for sent in tqdm(input_ids):
    
    # Create the attention mask.
    #   - If a token ID is 0, then it's padding, set the mask to 0.
    #   - If a token ID is > 0, then it's a real token, set the mask to 1.
    att_mask = [int(token_id > 0) for token_id in sent]
    
    # Store the attention mask for this sentence.
    attention_masks.append(att_mask)

100%|██████████| 117551/117551 [00:08<00:00, 13953.62it/s]


In [None]:
test_inputs = torch.tensor(input_ids)

test_labels = torch.tensor(labels)

test_masks = torch.tensor(attention_masks)

In [None]:
batch_size = 16

# Create the DataLoader for our validation set.
test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

In [None]:
from sklearn.metrics import f1_score



In [None]:
trained_model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in tqdm(test_dataloader):
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = trained_model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

print('    DONE.')

100%|██████████| 7347/7347 [2:43:48<00:00,  1.34s/it]

    DONE.





In [None]:
predictions = np.concatenate(predictions)
true_labels = np.concatenate(true_labels)
# f1_score(predictions, true_labels)

In [None]:
f1_score(true_labels, np.argmax(predictions, axis=1))

0.7071869736103312

In [None]:
output_dir = './model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print(f"Saving model to {output_dir}")

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

# Mount Google Drive to this Notebook instance.
drive.mount('/content/drive')

# Copy the model files to a directory in your Google Drive.
!cp -r ./model_save/ "/content/drive/MyDrive"

Saving model to ./model_save/
Mounted at /content/drive
