<a href="https://colab.research.google.com/github/matthewchung74/inference_nbs/blob/main/hugging_face_imdb_classification_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### training imdb classification using HuggingFace
based on https://huggingface.co/transformers/custom_datasets.html#seq-imdb. All the steps are the same until the **inference setup section**

In [None]:
!pip install transformers

Collecting transformers
  Using cached https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl
Installing collected packages: transformers
Successfully installed transformers-4.4.2


In [None]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2021-04-02 17:17:34--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-04-02 17:17:36 (46.9 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [None]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

In [None]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [None]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Step,Training Loss
10,0.6917
20,0.7003
30,0.6947
40,0.6869
50,0.6857
60,0.6778
70,0.6796
80,0.6612
90,0.6307
100,0.576


TrainOutput(global_step=3750, training_loss=0.18460320075402656, metrics={'train_runtime': 1713.2026, 'train_samples_per_second': 2.189, 'total_flos': 1.23411474432e+16, 'epoch': 3.0, 'init_mem_cpu_alloc_delta': 332937, 'init_mem_gpu_alloc_delta': 268953088, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1165505, 'train_mem_gpu_alloc_delta': 804118016, 'train_mem_cpu_peaked_delta': 95392469, 'train_mem_gpu_peaked_delta': 6858511872})

now let's export this model and do a test

In [None]:
import os
import shutil
import traceback

try:
    model_path = Path("model")
    if not os.path.exists(model_path):
        os.makedirs(model_path)
        os.makedirs(model_path/"tokenizer")
        model.eval()
        model.save_pretrained(model_path)
        tokenizer.save_vocabulary(str(model_path/"tokenizer"))
except:
    shutil.rmtree(model_path)
    traceback.print_exc()

In [None]:
from transformers import DistilBertTokenizer
model_new = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer_new = DistilBertTokenizer.from_pretrained(model_path/"tokenizer")

In [None]:
import torch.nn.functional as F

class_names =["negative", "positive"]

def predict(text):
    inputs = tokenizer_new(text, return_tensors="pt")
    labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    outputs = model_new(**inputs, labels=labels)
    logits = outputs.logits
    probabilities = F.softmax(logits, dim=1)
    confidence, predicted_class = torch.max(probabilities, dim=1)
    return f"prediction: {class_names[predicted_class]}, confidence: {confidence.item()}"

In [None]:
predict('this movie is awesome') 

'prediction: positive, confidence: 0.9962551593780518'

In [None]:
predict('this movie is very bad')

'prediction: negative, confidence: 0.999083399772644'

### inference setup section


1.   zip model
2.   download model
3.   add your zipped model to SOMEWHERE
4.   create a inference notebook 




In [None]:
# step 1
!zip -r model.zip model

  adding: model/ (stored 0%)
  adding: model/config.json (deflated 44%)
  adding: model/tokenizer/ (stored 0%)
  adding: model/tokenizer/vocab.txt (deflated 53%)
  adding: model/pytorch_model.bin (deflated 8%)


In [None]:
# step 2
from google.colab import files
files.download("model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>