In [None]:
%load_ext autoreload
%autoreload 2

# Data

> The `data.core` module contains the core bits required to use fast.ai's low-level and/or mid-level APIs to define `Datasets` and build `DataLoaders` suitable for training transformers

In [None]:
# |default_exp data.core
# |default_cls_lvl 3

In [None]:
# |export
from __future__ import annotations

import gc, importlib, sys, traceback

from accelerate.logging import get_logger
from dataclasses import dataclass
from dotenv import load_dotenv
from fastai.imports import *
from fastai.losses import CrossEntropyLossFlat
from fastai.data.block import TransformBlock
from fastai.data.transforms import DataLoaders, ItemTransform, Transform
from fastai.text.data import SortedDL
from fastai.torch_core import *
from fastai.torch_imports import *
from transformers import PretrainedConfig, PreTrainedTokenizerBase, PreTrainedModel, AutoModelForSequenceClassification
from transformers import logging as hf_logging
from transformers.data.data_collator import DataCollatorWithPadding

from blurr.utils import get_hf_objects

In [None]:
# |hide
import pdb, nbdev

from datasets import Dataset
from datasets import concatenate_datasets, load_dataset, Dataset, Value
from fastai.data.block import CategoryBlock, ColReader, ColSplitter, DataBlock, FuncSplitter, MultiCategoryBlock
from fastai.data.transforms import DataLoader, DataLoaders, Datasets, ItemTransform
from fastai.losses import BaseLoss, BCEWithLogitsLossFlat
from fastcore.test import *
from transformers import AutoConfig, AutoTokenizer

from blurr.utils import clean_memory, print_versions, set_seed

In [None]:
# |export
# silence all the HF warnings and load environment variables
warnings.simplefilter("ignore")
hf_logging.set_verbosity_error()
logger = get_logger(__name__)

load_dotenv()

False

In [None]:
# |hide
# |notest
torch.cuda.set_device(0)
print(f"Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}")

Using GPU #0: NVIDIA GeForce RTX 3090


In [None]:
# | echo: false
os.environ["TOKENIZERS_PARALLELISM"] = "false"
print("What we're running with at the time this documentation was generated:")
print_versions("torch fastai transformers")

What we're running with at the time this documentation was generated:
torch: 1.13.1
fastai: 2.7.11
transformers: 4.26.1


## Setup

We'll use a subset of `imdb` to demonstrate how to configure your BLURR for sequence classification tasks. **BLURR** is designed to work with Hugging Face `Dataset` and/or pandas `DataFrame` objects

### Multiclass

In [None]:
imdb_dsd = load_dataset("imdb", split=["train", "test"])

# build HF `Dataset` objects
train_ds = imdb_dsd[0].add_column("is_valid", [False] * len(imdb_dsd[0])).shuffle().select(range(1000))
valid_ds = imdb_dsd[1].add_column("is_valid", [True] * len(imdb_dsd[1])).shuffle().select(range(200))
imdb_ds = concatenate_datasets([train_ds, valid_ds])

# build a `DataFrame` representation as well
imdb_df = pd.DataFrame(imdb_ds)

print(len(train_ds), len(valid_ds))
print(len(imdb_df[imdb_df["is_valid"] == False]), len(imdb_df[imdb_df["is_valid"] == True]))
imdb_df.head()

Found cached dataset imdb (/home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/2 [00:00<?, ?it/s]

1000 200
1000 200


Unnamed: 0,text,label,is_valid
0,"I've read a number of reviews on this film and I have to say ""What is wrong with you people?!?!"" This was an excellent film! I thought this film was superb from start to finish and the story was extremely well told. I'm convinced that the people that didn't like this film weren't paying very good attention to the film. There are a number of very important scenes that if you aren't paying attention you will be confused and the following scenes may not make sense. I urge anyone who didn't like this film to watch it again and watch it alone so that you can truly pay attention. The story made ...",1,False
1,"Bugs life is a good film. But to me, it doesn't really compare to movies like Toy story and stuff. Don't get me wrong, I liked this movie, but it wasn't as good as Toy story. The film has the visuals, the laughs, and others that Toy story had. But the film didn't feel quite as... I don't know, but I thought it was still a pretty good film. <br /><br />A bugs life... I don't want to say this, is a film that I don't remember. I saw it years ago. Of course, I haven't seen Toy story in years, but I still remember it. I shouldn't have reviewed this film, but I am. I am giving it a thumbs up, th...",1,False
2,"Sorry did i miss something? did i walk out early? The first ten minutes of unusual (and untrue!) stories had me thinking ""This is going to be a classic"" But it was all down hill from there! The acting was brilliant, for what it's worth William H Macy is fantastic and just gets better and better every film i watch him in. But it never seemed to connect. I was waiting for the big moment where all the stories inter connect and then suddenly..it rains frog?? it was if the writer said ""i've gone to deep how can i pull all these stories together cleverely....Oh sod it i'll just have it raining f...",0,False
3,"I think its pretty safe to say that this is the worst film ever made, When I saw the trailer on TV i knew right from second 1 that this would be a piece of **** and it would be best to avoid it, but I somehow got dragged into seeing this by some friends, I walked into the cinema with low expectations but i was hoping there would be a couple of cheap laughs to keep me awake during this film. The so-called ""jokes"" in this film bring a cringe to the face, they are mostly comprised of people taking hits to the face and balls, the baby looking weird and acting like a horny gangsta and the typic...",0,False
4,"(This is a review of the later English release by Disney, featuring Alison Lohman, Patrick Stewart, and co.) <br /><br />I really wanted this film to be good. Really, really. I'm a huge fan of Princess Mononoke and Spirited Away, and after seeing all the glowing reviews on this earlier Miyazaki film, I was eager to see it. But I was shocked, shocked I say, at the quality of this film. Those later films boast well-crafted plots, 3-dimensional characters, and the best film music since...well...ever. This film just doesn't come close.<br /><br />Might as well start w/ the positive aspects, th...",0,False


In [None]:
label_names = imdb_dsd[0].features["label"].names
label_names

['neg', 'pos']

### Multilabel

In [None]:
civil_dsd = load_dataset("civil_comments", split=["train", "validation"])

# round the floats
civil_label_names = ["toxicity", "severe_toxicity", "obscene", "threat", "insult", "identity_attack", "sexual_explicit"]


def round_targs(example):
    for lbl in civil_label_names:
        example[lbl] = np.round(example[lbl])
    return example


# convert floats to ints
def fix_dtypes(ds):
    new_features = ds.features.copy()
    for lbl in civil_label_names:
        new_features[lbl] = Value("int32")
    return ds.cast(new_features)


# build HF `Dataset` objects
civil_train_ds = civil_dsd[0].add_column("is_valid", [False] * len(civil_dsd[0])).shuffle().select(range(1000))
civil_train_ds = civil_train_ds.map(round_targs)
civil_train_ds = fix_dtypes(civil_train_ds)

civil_valid_ds = civil_dsd[1].add_column("is_valid", [True] * len(civil_dsd[1])).shuffle().select(range(200))
civil_valid_ds = civil_valid_ds.map(round_targs)
civil_valid_ds = fix_dtypes(civil_valid_ds)

civil_ds = concatenate_datasets([civil_train_ds, civil_valid_ds])

# build a `DataFrame` representation as well
civil_df = pd.DataFrame(civil_ds)

print(len(civil_train_ds), len(civil_valid_ds))
print(len(civil_df[civil_df["is_valid"] == False]), len(civil_df[civil_df["is_valid"] == True]))
civil_df.head()

Found cached dataset civil_comments (/home/wgilliam/.cache/huggingface/datasets/civil_comments/default/0.9.0/e7a3aacd2ab7d135fa958e7209d10b1fa03807d44c486e3c34897aa08ea8ffab)


  0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

1000 200
1000 200


Unnamed: 0,text,toxicity,severe_toxicity,obscene,threat,insult,identity_attack,sexual_explicit,is_valid
0,"We live in an age of hysteria and histrionics. Critical, rational thinking is needed to deal with the deluge of propaganda emanating from the Grope & Flail and the Toronto Red Star. Head for the hills!!",0,0,0,0,0,0,0,False
1,"No excuse for poaching, no matter who you are... native, white, half white, half native, whatever.",0,0,0,0,0,0,0,False
2,Foreign Affairs Minister Chrystia Freeland says the Liberal government will make a “substantial investment” in the military because Canada can no longer rely on Washington for global leadership \n-\nsubstantial investment?\njoke?\nTrudeau Canada cannot even meet its NATO commitments of 2% of GDP for defense (spending less than 1%)\ntruly Liberal hot air,0,0,0,0,0,0,0,False
3,Relying on others. It must be Liberal thing.,0,0,0,0,0,0,0,False
4,"Let's be fair -- if there's one thing that unites people of all lands, of all colours and creeds, it's the thrill of beholding a giant rubber duck. Some might feel fondness and some even arousal (stay away from these), but all will cherish the bizarre memory.\nAlas, the more suspicious types might worry that it is in fact a Trumpian Trojan Horse, crammed with his northern militia, his online fifth column, the tiny one that usually hangs out at the online Globe.",0,0,0,0,0,0,0,False


## Base API

### Task specific functions

The below functions provide a basic way to fetch your Hugging Face objects and pretokenize your inputs.

In [None]:
# |export
def get_task_hf_objects(pretrained_model_name: str, label_names: list[str] = ["neg", "pos"], verbose: bool = False):
    """A helper function for getting the Hugging Face objects that works out of the box for most sequence classification tasks"""
    model_cls = AutoModelForSequenceClassification
    n_label_names = len(label_names)

    hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(
        pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_label_names}
    )

    if verbose:
        hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)

        print("=== config ===")
        print(f"# of labels:\t{hf_config.num_labels}")
        print("")
        print("=== tokenizer ===")
        print(f"Vocab size:\t\t{hf_tokenizer.vocab_size}")
        print(f"Max # of tokens:\t{hf_tokenizer.model_max_length}")
        print(f"Attributes expected by model in forward pass:\t{hf_tokenizer.model_input_names}")

    return hf_arch, hf_config, hf_tokenizer, hf_model

In [None]:
nbdev.show_doc(get_task_hf_objects)

---

[source](https://github.com/ohmeow/blurr/blob/dev-3.0.0 #master/blurr/data/token_classification.py#L130){target="_blank" style="float:right; font-size:smaller"}

### get_task_hf_objects

>      get_task_hf_objects (pretrained_model_name:str,
>                           label_names:list[str]=['neg', 'pos'],
>                           verbose:bool=False)

A helper function for getting the Hugging Face objects that works out of the box for most sequence classification tasks

In [None]:
# |eval:true
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects(
    "distilbert-base-cased", ["great", "good", "bad", "horrific"], verbose=False
)

test_eq(hf_arch, "distilbert")
test_eq(hf_config.num_labels, 4)

In [None]:
# |export
def multiclass_tokenize_func(
    examples,
    hf_tokenizer: PreTrainedTokenizerBase,
    text_attr: str = "text",
    text_pair_attr: str = None,
    max_length: int = None,
    padding: bool | str = True,
    truncation: bool | str = True,
    tok_kwargs: dict = {},
):
    """A tokenization function that works out of the box for most multiclassification tasks"""
    txts = [examples[text_attr]] if text_pair_attr is None else [examples[text_attr], examples[text_pair_attr]]
    return hf_tokenizer(*txts, max_length=max_length, padding=padding, truncation=truncation, **tok_kwargs)

In [None]:
nbdev.show_doc(multiclass_tokenize_func)

---

[source](https://github.com/ohmeow/blurr/blob/dev-3.0.0 #master/blurr/data/core.py#L61){target="_blank" style="float:right; font-size:smaller"}

### multiclass_tokenize_func

>      multiclass_tokenize_func (examples, hf_tokenizer:transformers.tokenizatio
>                                n_utils_base.PreTrainedTokenizerBase,
>                                text_attr:str='text', text_pair_attr:str=None,
>                                max_length:int=None, padding:bool|str=True,
>                                truncation:bool|str=True, tok_kwargs:dict={})

A tokenization function that works out of the box for most multiclassification tasks

In [None]:
# |eval:true
my_dict = {"id": [0, 1, 2], "text": ["This is great!", "This is so horrible", "It was, uh, kinda meh."], "label": [1, 0, 0]}
test_ds = Dataset.from_dict(my_dict)

tokenize_func = partial(multiclass_tokenize_func, hf_tokenizer=hf_tokenizer)
proc_test_ds = test_ds.map(tokenize_func, batched=True)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [None]:
# |eval:true
proc_example = proc_test_ds[0]

test_eq(isinstance(proc_example, dict), True)
test_eq("input_ids" in proc_example.keys(), True)

In [None]:
# |export
def multilabel_tokenize_func(
    examples,
    hf_tokenizer: PreTrainedTokenizerBase,
    label_attrs: list[str],
    text_attr: str = "text",
    text_pair_attr: str = None,
    max_length: int = None,
    padding: bool | str = True,
    truncation: bool | str = True,
    tok_kwargs: dict = {},
):
    """A tokenization function that works out of the box for most multilabel classification tasks"""
    txts = [examples[text_attr]] if text_pair_attr is None else [examples[text_attr], examples[text_pair_attr]]
    inputs = hf_tokenizer(*txts, max_length=max_length, padding=padding, truncation=truncation, **tok_kwargs)

    label_names = torch.stack([tensor(examples[lbl]) for lbl in label_attrs], dim=-1)
    inputs["label"] = label_names

    return inputs

In [None]:
nbdev.show_doc(multilabel_tokenize_func)

---

[source](https://github.com/ohmeow/blurr/blob/dev-3.0.0 #master/blurr/data/core.py#L76){target="_blank" style="float:right; font-size:smaller"}

### multilabel_tokenize_func

>      multilabel_tokenize_func (examples, hf_tokenizer:transformers.tokenizatio
>                                n_utils_base.PreTrainedTokenizerBase,
>                                label_attrs:list[str], text_attr:str='text',
>                                text_pair_attr:str=None, max_length:int=None,
>                                padding:bool|str=True,
>                                truncation:bool|str=True, tok_kwargs:dict={})

A tokenization function that works out of the box for most multilabel classification tasks

In [None]:
# |eval:true
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("distilbert-base-cased", ["label1", "label2"], verbose=False)

test_eq(hf_arch, "distilbert")
test_eq(hf_config.num_labels, 2)

In [None]:
# |eval:true
my_dict = {
    "id": [0, 1, 2],
    "text": ["This is great!", "This is so horrible", "It was, uh, kinda meh."],
    "label1": [1, 0, 0],
    "label2": [0, 1, 1],
}

test_ds = Dataset.from_dict(my_dict)

tokenize_func = partial(multilabel_tokenize_func, hf_tokenizer=hf_tokenizer, label_attrs=["label1", "label2"])
proc_test_ds = test_ds.map(tokenize_func, batched=True)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [None]:
# |eval:true
proc_example = proc_test_ds[0]

test_eq(isinstance(proc_example, dict), True)
test_eq("input_ids" in proc_example.keys(), True)

### `TextCollatorWithPadding`

In [None]:
# |export
@dataclass
class TextCollatorWithPadding:
    def __init__(
        self,
        # A Hugging Face tokenizer
        hf_tokenizer: PreTrainedTokenizerBase,
        # The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
        hf_arch: str = None,
        # A specific configuration instance you want to use
        hf_config: PretrainedConfig = None,
        # A Hugging Face model
        hf_model: PreTrainedModel = None,
        # The number of inputs expected by your model
        n_inp: int = 1,
        # Defaults to use Hugging Face's DataCollatorWithPadding(tokenizer=hf_tokenizer)
        data_collator_cls: type = DataCollatorWithPadding,
        # kwyargs specific for the instantiation of the `data_collator`
        data_collator_kwargs: dict = {},
    ):
        """A data collation function that can be used across blurr's base, low-level, and mid-level APIs"""
        store_attr()
        self.hf_tokenizer = data_collator_kwargs.pop("tokenizer", self.hf_tokenizer)
        self.data_collator = data_collator_cls(tokenizer=self.hf_tokenizer, **data_collator_kwargs)

    def __call__(self, features):
        features = L(features)
        inputs, labels, targs = [], [], []

        # features contain dictionaries
        if isinstance(features[0], dict):
            feature_keys = list(features[0].keys())
            inputs = [self._build_inputs_d(features, feature_keys)]

            input_labels = self._build_input_labels(inputs[0], features, feature_keys)
            if input_labels is not None:
                labels, targs = [input_labels], [input_labels.clone()]
        # features contains tuples, each of which can contain multiple inputs and/or targets
        elif isinstance(features[0], tuple):
            for f_idx in range(self.n_inp):
                feature_keys = list(features[0][f_idx].keys())
                inputs.append(self._build_inputs_d(features.itemgot(f_idx), feature_keys))

                input_labels = self._build_input_labels(inputs[0], features.itemgot(f_idx), feature_keys)
                labels.append(input_labels if input_labels is not None else [])

            targs = [self._proc_targets(inputs[0], list(features.itemgot(f_idx))) for f_idx in range(self.n_inp, len(features[0]))]

        return self._build_batch(inputs, labels, targs)

    # ----- utility methods -----

    # to build the inputs dictionary
    def _build_inputs_d(self, features, feature_keys):
        return {fwd_arg: list(features.attrgot(fwd_arg)) for fwd_arg in self.hf_tokenizer.model_input_names if fwd_arg in feature_keys}

    # to build the input "labels"
    def _build_input_labels(self, inputs_d, features, feature_keys):
        if "label" in feature_keys:
            labels = list(features.attrgot("label"))
            return self._proc_targets(inputs_d, labels)
        return None

    # used to give the labels/targets the right shape
    def _proc_targets(self, inputs_d, targs):
        if is_listy(targs[0]):
            targs = torch.stack([tensor(lbls) for lbls in targs])
        elif isinstance(targs[0], torch.Tensor) and len(targs[0].size()) > 0:
            targs = torch.stack(targs)
        else:
            targs = torch.tensor(targs)

        return targs

    # will properly assemble are batch given a list of inputs, labels, and targets
    def _build_batch(self, inputs, labels, targs):
        batch = []

        for input, input_labels in zip(inputs, labels):
            input_d = dict(self.data_collator(input))
            if len(input_labels) > 0:
                input_d["labels"] = input_labels

            batch.append(input_d)

        for targ in targs:
            batch.append(targ)

        return tuplify(batch)

## Base API: Examples

This section demonstrates how you can use standard `Dataset` objects (PyTorch and Hugging Face) to build PyTorch `DataLoader`s

**Note** that most fast.ai specific features such as `DataLoaders.one_batch` and `DataLoader.show_batch` are not available when using PyTorch.

### PyTorch

#### Multiclass

##### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


##### Step 2: `Dataset`s (PyTorch)

In [None]:
# tokenize the dataset
tokenize_func = partial(multiclass_tokenize_func, hf_tokenizer=hf_tokenizer)
proc_train_ds = train_ds.map(tokenize_func, batched=True)
proc_valid_ds = valid_ds.map(tokenize_func, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# define our PyTorch Dataset class
class HFTextClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, hf_dataset, hf_tokenizer):
        self.hf_dataset = hf_dataset
        self.hf_tokenizer = hf_tokenizer

    def __len__(self):
        return len(self.hf_dataset)

    def __getitem__(self, idx):
        item = self.hf_dataset[idx]
        return item


# build our PyTorch training and validation Datasets
pt_proc_train_ds = HFTextClassificationDataset(proc_train_ds, hf_tokenizer=hf_tokenizer)
pt_proc_valid_ds = HFTextClassificationDataset(proc_valid_ds, hf_tokenizer=hf_tokenizer)

##### Step 3: `DataLoader`s (PyTorch)

In [None]:
# build your fastai `DataLoaders` from Pytorch `DataLoader` objects
batch_size = 4
data_collator = TextCollatorWithPadding(hf_tokenizer)
train_dl = torch.utils.data.DataLoader(pt_proc_train_ds, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
valid_dl = torch.utils.data.DataLoader(pt_proc_valid_ds, batch_size=batch_size * 2, shuffle=False, collate_fn=data_collator)

dls = DataLoaders(train_dl, valid_dl)

In [None]:
print("# of batches in train|validation dataloaders:", len(train_dl), len(valid_dl))

b = next(iter(train_dl))
print("# of items in each batch:", len(b))
print("")
print(f"Decoded input_ids: {hf_tokenizer.decode(b[0]['input_ids'][0][:10])} ... ")
print("Targets:", b[1])

# b

# of batches in train|validation dataloaders: 250 25
# of items in each batch: 2

Decoded input_ids: [CLS] When watching A Bug's Life for the ... 
Targets: tensor([1, 1, 0, 0])


In [None]:
# NOPE: Won't work with PyTorch DataLoaders
# AttributeError: 'DataLoader' object has no attribute 'show_batch'
# dls.show_batch(dataloaders=dls, max_n=2, trunc_at=800)

In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

#### Multilabel

##### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", civil_label_names, verbose=True)

=== config ===
# of labels:	7

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


##### Step 2: `Dataset`s (PyTorch)

In [None]:
# tokenize the dataset
tokenize_func = partial(multilabel_tokenize_func, hf_tokenizer=hf_tokenizer, label_attrs=civil_label_names)
proc_civil_train_ds = civil_train_ds.map(tokenize_func, batched=True)
proc_civil_valid_ds = civil_valid_ds.map(tokenize_func, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# define our PyTorch Dataset class
class HFTextMultilabelClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, hf_dataset, hf_tokenizer, label_names):
        self.hf_dataset = hf_dataset
        self.hf_tokenizer = hf_tokenizer
        self.label_names = label_names

    def __len__(self):
        return len(self.hf_dataset)

    def __getitem__(self, idx):
        item = self.hf_dataset[idx]
        # item["label"] = [item[lbl] for lbl in self.label_names]
        return item


# build our PyTorch training and validation Datasets
pt_proc_civil_train_ds = HFTextMultilabelClassificationDataset(
    proc_civil_train_ds, hf_tokenizer=hf_tokenizer, label_names=civil_label_names
)
pt_proc_civil_valid_ds = HFTextMultilabelClassificationDataset(
    proc_civil_valid_ds, hf_tokenizer=hf_tokenizer, label_names=civil_label_names
)

##### Step 3: `DataLoader`s (PyTorch)

In [None]:
# build your fastai `DataLoaders` from Pytorch `DataLoader` objects
batch_size = 4
data_collator = TextCollatorWithPadding(hf_tokenizer)
train_dl = torch.utils.data.DataLoader(pt_proc_civil_train_ds, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
valid_dl = torch.utils.data.DataLoader(pt_proc_civil_valid_ds, batch_size=batch_size * 2, shuffle=False, collate_fn=data_collator)

dls = DataLoaders(train_dl, valid_dl)

In [None]:
print("# of batches in train|validation dataloaders:", len(train_dl), len(valid_dl))

b = next(iter(train_dl))
print("# of items in each batch:", len(b))
print("")
print(f"Decoded input_ids: {hf_tokenizer.decode(b[0]['input_ids'][0][:10])} ... ")
print("Targets:", b[1])

# b

# of batches in train|validation dataloaders: 250 25
# of items in each batch: 2

Decoded input_ids: [CLS] There's no such thing as a National ... 
Targets: tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

### Hugging Face

#### Multiclass

##### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


##### Step 2: `Datasets` (huggingface)

In [None]:
# tokenize the dataset
tokenize_func = partial(multiclass_tokenize_func, hf_tokenizer=hf_tokenizer)
proc_train_ds = train_ds.map(tokenize_func, batched=True)
proc_valid_ds = valid_ds.map(tokenize_func, batched=True)

Loading cached processed dataset at /home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-bf198da76a267503.arrow


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

##### Step 3: `DataLoader`s (PyTorch)

In [None]:
# build your fastai `DataLoaders` from Pytorch `DataLoader` objects
batch_size = 4
data_collator = TextCollatorWithPadding(hf_tokenizer)
train_dl = torch.utils.data.DataLoader(proc_train_ds, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
valid_dl = torch.utils.data.DataLoader(proc_valid_ds, batch_size=batch_size * 2, shuffle=False, collate_fn=data_collator)

dls = DataLoaders(train_dl, valid_dl)

In [None]:
print("# of batches in train|validation dataloaders:", len(train_dl), len(valid_dl))

b = next(iter(train_dl))
print("# of items in each batch:", len(b))
print("")
print(f"Decoded input_ids: {hf_tokenizer.decode(b[0]['input_ids'][0][:10])} ... ")
print("Targets:", b[1])

# b

# of batches in train|validation dataloaders: 250 25
# of items in each batch: 2

Decoded input_ids: [CLS] 'One-Round' Jack Sander is called ... 
Targets: tensor([1, 0, 1, 0])


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

#### Multilabel

##### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", civil_label_names, verbose=True)

=== config ===
# of labels:	7

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


##### Step 2: `Datasets` (huggingface)

In [None]:
# tokenize the dataset
tokenize_func = partial(multilabel_tokenize_func, hf_tokenizer=hf_tokenizer, label_attrs=civil_label_names)
proc_civil_train_ds = civil_train_ds.map(tokenize_func, batched=True)
proc_civil_valid_ds = civil_valid_ds.map(tokenize_func, batched=True)

Loading cached processed dataset at /home/wgilliam/.cache/huggingface/datasets/civil_comments/default/0.9.0/e7a3aacd2ab7d135fa958e7209d10b1fa03807d44c486e3c34897aa08ea8ffab/cache-3d64576ee130c580.arrow


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

##### Step 3: `DataLoader`s (PyTorch)

In [None]:
# build your fastai `DataLoaders` from Pytorch `DataLoader` objects
batch_size = 4
data_collator = TextCollatorWithPadding(hf_tokenizer)
train_dl = torch.utils.data.DataLoader(proc_civil_train_ds, batch_size=batch_size, shuffle=True, collate_fn=data_collator)
valid_dl = torch.utils.data.DataLoader(proc_civil_valid_ds, batch_size=batch_size * 2, shuffle=False, collate_fn=data_collator)

dls = DataLoaders(train_dl, valid_dl)

In [None]:
print("# of batches in train|validation dataloaders:", len(train_dl), len(valid_dl))

b = next(iter(train_dl))
print("# of items in each batch:", len(b))
print("")
print(f"Decoded input_ids: {hf_tokenizer.decode(b[0]['input_ids'][0][:10])} ... ")
print("Targets:", b[1])

# b

# of batches in train|validation dataloaders: 250 25
# of items in each batch: 2

Decoded input_ids: [CLS] How do you think most adults vote? Often ... 
Targets: tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

## Low-Level API

This section demonstrates how you can migrate from using PyTorch/Hugging Face to fast.ai `Datasets` and `DataLoaders` to recapture much of the fast.ai specific features unavailable when using basic PyTorch. This includes:

- `DataLoaders.one_batch()`
- `DataLoaders.show_batch()`
- `Leaner.export()`

### `TextInput` -

In [None]:
# |export
class TextInput(TensorBase):
    """The base represenation of your inputs; used by the various fastai `show` methods"""

    pass

A `TextInput` object is returned from the decodes method of `BatchDecodeTransform` as a means to customize `@typedispatch`ed functions like `DataLoaders.show_batch` and `Learner.show_results`. The value will the your "input_ids".

### `BatchDecodeTransform` -

In [None]:
# |export
class BatchDecodeTransform(Transform):
    """A class used to cast your inputs as `input_return_type` for fastai `show` methods"""

    def __init__(
        self,
        # A Hugging Face tokenizer (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_tokenizer: PreTrainedTokenizerBase,
        # The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_arch: str = None,
        # A Hugging Face configuration object (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_config: PretrainedConfig = None,
        # A Hugging Face model (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_model: PreTrainedModel = None,
        # Used by typedispatched show methods
        input_return_type: type = TextInput,
        # The token ID that should be ignored when calculating the loss
        ignore_token_id: int = CrossEntropyLossFlat().ignore_index,
        # Any other keyword arguments
        **kwargs,
    ):
        store_attr()
        self.kwargs = kwargs

    def decodes(self, items: dict):
        """Returns the proper object and data for show related fastai methods"""
        return self.input_return_type(items["input_ids"])

As of fastai 2.1.5, before batch transforms no longer have a `decodes` method ... and so, I've introduced a standard batch transform here, `BatchDecodeTransform`, (one that occurs "after" the batch has been created) that will do the decoding for us.

### Utility classes and methods 

These methods are use internally for getting blurr transforms associated to your `DataLoaders`

In [None]:
# |export
def get_blurr_tfm(
    # A list of transforms (e.g., dls.after_batch, dls.before_batch, etc...)
    tfms_list: Pipeline,
    # The transform to find
    tfm_class: Transform = BatchDecodeTransform,
):
    """
    Given a fastai DataLoaders batch transforms, this method can be used to get at a transform
    instance used in your Blurr DataBlock
    """
    return next(filter(lambda el: issubclass(type(el), tfm_class), tfms_list), None)

In [None]:
nbdev.show_doc(get_blurr_tfm, title_level=3)

---

[source](https://github.com/ohmeow/blurr/blob/dev-3.0.0 #master/blurr/data/core.py#L221){target="_blank" style="float:right; font-size:smaller"}

### get_blurr_tfm

>      get_blurr_tfm (tfms_list:fastcore.transform.Pipeline,
>                     tfm_class:fastcore.transform.Transform=<class
>                     '__main__.BatchDecodeTransform'>)

Given a fastai DataLoaders batch transforms, this method can be used to get at a transform
instance used in your Blurr DataBlock

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| tfms_list | Pipeline |  | A list of transforms (e.g., dls.after_batch, dls.before_batch, etc...) |
| tfm_class | Transform | BatchDecodeTransform | The transform to find |

In [None]:
# |export
def first_blurr_tfm(
    # Your fast.ai `DataLoaders
    dls: DataLoaders,
    # The Blurr transforms to look for in order
    tfms: list[Transform] = [BatchDecodeTransform],
):
    """
    This convenience method will find the first Blurr transform required for methods such as
    `show_batch` and `show_results`. The returned transform should have everything you need to properly
    decode and 'show' your Hugging Face inputs/targets
    """
    for tfm in tfms:
        found_tfm = get_blurr_tfm(dls.before_batch, tfm_class=tfm)
        if found_tfm:
            return found_tfm

        found_tfm = get_blurr_tfm(dls.after_batch, tfm_class=tfm)
        if found_tfm:
            return found_tfm

In [None]:
nbdev.show_doc(first_blurr_tfm, title_level=3)

---

[source](https://github.com/ohmeow/blurr/blob/dev-3.0.0 #master/blurr/data/core.py#L234){target="_blank" style="float:right; font-size:smaller"}

### first_blurr_tfm

>      first_blurr_tfm (dls:fastai.data.core.DataLoaders,
>                       tfms:list[fastcore.transform.Transform]=[<class
>                       '__main__.BatchDecodeTransform'>])

This convenience method will find the first Blurr transform required for methods such as
`show_batch` and `show_results`. The returned transform should have everything you need to properly
decode and 'show' your Hugging Face inputs/targets

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| dls | DataLoaders |  | Your fast.ai `DataLoaders |
| tfms | list[Transform] | [<class '__main__.BatchDecodeTransform'>] | The Blurr transforms to look for in order |

### `show_batch` -

In [None]:
# |export
@typedispatch
def show_batch(
    # This typedispatched `show_batch` will be called for `TextInput` typed inputs
    x: TextInput,
    # Your targets
    y,
    # Your raw inputs/targets
    samples,
    # Your `DataLoaders`. This is required so as to get at the Hugging Face objects for
    # decoding them into something understandable
    dataloaders,
    # Your `show_batch` context
    ctxs=None,
    # The maximum number of items to show
    max_n=6,
    # Any truncation your want applied to your decoded inputs
    trunc_at=None,
    # Any other keyword arguments you want applied to `show_batch`
    **kwargs,
):
    # grab our tokenizer
    tfm = first_blurr_tfm(dataloaders)
    hf_tokenizer = tfm.hf_tokenizer

    # if we've included our label_names list, we'll use it to look up the value of our target(s)
    trg_label_names = tfm.kwargs["label_names"] if ("label_names" in tfm.kwargs) else None
    if trg_label_names is None and dataloaders.vocab is not None:
        trg_label_names = dataloaders.vocab

    res = L()
    n_inp = dataloaders.n_inp

    n_samples = min(max_n, dataloaders.bs)
    for idx in range(n_samples):
        input_ids = x[idx]
        rets = [hf_tokenizer.decode(input_ids, skip_special_tokens=True)[:trunc_at]]

        sample = samples[idx] if samples is not None else None
        for item_idx, item in enumerate(sample[n_inp:]):
            label = y[item_idx] if y is not None else item

            if torch.is_tensor(label):
                label = list(label.numpy()) if len(label.size()) > 0 else label.item()

            if is_listy(label):
                trg = [trg_label_names[int(idx)] for idx, val in enumerate(label) if (val == 1)] if trg_label_names else label
            else:
                trg = trg_label_names[int(item)] if trg_label_names else item

            rets.append(trg)
        res.append(tuplify(rets))

    cols = ["text"] + ["target" if (i == 0) else f"target_{i}" for i in range(len(res[0]) - n_inp)]
    display_df(pd.DataFrame(res, columns=cols)[:max_n])
    return ctxs

### `sorted_dl_func` -

In [None]:
# |export
def sorted_dl_func(
    example,
    # A Hugging Face tokenizer
    hf_tokenizer: PreTrainedTokenizerBase,
    # The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. \
    # Set this to 'True' if your inputs are pre-tokenized (not numericalized)
    is_split_into_words: bool = False,
    # Any other keyword arguments you want to include during tokenization
    tok_kwargs: dict = {},
):
    """This method is used by the `SortedDL` to ensure your dataset is sorted *after* tokenization"""
    txt = None
    if isinstance(example[0], dict):
        if "input_ids" in example[0]:
            # if inputs are pretokenized
            return len(example[0]["input_ids"])
        else:
            txt = example[0]["text"]
    else:
        txt = example[0]

    return len(txt) if is_split_into_words else len(hf_tokenizer.tokenize(txt, **tok_kwargs))

In [None]:
nbdev.show_doc(sorted_dl_func, title_level=3)

---

[source](https://github.com/ohmeow/blurr/blob/dev-3.0.0 #master/blurr/data/core.py#L312){target="_blank" style="float:right; font-size:smaller"}

### sorted_dl_func

>      sorted_dl_func (example, hf_tokenizer:transformers.tokenization_utils_bas
>                      e.PreTrainedTokenizerBase,
>                      is_split_into_words:bool=False, tok_kwargs:dict={})

This method is used by the `SortedDL` to ensure your dataset is sorted *after* tokenization

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| example |  |  |  |
| hf_tokenizer | PreTrainedTokenizerBase |  | A Hugging Face tokenizer |
| is_split_into_words | bool | False | The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. \<br>Set this to 'True' if your inputs are pre-tokenized (not numericalized) |
| tok_kwargs | dict | {} | Any other keyword arguments you want to include during tokenization |

## Low-Level API: Examples

### Multiclass

#### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#### Step 2: `Datasets` (fastai)

In [None]:
# tokenize the dataset
tokenize_func = partial(multiclass_tokenize_func, hf_tokenizer=hf_tokenizer)
proc_imdb_ds = imdb_ds.map(tokenize_func, batched=True)

# turn Arrow into DataFrame (`ColSplitter` only works with `DataFrame`s)
train_df = pd.DataFrame(proc_imdb_ds)
train_df.head()

# define dataset splitter
splitter = ColSplitter("is_valid")
splits = splitter(imdb_df)


# define how we want to build our inputs and targets
def _build_inputs(example):
    return {fwd_arg_name: example[fwd_arg_name] for fwd_arg_name in hf_tokenizer.model_input_names if fwd_arg_name in list(example.keys())}


def _build_targets(example):
    return example["label"]


# create our fastai `Datasets` object
dsets = Datasets(items=train_df, splits=splits, tfms=[[_build_inputs], _build_targets], n_inp=1)

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [None]:
print("Items in train|validation datasets: ", len(dsets.train), len(dsets.valid))

example = dsets.valid[0]
# example

print(f"Items in each example: {len(example)}")
print(f"Example inputs: {list(example[0].keys())}")
print(f"Example target(s): {example[1]}")

Items in train|validation datasets:  1000 200
Items in each example: 2
Example inputs: ['input_ids', 'token_type_ids', 'attention_mask']
Example target(s): 0


#### Step 3: `DataLoaders` (fastai)

In [None]:
data_collator = TextCollatorWithPadding(hf_tokenizer)
sort_func = partial(sorted_dl_func, hf_tokenizer=hf_tokenizer)
batch_decode_tfm = BatchDecodeTransform(hf_tokenizer, hf_arch, hf_config, hf_model, label_names=label_names)

dls = dsets.dataloaders(
    batch_size=4,
    create_batch=data_collator,
    after_batch=batch_decode_tfm,
    dl_type=partial(SortedDL, sort_func=sort_func),
)

In [None]:
print("# of batches in train|validation dataloaders:", len(train_dl), len(valid_dl))

b = next(iter(train_dl))
print("# of items in each batch:", len(b))
print("")
print(f"Decoded input_ids: {hf_tokenizer.decode(b[0]['input_ids'][0][:10])} ... ")
print("Targets:", b[1])

# b

# of batches in train|validation dataloaders: 250 25
# of items in each batch: 2

Decoded input_ids: [CLS] It is no different than the guys who will ... 
Targets: tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 1, 0, 0]])


In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=800)

Unnamed: 0,text,target
0,"I've read a number of reviews on this film and I have to say ""What is wrong with you people?!?!"" This was an excellent film! I thought this film was superb from start to finish and the story was extremely well told. I'm convinced that the people that didn't like this film weren't paying very good attention to the film. There are a number of very important scenes that if you aren't paying attention you will be confused and the following scenes may not make sense. I urge anyone who didn't like this film to watch it again and watch it alone so that you can truly pay attention. The story made perfect sense to me and as I said, was very well told. Every scene in the film has a point and everything fits together at the end of the film.<br /><br />All the actors did a fantastic job! Sean Connery",pos
1,"I'm not going to criticize the movie. There isn't that much to talk about. It has good animal actions scenes which were probably pretty astonishing at the time. Clyde Beatty isn't exactly a matinée idol. He's a little slight and not particularly good looking. But that's OK. He's the man in that lion cage. We know that when he can't take the time away from his lions to tend to his girlfriend, he will end up on an island with her and have to save the day. Someone said earlier that it is a history lesson. The scenes at the circus are of another day, especially the kids who hang around. I didn't realize that even back in the thirties, they sailed on three masted schooners. It looked like something out of 1860. I guess that's the stock footage they had. No wonder the thing got wrecked. They're",neg


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

### Multilabel

#### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", civil_label_names, verbose=True)

=== config ===
# of labels:	7

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#### Step 2: `Datasets` (fastai)

In [None]:
# tokenize the dataset
tokenize_func = partial(multilabel_tokenize_func, hf_tokenizer=hf_tokenizer, label_attrs=civil_label_names)
proc_civil_ds = civil_ds.map(tokenize_func, batched=True)

# turn Arrow into DataFrame (`ColSplitter` only works with `DataFrame`s)
train_df = pd.DataFrame(proc_civil_ds)
train_df.head()

# define dataset splitter
splitter = ColSplitter("is_valid")
splits = splitter(civil_df)


# define how we want to build our inputs and targets
def _build_inputs(example):
    return {fwd_arg_name: example[fwd_arg_name] for fwd_arg_name in hf_tokenizer.model_input_names if fwd_arg_name in list(example.keys())}


def _build_targets(example):
    return example["label"]


# create our fastai `Datasets` object
dsets = Datasets(items=train_df, splits=splits, tfms=[[_build_inputs], _build_targets], n_inp=1)

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [None]:
print("Items in train|validation datasets: ", len(dsets.train), len(dsets.valid))

example = dsets.valid[0]
# example

print(f"Items in each example: {len(example)}")
print(f"Example inputs: {list(example[0].keys())}")
print(f"Example target(s): {example[1]}")

Items in train|validation datasets:  1000 200
Items in each example: 2
Example inputs: ['input_ids', 'token_type_ids', 'attention_mask']
Example target(s): [1, 0, 0, 0, 1, 0, 0]


#### Step 3: `DataLoaders` (fastai)

In [None]:
data_collator = TextCollatorWithPadding(hf_tokenizer)
sort_func = partial(sorted_dl_func, hf_tokenizer=hf_tokenizer)
batch_decode_tfm = BatchDecodeTransform(hf_tokenizer, hf_arch, hf_config, hf_model, label_names=civil_label_names)

dls = dsets.dataloaders(
    batch_size=4,
    create_batch=data_collator,
    after_batch=batch_decode_tfm,
    dl_type=partial(SortedDL, sort_func=sort_func),
)

In [None]:
print("# of batches in train|validation dataloaders:", len(train_dl), len(valid_dl))

b = next(iter(train_dl))
print("# of items in each batch:", len(b))
print("")
print(f"Decoded input_ids: {hf_tokenizer.decode(b[0]['input_ids'][0][:10])} ... ")
print("Targets:", b[1])

# b

# of batches in train|validation dataloaders: 250 25
# of items in each batch: 2

Decoded input_ids: [CLS] "Fr. Martin envisions a bridge that one ... 
Targets: tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])


In [None]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,target
0,"We live in an age of hysteria and histrionics. Critical, rational thinking is needed to deal with the deluge of propaganda emanating from the Grope & Flail and the Toronto Red Star. Head for the hills!!",[]
1,"JimmyJ: You, sir, are the BIGLY WINNER. Your prize. On February 3rd, at 6:00 a number of us are gathering at Planktown in Springfield for a beer or soda pop (sarsaparilla) and a visit......Your prize is a free beer (orsarsaparilla) on me. Actually, it should be fun....if you can make it, please do. Should you have any questions, please call me. best regards, Gary",[]


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

## Mid-Level API

BLURR's mid-level API provides a way to build your `DataLoaders` using fast.ai's mid-level `DataBlock` API utilizing one of these three tokenization strategies:

1. Using pre-tokenized data (the traditional approach)

2. batch-time tokenization (the default approach in previous versions of blurr)

2. item-time tokenization (e.g., to apply tokenization on individual items as they are pulled from their respective `Dataset`)

### `BatchTokenizeTransform` -

In [None]:
# |export
class BatchTokenizeTransform(Transform):
    """
    Handles everything you need to assemble a mini-batch of inputs and targets, as well as
    decode the dictionary produced as a byproduct of the tokenization process in the `encodes` method.
    """

    def __init__(
        self,
        # The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
        hf_arch: str,
        # A specific configuration instance you want to use
        hf_config: PretrainedConfig,
        # A Hugging Face tokenizer
        hf_tokenizer: PreTrainedTokenizerBase,
        # A Hugging Face model
        hf_model: PreTrainedModel,
        # To control whether the "include_labels" are included in your inputs. If they are, the loss will be calculated in \
        # the model's forward function and you can simply use `PreCalculatedLoss` as your `Learner`'s loss function to use it
        include_labels: bool = True,
        # The token ID that should be ignored when calculating the loss
        ignore_token_id: int = CrossEntropyLossFlat().ignore_index,
        # To control the length of the padding/truncation. It can be an integer or None, \
        # in which case it will default to the maximum length the model can accept. \
        # If the model has no specific maximum input length, truncation/padding to max_length is deactivated. \
        # See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)
        max_length: int = None,
        # To control the `padding` applied to your `hf_tokenizer` during tokenization. \
        # If None, will default to 'False' or 'do_not_pad'. \
        # See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)
        padding: bool | str = True,
        # To control `truncation` applied to your `hf_tokenizer` during tokenization. \
        # If None, will default to 'False' or 'do_not_truncate'. \
        # See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)
        truncation: bool | str = True,
        # The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. \
        # Set this to 'True' if your inputs are pre-tokenized (not numericalized) \
        is_split_into_words: bool = False,
        # Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs
        tok_kwargs: dict = {},
        # Keyword arguments to apply to `BatchTokenizeTransform`
        **kwargs,
    ):
        store_attr()
        self.kwargs = kwargs

    def encodes(self, samples, return_batch_encoding=False):
        """
        This method peforms on-the-fly, batch-time tokenization of your data. In other words, your raw inputs
        are tokenized as needed for each mini-batch of data rather than requiring pre-tokenization of your full
        dataset ahead of time.
        """
        samples = L(samples)

        # grab inputs
        is_dict = isinstance(samples[0][0], dict)
        test_inp = samples[0][0]["text"] if is_dict else samples[0][0]

        if is_listy(test_inp) and not self.is_split_into_words:
            if is_dict:
                inps = [(item["text"][0], item["text"][1]) for item in samples.itemgot(0).items]
            else:
                inps = list(zip(samples.itemgot(0, 0), samples.itemgot(0, 1)))
        else:
            inps = [item["text"] for item in samples.itemgot(0).items] if is_dict else samples.itemgot(0).items

        inputs = self.hf_tokenizer(
            inps,
            max_length=self.max_length,
            padding=self.padding,
            truncation=self.truncation,
            is_split_into_words=self.is_split_into_words,
            return_tensors="pt",
            **self.tok_kwargs,
        )

        d_keys = inputs.keys()

        # update the samples with tokenized inputs (e.g. input_ids, attention_mask, etc...), as well as extra information
        # if the inputs is a dictionary.
        # (< 2.0.0): updated_samples = [(*[{k: inputs[k][idx] for k in d_keys}], *sample[1:]) for idx, sample in enumerate(samples)]
        updated_samples = []
        for idx, sample in enumerate(samples):
            inps = {k: inputs[k][idx] for k in d_keys}
            if is_dict:
                inps = {
                    **inps,
                    **{k: v for k, v in sample[0].items() if k not in ["text"]},
                }

            trgs = sample[1:]
            if self.include_labels and len(trgs) > 0:
                inps["label"] = trgs[0]

            updated_samples.append((*[inps], *trgs))

        if return_batch_encoding:
            return updated_samples, inputs

        return updated_samples

Inspired by this [article](https://docs.fast.ai/tutorial.transformers.html), `BatchTokenizeTransform` inputs can come in as raw **text**, **a list of words** (e.g., tasks like Named Entity Recognition (NER), where you want to predict the label of each token), or as a **dictionary** that includes extra information you want to use during post-processing.

Part of the inspiration for this derives from the mechanics of Hugging Face tokenizers, in particular it can return a collated mini-batch of data given a list of sequences. As such, the collating required for our inputs can be done during tokenization ***before*** our batch transforms run in a `before_batch_tfms` transform (where we get a list of examples)! This allows users of BLURR to have everything done dynamically at batch-time without prior preprocessing with at least four potential benefits:

1. No need to pretokenize your text
2. Less code
3. Faster mini-batch creation
4. Less RAM utilization and time spent tokenizing beforehand (this really helps with very large datasets)
5. Batch level flexibility

### `ItemTokenizeTransform` -

In [None]:
# |export
class ItemTokenizeTransform(ItemTransform):
    split_idx = None

    def __init__(
        self,
        # A Hugging Face configuration object
        hf_config: PretrainedConfig = None,
        # A Hugging Face tokenizer
        hf_tokenizer: PreTrainedTokenizerBase = None,
        # Any keyword arguments you want your Hugging Face tokenizer to use during tokenization
        tok_kwargs: dict = {},
        # Any keyword arguments you want applied to `ItemTokenizeTransform`
        **kwargs,
    ) -> None:
        store_attr()

        if tok_kwargs.get("truncation", None) is None:
            tok_kwargs["truncation"] = True
        if tok_kwargs.get("max_length", None) is None:
            tok_kwargs["max_length"] = True

    def encodes(self, txt, **kwargs):
        inputs = self.hf_tokenizer(txt, **self.tok_kwargs)
        return dict(inputs)

Whereas the `BatchTokenizeTransform` allows you to tokenize text at batch-time, `ItemTokenizeTransform` allows you to tokenize text at item-time (e.g., when you get an item out of the dataset). This may be very useful if applying some form of data augmentation to your inputs.

In order for this transform to run when getting an item, it needs to be specified as one of your `DataBlock`s `type_tfms`.

Potential benefits:

1. No need to pretokenize your text
2. The ability to apply data augmentation each time an item is pulled from your dataset
3. Less RAM utilization and time spent tokenizing beforehand (this really helps with very large datasets)
4. Item level flexibility

### `TextBlock` -

In [None]:
# |export
class TextBlock(TransformBlock):
    """The core `TransformBlock` to prepare your inputs for training in Blurr with fastai's `DataBlock` API"""

    def __init__(
        self,
        # The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an \
        # instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_arch: str = None,
        # A Hugging Face configuration object (not required if passing in an \
        # instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_config: PretrainedConfig = None,
        # A Hugging Face tokenizer (not required if passing in an \
        # instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_tokenizer: PreTrainedTokenizerBase = None,
        # A Hugging Face model (not required if passing in an \
        # instance of `BatchTokenizeTransform` to `before_batch_tfm`)
        hf_model: PreTrainedModel = None,
        # Any transforms to apply when getting an item from a dataset (useufl for item-time tokenization)
        type_tfms: list[ItemTokenizeTransform] = None,
        # The "before_batch" transform you want to use if tokenizing your raw data on the fly (optional)
        tokenize_tfm: Transform = None,
        # The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods, \
        # (defaults to BatchDecodeTransform)
        batch_decode_tfm: BatchDecodeTransform = None,
        # An instance of `TextCollatorWithPadding` to use when not performing batch-time tokenization, \
        # (defaults to `TextCollatorWithPadding` when using pretokenized or item-time tokenization)
        data_collator: TextCollatorWithPadding = None,
        # To control whether the "include_labels" are included in your inputs. If they are, the loss will be calculated in \
        # the model's forward function and you can simply use `PreCalculatedLoss` as your `Learner`'s loss function to use it
        include_labels: bool = True,
        # The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. \
        # Set this to `True` if your inputs are pre-tokenized (not numericalized)
        is_split_into_words: bool = False,
        # The return type your decoded inputs should be cast too (used by methods such as `show_batch`)
        input_return_type: type = TextInput,
        # The type of `DataLoader` you want created (defaults to `SortedDL`)
        dl_type: DataLoader = None,
        # Any keyword arguments you want applied to your `batch_decode_tfm` (will be set as a fastai `batch_tfms`)
        batch_decode_kwargs: dict = {},
        # Any keyword arguments you want your Hugging Face tokenizer to use during tokenization
        tok_kwargs: dict = {},
        # Any keyword arguments you want to have applied with generating text
        text_gen_kwargs: dict = {},
        # Any keyword arguments you want applied to `TextBlock`
        **kwargs,
    ):
        if (not all([hf_arch, hf_config, hf_tokenizer, hf_model])) and tokenize_tfm is None:
            raise ValueError("You must supply an hf_arch, hf_config, hf_tokenizer, hf_model -or- a tokenize_tfm")

        # if we are using a transform to tokenize our inputs, grab the HF objects from it
        if tokenize_tfm is not None:
            hf_arch = getattr(tokenize_tfm, "hf_arch", hf_arch)
            hf_config = getattr(tokenize_tfm, "hf_config", hf_config)
            hf_tokenizer = getattr(tokenize_tfm, "hf_tokenizer", hf_tokenizer)
            hf_model = getattr(tokenize_tfm, "hf_model", hf_model)
            is_split_into_words = getattr(tokenize_tfm, "is_split_into_words", is_split_into_words)
            include_labels = getattr(tokenize_tfm, "include_labels", include_labels)

        # configure our batch decode transform (used by show_batch/results methods)
        if batch_decode_tfm is None:
            batch_decode_tfm = BatchDecodeTransform(
                hf_arch=hf_arch,
                hf_config=hf_config,
                hf_tokenizer=hf_tokenizer,
                hf_model=hf_model,
                input_return_type=input_return_type,
                **batch_decode_kwargs.copy(),
            )

        # default to SortedDL using our custom sort function if no `dl_type` is specified
        if dl_type is None:
            dl_sort_func = partial(
                sorted_dl_func, hf_tokenizer=hf_tokenizer, is_split_into_words=is_split_into_words, tok_kwargs=tok_kwargs.copy()
            )
            dl_type = partial(SortedDL, sort_func=dl_sort_func)

        # build our custom `TransformBlock`
        if tokenize_tfm is None:
            if data_collator is None:
                data_collator = TextCollatorWithPadding(hf_tokenizer)
            dl_kwargs = {"create_batch": data_collator}
        else:
            dl_kwargs = {"before_batch": tokenize_tfm}

        return super().__init__(dl_type=dl_type, dls_kwargs=dl_kwargs, type_tfms=type_tfms, batch_tfms=batch_decode_tfm)

## Mid-Level API: Examples

### Pretokenized

#### Multiclass

##### Step 1: HF objects

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#####  Step 2: `DataBlock`

In [None]:
# define DataBlock splitter
def _split_func(example):
    return example["is_valid"] == True


# define how we want to build our targets
# note: we don't need to define how to build our inputs because we're using an HF `Dataset` in this example
def get_y(example):
    return example["label"]


# define the DataBlock
txt_block = TextBlock(
    hf_arch=hf_arch, hf_config=hf_config, hf_tokenizer=hf_tokenizer, hf_model=hf_model, batch_decode_kwargs={"label_names": label_names}
)

blocks = (txt_block, CategoryBlock)
dblock = DataBlock(blocks=blocks, get_y=get_y, splitter=FuncSplitter(_split_func))

##### Step 3: `DataLoaders`

In [None]:
# tokenize the dataset
tokenize_func = partial(multiclass_tokenize_func, hf_tokenizer=hf_tokenizer)
proc_imdb_ds = imdb_ds.map(tokenize_func, batched=True)

# build your `DataLoaders`
dls = dblock.dataloaders(proc_imdb_ds, bs=4)

Loading cached processed dataset at /home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-172edc75a088c5ea.arrow


In [None]:
b = dls.one_batch()
print("# of items in each batch:", len(b))
print("# of inputs in each batch:", len(b[0]["input_ids"]))
print("# of targets in each batch:", len(b[1]))
print("Shape of input_ids (bsz, seq):", b[0]["input_ids"].shape)

# of items in each batch: 2
# of inputs in each batch: 4
# of targets in each batch: 4
Shape of input_ids (bsz, seq): torch.Size([4, 1742])


Let's take a look at the actual types represented by our batch

In [None]:
explode_types(b)

{tuple: [dict, fastai.torch_core.TensorCategory]}

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,"I've read a number of reviews on this film and I have to say ""What is wrong with you people?!?!"" This was an excellent film! I thought this film was superb from start to finish and the story was extremely well told. I'm convinced that the people that didn't like this film weren't paying very good attention to the film. There are a number of very important scenes that if you aren't paying attention you will be confused and the following scenes may not make sense. I urge anyone who didn't like thi",pos
1,"this is one of the funniest shows i have ever seen. it is really refreshing to watch and i was in stitches many times. i guess there is a social awareness factor to this too which makes it quite interesting. if these were white girls would they get the same reaction? maybe they would, maybe they wouldn't? the characters know no limits (check my lyrics) and do not exclude anyone from their twisted sense of fun!There are so many funny sketches. my favorites is the bob the builder one. it's so sill",pos


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

#### Multilabel

##### Step 1: HF objects.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", civil_label_names, verbose=True)

=== config ===
# of labels:	7

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#####  Step 2: `DataBlock`

In [None]:
# define DataBlock splitter
def _split_func(example):
    return example["is_valid"] == True


# define how we want to build our targets
# note: we don't need to define how to build our inputs because we're using an HF `Dataset` in this example
def get_y(example):
    return example["label"]


# define the DataBlock
blocks = (
    TextBlock(hf_arch=hf_arch, hf_config=hf_config, hf_tokenizer=hf_tokenizer, hf_model=hf_model),
    MultiCategoryBlock(encoded=True, vocab=civil_label_names),
)
dblock = DataBlock(blocks=blocks, get_y=get_y, splitter=FuncSplitter(_split_func))

##### Step 3: `DataLoaders`

In [None]:
# tokenize the dataset
tokenize_func = partial(multilabel_tokenize_func, hf_tokenizer=hf_tokenizer, label_attrs=civil_label_names)
proc_civil_ds = civil_ds.map(tokenize_func, batched=True)

dls = dblock.dataloaders(proc_civil_ds, bs=4)

Loading cached processed dataset at /home/wgilliam/.cache/huggingface/datasets/civil_comments/default/0.9.0/e7a3aacd2ab7d135fa958e7209d10b1fa03807d44c486e3c34897aa08ea8ffab/cache-1f3711404ab09b4a.arrow


In [None]:
b = dls.one_batch()
print("# of items in each batch:", len(b))
print("# of inputs in each batch:", len(b[0]["input_ids"]))
print("# of targets in each batch:", len(b[1]))
print("Shape of input_ids (bsz, seq):", b[0]["input_ids"].shape)

# of items in each batch: 2
# of inputs in each batch: 4
# of targets in each batch: 4
Shape of input_ids (bsz, seq): torch.Size([4, 244])


Let's take a look at the actual types represented by our batch

In [None]:
explode_types(b)

{tuple: [dict, fastai.torch_core.TensorMultiCategory]}

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,"We live in an age of hysteria and histrionics. Critical, rational thinking is needed to deal with the deluge of propaganda emanating from the Grope & Flail and the Toronto Red Star. Head for the hills!!",[]
1,"Yes. Eat meat, just eat less of it and push for policies that ensure animals are treated as humanely as possible. Same for wearing animal products. Choose non animal products first. When we know better, we do better. I don't eat pate and I won't be buying down products in the future.",[]


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

### Batch-Time Tokenization

#### Multiclass

##### Step 1: HF objects.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#####  Step 2: `DataBlock`

In [None]:
tokenize_tfm = BatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model)

# note: we pass the label_names here because the labels in the dataset are already encoded as 0 or 1
blocks = (
    TextBlock(tokenize_tfm=tokenize_tfm, batch_decode_kwargs={"label_names": label_names}),
    CategoryBlock,
)
dblock = DataBlock(
    blocks=blocks,
    get_x=ColReader("text"),
    get_y=ColReader("label"),
    splitter=ColSplitter(),
)

##### Step 3: `DataLoaders`

In [None]:
dls = dblock.dataloaders(imdb_df, bs=4)

In [None]:
b = dls.one_batch()
print("# of items in each batch:", len(b))
print("# of inputs in each batch:", len(b[0]["input_ids"]))
print("# of targets in each batch:", len(b[1]))
print("Shape of input_ids (bsz, seq):", b[0]["input_ids"].shape)

# of items in each batch: 2
# of inputs in each batch: 4
# of targets in each batch: 4
Shape of input_ids (bsz, seq): torch.Size([4, 1742])


Let's take a look at the actual types represented by our batch

In [None]:
explode_types(b)

{tuple: [dict, fastai.torch_core.TensorCategory]}

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,"CitizenX(1995) is the developing world's answer to Silence of the Lambs. Where `Silence' terrorized our peace of mind, `Citizen' exhausts and saddens us instead. This dramatization of the Chikatilo case translates rather well, thanks to a Westernized friendship between two Rostov cops who become equals.<br /><br />CitizenX may also argue against(!) the death penalty far better than Kevin Spacey's The Life of David Gayle(2002).<br /><br />Humans are Machiavellian mammals, under which lie limbic b",pos
1,"In 1983, Director Brian De Palma set out to make a film about the rise and fall of an American gangster, and that he did-- with the help of a terrific screenplay by Oliver Stone and some impeccable work by an outstanding cast. The result was `Scarface,' starring Al Pacino in one of his most memorable roles. The story begins in May of 1980, when Castro opened the harbor at Mariel, Cuba, to allow Cuban nationals to join their families in the United States. 125,000 left Cuba at that time, for the g",pos


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

#### Multilabel

##### Step 1: HF objects.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", civil_label_names, verbose=True)

=== config ===
# of labels:	7

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#####  Step 2: `DataBlock`

In [None]:
tokenize_tfm = BatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model)

blocks = (TextBlock(tokenize_tfm=tokenize_tfm), MultiCategoryBlock(encoded=True, vocab=civil_label_names))
dblock = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader(civil_label_names), splitter=ColSplitter())

##### Step 3: `DataLoaders`

In [None]:
dls = dblock.dataloaders(civil_df, bs=4)

In [None]:
b = dls.one_batch()
print("# of items in each batch:", len(b))
print("# of inputs in each batch:", len(b[0]["input_ids"]))
print("# of targets in each batch:", len(b[1]))
print("Shape of input_ids (bsz, seq):", b[0]["input_ids"].shape)

# of items in each batch: 2
# of inputs in each batch: 4
# of targets in each batch: 4
Shape of input_ids (bsz, seq): torch.Size([4, 244])


Let's take a look at the actual types represented by our batch

In [None]:
explode_types(b)

{tuple: [dict, fastai.torch_core.TensorMultiCategory]}

In [None]:
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,"You're much more generous than I could be on this subject. Little doubt this article (press release) is for the purpose of outrage, and intended for monies to mysteriously be available. After all, the biennial budget for WSU is $1,539,578,000, i.e. $1.54 Billion! Have a look for a laugh: http://fiscal.wa.gov/BudgetOAgy.aspx Select under ""Agency"", Washington State University. Then, have the laugh that they're dropping $13 million of stuff designed for public outcry while retaining a cool $1 milli",[]
1,"The Congress gets to choose. They have been doing that since the beginning of the country. The ability for the Congress, the President and the Attorney General to do whatever they want in their sole discretion under those statutes goes back that far. The real restrictions have come from Congress, mostly since 1952 and on. The last Immigration case was Sessions v. Morales-Santana. (2017) In that case, they stuck closely to what the Congress had actually written in the law and ruled against the im",[]


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

### Item-Time Tokenization

#### Multiclass

##### Step 1: HF objects.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", label_names, verbose=True)

=== config ===
# of labels:	2

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#####  Step 2: `DataBlock`

In [None]:
tfm = ItemTokenizeTransform(hf_config=hf_config, hf_tokenizer=hf_tokenizer)

blocks = (
    TextBlock(
        hf_arch=hf_arch,
        hf_config=hf_config,
        hf_tokenizer=hf_tokenizer,
        hf_model=hf_model,
        type_tfms=[tfm],
        batch_decode_kwargs={"label_names": label_names},
    ),
    CategoryBlock,
)

dblock = DataBlock(
    blocks=blocks,
    get_x=ColReader("text"),
    get_y=ColReader("label"),
    splitter=ColSplitter(),
)

##### Step 3: `DataLoaders`

In [None]:
dls = dblock.dataloaders(imdb_df, bs=4)

In [None]:
b = dls.valid.one_batch()
print("# of items in each batch:", len(b))
print("# of inputs in each batch:", len(b[0]["input_ids"]))
print("# of targets in each batch:", len(b[1]))
print("Shape of input_ids (bsz, seq):", b[0]["input_ids"].shape)

# of items in each batch: 2
# of inputs in each batch: 4
# of targets in each batch: 4
Shape of input_ids (bsz, seq): torch.Size([4, 521])


Let's take a look at the actual types represented by our batch

In [None]:
explode_types(b)

{tuple: [dict, fastai.torch_core.TensorCategory]}

In [None]:
dls.valid.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,Dreadful! A friend of mine (who obviously thought I had an abysmal sense of humour) recommended this.<br /><br />It's bobbins. I almost switched it off. It is only my anal desire to not leave things unfinished that prevented me doing so.<br /><br />This was evidently a British attempt to make a movie with a bunch of also ran TV actors using some lame script from their mate in the business. I struggle to think of anything even approaching the paucity of this movie. Less funny than global warming.,neg
1,"First, I'm sorry for my English. Second, the true story of this episode: 39 soldiers, operation ""Magistral'"". 6 soldiers were killed. Hundreds of insurgents were killed too. Within 10 years the Soviet Army has lost less than 15 thousand person and killed over 900000 insurgents and civilians. There is no insurgents without permanent help of USA The veteran of war: ""Traditions. There are no traditions in this film. There is no military oath, there is no first jump, no farewell to the Fighting Bann",neg


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

#### Multilabel

##### Step 1: HF objects.

In [None]:
hf_arch, hf_config, hf_tokenizer, hf_model = get_task_hf_objects("microsoft/deberta-v3-small", civil_label_names, verbose=True)

=== config ===
# of labels:	7

=== tokenizer ===
Vocab size:		128000
Max # of tokens:	1000000000000000019884624838656
Attributes expected by model in forward pass:	['input_ids', 'token_type_ids', 'attention_mask']


#####  Step 2: `DataBlock`

In [None]:
tfm = ItemTokenizeTransform(hf_config=hf_config, hf_tokenizer=hf_tokenizer)

blocks = (
    TextBlock(hf_arch=hf_arch, hf_config=hf_config, hf_tokenizer=hf_tokenizer, hf_model=hf_model, type_tfms=[tfm]),
    MultiCategoryBlock(encoded=True, vocab=civil_label_names),
)

dblock = DataBlock(
    blocks=blocks,
    get_x=ColReader("text"),
    get_y=ColReader(civil_label_names),
    splitter=ColSplitter(),
)

##### Step 3: `DataLoaders`

In [None]:
dls = dblock.dataloaders(civil_df, bs=4)

In [None]:
b = dls.valid.one_batch()
print("# of items in each batch:", len(b))
print("# of inputs in each batch:", len(b[0]["input_ids"]))
print("# of targets in each batch:", len(b[1]))
print("Shape of input_ids (bsz, seq):", b[0]["input_ids"].shape)

# of items in each batch: 2
# of inputs in each batch: 4
# of targets in each batch: 4
Shape of input_ids (bsz, seq): torch.Size([4, 168])


Let's take a look at the actual types represented by our batch

In [None]:
explode_types(b)

{tuple: [dict, fastai.torch_core.TensorMultiCategory]}

In [None]:
dls.valid.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Unnamed: 0,text,target
0,....and useful idiots.,"[toxicity, insult]"
1,"Even some Canadians have been implicated; UN figures show that three Canadian police officers deployed to Haiti have been accused of sexual abuse or exploitation since 2015. ----- Nice. It's not enough that the RCMP are attacking female RCMP members and women in Canada, who are somewhat better equipped to stand up against the bullying, harassment and violence, the RCMP are now exporting their criminality and sexually exploitative behavior to highly vulnerable young women in 3rd world countries.","[toxicity, insult]"


In [None]:
# |echo:false
try:
    del dls, hf_model
except:
    pass
finally:
    clean_memory()

## Tests

The tests below to ensure the core DataBlock code above works for **all** pretrained sequence classification models available in Hugging Face.  These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

**Note**: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue *(or a PR if you'd like to fix it yourself)*

## Export -

In [None]:
# | hide
import nbdev

nbdev.nbdev_export()