# Tokenizers

In [2]:
from transformers import BertTokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 

In [3]:
text = "Using transformers is easy!" 
tokenizer(text) 

{'input_ids': [101, 2478, 19081, 2003, 3733, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

However, it is recommended by the HF team to use the AutoTokenizer Class.

In [4]:
from transformers import AutoTokenizer 
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 

In [11]:
text = "Using transformers is easy!" 
encoded_input = tokenizer(text)
encoded_input

{'input_ids': [101, 2478, 19081, 2003, 3733, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [12]:
type(encoded_input['input_ids'])

list

Since we will be doing our work in Pytorch, we generally want the results in the form of torch tensors rather than Python lists

In [14]:
encoded_input_pt = tokenizer(text, return_tensors="pt")

In [15]:
type(encoded_input_pt['input_ids'])

torch.Tensor

# Models

In [16]:
from transformers import AutoModel 
model = AutoModel.from_pretrained('bert-base-uncased') 

Some weights of the model checkpoint at ../../models/bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
output = model(**encoded_input_pt)

Using the default list-based `encoded_input` will give an error here!

# Pipelines

Pipeline is a plug-and-play API used for inference over a variety of tasks

In [22]:
from transformers import pipeline 

## Masked Language Model

In [20]:
unmasker = pipeline('fill-mask', model='bert-base-uncased') 
unmasker("The man worked as a [MASK].") 

Some weights of the model checkpoint at ../../models/bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'sequence': 'the man worked as a carpenter.',
  'score': 0.09747541695833206,
  'token': 10533,
  'token_str': 'carpenter'},
 {'sequence': 'the man worked as a waiter.',
  'score': 0.05238313227891922,
  'token': 15610,
  'token_str': 'waiter'},
 {'sequence': 'the man worked as a barber.',
  'score': 0.04962710291147232,
  'token': 13362,
  'token_str': 'barber'},
 {'sequence': 'the man worked as a mechanic.',
  'score': 0.03788599371910095,
  'token': 15893,
  'token_str': 'mechanic'},
 {'sequence': 'the man worked as a salesman.',
  'score': 0.03768078610301018,
  'token': 18968,
  'token_str': 'salesman'}]

or, for a better view ...

In [21]:
import pandas as pd
pd.DataFrame(unmasker("The man worked as a [MASK]."))

Unnamed: 0,sequence,score,token,token_str
0,the man worked as a carpenter.,0.097475,10533,carpenter
1,the man worked as a waiter.,0.052383,15610,waiter
2,the man worked as a barber.,0.049627,13362,barber
3,the man worked as a mechanic.,0.037886,15893,mechanic
4,the man worked as a salesman.,0.037681,18968,salesman


## Zero-Shot Classification

In [23]:
classifier = pipeline("zero-shot-classification", model='bert-base-uncased') 
sequence_to_classify = "I am going to france." 
candidate_labels = ['travel', 'cooking', 'dancing'] 
classifier(sequence_to_classify, candidate_labels) 

Some weights of the model checkpoint at ../../models/bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

{'sequence': 'I am going to france.',
 'labels': ['dancing', 'cooking', 'travel'],
 'scores': [0.39708060026168823, 0.37949416041374207, 0.2234252244234085]}

The results are bad because the model is not suitable. Let's use another one.

In [26]:
classifier = pipeline("zero-shot-classification", model='facebook-bart-large-mnli') 
sequence_to_classify = "I am going to france." 
candidate_labels = ['travel', 'cooking', 'dancing'] 
classifier(sequence_to_classify, candidate_labels) 

{'sequence': 'I am going to france.',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9866884350776672, 0.007197576109319925, 0.006114049348980188]}

## Language Generation

In [29]:
from transformers import set_seed 
generator = pipeline("text-generation", model="gpt2") 
set_seed(42)
sequence_prompt = "Hello, I'm a language model," 
generator(sequence_prompt, max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I'm writing a new language for you. But first, I'd like to tell you about the language itself"},
 {'generated_text': "Hello, I'm a language model, and I'm trying to be as expressive as possible. In order to be expressive, it is necessary to know"},
 {'generated_text': "Hello, I'm a language model, so I don't get much of a license anymore, but I'm probably more familiar with other languages on that"},
 {'generated_text': "Hello, I'm a language model, a functional model... It's not me, it's me!\n\nI won't bore you with how"},
 {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I need to give language model a set of properties that"}]

# Datasets

In [30]:
from datasets import load_dataset

To access a benchmark from the GLUE dataset, we pass two arguments where the first is 'glue' and second is a sub-part of it to be chosen. 

## COLA

Lets load 'cola' subset of GLUE as follows:

In [31]:
cola = load_dataset('glue', 'cola')
cola['train'][18:22]

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

{'sentence': ['They drank the pub.',
  'The professor talked us into a stupor.',
  'The professor talked us.',
  'We yelled ourselves hoarse.'],
 'label': [0, 1, 0, 1],
 'idx': [18, 19, 20, 21]}

While train and validation datasets include two labels ( 1 for acceptable, 0 for unacceptable), the label value of test split is -1 , which means no-label.

In [55]:
cola['train'][12]

{'label': 1, 'idx': 12, 'sentence': 'Bill rolled out of the room.'}

In [56]:
cola['validation'][68]

{'label': 0,
 'idx': 68,
 'sentence': 'Which report that John was incompetent did he submit?'}

In [57]:
cola['test'][20]

{'label': -1, 'idx': 20, 'sentence': 'Has John seen Mary?'}

## Metadata of Datasets

The datasets also come along with information regarding the datasets.

* split
* description
* citation
* homepage
* license  
(see https://huggingface.co/docs/datasets/package_reference/main_classes.html)

In [49]:
print(cola["train"].split)

train


In [50]:
print(cola["train"].description)

GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.




In [51]:
print(cola["train"].citation)

@article{warstadt2018neural,
  title={Neural Network Acceptability Judgments},
  author={Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R},
  journal={arXiv preprint arXiv:1805.12471},
  year={2018}
}
@inproceedings{wang2019glue,
  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
  note={In the Proceedings of ICLR.},
  year={2019}
}



In [52]:
print(cola["train"].homepage)

https://nyu-mll.github.io/CoLA/


In [54]:
print(cola["train"].license)




## XTREME

In [59]:
en_de = load_dataset('xtreme', 'MLQA.en.de')

Downloading:   0%|          | 0.00/9.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/23.1k [00:00<?, ?B/s]

Downloading and preparing dataset xtreme/MLQA.en.de (download: 72.21 MiB, generated: 5.39 MiB, post-processed: Unknown size, total: 77.60 MiB) to /root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/fb182342ff5c7a211ebf678cde070463acd29524b30b87f8f38c617948c2826a...


Downloading:   0%|          | 0.00/75.7M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset xtreme downloaded and prepared to /root/.cache/huggingface/datasets/xtreme/MLQA.en.de/1.0.0/fb182342ff5c7a211ebf678cde070463acd29524b30b87f8f38c617948c2826a. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [60]:
en_de

DatasetDict({
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4517
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 512
    })
})

In [61]:
# View dataset as a pandas data frame
import pandas as pd
pd.DataFrame(en_de['test'][0:4])

Unnamed: 0,id,title,context,question,answers
0,037e8929e7e4d2f949ffbabd10f0f860499ff7c9,Cell culture,An established or immortalized cell line has a...,Woraus besteht die Linie?,"{'answer_start': [31], 'text': ['cell']}"
1,4b36724f3cbde7c287bde512ff09194cbba7f932,Cell culture,The 19th-century English physiologist Sydney R...,Wann hat Roux etwas von seiner Medullarplatte ...,"{'answer_start': [232], 'text': ['1885']}"
2,13e58403df16d88b0e2c665953e89575704942d4,TRIPS Agreement,"After the Uruguay round, the GATT became the b...","Was muss ratifiziert werden, wenn ein Land ger...","{'answer_start': [131], 'text': ['TRIPS']}"
3,d23b5372af1de9425a4ae313c01eb80764c910d8,TRIPS Agreement,"Since TRIPS came into force, it has been subje...",Welche Teile der Welt kritisierten das TRIPS a...,"{'answer_start': [67], 'text': ['developing co..."


## Total number of datasets and metrics

In [63]:
from datasets import list_datasets, list_metrics
all = list_datasets()
metrics = list_metrics()

print(f"{len(all)} datasets and {len(metrics)} metrics exists in the hub\n")
print(all[:20])
print(metrics)

1663 datasets and 28 metrics exists in the hub

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity', 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'ami', 'amttl', 'anli', 'app_reviews']
['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 'matthews_correlation', 'meteor', 'pearsonr', 'precision', 'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']


# Maniuplating Data with the datasets library

## Splits

The split parameter is used to decide which subset(s) or portion of the subset is to be loaded. If None by default, will return a dict with all splits (Train, Test, Validation or any other). If split is specified, it will return a single Dataset rather than a Dictionary

In [68]:
cola = load_dataset('glue', 'cola')
cola

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

To get a particular element, you have to specify the split first followed by the index.

In [69]:
cola['train'][0]

{'label': 1,
 'idx': 0,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

In [70]:
cola = load_dataset('glue', 'cola', split ='train[:300]+validation[-30%:]')
cola

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 613
})

In [71]:
cola[0]

{'label': 1,
 'idx': 0,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

**Other Split Examples include**  
The first 100 examples from train and validation

`split='train[:100]+validation[:100]'` 

50% of train and 30 % of validation

`split='train[:50%]+validation[:30%]'`


The first 20% of train and examples in the slice 30:50 from validation

`split='train[:20%]+validation[30:50]'`

## Sort

To sort - say - according to the label...

First 15 elements will have a label 0, and last 15 will have label 1.

In [73]:
cola.sort('label')['label'][:15]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [75]:
cola.sort('label')['label'][-15:]

Loading cached sorted indices for dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f0922985b2dfc641.arrow


[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

## Index

We can use Python's slicing/index notation to select a few elements

In [76]:
cola[6,19,44]

{'sentence': ['Fred watered the plants flat.',
  'The professor talked us into a stupor.',
  'The trolley rumbled through the tunnel.'],
 'label': [1, 1, 1],
 'idx': [6, 19, 44]}

In [77]:
cola[42:46]

{'sentence': ['They made him to exhaustion.',
  'They made him into a monster.',
  'The trolley rumbled through the tunnel.',
  'The wagon rumbled down the road.'],
 'label': [0, 1, 1, 1],
 'idx': [42, 43, 44, 45]}

## Shuffle

To randomly select data from the dataset

In [78]:
cola.shuffle(seed=42)[:3]

{'sentence': ['Lou forgot the umbrella in the closet.',
  'It is the problem that he is here.',
  'I met the person who left.'],
 'label': [1, 0, 1],
 'idx': [904, 1017, 885]}

## Filter

**To retrieve sentences only, including the term kick in the cola dataset**

In [83]:
from pprint import pprint # for a nicer print view

In [82]:
cola = load_dataset('glue', 'cola', split='train[:100%]+validation[-30%:]')
pprint(cola.filter(lambda s: "kick" in s['sentence'])[:3])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-76b32776b6296bf0.arrow


{'idx': [2003, 2009, 2010],
 'label': [1, 1, 1],
 'sentence': ['Jill kicked the ball from home plate to third base.',
              'Fred kicked the ball under the porch.',
              'Fred kicked the ball behind the tree.']}


**To get 3 acceptable sentences**

In [84]:
pprint(cola.filter(lambda s: s['label']== 1 )["sentence"][:3])

  0%|          | 0/9 [00:00<?, ?ba/s]

["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]


**To get 3 acceptable sentences ( in case we know class label (string) but not the class integer)**

In [85]:
cola.filter(lambda s: s['label']== cola.features['label'].str2int('acceptable'))["sentence"][:3]

  0%|          | 0/9 [00:00<?, ?ba/s]

["Our friends won't buy this analysis, let alone the next one we propose.",
 "One more pseudo generalization and I'm giving up.",
 "One more pseudo generalization or I'm giving up."]

In [86]:
cola.features['label']

ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None)

## Map

### To add new features

In [87]:
cola

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8864
})

In [88]:
cola_new=cola.map(lambda e: {'len': len(e['sentence'])})
cola_new

  0%|          | 0/8864 [00:00<?, ?ex/s]

Dataset({
    features: ['sentence', 'label', 'idx', 'len'],
    num_rows: 8864
})

In [89]:
pprint(cola_new[0:3])

{'idx': [0, 1, 2],
 'label': [1, 1, 1],
 'len': [71, 49, 48],
 'sentence': ["Our friends won't buy this analysis, let alone the next one we "
              'propose.',
              "One more pseudo generalization and I'm giving up.",
              "One more pseudo generalization or I'm giving up."]}


In [90]:
pd.DataFrame(cola_new[0:3])

Unnamed: 0,sentence,label,idx,len
0,"Our friends won't buy this analysis, let alone...",1,0,71
1,One more pseudo generalization and I'm giving up.,1,1,49
2,One more pseudo generalization or I'm giving up.,1,2,48


### To modify existing feature (crop length of text)

In [93]:
cola_cut=cola_new.map(lambda e: {'sentence': e['sentence'][:20]+ '_'})
pd.DataFrame(cola_cut[0:3])

  0%|          | 0/8864 [00:00<?, ?ex/s]

Unnamed: 0,sentence,label,idx,len
0,Our friends won't bu_,1,0,71
1,One more pseudo gene_,1,1,49
2,One more pseudo gene_,1,2,48


# Local Datasets

This method seems to alwats create a Dataset dictionary!

In [94]:
from datasets import load_dataset
data1 = load_dataset('csv', data_files='./data/a.csv', delimiter="\t")
data2 = load_dataset('csv', data_files=['./data/a.csv','./data/b.csv', './data/c.csv'], delimiter="\t")
data3 = load_dataset('csv', data_files={'train':['./data/a.csv','./data/b.csv'], 'test':['./data/c.csv']}, delimiter="\t") 

Using custom data configuration default-b74816b7681c96f5
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b74816b7681c96f5/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-96b9daeb57726dfd
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-96b9daeb57726dfd/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/1 [00:00<?, ?it/s]

Using custom data configuration default-50f3826d38cdd1c0
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-50f3826d38cdd1c0/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/2 [00:00<?, ?it/s]

In [95]:
data1

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 99
    })
})

In [96]:
data2

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 299
    })
})

In [97]:
data3

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 199
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 100
    })
})

In [98]:
import pandas as pd
pd.DataFrame(data1["train"][:3])

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1


In [99]:
pd.DataFrame(data3["test"][:3])

Unnamed: 0,sentence,label
0,inane and awful,0
1,told in scattered fashion,0
2,takes chances that are bold by studio standards,1


In [100]:
# get the files in other format
# data_json = load_dataset('json', data_files='a.json')
# data_text = load_dataset('text', data_files='a.txt')

# Preparing the data for model training

Once we have the dataset - either using the datasets library or uploading our own - we have to process it to use in our model. The process is tokenization!

In [102]:
from transformers import AutoTokenizer 
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 

## Only tokenization

In [103]:
encoded_data1 = data1.map( lambda e: tokenizer(e['sentence']), batched=True, batch_size=1000)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [104]:
data1

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 99
    })
})

In [105]:
encoded_data1

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence', 'token_type_ids'],
        num_rows: 99
    })
})

In [108]:
pprint(encoded_data1['train'][0])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
 'label': 0,
 'sentence': 'hide new secretions from the parental units ',
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


## Tokenization + Truncation

In [109]:
encoded_data3 = data3.map(lambda e: tokenizer( e['sentence'], padding=True, truncation=True, max_length=12), batched=True, batch_size=1000) 

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [110]:
data3

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 199
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 100
    })
})

In [111]:
encoded_data3

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence', 'token_type_ids'],
        num_rows: 199
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence', 'token_type_ids'],
        num_rows: 100
    })
})

In [118]:
pprint(data3['test'][90])

{'label': 1,
 'sentence': 'warm water under a red bridge is a celebration of feminine '
             'energy , a tribute to the power of women to heal . '}


In [119]:
pprint(encoded_data3['test'][90])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
               4010,
               2300,
               2104,
               1037,
               2417,
               2958,
               2003,
               1037,
               7401,
               1997,
               102],
 'label': 1,
 'sentence': 'warm water under a red bridge is a celebration of feminine '
             'energy , a tribute to the power of women to heal . ',
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


In [133]:
' '.join(tokenizer.convert_ids_to_tokens(encoded_data3['test'][90]['input_ids']))

'[CLS] warm water under a red bridge is a celebration of [SEP]'

We can see that the original sentence has indeed been truncated to a length of 12 (inlcuding tokens)

# Speed and Memory Bencmarking

We can benchmark our HF models for the computational cost - speed and memory

## Get CPU/GPU Memory

In [1]:
import torch 
print(f"The GPU total memory is {torch.cuda.get_device_properties(0).total_memory /(1024**3)} GB") 

AssertionError: Torch not compiled with CUDA enabled

In [None]:
import matplotlib.pyplot as plt 
plt.figure(figsize=(8,8)) 
t=sequence_lengths 
models_perf=[list(results.time_inference_result[m]['result'][batch_sizes[0]].values()) for m in models] 
plt.xlabel('Seq Length') 
plt.ylabel('Time in Second') 
plt.title('Inference Speed Result') 
plt.plot(t, models_perf[0], 'rs--', t, models_perf[1], 'g--.', t, models_perf[2], 'b--^', t, models_perf[3], 'c--o') 
plt.legend(models)  
plt.show() 

In [None]:
import matplotlib.pyplot as plt 
plt.figure(figsize=(8,8)) 
t=sequence_lengths 
models_perf=[list(results.memory_inference_result[m]['result'][batch_sizes[0]].values()) for m in models] 
plt.xlabel('Seq Length') 
plt.ylabel('Time in Second') 
plt.title('Inference Speed Result') 
plt.plot(t, models_perf[0], 'rs--', t, models_perf[1], 'g--.', t, models_perf[2], 'b--^', t, models_perf[3], 'c--o') 
plt.legend(models)  
plt.show() 