## HuggingFace Transformer Examples

In [1]:
from transformers import pipeline

In [3]:
sentiment_classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [5]:
sentiment_classifier("I'm so excited to learn about AI!")

[{'label': 'POSITIVE', 'score': 0.9996881484985352}]

In [8]:
ner = pipeline("ner", model="dslim/bert-base-NER")

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [9]:
ner("Her name is Anna and she work for Morgan Stanley in NY")

[{'entity': 'B-PER',
  'score': 0.99564785,
  'index': 4,
  'word': 'Anna',
  'start': 12,
  'end': 16},
 {'entity': 'B-ORG',
  'score': 0.9985104,
  'index': 9,
  'word': 'Morgan',
  'start': 34,
  'end': 40},
 {'entity': 'I-ORG',
  'score': 0.9986541,
  'index': 10,
  'word': 'Stanley',
  'start': 41,
  'end': 48},
 {'entity': 'B-LOC',
  'score': 0.9993187,
  'index': 12,
  'word': 'NY',
  'start': 52,
  'end': 54}]

In [12]:
zeroshot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [13]:
sequence_to_classify = "One day I will see Las Vegas!"
candidate_labels = ['travel', 'cooking', 'dancing']

In [17]:
zeroshot_classifier(sequence_to_classify, candidate_labels)

{'sequence': 'One day I will see Las Vegas!',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9854480624198914, 0.010188054293394089, 0.004363819491118193]}

## Pre-Trained Tokenizers

In [22]:
from transformers import AutoTokenizer

In [26]:
model = "bert-base-uncased"

In [28]:
tokenizer = AutoTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [30]:
sentence = "I'm so excited to learn about AI!"

In [32]:
input_ids = tokenizer(sentence)

In [34]:
print(input_ids)

{'input_ids': [101, 1045, 1005, 1049, 2061, 7568, 2000, 4553, 2055, 9932, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [36]:
tokens = tokenizer.tokenize(sentence)

In [40]:
print(tokens)

['i', "'", 'm', 'so', 'excited', 'to', 'learn', 'about', 'ai', '!']


In [42]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)

In [44]:
token_ids

[1045, 1005, 1049, 2061, 7568, 2000, 4553, 2055, 9932, 999]

In [48]:
decode_token_ids = tokenizer.decode(token_ids)

In [50]:
decode_token_ids

"i ' m so excited to learn about ai!"

In [52]:
tokenizer.decode(101)

'[CLS]'

In [56]:
tokenizer.decode(102)

'[SEP]'

In [58]:
model2 = "xlnet-base-cased"

In [60]:
tokenizer2 = AutoTokenizer.from_pretrained(model2)

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

In [62]:
input_ids2 = tokenizer2(sentence)

In [64]:
input_ids2

{'input_ids': [35, 26, 98, 102, 5564, 22, 1184, 75, 79, 96, 136, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [71]:
tokens2 = tokenizer2.tokenize(sentence)

In [73]:
tokens2

['▁I', "'", 'm', '▁so', '▁excited', '▁to', '▁learn', '▁about', '▁A', 'I', '!']

In [75]:
token_id2 = tokenizer2.convert_tokens_to_ids(tokens2)

In [77]:
token_id2

[35, 26, 98, 102, 5564, 22, 1184, 75, 79, 96, 136]

In [79]:
tokenizer2.convert_ids_to_tokens(4)

'<sep>'

In [81]:
tokenizer2.convert_ids_to_tokens(3)

'<cls>'

## HuggingFace and Pytorch/Tensorflow

In [84]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [86]:
print(sentence)
print(input_ids)

I'm so excited to learn about AI!
{'input_ids': [101, 1045, 1005, 1049, 2061, 7568, 2000, 4553, 2055, 9932, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [88]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [92]:
input_ids_pt = tokenizer(sentence, return_tensors='pt')

In [94]:
print(input_ids_pt)

{'input_ids': tensor([[ 101, 1045, 1005, 1049, 2061, 7568, 2000, 4553, 2055, 9932,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [96]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [98]:
with torch.no_grad():
    logits = model(**input_ids_pt).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'

## Saving and Loading Models

In [101]:
model_directory = "my_saved_models"

In [103]:
tokenizer.save_pretrained(model_directory)

('my_saved_models\\tokenizer_config.json',
 'my_saved_models\\special_tokens_map.json',
 'my_saved_models\\vocab.txt',
 'my_saved_models\\added_tokens.json',
 'my_saved_models\\tokenizer.json')

In [105]:
model.save_pretrained(model_directory)

In [107]:
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)

In [109]:
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)