<a href="https://colab.research.google.com/github/lkarjun/fastai-huggingface-workouts/blob/main/notebook1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Packages

In [1]:
!pip install -qq transformers[sentencepiece]

!pip install --q datasets

[K     |████████████████████████████████| 3.5 MB 9.5 MB/s 
[K     |████████████████████████████████| 67 kB 4.6 MB/s 
[K     |████████████████████████████████| 596 kB 50.3 MB/s 
[K     |████████████████████████████████| 895 kB 58.5 MB/s 
[K     |████████████████████████████████| 6.8 MB 42.4 MB/s 
[K     |████████████████████████████████| 1.2 MB 50.9 MB/s 
[K     |████████████████████████████████| 311 kB 9.6 MB/s 
[K     |████████████████████████████████| 133 kB 36.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 45.5 MB/s 
[K     |████████████████████████████████| 243 kB 43.3 MB/s 
[K     |████████████████████████████████| 94 kB 3.0 MB/s 
[K     |████████████████████████████████| 144 kB 54.2 MB/s 
[K     |████████████████████████████████| 271 kB 48.6 MB/s 
[?25h

In [2]:
from transformers import pipeline

## Tokenizer Basics

In [3]:
from transformers import BertTokenizerFast

In [4]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [5]:
print(type(tokenizer))

print(tokenizer)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [6]:
from transformers import AutoTokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [10]:
print(type(tokenizer))
print(tokenizer)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [11]:
inputs = tokenizer('hi my name is Arjun')

In [12]:
print(type(inputs), type(inputs.data))

print(inputs.data)

<class 'transformers.tokenization_utils_base.BatchEncoding'> <class 'dict'>
{'input_ids': [101, 20844, 1139, 1271, 1110, 138, 1197, 17936, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [13]:
for idx, tok in enumerate(tokenizer.vocab.keys()):
  print(f'{tok}: {tokenizer.vocab[tok]}')
  if (idx > 4): break

birthplace: 15979
##ldo: 25791
##angelo: 27638
MLA: 20222
##dge: 8484
180: 7967


In [19]:
print(inputs.input_ids)
print(tokenizer.decode(inputs.input_ids))

print(tokenizer.convert_ids_to_tokens(inputs.input_ids))

[101, 20844, 1139, 1271, 1110, 138, 1197, 17936, 102]
[CLS] hi my name is Arjun [SEP]
['[CLS]', 'hi', 'my', 'name', 'is', 'A', '##r', '##jun', '[SEP]']


## Sentiment

In [22]:
classifier = pipeline('sentiment-analysis')

classifier("I've been waiting for you...")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9973922967910767}]

In [23]:
classifier(["I've been waiting for you", "I hate you"])

[{'label': 'POSITIVE', 'score': 0.9972347617149353},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [24]:
classifier.model.name_or_path, classifier.modelcard

('distilbert-base-uncased-finetuned-sst-2-english', None)

In [25]:
# to use special model

In [32]:
classifier = pipeline("sentiment-analysis", model = 'cardiffnlp/twitter-roberta-base-sentiment')

classifier.model.name_or_path, classifier.modelcard

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

('cardiffnlp/twitter-roberta-base-sentiment', None)

## Zero-Shot

In [34]:
classifier = pipeline("zero-shot-classification")

classifier("This is good course for your exams", candidate_labels = ['education', 'employement'])

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'labels': ['education', 'employement'],
 'scores': [0.9344026446342468, 0.06559735536575317],
 'sequence': 'This is good course for your exams'}

## Text Generation

In [35]:
generator = pipeline('text-generation')

generator("In this course, we will")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will use examples from the past. In this course, we are looking at scenarios with multiple classes and have three basic classes. It is helpful to understand some of how these scenarios apply to the project development pipeline. First, let'}]