<a href="https://colab.research.google.com/github/lkarjun/fastai-huggingface-workouts/blob/main/notebook1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Packages

In [1]:
!pip install -qq transformers[sentencepiece]

!pip install --q datasets

[K     |████████████████████████████████| 3.5 MB 9.5 MB/s 
[K     |████████████████████████████████| 67 kB 4.6 MB/s 
[K     |████████████████████████████████| 596 kB 50.3 MB/s 
[K     |████████████████████████████████| 895 kB 58.5 MB/s 
[K     |████████████████████████████████| 6.8 MB 42.4 MB/s 
[K     |████████████████████████████████| 1.2 MB 50.9 MB/s 
[K     |████████████████████████████████| 311 kB 9.6 MB/s 
[K     |████████████████████████████████| 133 kB 36.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 45.5 MB/s 
[K     |████████████████████████████████| 243 kB 43.3 MB/s 
[K     |████████████████████████████████| 94 kB 3.0 MB/s 
[K     |████████████████████████████████| 144 kB 54.2 MB/s 
[K     |████████████████████████████████| 271 kB 48.6 MB/s 
[?25h

In [2]:
from transformers import pipeline

## Tokenizer Basics

In [3]:
from transformers import BertTokenizerFast

In [4]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [5]:
print(type(tokenizer))

print(tokenizer)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [6]:
from transformers import AutoTokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [10]:
print(type(tokenizer))
print(tokenizer)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
PreTrainedTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [11]:
inputs = tokenizer('hi my name is Arjun')

In [12]:
print(type(inputs), type(inputs.data))

print(inputs.data)

<class 'transformers.tokenization_utils_base.BatchEncoding'> <class 'dict'>
{'input_ids': [101, 20844, 1139, 1271, 1110, 138, 1197, 17936, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [13]:
for idx, tok in enumerate(tokenizer.vocab.keys()):
  print(f'{tok}: {tokenizer.vocab[tok]}')
  if (idx > 4): break

birthplace: 15979
##ldo: 25791
##angelo: 27638
MLA: 20222
##dge: 8484
180: 7967


In [19]:
print(inputs.input_ids)
print(tokenizer.decode(inputs.input_ids))

print(tokenizer.convert_ids_to_tokens(inputs.input_ids))

[101, 20844, 1139, 1271, 1110, 138, 1197, 17936, 102]
[CLS] hi my name is Arjun [SEP]
['[CLS]', 'hi', 'my', 'name', 'is', 'A', '##r', '##jun', '[SEP]']


## Sentiment

In [22]:
classifier = pipeline('sentiment-analysis')

classifier("I've been waiting for you...")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9973922967910767}]

In [23]:
classifier(["I've been waiting for you", "I hate you"])

[{'label': 'POSITIVE', 'score': 0.9972347617149353},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [24]:
classifier.model.name_or_path, classifier.modelcard

('distilbert-base-uncased-finetuned-sst-2-english', None)

In [25]:
# to use special model

In [32]:
classifier = pipeline("sentiment-analysis", model = 'cardiffnlp/twitter-roberta-base-sentiment')

classifier.model.name_or_path, classifier.modelcard

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

('cardiffnlp/twitter-roberta-base-sentiment', None)

## Zero-Shot

In [34]:
classifier = pipeline("zero-shot-classification")

classifier("This is good course for your exams", candidate_labels = ['education', 'employement'])

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'labels': ['education', 'employement'],
 'scores': [0.9344026446342468, 0.06559735536575317],
 'sequence': 'This is good course for your exams'}

## Text Generation

In [39]:
generator = pipeline('text-generation', model = 'distilgpt2')

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [40]:
generator("In this course, we will", max_length = 30, num_return_sequences = 2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will look at the first half of the new album and see that the whole thing was done in earnest. We hope you enjoyed'},
 {'generated_text': 'In this course, we will have the opportunity to look at aspects and how an architect can use these features for his or her business to advance his or'}]

## Language modeling MLM

In [41]:
unmasker = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [44]:
unmasker('This course will <mask> you all about Deep Learning', top_k = 2)

[{'score': 0.9537787437438965,
  'sequence': 'This course will teach you all about Deep Learning',
  'token': 6396,
  'token_str': ' teach'},
 {'score': 0.0386032909154892,
  'sequence': 'This course will tell you all about Deep Learning',
  'token': 1137,
  'token_str': ' tell'}]

## Token classificatino (eg NER)

In [46]:
ner = pipeline('ner', grouped_entities = True)
# grouped entities -> reconstruct the subwordtokenization splits.

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


In [49]:
ner("My name is Lalkrishna and I'm a final year student.")

[{'end': 21,
  'entity_group': 'PER',
  'score': 0.993746,
  'start': 11,
  'word': 'Lalkrishna'}]

In [50]:
ner = pipeline('ner', grouped_entities = False)

ner("My name is Lalkrishna and I'm a final year student.")

# see if grouped_entities = False

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 14,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.99807906,
  'start': 11,
  'word': 'Lal'},
 {'end': 15,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.99698573,
  'start': 14,
  'word': '##k'},
 {'end': 18,
  'entity': 'I-PER',
  'index': 6,
  'score': 0.99242496,
  'start': 15,
  'word': '##ris'},
 {'end': 20,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.9867569,
  'start': 18,
  'word': '##hn'},
 {'end': 21,
  'entity': 'I-PER',
  'index': 8,
  'score': 0.99448365,
  'start': 20,
  'word': '##a'}]

## Question Answering

In [51]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [55]:
question_answerer(question = 'Where do I work?',
                  context = "My name is Lalkrishna and I working for Microsoft")

{'answer': 'Microsoft', 'end': 49, 'score': 0.9773754477500916, 'start': 40}

## Summarization

In [56]:
text = """Lionel Messi put in a man-of-the-match performance as Paris Saint-Germain (PSG) thumped defending champions Lille 5-1 in Ligue 1 on Sunday. The Argentinian superstar scored his second league goal of the campaign, while also providing an assist on the night.

PSG needed to bounce back with a huge performance after being knocked out of the Coupe de France by Nice last week and that's exactly what Mauricio Pochettino's side did. Despite having a severely depleted squad, the French giants fielded an extremely strong lineup, with Lionel Messi, Angel Di Maria and Kylian Mbappe all starting in attack.

Lille shot themselves in the foot early on as goalkeeper Ivo Grbic spilled a routine cross for Danilo Pereira to stab home and give his side the lead. However, the French champions did well to stay in the game and their resilience was rewarded in the 28th minute when Sven Botman scored an acrobatic scissor kick after some good work by Hatem Ben Arfa down the touchline.

Their joy was shortlived, however, as Presnel Kimpembe powered home a header after connecting with Lionel Messi's corner in the 32nd minute. The Argentine would then get in on the action in vintage fashion. He skipped past the defender's challenge before dinking the ball over Grbic to score his second league goal of the campaign.

The Parisians went into half-time 3-1 up and in the ascendancy with Messi hitting the crossbar with a free-kick right before the half ended. The second half started much the same, with PSG dominating proceedings. Their pressure paid off in the 51st minute as Pereira fired home from outside the area after a scramble in the box. This was the midfielder's third goal in his last three games.

Pochettino's side came close to making it 5-1 four minutes later as Mbappe saw his effort saved after latching onto Achraf Hakimi's dangerous cross. However, the Frenchman would not be denied a second time as he curled an effort into the top corner in the 67th minute after some good work by Marco Verratti in midfield.

This killed the game as a contest as the tempo of both sides' attacking play drastically reduced. Neither side made any more significant chances and PSG came away 5-1 winners."""

In [57]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [59]:
summary = summarizer(text)

In [63]:
print(summary)

[{'summary_text': ' Paris Saint-Germain beat Lille 5-1 in Ligue 1 on Sunday . Lionel Messi scored his second league goal of the season for the French champions . Presnel Kimpembe and Danilo Pereira also on target for the hosts . Kylian Mbappe scored twice in the second half to complete the rout .'}]


## Translation

In [62]:
translator = pipeline('translation', model = 'Helsinki-NLP/opus-mt-en-ml')

Downloading:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/219M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/439k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/600k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/934k [00:00<?, ?B/s]

In [67]:
translator("Hai friends, this is me Arjun.")

[{'translation_text': 'ഹൈ സുഹൃത്തുക്കളേ, ഇത് ഞാനാണ് അർജുൻ.'}]