<a href="https://colab.research.google.com/github/linhlinhle997/low-resource-nmt-bart/blob/develop/BART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [46]:
import torch
from transformers import (
    AutoTokenizer,
    BartForConditionalGeneration,
    BartForSequenceClassification,
    BartForQuestionAnswering,
    BartForConditionalGeneration
)

## Mask Filling

In [2]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [13]:
# Input sentence with a masked word
TXT = "My friends are <mask> but they eat too many carbs."

# Convert text to token IDs
input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"]
input_ids

tensor([[    0,  2387,   964,    32, 50264,    53,    51,  3529,   350,   171,
         33237,     4,     2]])

In [16]:
# Get model predictions (logits)
logits = model(input_ids).logits

logits.shape

torch.Size([1, 13, 50265])

In [26]:
# Find the position of the <mask> token in the input sentence
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()

masked_index

4

In [28]:
probs = logits[0, masked_index].softmax(dim=0) # Compute probabilities for <mask>
values, predictions = probs.topk(5) # Get top 5 predictions
top_words = tokenizer.decode(predictions).split() # Convert token IDs back to words

print(TXT)
print(f"Top 5 Token IDs: {predictions}")
print(f"Top 5 Probabilities: {values}")
print(f"Predicted Words: {top_words}")

My friends are <mask> but they eat too many carbs.
Top 5 Token IDs: tensor([  45,  205, 2245,  372,  182])
Top 5 Probabilities: tensor([0.0929, 0.0917, 0.0855, 0.0579, 0.0412], grad_fn=<TopkBackward0>)
Predicted Words: ['not', 'good', 'healthy', 'great', 'very']


## Text Classification

The `valhalla/bart-large-sst2` model is fine-tuned for binary sentiment analysis (positive/negative) using the SST-2 dataset. It classifies text as POSITIVE or NEGATIVE, useful for reviews, social media, and feedback analysis.

In [30]:
tokenizer = AutoTokenizer.from_pretrained("valhalla/bart-large-sst2")
model = BartForSequenceClassification.from_pretrained("valhalla/bart-large-sst2")

tokenizer_config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.


pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [31]:
TXT = "Hello, my dog is cute"

inputs = tokenizer(TXT, return_tensors="pt")

inputs

{'input_ids': tensor([[    0, 31414,     6,   127,  2335,    16, 11962,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [33]:
with torch.no_grad():
    logits = model(**inputs).logits # Get raw output scores

In [34]:
# Get the index of the highest probability class
predicted_class_id = logits.argmax().item()

predicted_class_id

1

In [35]:
# Convert class ID to a human-readable label
sentiment = model.config.id2label[predicted_class_id]
sentiment

'POSITIVE'

## Question Answering

The model `valhalla/bart-large-finetuned-squadv1squadv1` is trained for Question Answering (QA) using the SQuAD v1 dataset. It predicts answer spans from a given context based on a question.

In [37]:
tokenizer = AutoTokenizer.from_pretrained("valhalla/bart-large-finetuned-squadv1")
model = BartForQuestionAnswering.from_pretrained("valhalla/bart-large-finetuned-squadv1")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'LABEL_0', '1': 'LABEL_1'}. The number of labels wil be overwritten to 2.


pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [38]:
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")

inputs

{'input_ids': tensor([[    0, 12375,    21,  2488,   289, 13919,   116,     2,     2, 24021,
           289, 13919,    21,    10,  2579, 29771,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [39]:
# Get model prediction
with torch.no_grad():
    outputs = model(**inputs)

In [40]:
# Find start and end positions of the answer
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

answer_start_index, answer_end_index

(tensor(14), tensor(15))

In [41]:
# Extract predicted answer tokens
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]

predict_answer_tokens

tensor([ 2579, 29771])

In [43]:
# Convert tokens back to text
answer = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

print(answer)

 nice puppet


In [45]:
# Define the true answer span ("nice puppet")
target_start_index = torch.tensor([14]) # Start index of the answer
target_end_index = torch.tensor([15]) # End index of the answer

# Compute loss for training
outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index)
loss = outputs.loss

print(round(loss.item(), 2))

0.59


## Summarization

The model `facebook/bart-large-cnn` is a BART model fine-tuned for text summarization. It was trained on the CNN/DailyMail dataset, which consists of news articles and their summaries.

In [47]:
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [48]:
ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)

# Tokenize input text
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors="pt")

inputs

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


{'input_ids': tensor([[    0,  8332,   947,   717,  2305,    24,  1768,     5,   909,  4518,
            11,  1263,     7,  5876,    13,   239,  2372,  2876,  3841,  1274,
             4,    20,  4374,    16,     7,  1888,     5,   810,     9, 12584,
             4,  9221,  5735,  7673,   916,    58,  1768,     7,    28,  2132,
            30,     5,  2572, 10816,    61,    58,   421,     7,    94,   149,
            23,   513, 15372,  3859,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]])}

In [49]:
# Generate summary with beam search
summary_ids = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)

summary_ids

tensor([[   2,    0, 8332,  947,  717, 1768,    5,  909, 4518,   11, 1263,    7,
         5876,   13,  239, 2372, 2876, 3841, 1274,    2]])

In [54]:
# Decode summary tokens into text
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print("Summary:", summary)

Summary: PG&E scheduled the blackouts in response to forecasts for high winds amid dry conditions
