## Here, I show some tasks using BERT, XLM-RoBERTa, and T5 models. I use some high-level API pipelines available with the models. 

BERT Documentation: https://huggingface.co/docs/transformers/model_doc/bert

The BERT model was pretrained on Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

 BERT was designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both directions, left and right context in all layers. Hence, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without much architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). BERT is not optimal for text generation.

### BERT is designed for natural language understanding with tasks specific to (1) Text Classification, (2) Named Entity Recognition (NER), (3) Question Answering, and (4) Sentiment Analysis

In [1]:
from transformers import BertModel, BertConfig

# Initializing a BERT bert-base-uncased style configuration
configuration = BertConfig()

# Initializing a model from the bert-base-uncased style configuration
model = BertModel(configuration)

# Accessing the model configuration. BERTConfig() is a class as it stores parameters and they can be changed. 
configuration = model.config

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
configuration

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [3]:
configuration.hidden_size, configuration.num_attention_heads

(768, 12)

In [4]:
#BERT Tokenizer is also a Class
from transformers import BertTokenizer

# Create an instance of the BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Access the tokenizer details
vocab_size = tokenizer.vocab_size
max_length = tokenizer.model_max_length
special_tokens = tokenizer.special_tokens_map

# Print the details
print("Vocabulary Size:", vocab_size)
print("Maximum Length:", max_length)
print("Special Tokens:", special_tokens)

Vocabulary Size: 30522
Maximum Length: 512
Special Tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}


In [5]:
# Import the necessary libraries
from transformers import BertTokenizer, BertModel
import torch

#Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
#Load the pre-trained BERT model
model = BertModel.from_pretrained('bert-base-uncased')

#The `tokenizer` function takes the input text as an argument and returns a dictionary of tokenized inputs. 
#The `return_tensors="pt"` argument specifies that the tokenizer should return PyTorch tensors.
#Tokenize the input text using the tokenizer:
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

#Pass the tokenized inputs through the BERT model:
# The two stars `**` in `model(**inputs)` are used for unpacking the dictionary `inputs` and passing its key-value pairs 
# as keyword arguments to the `model` function. This syntax is known as "dictionary unpacking" or "keyword argument 
#unpacking" in Python.
outputs = model(**inputs)

last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# Let us what inputs does?
inputs

{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [7]:
#inputs returns dictionary of key-value pairs for input_ids, token_type_ids, and attention_mask
# Access different keys:
print(inputs['input_ids'], inputs['token_type_ids'],inputs['attention_mask'])

tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]) tensor([[0, 0, 0, 0, 0, 0, 0, 0]]) tensor([[1, 1, 1, 1, 1, 1, 1, 1]])


In [8]:
# Each token is given an unique id called input_id. If all tokens belong to same segment then token_type_ids return 0. 
# attention_mask tells which tokens should be attended first

In [9]:
# Let us see outputs
outputs 

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1144,  0.1937,  0.1250,  ..., -0.3827,  0.2107,  0.5407],
         [ 0.5308,  0.3207,  0.3665,  ..., -0.0036,  0.7579,  0.0388],
         [-0.4877,  0.8849,  0.4256,  ..., -0.6976,  0.4458,  0.1231],
         ...,
         [-0.7003, -0.1815,  0.3297,  ..., -0.4838,  0.0680,  0.8901],
         [-1.0355, -0.2567, -0.0317,  ...,  0.3197,  0.3999,  0.1795],
         [ 0.6080,  0.2610, -0.3131,  ...,  0.0311, -0.6283, -0.1994]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-7.1946e-01, -2.1445e-01, -2.9576e-01,  3.6603e-01,  2.7968e-01,
          2.2183e-02,  5.7299e-01,  6.2331e-02,  5.9585e-02, -9.9965e-01,
          5.0145e-02,  4.4756e-01,  9.7612e-01,  3.3989e-02,  8.4494e-01,
         -3.6905e-01,  9.8648e-02, -3.7169e-01,  1.7371e-01,  1.1515e-01,
          4.4133e-01,  9.9525e-01,  3.7221e-01,  8.2881e-02,  2.1402e-01,
          6.8965e-01, -6.1042e-01,  8.7136e-01,  9.4158e-01,  5.737

In [10]:
print(type(outputs))
# Usually it was a tuple. It may be overridden.

<class 'transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions'>


In [11]:
outputs[0]

tensor([[[-0.1144,  0.1937,  0.1250,  ..., -0.3827,  0.2107,  0.5407],
         [ 0.5308,  0.3207,  0.3665,  ..., -0.0036,  0.7579,  0.0388],
         [-0.4877,  0.8849,  0.4256,  ..., -0.6976,  0.4458,  0.1231],
         ...,
         [-0.7003, -0.1815,  0.3297,  ..., -0.4838,  0.0680,  0.8901],
         [-1.0355, -0.2567, -0.0317,  ...,  0.3197,  0.3999,  0.1795],
         [ 0.6080,  0.2610, -0.3131,  ...,  0.0311, -0.6283, -0.1994]]],
       grad_fn=<NativeLayerNormBackward0>)

In [12]:
outputs.last_hidden_state, outputs.pooler_output,outputs.cross_attentions

(tensor([[[-0.1144,  0.1937,  0.1250,  ..., -0.3827,  0.2107,  0.5407],
          [ 0.5308,  0.3207,  0.3665,  ..., -0.0036,  0.7579,  0.0388],
          [-0.4877,  0.8849,  0.4256,  ..., -0.6976,  0.4458,  0.1231],
          ...,
          [-0.7003, -0.1815,  0.3297,  ..., -0.4838,  0.0680,  0.8901],
          [-1.0355, -0.2567, -0.0317,  ...,  0.3197,  0.3999,  0.1795],
          [ 0.6080,  0.2610, -0.3131,  ...,  0.0311, -0.6283, -0.1994]]],
        grad_fn=<NativeLayerNormBackward0>),
 tensor([[-7.1946e-01, -2.1445e-01, -2.9576e-01,  3.6603e-01,  2.7968e-01,
           2.2183e-02,  5.7299e-01,  6.2331e-02,  5.9585e-02, -9.9965e-01,
           5.0145e-02,  4.4756e-01,  9.7612e-01,  3.3989e-02,  8.4494e-01,
          -3.6905e-01,  9.8648e-02, -3.7169e-01,  1.7371e-01,  1.1515e-01,
           4.4133e-01,  9.9525e-01,  3.7221e-01,  8.2881e-02,  2.1402e-01,
           6.8965e-01, -6.1042e-01,  8.7136e-01,  9.4158e-01,  5.7372e-01,
          -3.2187e-01,  8.6672e-03, -9.8611e-01, -2.0542

In [13]:
outputs = model(**inputs)

# Shape of last_hidden_state
print(outputs.last_hidden_state.shape)

# Shape of pooler_output
print(outputs.pooler_output.shape)

# Check if cross_attentions exist
if outputs.cross_attentions is not None:
    # Shapes of cross_attentions
    for i, attentions in enumerate(outputs.cross_attentions):
        print(f"Shape of cross_attentions[{i}]: {attentions.shape}")
else:
    print("cross_attentions attribute does not exist in the outputs.")

torch.Size([1, 8, 768])
torch.Size([1, 768])
cross_attentions attribute does not exist in the outputs.


The `last_hidden_state` tensor has three dimensions: `(batch_size, sequence_length, hidden_size)`. Here's what each dimension represents:

1. `batch_size`: The number of input sequences processed together in a batch. It indicates how many sequences are processed simultaneously during inference or training.

2. `sequence_length`: The length of the longest input sequence in the batch. It represents the number of tokens in each sequence.

3. `hidden_size`: The size of the hidden state representation for each token. It indicates the dimensionality of the contextualized representation for each token in the sequence.

In [14]:
from transformers import BertTokenizer, BertForPreTraining
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

prediction_scores, seq_relationship_scores = outputs[:2]
prediction_scores, seq_relationship_scores

(tensor([[[ -7.8962,  -7.8105,  -7.7903,  ...,  -7.0694,  -7.1693,  -4.3590],
          [ -8.4461,  -8.4401,  -8.5044,  ...,  -8.0625,  -7.9909,  -5.7160],
          [-15.2953, -15.4727, -15.5865,  ..., -12.9857, -11.7038, -11.4293],
          ...,
          [-14.0628, -14.2535, -14.3645,  ..., -12.7151, -11.1621, -10.2317],
          [-10.6576, -10.7892, -11.0402,  ..., -10.3233, -10.1578,  -3.7722],
          [-11.3383, -11.4590, -11.1767,  ...,  -9.2152,  -9.5209,  -9.5571]]],
        grad_fn=<ViewBackward0>),
 tensor([[ 3.3474, -2.0613]], grad_fn=<AddmmBackward0>))

## Text/Sequence Classification using BERT

In [15]:
from transformers import BertForSequenceClassification, BertTokenizer

# Load pre-trained BERT model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define the input text
text = "I really enjoyed watching this movie. The acting and plot were excellent!"

# Tokenize input text
inputs = tokenizer(text, return_tensors='pt')

# Perform text classification
outputs = model(**inputs)
# logits are raw output scores for each class.
logits = outputs.logits

# Interpretation of logits
# To interpret the logits, we use the `argmax` function to select the class with the highest score (`predicted_class`). 
#We also create a list of class names, where the index corresponds to the class label.
predicted_class = logits.argmax(dim=1)
class_names = ['Negative', 'Positive']
predicted_label = class_names[predicted_class.item()]

#  Finally, we retrieve the predicted label by mapping the predicted class to its corresponding class name using 
# `class_names[predicted_class.item()]`.
# Print the predicted class label
print("Predicted class: ", predicted_label)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Predicted class:  Positive


### Named Entity Recognition using high-level API pipeline

In [16]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


## Question Answering Task: has an API pipeline from the model distilbert by default 

In [17]:
from transformers import pipeline

qa_model = pipeline("question-answering")
question = "Where do I live?"
context = "My name is Merve and I live in İstanbul."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9538117051124573, 'start': 31, 'end': 39, 'answer': 'İstanbul'}

## Sentiment Analysis (API pipeline)

In [18]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
t = "I like you"
pipe(t)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': '5 stars', 'score': 0.4749954640865326}]

In [19]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

## <center> XLM-RoBERTa

https://huggingface.co/xlm-roberta-base

XLM-RoBERTa is a multilingual version of RoBERTa (another version that uses BERT). It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.It can predict masked tokens and classify text.

In [20]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='xlm-roberta-base')
unmasker("Hello I'm a <mask> model.")

[{'score': 0.10563922673463821,
  'token': 54543,
  'token_str': 'fashion',
  'sequence': "Hello I'm a fashion model."},
 {'score': 0.08015299588441849,
  'token': 3525,
  'token_str': 'new',
  'sequence': "Hello I'm a new model."},
 {'score': 0.03341350704431534,
  'token': 3299,
  'token_str': 'model',
  'sequence': "Hello I'm a model model."},
 {'score': 0.03021792322397232,
  'token': 92265,
  'token_str': 'French',
  'sequence': "Hello I'm a French model."},
 {'score': 0.02643618918955326,
  'token': 17473,
  'token_str': 'sexy',
  'sequence': "Hello I'm a sexy model."}]

In [21]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base")

# prepare input
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)
output

MaskedLMOutput(loss=None, logits=tensor([[[ 6.4861e+01,  1.6882e-02,  3.7656e+01,  ...,  2.1584e+01,
           1.4380e+01,  1.8790e+01],
         [ 2.7493e+01, -1.4091e+00,  6.4847e+01,  ...,  4.0234e+01,
           1.6296e+01,  3.0925e+01],
         [ 1.9604e+01, -1.2597e+00,  4.8981e+01,  ...,  3.5830e+01,
           1.7145e+01,  2.7173e+01],
         ...,
         [ 2.2920e+01, -1.4657e+00,  5.1211e+01,  ...,  3.8495e+01,
           1.6508e+01,  2.7687e+01],
         [ 2.8598e+01, -1.2868e+00,  6.7706e+01,  ...,  4.4857e+01,
           1.8004e+01,  3.5004e+01],
         [ 4.4955e+01, -2.1554e-01,  4.9643e+01,  ...,  2.8253e+01,
           1.6841e+01,  2.3610e+01]]], grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [22]:
from transformers import XLMRobertaConfig, XLMRobertaModel

# Initializing a XLM-RoBERTa xlm-roberta-base style configuration
configuration = XLMRobertaConfig()

# Initializing a model (with random weights) from the xlm-roberta-base style configuration
model = XLMRobertaModel(configuration)

# Accessing the model configuration
configuration = model.config
configuration

XLMRobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [23]:
from transformers import AutoTokenizer, XLMRobertaForCausalLM, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
config = AutoConfig.from_pretrained("roberta-base")
config.is_decoder = True
model = XLMRobertaForCausalLM.from_pretrained("roberta-base", config=config)

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits[0]

predicted_class = logits.argmax(dim=1)
class_names = ['Negative', 'Positive']
predicted_label = class_names[predicted_class.tolist()[0]]

# Finally, we retrieve the predicted label by mapping the predicted class to its corresponding class name using 
# `class_names[predicted_class.tolist()[0]]`.
# Print the predicted class label
print("Predicted class: ", predicted_label)

Predicted class:  Negative


## <center> T5 Model (Text-to-Text Transformer Model)

T5 model was developed by Google Research and currently supports English, French, German, and Romanian. It has 220 M parameters. https://huggingface.co/transformers/v3.0.2/notebooks.html

In [24]:
from transformers import T5Tokenizer, T5ForConditionalGeneration 

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
#he `.input_ids` method is then used to extract the input_ids from the dictionary-like object. 
#The input_ids represent the numerical representation of the tokenized input text that can be understood by the T5 model.
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Das Haus ist wunderbar.,

Das Haus ist wunderbar.




### Text Summarization

In [25]:
from transformers import pipeline

summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
text = "The T5 model is a versatile language model that can be used for various natural language processing tasks. It has been trained on a large amount of text data and can generate high-quality summaries of long documents or articles."
summary = summarizer(text, max_length=100, min_length=30, do_sample=False)
print(summary[0]['summary_text'])

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Your max_length is set to 100, but your input_length is only 51. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


the T5 model can be used for various natural language processing tasks . it can generate high-quality summaries of long documents or articles .


### Language Translation

In [26]:
from transformers import pipeline

translator = pipeline("translation_en_to_de", model="t5-base", tokenizer="t5-base")
english_text = "Hello, how are you?"
german_translation = translator(english_text)
print(german_translation[0]['translation_text'])

Hallo, wie sind Sie?


### Question Answering

In [27]:
from transformers import pipeline

question_answering = pipeline("question-answering")
context = "T5 is a language model developed by the Hugging Face team. It can be used for various natural language processing tasks."
question = "What is T5?"
answer = question_answering(question=question, context=context)
print(answer['answer'])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


a language model


### Text Completion or Generation

In [28]:
from transformers import pipeline

text_completion = pipeline("text-generation", model="t5-base", tokenizer="t5-base")
text = "I am going to the "
completed_text = text_completion(text, max_length=100, do_sample=False)
print(completed_text[0]['generated_text'])

The model 'T5ForConditionalGeneration' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPre

I am going to the  I am going to the I am going to the I am going to the I am going to the I am going to the I am going to the I am going to the I am going to the I am going to the I am going to the I am going to I am going to I am going to I am going to I am going to I am going to I am going to I am going to I am going to I am going to I am going to


### Paraphrasing

In [29]:
from transformers import pipeline

paraphraser = pipeline("text2text-generation", model="t5-base", tokenizer="t5-base")
text = "I am going for a walk."
paraphrase = paraphraser(text, max_length=100, do_sample=False)
print(paraphrase[0]['generated_text'])

I am going for a walk.


### Comparisions between BERT and T5: 

1. BERT (Bidirectional Encoder Representations from Transformers):
   - BERT-base: It has around 110 million parameters.
   - BERT-large: It has around 340 million parameters.

BERT is has a large number of parameters, allowing it to capture intricate patterns and representations in the data. The large size of BERT enables it to learn and generate high-quality contextualized word embeddings, which can be beneficial for various NLP tasks.

2. T5 (Text-to-Text Transfer Transformer):
   - T5-base: It has around 220 million parameters.
   - T5-large: It has around 770 million parameters.
   - T5-3B: It has around 3 billion parameters.
   - T5-11B: It has around 11 billion parameters.

T5 is considered an LLM due to its massive number of parameters. The large size of T5 allows it to capture complex relationships between text inputs and outputs, making it highly effective for a wide range of NLP tasks.

BERT, when compared with GPT-3, T5 is not considered as an LLM because it does not have Billions of parameters