## BERT

- Introduziu o conceito de aprendizado bidirecional, capturando o contexto de uma palavra em uma sentença.
- Utiliza uma única camada de codificador bidirecional.
- Usa uma técnica de pré-treinamento chamada "Masked Language Model" (MLM), onde palavras são mascaradas aleatoriamente e o modelo é treinado para prever essas palavras mascaradas.

In [1]:
from transformers import BertTokenizer, BertForTokenClassification
from transformers import pipeline

# Load the pre-trained BERT model and tokenizer for NER
tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english')
model = BertForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english')

# Create a pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Sample text
text = "Hugging Face Inc. is a company based in New York City. Its technology is used by more than 5,000 organizations worldwide."

# Use the NER pipeline to find entities in the text
entities = ner_pipeline(text)

# Print detected entities and their labels
for entity in entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english/9a90b161380a5549418764749cabe9257dce2df7fa58bcec648289f00f982ebb?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1713452064&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzQ1MjA2NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYm1kei9iZXJ0LWxhcmdlLWNhc2VkLWZpbmV0dW5lZC1jb25sbDAzLWVuZ2xpc2gvOWE5MGIxNjEzODBhNTU0OTQxODc2NDc0OWNhYmU5MjU3ZGNlMmRmN2ZhNThiY2VjNjQ4Mjg5ZjAwZjk4MmViYj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=rgoBnUtHiO4%7E5--8deIWUQ8VVPWZVZCoBJZBjbiT5suymzrG09pNbFXsQgn8-RMaCQWamA4IciZJPM-yY8yRfa%7EUOWdg4RJ8SDn-ZpXL9my96aQAL2kYtFaqLKFCRYPRRYTzoatPMZo1QELywB0gz2w0ZP0TGtNdvAPnPQ1pCEZsnYx-W%7EzfV7iHk8mV%7E6AgGKTHEAlKQZW%7E71CWAq-3ChFPIXa4TNZ7DxSqWOLtYcr4EsslX9jrmrQMVuZyaD09yVvfA%7EnFy%7E-1RuD0nazJYBO3SeVqPYfFywc4oPcz2R

model.safetensors:  98%|#########8| 1.31G/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entity: Hu, Label: I-ORG
Entity: ##gging, Label: I-ORG
Entity: Face, Label: I-ORG
Entity: Inc, Label: I-ORG
Entity: New, Label: I-LOC
Entity: York, Label: I-LOC
Entity: City, Label: I-LOC


## RoBERTa: Robustly Optimized BERT

- Realiza otimizações em relação ao pré-treinamento e hiperparâmetros do BERT original.
- Utiliza um conjunto de dados de pré-treinamento maior e mais diversificado.
- Modifica a técnica MLM para remover a aleatoriedade na substituição de tokens mascarados e introduz a técnica de treinamento de sentença aleatória.

In [2]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import pipeline

# Load the pre-trained RoBERTa model and tokenizer for sentiment analysis
tokenizer = RobertaTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')
model = RobertaForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')

# Create a pipeline for sentiment analysis
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Sample text
text = "RoBERTa models are amazing for natural language processing tasks!"

# Use the sentiment analysis pipeline to assess the sentiment of the text
sentiment = sentiment_pipeline(text)

# Print the sentiment analysis result
print(f"Sentiment: {sentiment[0]['label']}, Score: {sentiment[0]['score']:.4f}")

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Sentiment: LABEL_2, Score: 0.9721


## AlBERT: A Lite BERT

- Propõe uma estrutura mais eficiente em termos de recursos computacionais.
- Reduz a quantidade de parâmetros ao compartilhar parâmetros entre camadas de transformadores.
- Utiliza uma abordagem de "Factorized Embedding Parameterization", reduzindo ainda mais a dimensionalidade dos embeddings.

In [None]:
from transformers import AlbertTokenizer, AlbertForSequenceClassification
from transformers import pipeline

# Load the pre-trained ALBERT model and tokenizer for sentiment analysis
tokenizer = AlbertTokenizer.from_pretrained('textattack/albert-base-v2-yelp-polarity')
model = AlbertForSequenceClassification.from_pretrained('textattack/albert-base-v2-yelp-polarity')

# Create a pipeline for sentiment analysis
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Sample text
text = "The new coffee shop on the corner is fantastic!"

# Use the sentiment analysis pipeline to assess the sentiment of the text
sentiment = sentiment_pipeline(text)

# Print the sentiment analysis result
print(f"Sentiment: {sentiment[0]['label']}, Score: {sentiment[0]['score']:.4f}")