## Hugging Face Transformers Tutorial

### Introduction
* This notebook provides a comprehensive guide to using the Hugging Face `transformers` library for NLP tasks.

* Hugging Face's `transformers` offers easy access to a range of pre-trained models and tools for NLP tasks.


In [9]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

import warnings
warnings.filterwarnings('ignore')

#### Using Pipelines
* Hugging Face pipelines simplify common NLP tasks like text generation, sentiment analysis, and named entity recognition (NER).

* We'll start by using pipelines for quick NLP tasks.

#### Sentiment Analysis Pipeline
* Let's try the sentiment analysis pipeline to analyze the sentiment of a sentence.

In [10]:
sentiment_pipeline = pipeline("sentiment-analysis")
result = sentiment_pipeline("I love Hugging Face!")
print("Sentiment Analysis:", result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Sentiment Analysis: [{'label': 'POSITIVE', 'score': 0.9998641014099121}]


#### Text Generation Pipeline
* We can use the text generation pipeline with models like GPT-2 for generating text.

In [11]:
text_gen_pipeline = pipeline("text-generation", model="gpt2")
generated_text = text_gen_pipeline("Once upon a time", max_length=50, num_return_sequences=1)
print("\nGenerated Text:", generated_text[0]['generated_text'])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



Generated Text: Once upon a time, the gods had been more like the good and the evil.

The ancient world was full of these beings.

And yet their deeds were terrible, for the gods were a thing of life.

This is


#### Named Entity Recognition (NER) Pipeline
* The NER pipeline identifies entities like persons, organizations, and locations in text.

In [12]:
ner_pipeline = pipeline("ner", grouped_entities=True)
entities = ner_pipeline("Hugging Face is based in New York and has a partnership with Microsoft.")
print("\nNamed Entities:", entities)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is


Named Entities: [{'entity_group': 'ORG', 'score': 0.97738045, 'word': 'Hugging Face', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9990541, 'word': 'New York', 'start': 25, 'end': 33}, {'entity_group': 'ORG', 'score': 0.9995321, 'word': 'Microsoft', 'start': 61, 'end': 70}]


#### Question Answering Pipeline
* The question-answering pipeline can answer questions based on a given context.

In [13]:
qa_pipeline = pipeline("question-answering")
context = "Hugging Face Inc. is a company based in New York."
question = "Where is Hugging Face based?"
answer = qa_pipeline(question=question, context=context)
print("\nQuestion Answering:", answer)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.



Question Answering: {'score': 0.9733557105064392, 'start': 40, 'end': 48, 'answer': 'New York'}


### Loading Pre-trained Models and Tokenizers
* Hugging Face provides various models for different tasks, including `bert-base-uncased`, `distilbert-base-uncased`, etc.

* Here, we'll load a `distilbert` model and tokenizer for sequence classification.

#### Load Model and Tokenizer
* We'll use the `distilbert-base-uncased` model for binary sentiment classification.

In [14]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenization
# Tokenize input text for the model.

text = "I love using Hugging Face transformers library!"
inputs = tokenizer(text, return_tensors="pt")
print("\nTokenized Input:", inputs)

# Model Inference
# Run inference to get sentiment predictions.

outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits).item()
print("\nPredicted Class (0 = Negative, 1 = Positive):", predicted_class)


Tokenized Input: {'input_ids': tensor([[  101,  1045,  2293,  2478, 17662,  2227, 19081,  3075,   999,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Predicted Class (0 = Negative, 1 = Positive): 1


#### Fine-tuning a Model on Custom Dataset
* We'll fine-tune a `distilbert` model on a custom dataset from Hugging Face's `datasets` library.

#### Load Dataset
* Load the IMDB dataset for sentiment analysis.

In [15]:
dataset = load_dataset("imdb")
train_data = dataset["train"]
test_data = dataset["test"]

# Preprocess Dataset
# Tokenize and format the data for PyTorch.

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

train_data = train_data.map(tokenize_function, batched=True)
test_data = test_data.map(tokenize_function, batched=True)

# Set format to PyTorch tensors
train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Define Trainer
# Use `Trainer` for simplified model training.

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

#### Fine-tune Model

In [16]:
# May take up to 20 minutes or more due to CPU use
trainer.train()

  0%|          | 0/3125 [00:00<?, ?it/s]

{'loss': 0.3913, 'grad_norm': 14.22544002532959, 'learning_rate': 1.6800000000000002e-05, 'epoch': 0.16}
{'loss': 0.3578, 'grad_norm': 11.48914623260498, 'learning_rate': 1.3600000000000002e-05, 'epoch': 0.32}
{'loss': 0.3499, 'grad_norm': 7.203550338745117, 'learning_rate': 1.04e-05, 'epoch': 0.48}
{'loss': 0.3403, 'grad_norm': 7.7178473472595215, 'learning_rate': 7.2000000000000005e-06, 'epoch': 0.64}
{'loss': 0.3246, 'grad_norm': 14.340924263000488, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.8}
{'loss': 0.3134, 'grad_norm': 20.36943244934082, 'learning_rate': 8.000000000000001e-07, 'epoch': 0.96}


  0%|          | 0/3125 [00:00<?, ?it/s]

{'eval_loss': 0.31017592549324036, 'eval_runtime': 281.1914, 'eval_samples_per_second': 88.907, 'eval_steps_per_second': 11.113, 'epoch': 1.0}
{'train_runtime': 1273.1583, 'train_samples_per_second': 19.636, 'train_steps_per_second': 2.455, 'train_loss': 0.34508411376953124, 'epoch': 1.0}


TrainOutput(global_step=3125, training_loss=0.34508411376953124, metrics={'train_runtime': 1273.1583, 'train_samples_per_second': 19.636, 'train_steps_per_second': 2.455, 'total_flos': 827921241600000.0, 'train_loss': 0.34508411376953124, 'epoch': 1.0})

#### Evaluate Model
* Evaluate the fine-tuned model on the test set.

In [17]:
evaluation_results = trainer.evaluate()
print("\nEvaluation Results:", evaluation_results)

  0%|          | 0/3125 [00:00<?, ?it/s]


Evaluation Results: {'eval_loss': 0.31017592549324036, 'eval_runtime': 255.9775, 'eval_samples_per_second': 97.665, 'eval_steps_per_second': 12.208, 'epoch': 1.0}


#### Save and Load Model
* Save the fine-tuned model and tokenizer to a local directory.

In [18]:
trainer.save_model("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

# Load the model back
loaded_model = AutoModelForSequenceClassification.from_pretrained("./fine_tuned_model")
loaded_tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_model")

#### Inference with Fine-tuned Model
* Perform inference with the newly fine-tuned model.

In [19]:
text = "The movie was fantastic!"
inputs = loaded_tokenizer(text, return_tensors="pt")
outputs = loaded_model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits).item()
print("\nFine-tuned Model Prediction (0 = Negative, 1 = Positive):", predicted_class)


Fine-tuned Model Prediction (0 = Negative, 1 = Positive): 1
