# Huggingface Transformers
Beginner-focused Jupyter Notebook introducing Hugging Face Transformer pipelines: Learn essentials of text embeddings, classification, question answering, named entity recognition and text generation using simple steps and pre-trained models. No prerequisites!

## What are Transformers?

Natural Language Processing is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

The following is a list of common NLP tasks, with some examples of each:

- Sentiment Analysis
- Question Answering
- Named Entity Recognition
- Text Summarization
- Text Generation
- Text Classification: Zero-shot, Multi-label
- Mask Filling
- Text Translation


**Transformer models** can be used to solve all kinds of NLP tasks, like the ones mentioned above. The 🤗 Transformers library provides the functionality to create and use those shared models. The Model Hub contains thousands of pretrained models that anyone can download and use. You can also upload your own models to the Hub!


In this notebook, we will use the `transformers` library to use pre-trained models for various NLP tasks. We will use the `pipeline` class to use pre-trained models for various NLP tasks. The `pipeline` class provides a simple API dedicated to several NLP tasks. It provides a simple, straight-forward, and efficient way to use pre-trained models.

<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/hf_transformers/hf_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
# Comment the above line to see the installation logs

# Install the latest version of the transformers library
!pip install -qU transformers

## Pipeline
The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

![Pipeline](images/transformer-pipeline.png)

Refer to the [Huggingface Pipelines](https://huggingface.co/docs/transformers/en/main_classes/pipelines) for more details.

In [None]:
from transformers import pipeline

### Sentiment analysis pipeline

<img src="images/sentiment_analysis.png" width="50%"/>

In [None]:
sentiment_task = pipeline("sentiment-analysis")
sentences = [
    "Dr Stone is great anime.",
    "I hate this so much!",
]
sentiment_task(sentences)

Pipelines groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:
<img src="images/sentiment_analysis_01.png" width="50%"/>

It is best practice to setup a cache directory for the transformers library. This will allow us to reuse the downloaded models and avoid downloading them again.

In [None]:
import os

# Setting the Hugging Face Model Cache Directory
MODEL_CACHE_DIR = "./../../../models_cache"

# https://stackoverflow.com/questions/63312859/how-to-change-huggingface-transformers-default-cache-directory
os.environ["HF_HOME"] = MODEL_CACHE_DIR
os.environ["TRANSFORMERS_CACHE"] = MODEL_CACHE_DIR

# os.environ["HF_DATASETS_CACHE"] = f"{MODEL_CACHE_DIR}/datasets"
# os.environ["TORCH_HOME"] = f"{MODEL_CACHE_DIR}/torch"

**Sentiment Analysis with Custom Model**

Lets choose a model from the [Model Hub: Text Classification](https://huggingface.co/models?pipeline_tag=text-classification&sort=likes) or Search for a `sentiment` analysis model https://huggingface.co/models?search=sentiment




In [None]:
# https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

sentiment_model = "cardiffnlp/twitter-roberta-base-sentiment-latest"
sentiment_task = pipeline(
    "sentiment-analysis", model=sentiment_model, tokenizer=sentiment_model
)
sentences = [
    "Dark Knight Rises was a great movie!.",
    "I hate this so much!",
]
sentiment_task(sentences)

### Question Answering 

QA systems differ in the way answers are created.

- **Extractive QnA**: The model extracts the answer from a context and provides it directly to the user. It is usually solved with BERT-like models.

  <img src="images/ex_q_a_01.png" width=50%>
- **Generative QnA**: The model generates free text directly based on the context. It leverages Text Generation models and provides more flexible answers.

  <img src="images/gen_q_a_01.png" width=50%>


In [None]:
qa_model = pipeline("question-answering")

question = "What is zero?"
context = "zero is a number representing an empty quantity. Adding zero to any number leaves that number unchanged.  Multiplying any number by zero has the result zero, and consequently, division by zero has no meaning in arithmetic."

qa_response = qa_model(question=question, context=context)
print(qa_response)

The model returns a dictionary containing the keys:

`answer`: The text extracted from the context, which should contain the answer.

`start`: The index of the character in the context that corresponds to the start of the extracted answer.

`end`: The index of the character in the context that corresponds to the end of the extracted answer.

`score`: The confidence of the model in extracting the answer from the context.

### Named Entity Recognition

Named Entity Recognition (NER) is the NLP task of identifying key information (entities) in text. An entity is a set of contiguous words that appear in the document and refers to the same thing. Some examples of entities are "Kumar", "India". Usually, entities are classified into categories like `Name`, `Country`.

  <img src="images/ner.png" width=50%>


In [None]:
ner = pipeline("ner")
ner("What was Gandhi doing in India on 15 August 1947?")

In [None]:
ner = pipeline("ner", aggregation_strategy="simple")
ner("What was Gandhi and Nehru doing in India on 15 August 1947?")

### Text Summarization

Text summarization is a task whose goal is generating a concise and precise summary of long texts, without losing the overall meaning. There are two main approaches to automatic text summarization: Abstractive summarization and Extractive summarization.

- **Extraction-based summarization**: A subset of words or sentences that represent the most important points is pulled from the long text and combined to make a summary. The results may not be grammatically accurate.

- **Abstraction-based summarization**: Advanced deep learning techniques (mainly in seq-to-seq models) are applied to paraphrase and shorten the original document, just like humans do. Since abstractive machine learning algorithms can generate new phrases and sentences that represent the most important information from the source text, they can assist in overcoming the grammatical inaccuracies of the extraction-based techniques.

<img src="images/text_summarization.png" width=50%>

Try Text Summarization

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")

text = """
The Uttaramerur inscriptions describe the distinctive Kudavolai system employed in the local governance of the Chola kingdom. In this administrative method, one representative per ward was selected via an electoral process. Contestant names were recorded on palm leaf tickets, which were subsequently placed inside a pot and mixed thoroughly. A young child drew out the tickets randomly, announcing the names of those chosen as representatives. Through this procedure, thirty individuals were elected to serve as ward administrators. The intriguing election technique came to be known as the Kudavolai system.
"""
summarizer(text, max_length=100, min_length=10)

### Text Generation

<img src="images/text_gen.png" width=50%>

Try Text Generation

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.

In [None]:
generator("Science fiction movies are based on ",
          max_length=15, num_return_sequences=2)

### Text Classification: Zero-shot classification

**What is Zero-Shot Learning?**
Zero-Shot Learning is the ability to detect classes that the model has never seen during training. It resembles our ability as humans to generalize and identify new things without explicit supervision.

For example, let’s say we want to do `sentiment classification` and `news category` classification. Normally, we will train/fine-tune a new model for each dataset. In contrast, with zero-shot learning, you can perform tasks such as sentiment and news classification directly without any task-specific training.

<img src="images/zero-shot-vs-transfer.png" width=50%>

Try Text Summarization

In [None]:
from transformers import pipeline

txt_to_classify = "You are learning about transformer library in this notebook. It is part of a course on LLM Bootcamp."

classifier = pipeline("zero-shot-classification")
classifier(
    txt_to_classify,
    candidate_labels=["education", "politics", "business"],
)

### Text Classification: Multi-label classification
If more than one candidate label can be correct, pass multi_label=True to calculate each class independently:

In [None]:
txt_to_classify = "I love travelling and enjoying the cuisines of the world."
candidate_labels = ["travel", "cooking", "dancing", "exploration"]
classifier(txt_to_classify, candidate_labels, multi_label=True)

### Mask filling
The model attempts to fills in the special `<mask>` word, which is often referred to as a _mask token_. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.The `top_k` argument controls how many possibilities you want to be displayed. 

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

### Text Translation
<img src="images/text_translation_03.png" width=50%>
<img src="images/text_translation_02.png" width=50%>


In [None]:
%%time
from transformers import pipeline

# French to English
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator(
    "Apprendre est amusant, j'aime apprendre l'intelligence artificielle.")

## Addtional Reading

1.  [Huggingface Transformers](https://huggingface.co/transformers/)
1.  [Huggingface Transformers Github](https://github.com/huggingface/transformers)
2.  [Huggingface Model Hub](https://huggingface.co/models)
3.  [LLM Bootcamp](https://github.com/miztiik/llm-bootcamp)
4.  [Deep dive into Text Generation](https://huggingface.co/blog/how-to-generate)